From necojp at citiz.net Thu Jun 1 00:08:39 2006 From: necojp at citiz.net (=?gb2312?B?aW5mb3JtYXRpb24=?=) Date: Thu, 1 Jun 2006 00:08:39 -0700 (PDT) Subject: [openib-general] =?iso-2022-jp?b?GyRCIXshfCRVJEgkMyRtJCwbKEI=?= =?iso-2022-jp?b?GyRCQmclVCVzJUEhfCF7GyhC?= Message-ID: <20060601070839.DD53422834D@openib.ca.sandia.gov>    今日、お財布の中に300円しか入ってない・・・    リアルに今、お財布が大ピンチな男性の皆様!    『逆援助交際』はご存知ですか?          ※男性専用 無料登録ページ※       http://www.himitsuno-sasayaki6.net/?haru225              簡単・高収入!      ★アドレス確認のみの簡単登録で即利用が可能です★        逆サポ&割り切り交際相手を探してみよう   【番外編】   即H希望女性のみ!!   →→→ http://www.himitsuno-sasayaki6.net/?haru225   即H、即アポOKな淫乱女性大募集!!        ☆☆☆☆登録は無料です☆☆☆☆  ◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆      配信不要はこちらまで→ office_news_himitsu at yahoo.co.jp -------------- next part -------------- An HTML attachment was scrubbed... URL: From hamlin at idsmail.com Thu Jun 1 00:59:24 2006 From: hamlin at idsmail.com (Catherine Light) Date: Thu, 01 Jun 2006 01:59:24 -0600 Subject: [openib-general] Re-finance at the lowestt ratess Message-ID: <543e234b.0474530@yahoo.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: dogfish.jpg Type: image/jpg Size: 5762 bytes Desc: not available URL: From gtrouhfbpls at hotmail.com Thu Jun 1 01:21:46 2006 From: gtrouhfbpls at hotmail.com (qfepkncill) Date: Thu, 1 Jun 2006 01:21:46 -0700 (PDT) Subject: [openib-general] Hoodia 920+ a day keeps diets away.% Message-ID: <20060601082146.EFCAF2283D5@openib.ca.sandia.gov> An HTML attachment was scrubbed... URL: From hitozumabi at hitmail.cc Thu Jun 1 01:46:50 2006 From: hitozumabi at hitmail.cc (hitozumabi at hitmail.cc) Date: Thu, 1 Jun 2006 01:46:50 -0700 (PDT) Subject: [openib-general] =?utf-8?b?wo7DpcKVd8KCw4bCl0Y=?= Message-ID: 20060601172719.51962mail@mail.love-woman889889_gogo-server114_freesystem01_freefree-lovelove.tv �Ƃ肠�����A��ԍŏ��ɂ���URL�����Ƃ��܂��B ���S�i�v�����T�C�g�F�l�Ȕ��@http://yaii.net/htm ��w�w�ɂƂĂ�l�C�̂���T�C�g�ł��B �Ȃ�������I�X�X������̂��͂��̌��ǂ�ł݂ĉ������B �v����� �@�@�@�@�@�@�@�@��w�́@(�E�́E)�����I �Ƃ������b�ł��B �o��n���鏗�̐l���āA����ς肢���ł���ˁB �������Ȃɐ������Ȃ����ǁA����肵�Ă܂����B ���X�A�G�b�`��ł����肵�Ă܂���B ���傤�ǐ挎��3�l�ł��ˁB ���́A���q�吶��2�l�Ǝ�w�̕���1�l�ł��B ���ԂŌ����ƁA���q�吶�����q�吶����w�Ƃ�������ŁA �v����ɎႢ�̂������Ƃ�����ƁA�Ⴄ��������Ă݂����ȁE�E�E �Ƃ��������ł����B ���j���q�吶�Ƃ��Ⴂ�̂́A�݂�ȍD�����낤���ǁA�}�W�ő�ρI �@�@���J�l�������������Ă�����������āE�E�E�G�b�`�͂܂��܂������B ��w�̕��́A������������Ȃ�ł����ǁA�ق�ƂɃG�����B�b�������B ���[���̂�����͂��߂đ��X�ɃG���b�΂�����B �ʃ��͂Ȃ񂩂��ꂢ�ȁA��w�Ƃ������Ȃ݂����ȁi�H�j���������̂ŁA ���M���^�Ȃ�������Ă݂܂����B http://yaii.net/htm ���_�Ƃ��ẮA��w�ō��ł��I ����|�C���g��1�_�����A�u26�`35�΁v���炢�́A��w�Ƃ����Ă�܂��Ⴂ�w��_�� ���Ƃł��B �܂��A�J���_���܂�����Ă��Ă��Ȃ��A�Ƃ����l�I�Ȏ��A���̗��R�ɓ����Ă� �܂����B ���̂�����̔N��w�́A�܂��G�b�`�ɑ΂��Ă̖����Ƃ������A�֐S�� ��߂Ă��Ȃ��̂ŁA�����̐����͈͓̔�Ŏ��́A��ɃG�b�`�̋@��� ���߂Ă��܂��B ----------------------------------------------------- �����o�^�F�@http://yaii.net/htm ----------------------------------------------------- ����܂���w���傾�����̂Łi���j�퓹�ɏ]���āA�f��ł���āA�H���ɍs���āA �E�E�E�Ȃ�ă��C�����ǂ낤�Ƃ�����A�͂��߂�����ޏ��A �u�]�v�Ȃ��Ƃ͂���Ȃ���ł����ǁv�I�[���o�܂����Ă���킯�ł��B �Ȃ̂ŁA�M���O�Łu����H�v�Ƃ����Ă݂���u����v�����āB �͂�i�Ƃ���������ł����̂��I�H��w�j�I ����ň��m�����āu�����Ƃł����H�v�Ȃ�Ă����Ă݂���A�����n�j�I �Ђ႟�I �z�e���㕂������A���b�L�[��Ȃ�ĕn�R���������Ǝv���Ă����ł����� �Ƃɂ‚�����A�������Ȃ��Ƃӂ��Ƃ񂶂Ⴂ�܂����I ��͂��w�̋��݂Ƃ������A�S��j����m���Ă�킯�ł��B ������2������ΐ���t�A���ɂ����邭�炢�̎��Ȃ̂ł����i�劾�j�A ����3���Ԃ�4���ʂĂĂ��܂��܂����B ----------------------------------------------------- �����o�^�F�@http://yaii.net/htm ----------------------------------------------------- ��������d���Ղ�B����ȂɋC���������̂͂ق�Ƃɏ��߂Ă��Ă��񂶂ł����i�Ȃ� ���̎q���H�j �����̂ǂ��ɂ���ȃ`�J�����c���Ă����̂��I�H���Ă��炢�ł��B �����Ɖ����ƐS�n�悢���ŔR���s������W������ԂɂȂ��Ă���Ƃ���ɁA �g�h���̈ꌾ���E�E�E�B �u�ˁA����I�ɉ���Ă�炤�Ƃ�����ǂꂭ�炢�K�v�H�v �E�E���H����I�H�E�E�E�b�e�i���f�X�J�H ����͑��ɂ����t���Ă�‚ł����H��‚ł���ˁH ���[�ƁE�E�E�E�E�E�E�E�E�E�E�E�E�E�E�E���ꂪ�S�R�킩��Ȃ��E�E�E�B ��������[��������b�̓W�J���X���[�Y�Ɍ����܂����A���� �ޏ��̌����Ă��邱�Ƃ𗝉���̂ɍŒ�ł�1���͂�����܂����B ������ƃA�^�}�����Ȃ��Ă܂����B�Ӗ��킩��Ȃ��āB �Ƃ肠�����A�Ў�̃��r��S�čL���Ă݂܂����B �����Ă킩��Ȃ���ł�����B �u�킩�����B���Ⴀ�Ƃ肠���������̂��炨���Ă���v ���H�킩�����H�H ���H������āA�‚܂�A���́A���A�@�g���񂱂�ɂ��́B �C�[�[�[�[�[�[�[�[�[�[�[�[�[�[�[�[��ł����H�i�J�r�����j �C�[�[�[�[�[�[�[�[�[�[�[�[�[�[�[�[��ł��A�Ƃ͌����܂���ł����� �Ȃɂ������������悤�ł��B�p�`�p�`�B�悩�����ˁA���B �����i3���ł��ˁj�͖��T1�񎄂̕����ʼn���ƂɂȂ��Ă܂��B �S�Ă����Ȃ̂͂�����Ɛh���ł����ǁB�ł�A�O�ł������Ⴂ�܂������B�T�񕪁B �Ȃ񂩎�w���ĒU�߂�����킯������A���̂����U�߂���荞��ł��āu�e���[�v�� �� �����邱�Ƃ����̂��ȁH���Ďv�����肵�܂����A�ޏ����m���Ă��鎄�̘A����� �� �t���[���[�������ł�����A�Ȃ�Ƃ��Ȃ邩�ȁH�Ƃ�v���Ă܂��B ----------------------------------------------------- �����o�^�F�@http://yaii.net/htm ----------------------------------------------------- �Ƃɂ����B ��w�B (�E�́E)�C�C!! �~������̂�2�‚��ɓ���B �I�X�X���ł��B ----------------------------------------------------- �����o�^�F�@http://yaii.net/htm ----------------------------------------------------- From leonida at voltaire.com Thu Jun 1 01:50:14 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Thu, 01 Jun 2006 11:50:14 +0300 Subject: [openib-general][PATCH 1 of 3] repost: Client Reregister support for kernel space In-Reply-To: References: <20060509060958.GA482@voltaire.com> Message-ID: <447EAA46.9080905@voltaire.com> Thank you! Looks fine. I don't see anything that can harm. The only thing that I can't be completely sure is the change in the ipath_mad.c - I'm less familiar with the code and didn't check it myself. Anyway, the change seems fine too. Roland Dreier wrote: > OK, I cleaned up your patches and applied the following to my > for-2.6.18 tree. I think all of my changes were fixes and/or > cleanups, but you may want to check that I didn't break anything -- > I'm sending the 5 patches I ended up with to the list. > > - R. > From love at aaamich.com Thu Jun 1 03:30:24 2006 From: love at aaamich.com (Weldon Cornelius) Date: Thu, 01 Jun 2006 02:30:24 -0800 Subject: [openib-general] Notice: Loww mortagee ratee approved Message-ID: <48379.$$.16807.Etrack@yahoo.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: diffident.7.gif Type: image/gif Size: 8467 bytes Desc: not available URL: From aisy at adlandpro.com Thu Jun 1 03:35:42 2006 From: aisy at adlandpro.com ( Curtis) Date: Thu, 01 Jun 2006 02:35:42 -0800 Subject: [openib-general] 3.25%% approvedd rattee Message-ID: <72563.$$.26481.Etrack@> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: .gif Type: image/gif Size: 8467 bytes Desc: not available URL: From dorranu at cfpa.com Thu Jun 1 02:40:44 2006 From: dorranu at cfpa.com (Niocla Dorrance) Date: Thu, 1 Jun 2006 02:40:44 -0700 Subject: [openib-general] Re: 658 AMBqBtEN Message-ID: <000001c6855f$7461c670$e4eaa8c0@anx39> Hi, V A L \ U M C \ A L i S A M B \ E N P R O Z & C M E R \ D i A S O M & X & N A X V \ A G R A L E V \ T R A http://www.rubasujadun.com but not quite straight into the side of the hill  The Hill, as all the people for many miles round called it  and many little round doors opened out of it, first on one side and then on another. No going upstairs for the hobbit: bedrooms, bathrooms, cellars, pantries (lots of these), wardrobes (he had whole rooms devoted to clothes), kitchens, dining-rooms, all were on the same floor, and indeed on the same -------------- next part -------------- An HTML attachment was scrubbed... URL: From kholden at adtaz.sps.mot.com Thu Jun 1 03:44:16 2006 From: kholden at adtaz.sps.mot.com (Rodney Carroll) Date: Thu, 01 Jun 2006 02:44:16 -0800 Subject: [openib-general] Agents compete for your refi!! Message-ID: <398267569426842.4310327@hotmail.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: checksummed.0.gif Type: image/gif Size: 7610 bytes Desc: not available URL: From rosasedp9 at hotmail.com Thu Jun 1 07:15:30 2006 From: rosasedp9 at hotmail.com (Nina) Date: Thu, 1 Jun 2006 13:15:30 -0100 Subject: [openib-general] pay attention to the message CTXE.PK watch it performing Message-ID: <20060601102828.4E19022834D@openib.ca.sandia.gov> Movements of stockk and market tendencies analyzed Price-sensitive insider information on sstocks to boost revenues C T X E - CCANTEX _ENERGGY_CORP. Check the sstock which should exppload: C T X E . P K CURRRENT_PRICE: $0.58 Expected price at the end of the week: $1.2 => + 48% stoock coverage highlighting booming markets This stoock is ggreatly recommended by agressive innvestors in a short tterm. Don't loos a chance to earnn. stoock market tendencies and booming predictions Reliable methods of stoock analysis and optimal buy and sell times ............................... It is a long lane that has no turning A sprat to catch a mackerel. Live and let live Experience is a wonderful thing It enables you to recognize a mistake when you make it again The eyes are the window of the soul What's good for the goose is good for the gander. A conscience is what hurts when all your other parts feel so good. Smooth runs the water where the brook is deep From fpqlzsaavnf at hotmail.com Thu Jun 1 03:50:33 2006 From: fpqlzsaavnf at hotmail.com (mark.simons) Date: Thu, 1 Jun 2006 03:50:33 -0700 (PDT) Subject: [openib-general] Home Equity Message-ID: <20060601105033.75E6E22834D@openib.ca.sandia.gov> An HTML attachment was scrubbed... URL: From laaaaaaa at yahoo.co.in Thu Jun 1 04:11:53 2006 From: laaaaaaa at yahoo.co.in (Lamine.) Date: Thu, 1 Jun 2006 15:11:53 +0400 Subject: [openib-general] Please Is Urgent. Message-ID: <20060601112447.84F9522834D@openib.ca.sandia.gov> Greetings from Dubai, This letter must come to you as a big surprise, but I believe it is only a day that people meet and become great friends and business partners. i am Mr LAMINE MOHAMED DALLAJI,currently Head of Cooperate affairs with a reputable bank here in U. A. E. I write you this proposal in good faith, believing that I can trust you with the information I am about to reveal to you. I have an urgent and very confidential business proposition for you. On November 6, 2000, an Iraqi Foreign Oil consultant/contractor with the CHEVRON PETROLEUM CORPORATION, MR. KHALIL AL NASSER made a (Fixed deposit) for 36 calendar months, valued at US$17,500,000.00 (Seventeen Million Five hundred Thousand Dollars only) in my bank and I happen to be his account officer before I was moved to my present position recently. Upon maturity in 2003, as his account officer, it is my duty to notify him on the maturity date so I sent a routine notification to his forwarding address but the letter was returned undelivered. After sometime, I tried sending back the letter, but it was again returned and finally I discovered from his contract employers, Chevron Petroleum Corporation that Mr. Khalil Al Nasser died as a result of torture in the hand of Saddam Hussein (former Iraqi President) during one of his trips to his country Iraq, as he was accused of leaking information to the Americans. On further investigation, I discovered that Mr. Al Nasser�s family wife and two sons died during the Gulf War in Iraq and was the reason why he did not declare any next of kin or relation in all his official documents, including his Bank Deposit paperwork in my Bank and did not leave any WILL. This sum of US$17,500,000.00 have been floating and placed under dormant/unserviceable account by my bank management since no one have heard from the owner since 2003. I wish to let you know that all the investigation I have made so far, my bank management is not aware of it, I am the only one that have the information.but recently i dicided to disclose the issue to someone abroad lets puts heads together over it. With the recent change of government in my country and with their efforts to support the United Nations in checkmating terrorism aid in the U. A. E. By end of this year, the government will pass a new financial control law which will give the government authority to interrogate account owners of above $5,000,000.00 to explain the source of the funds, making sure it is not for terrorism support. If I do not move this money out of the country immediately, by end of the year the government will definitely confiscate the money, because my bank cannot provide the account owner to explain the source of the money. I cannot directly transfer out this money without the help of a foreigner and that is why I am contacting you for an assistance. As the Account Officer to late Al Nasser, coupled with my present position and status in the bank as Head of Retail Banking Groug , I have the power to influence the release of the funds to any foreigner that comes up as the next of kin to the account, with the correct information concerning the account, which I shall give you. All documents to enable you claim this fund will be carefully worked out and there is practically no risk involved, the transaction will be executed under a legitimate arrangement that will protect you from any breach of law, beside U. A. E is porous and anything goes. If you accept to work with me, I want you to state how you wish us to share the funds in percentage, so that both parties will be satisfied. If you are interested, contact me as soon as you receive this message so we can go over the details. Thanking you in advance and may God bless . Please, treat with utmost confidentiality. I shall send you copy of the deposit certificate issued to Al Nasser when the deposit was made for your perusal. I wait your urgent response. Regards, Mr.LAMINE Mohamed DALLAJI. From melody19194 at yahoo.co.jp Thu Jun 1 04:28:45 2006 From: melody19194 at yahoo.co.jp (melody19194 at yahoo.co.jp) Date: Thu, 1 Jun 2006 04:28:45 -0700 (PDT) Subject: [openib-general] =?utf-8?b?woNcwoFbwoNWwoPCg8KDwovCg2zCg2LCg2c=?= =?utf-8?b?woPCj8KBW8KDTMKDwpPCg0/Cg1TCg0PCg2fCj8K1wpHDksKPw7M=?= Message-ID: 20060601201402.39458mail@mail.love-woman889889_gogo-server114_freesystem01_freefree-lovelove.tv ����ɂ��́A������̓����f�B�[�^�c�����ǂł��B �����f�B�[�Ƃ́A�����o�[�݂̂ō\������Ă���ŋߗ��s��SNS�i�\�[�V�����l�b�g���[�L���O�T�C�g�ł��B ���񃉃��_�����I�ł��Ȃ��l�ɏ��ҏ�����炳���Ă��������܂����B ���L��URL���o�^��s���Ă��������l�b�g���[�N�������̊F�l�Ƃ̌𗬂��肢�������܂��B �@�@�@http://qqpg.com/mmt �����F�l�̓v���t�B�[���A�ʐ^��o�^�A���J���邱�Ƃɂ���Ă�葽���̕��X�ɏ��� �@�@���M���邱�Ƃ��o���܂��B���p�A�o�^�͖����ł��B �@�@�v���t�B�[���A�ʐ^�̓o�^�A���J�@���� �@�@�@http://qqpg.com/mmt ���������f�B�[�ł͐M���ł�����l�A�F�B�A���l�A�Z�b�N�X�t�����h�A���܂��܂ȃc�[�����p�ӂ��Ă���܂��B �@�@�@http://qqpg.com/mmt ���������f�B�[��g���Ή�����m�̃l�b�g���[�N���ǂ��ė���p�[�e�B�Ȃǂ̌𗬂� �@�@�ȒP�ɂł��܂��B�����ɂ͂��Ȃ��̃p�[�g�i�[����q����M���ł���l�b�g���[�N�� �@�@�`������Ă��܂��B�����f�B�[�͂ǂ����Ōq�����Ă���l���m���W�܂�o������T�C�g �@�@�ł���A���ꂪ�����f�B�[�̓����ł��B �@�@�@http://qqpg.com/mmt ����ł́A�Q����S��肨�҂����Ă���܂��B�����f�B�[�^�c�ǁB From rkuchimanchi at silverstorm.com Thu Jun 1 07:12:13 2006 From: rkuchimanchi at silverstorm.com (Ramchandra K) Date: Thu, 01 Jun 2006 19:42:13 +0530 Subject: [openib-general] Re: [PATCH] SRP : Use correct port identifier format according to target io_class In-Reply-To: References: Message-ID: <1149171133.7588.45.camel@Prawra.gs-lab.com> On Mon, 2006-05-29 at 10:07 -0700, Roland Dreier wrote: > Overall seems OK. Some comments: I am resending the patch with the modifications you suggested. > > +#define SRP_REV10_IO_CLASS 0xFF00 > > +#define SRP_REV16A_IO_CLASS 0x0100 > > I think these should be in an enum in , since they're > generic constants from the SRP spec. > I have defined the IO class values as an enum in . I am sending this as a separate patch. I am not sure if those changes are to be submitted here, since srp.h is not in the Open Fabrics code base. But both the patches have to be applied together for the SRP code to compile. Signed-off-by: Ramachandra K Index: infiniband/ulp/srp/ib_srp.c =================================================================== --- infiniband/ulp/srp/ib_srp.c (revision 7615) +++ infiniband/ulp/srp/ib_srp.c (working copy) @@ -321,8 +321,33 @@ req->priv.req_it_iu_len = cpu_to_be32(srp_max_iu_len); req->priv.req_buf_fmt = cpu_to_be16(SRP_BUF_FORMAT_DIRECT | SRP_BUF_FORMAT_INDIRECT); - memcpy(req->priv.initiator_port_id, target->srp_host->initiator_port_id, 16); /* + * Older targets conforming to Rev 10 of the SRP specification + * use the port identifier format which is + * + * lower 8 bytes : GUID + * upper 8 bytes : extension + * + * Where as according to the new SRP specification (Rev 16a), the + * port identifier format is + * + * lower 8 bytes : extension + * upper 8 bytes : GUID + * + * So check the IO class of the target to decide which format to use. + */ + + /* If its Rev 10, flip the initiator port id fields */ + if (target->io_class == SRP_REV10_IO_CLASS) { + memcpy(req->priv.initiator_port_id, + target->srp_host->initiator_port_id + 8 , 8); + memcpy(req->priv.initiator_port_id + 8, + target->srp_host->initiator_port_id, 8); + } else { + memcpy(req->priv.initiator_port_id, + target->srp_host->initiator_port_id, 16); + } + /* * Topspin/Cisco SRP targets will reject our login unless we * zero out the first 8 bytes of our initiator port ID. The * second 8 bytes must be our local node GUID, but we always @@ -334,8 +359,13 @@ (unsigned long long) be64_to_cpu(target->ioc_guid)); memset(req->priv.initiator_port_id, 0, 8); } - memcpy(req->priv.target_port_id, &target->id_ext, 8); - memcpy(req->priv.target_port_id + 8, &target->ioc_guid, 8); + if (target->io_class == SRP_REV10_IO_CLASS) { + memcpy(req->priv.target_port_id, &target->ioc_guid, 8); + memcpy(req->priv.target_port_id + 8, &target->id_ext, 8); + } else { + memcpy(req->priv.target_port_id, &target->id_ext, 8); + memcpy(req->priv.target_port_id + 8, &target->ioc_guid, 8); + } status = ib_send_cm_req(target->cm_id, &req->param); @@ -1513,6 +1543,7 @@ SRP_OPT_SERVICE_ID = 1 << 4, SRP_OPT_MAX_SECT = 1 << 5, SRP_OPT_MAX_CMD_PER_LUN = 1 << 6, + SRP_OPT_IO_CLASS = 1 << 7, SRP_OPT_ALL = (SRP_OPT_ID_EXT | SRP_OPT_IOC_GUID | SRP_OPT_DGID | @@ -1528,6 +1559,7 @@ { SRP_OPT_SERVICE_ID, "service_id=%s" }, { SRP_OPT_MAX_SECT, "max_sect=%d" }, { SRP_OPT_MAX_CMD_PER_LUN, "max_cmd_per_lun=%d" }, + { SRP_OPT_IO_CLASS, "io_class=%x" }, { SRP_OPT_ERR, NULL } }; @@ -1611,7 +1643,19 @@ } target->scsi_host->cmd_per_lun = min(token, SRP_SQ_SIZE); break; - + case SRP_OPT_IO_CLASS: + if (match_hex(args, &token)) { + printk(KERN_WARNING PFX "bad IO class parameter '%s' \n", p); + goto out; + } + if (token == SRP_REV10_IO_CLASS || token == SRP_REV16A_IO_CLASS) + target->io_class = token; + else + printk(KERN_WARNING PFX "unknown IO class parameter value" + " %x specified. Use %x or %x. Defaulting to IO class %x\n", + token, SRP_REV10_IO_CLASS, SRP_REV16A_IO_CLASS, + SRP_REV16A_IO_CLASS); + break; default: printk(KERN_WARNING PFX "unknown parameter or missing value " "'%s' in target creation request\n", p); @@ -1654,6 +1698,7 @@ target = host_to_target(target_host); memset(target, 0, sizeof *target); + target->io_class = SRP_REV16A_IO_CLASS; target->scsi_host = target_host; target->srp_host = host; Index: infiniband/ulp/srp/ib_srp.h =================================================================== --- infiniband/ulp/srp/ib_srp.h (revision 7615) +++ infiniband/ulp/srp/ib_srp.h (working copy) @@ -122,6 +122,7 @@ __be64 id_ext; __be64 ioc_guid; __be64 service_id; + __be16 io_class; struct srp_host *srp_host; struct Scsi_Host *scsi_host; char target_name[32]; From rkuchimanchi at silverstorm.com Thu Jun 1 07:12:25 2006 From: rkuchimanchi at silverstorm.com (Ramchandra K) Date: Thu, 01 Jun 2006 19:42:25 +0530 Subject: [openib-general] [PATCH] Define IO class values in Message-ID: <1149171145.7588.46.camel@Prawra.gs-lab.com> Hi Roland, This patch adds IO class values of SRP Rev 10 and Rev 16a to aid in deciding the port identifier format to be used. Regards, Ram Signed-off-by: Ramachandra K --- orig/include/scsi/srp.h 2006-06-01 00:45:13.000000000 -0400 +++ wc/include/scsi/srp.h 2006-06-01 00:58:10.000000000 -0400 @@ -44,6 +44,11 @@ #include enum { + SRP_REV10_IO_CLASS = 0xFF00, + SRP_REV16A_IO_CLASS = 0x0100 +}; + +enum { SRP_LOGIN_REQ = 0x00, SRP_TSK_MGMT = 0x01, SRP_CMD = 0x02, From rkuchimanchi at silverstorm.com Thu Jun 1 07:12:32 2006 From: rkuchimanchi at silverstorm.com (Ramchandra K) Date: Thu, 01 Jun 2006 19:42:32 +0530 Subject: [openib-general] [PATCH] (Resend) SRPTOOLS: print out the target io_class in ibsrpdm Message-ID: <1149171152.7588.47.camel@Prawra.gs-lab.com> Hi Roland, Resending the patch that prints out the target io class value in ibsrpdm to aid in specifying the target creation parameter - io_class. Regards, Ram Signed-off-by: Ramachandra K Index: userspace/srptools/src/srp-dm.c =================================================================== --- userspace/srptools/src/srp-dm.c (revision 7617) +++ userspace/srptools/src/srp-dm.c (working copy) @@ -398,6 +398,7 @@ (unsigned long long) ntohll(ioc_prof.guid)); pr_human(" vendor ID: %06x\n", ntohl(ioc_prof.vendor_id) >> 8); pr_human(" device ID: %06x\n", ntohl(ioc_prof.device_id)); + pr_human(" IO class : %hx\n", ntohs(ioc_prof.io_class)); pr_human(" ID: %s\n", ioc_prof.id); pr_human(" service entries: %d\n", ioc_prof.service_entries); @@ -429,11 +430,13 @@ "ioc_guid=%016llx," "dgid=%016llx%016llx," "pkey=ffff," + "io_class=%hx," "service_id=%016llx\n", id_ext, (unsigned long long) ntohll(ioc_prof.guid), (unsigned long long) subnet_prefix, (unsigned long long) guid, + (unsigned short) ntohs(ioc_prof.io_class), (unsigned long long) ntohll(svc_entries.service[k].id)); } } From rdreier at cisco.com Thu Jun 1 07:29:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 01 Jun 2006 07:29:32 -0700 Subject: [openib-general] [PATCH 4/5] IB/mthca: Add client reregister event generation In-Reply-To: <447E7C9E.1060907@mellanox.co.il> (Eitan Zahavi's message of "Thu, 01 Jun 2006 08:35:26 +0300") References: <20060531223205.10506.51241.stgit@localhost.localdomain> <20060531223215.10506.28838.stgit@localhost.localdomain> <447E7C9E.1060907@mellanox.co.il> Message-ID: Eitan> Hi Roland, Is there a reason why the LID_CHANGE event is Eitan> happening even if the LID did not change? It was used as a proxy for client reregister-like events before client reregister existed. - R. From kaho20006p at infoseek.jp Thu Jun 1 05:07:29 2006 From: kaho20006p at infoseek.jp (=?iso-2022-jp?B?GyRCOzMyPDlhSmYbKEI=?=) Date: Thu, 01 Jun 2006 21:07:29 +0900 Subject: [openib-general] =?iso-2022-jp?b?GyRCPWkyRiROJDQwJzsiJHIbKEI=?= =?iso-2022-jp?b?GyRCPz0kNz5lJDIkXiQ5ISMbKEI=?= Message-ID: <20060601161713.64D6A22834D@openib.ca.sandia.gov> An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Thu Jun 1 10:00:33 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 01 Jun 2006 12:00:33 -0500 Subject: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager. In-Reply-To: <447E1720.7000307@ichips.intel.com> References: <20060531182650.3308.81538.stgit@stevo-desktop> <20060531182652.3308.1244.stgit@stevo-desktop> <447E1720.7000307@ichips.intel.com> Message-ID: <1149181233.31610.34.camel@stevo-desktop> On Wed, 2006-05-31 at 15:22 -0700, Sean Hefty wrote: > Steve Wise wrote: > > +/* > > + * Release a reference on cm_id. If the last reference is being removed > > + * and iw_destroy_cm_id is waiting, wake up the waiting thread. > > + */ > > +static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv) > > +{ > > + int ret = 0; > > + > > + BUG_ON(atomic_read(&cm_id_priv->refcount)==0); > > + if (atomic_dec_and_test(&cm_id_priv->refcount)) { > > + BUG_ON(!list_empty(&cm_id_priv->work_list)); > > + if (waitqueue_active(&cm_id_priv->destroy_wait)) { > > + BUG_ON(cm_id_priv->state != IW_CM_STATE_DESTROYING); > > + BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, > > + &cm_id_priv->flags)); > > + ret = 1; > > + wake_up(&cm_id_priv->destroy_wait); > > We recently changed the RDMA CM, IB CM, and a couple of other modules from using > wait objects to completions. This avoids a race condition between decrementing > the reference count, which allows destruction to proceed, and calling wake_up on > a freed cm_id. My guess is that you may need to do the same. > Good catch. Yes, the IW CM suffers from the same race condition. I'll change this to use completions... > Can you also explain the use of the return value here? It's ignored below in > rem_ref() and destroy_cm_id(). > The return value is supposed to indicate whether this call to deref _may_ have resulted in waking up another thread and the cm_id being freed. Its used in cm_work_handler(), in conjunction with setting the IWCM_F_CALLBACK_DESTROY flag to know whether the cm_id needs to be freed on the callback path. > > +static void add_ref(struct iw_cm_id *cm_id) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + atomic_inc(&cm_id_priv->refcount); > > +} > > + > > +static void rem_ref(struct iw_cm_id *cm_id) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + iwcm_deref_id(cm_id_priv); > > +} > > + > > > +/* > > + * CM_ID <-- CLOSING > > + * > > + * Block if a passive or active connection is currenlty being processed. Then > > + * process the event as follows: > > + * - If we are ESTABLISHED, move to CLOSING and modify the QP state > > + * based on the abrupt flag > > + * - If the connection is already in the CLOSING or IDLE state, the peer is > > + * disconnecting concurrently with us and we've already seen the > > + * DISCONNECT event -- ignore the request and return 0 > > + * - Disconnect on a listening endpoint returns -EINVAL > > + */ > > +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + unsigned long flags; > > + int ret = 0; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + /* Wait if we're currently in a connect or accept downcall */ > > + wait_event(cm_id_priv->connect_wait, > > + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); > > Am I understanding this check correctly? You're checking to see if the user has > called iw_cm_disconnect() at the same time that they called iw_cm_connect() or > iw_cm_accept(). Are connect / accept blocking, or are you just waiting for an > event? The CM must wait for the low level provider to finish a connect() or accept() operation before telling the low level provider to disconnect via modifying the iwarp QP. Regardless of whether they block, this disconnect can happen concurrently with the connect/accept so we need to hold the disconnect until the connect/accept completes. > > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + switch (cm_id_priv->state) { > > + case IW_CM_STATE_ESTABLISHED: > > + cm_id_priv->state = IW_CM_STATE_CLOSING; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + if (cm_id_priv->qp) { /* QP could be for user-mode client */ > > + if (abrupt) > > + ret = iwcm_modify_qp_err(cm_id_priv->qp); > > + else > > + ret = iwcm_modify_qp_sqd(cm_id_priv->qp); > > + /* > > + * If both sides are disconnecting the QP could > > + * already be in ERR or SQD states > > + */ > > + ret = 0; > > + } > > + else > > + ret = -EINVAL; > > + break; > > + case IW_CM_STATE_LISTEN: > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + ret = -EINVAL; > > + break; > > + case IW_CM_STATE_CLOSING: > > + /* remote peer closed first */ > > + case IW_CM_STATE_IDLE: > > + /* accept or connect returned !0 */ > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + break; > > + case IW_CM_STATE_CONN_RECV: > > + /* > > + * App called disconnect before/without calling accept after > > + * connect_request event delivered. > > + */ > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + break; > > + case IW_CM_STATE_CONN_SENT: > > + /* Can only get here if wait above fails */ > > + default: > > + BUG_ON(1); > > + } > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(iw_cm_disconnect); > > +static void destroy_cm_id(struct iw_cm_id *cm_id) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + unsigned long flags; > > + int ret; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + /* Wait if we're currently in a connect or accept downcall. A > > + * listening endpoint should never block here. */ > > + wait_event(cm_id_priv->connect_wait, > > + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); > > Same question/comment as above. > Same answer. > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + switch (cm_id_priv->state) { > > + case IW_CM_STATE_LISTEN: > > + cm_id_priv->state = IW_CM_STATE_DESTROYING; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + /* destroy the listening endpoint */ > > + ret = cm_id->device->iwcm->destroy_listen(cm_id); > > + break; > > + case IW_CM_STATE_ESTABLISHED: > > + cm_id_priv->state = IW_CM_STATE_DESTROYING; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + /* Abrupt close of the connection */ > > + (void)iwcm_modify_qp_err(cm_id_priv->qp); > > + break; > > + case IW_CM_STATE_IDLE: > > + case IW_CM_STATE_CLOSING: > > + cm_id_priv->state = IW_CM_STATE_DESTROYING; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + break; > > + case IW_CM_STATE_CONN_RECV: > > + /* > > + * App called destroy before/without calling accept after > > + * receiving connection request event notification. > > + */ > > + cm_id_priv->state = IW_CM_STATE_DESTROYING; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + break; > > + case IW_CM_STATE_CONN_SENT: > > + case IW_CM_STATE_DESTROYING: > > + default: > > + BUG_ON(1); > > + break; > > + } > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > As an alternative, you could hold the lock from above, an let the LISTEN / > ESTABLISHED state checks release and reacquire. > Yes, perhaps that's cleaner. > > + if (cm_id_priv->qp) { > > + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); > > + cm_id_priv->qp = NULL; > > + } > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + > > + (void)iwcm_deref_id(cm_id_priv); > > +} > > + > > +/* > > + * This function is only called by the application thread and cannot > > + * be called by the event thread. The function will wait for all > > + * references to be released on the cm_id and then kfree the cm_id > > + * object. > > + */ > > +void iw_destroy_cm_id(struct iw_cm_id *cm_id) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)); > > + > > + destroy_cm_id(cm_id); > > + > > + wait_event(cm_id_priv->destroy_wait, > > + !atomic_read(&cm_id_priv->refcount)); > > + > > + kfree(cm_id_priv); > > +} > > +EXPORT_SYMBOL(iw_destroy_cm_id); > > + > > +/* > > + * CM_ID <-- LISTEN > > + * > > + * Start listening for connect requests. Generates one CONNECT_REQUEST > > + * event for each inbound connect request. > > + */ > > +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + unsigned long flags; > > + int ret = 0; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + switch (cm_id_priv->state) { > > + case IW_CM_STATE_IDLE: > > + cm_id_priv->state = IW_CM_STATE_LISTEN; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + ret = cm_id->device->iwcm->create_listen(cm_id, backlog); > > + if (ret) > > + cm_id_priv->state = IW_CM_STATE_IDLE; > > + break; > > + default: > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + ret = -EINVAL; > > + } > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(iw_cm_listen); > > + > > +/* > > + * CM_ID <-- IDLE > > + * > > + * Rejects an inbound connection request. No events are generated. > > + */ > > +int iw_cm_reject(struct iw_cm_id *cm_id, > > + const void *private_data, > > + u8 private_data_len) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + unsigned long flags; > > + int ret; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) { > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + wake_up_all(&cm_id_priv->connect_wait); > > + return -EINVAL; > > + } > > + cm_id_priv->state = IW_CM_STATE_IDLE; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + > > + ret = cm_id->device->iwcm->reject(cm_id, private_data, > > + private_data_len); > > + > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + wake_up_all(&cm_id_priv->connect_wait); > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(iw_cm_reject); > > + > > +/* > > + * CM_ID <-- ESTABLISHED > > + * > > + * Accepts an inbound connection request and generates an ESTABLISHED > > + * event. Callers of iw_cm_disconnect and iw_destroy_cm_id will block > > + * until the ESTABLISHED event is received from the provider. > > + */ > > This makes it sound like we're just waiting for an event. > disconnect/destory paths wait for the provider to complete the accept or connect operation. > > +int iw_cm_accept(struct iw_cm_id *cm_id, > > + struct iw_cm_conn_param *iw_param) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + struct ib_qp *qp; > > + unsigned long flags; > > + int ret; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) { > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + wake_up_all(&cm_id_priv->connect_wait); > > + return -EINVAL; > > + } > > + /* Get the ib_qp given the QPN */ > > + qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); > > + if (!qp) { > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + return -EINVAL; > > + } > > + cm_id->device->iwcm->add_ref(qp); > > + cm_id_priv->qp = qp; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + > > + ret = cm_id->device->iwcm->accept(cm_id, iw_param); > > + if (ret) { > > + /* An error on accept precludes provider events */ > > + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV); > > + cm_id_priv->state = IW_CM_STATE_IDLE; > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + if (cm_id_priv->qp) { > > + cm_id->device->iwcm->rem_ref(qp); > > + cm_id_priv->qp = NULL; > > + } > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + printk("Accept failed, ret=%d\n", ret); > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + wake_up_all(&cm_id_priv->connect_wait); > > + } > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(iw_cm_accept); > > + > > +/* > > + * Active Side: CM_ID <-- CONN_SENT > > + * > > + * If successful, results in the generation of a CONNECT_REPLY > > + * event. iw_cm_disconnect and iw_cm_destroy will block until the > > + * CONNECT_REPLY event is received from the provider. > > + */ > > +int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + int ret = 0; > > + unsigned long flags; > > + struct ib_qp *qp; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + if (cm_id_priv->state != IW_CM_STATE_IDLE) { > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + wake_up_all(&cm_id_priv->connect_wait); > > + return -EINVAL; > > + } > > + > > + /* Get the ib_qp given the QPN */ > > + qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); > > + if (!qp) { > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + return -EINVAL; > > + } > > + cm_id->device->iwcm->add_ref(qp); > > + cm_id_priv->qp = qp; > > + cm_id_priv->state = IW_CM_STATE_CONN_SENT; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + > > + ret = cm_id->device->iwcm->connect(cm_id, iw_param); > > + if (ret) { > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + if (cm_id_priv->qp) { > > + cm_id->device->iwcm->rem_ref(qp); > > + cm_id_priv->qp = NULL; > > + } > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT); > > + cm_id_priv->state = IW_CM_STATE_IDLE; > > + printk("Connect failed, ret=%d\n", ret); > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + wake_up_all(&cm_id_priv->connect_wait); > > + } > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(iw_cm_connect); > > + > > +/* > > + * Passive Side: new CM_ID <-- CONN_RECV > > + * > > + * Handles an inbound connect request. The function creates a new > > + * iw_cm_id to represent the new connection and inherits the client > > + * callback function and other attributes from the listening parent. > > + * > > + * The work item contains a pointer to the listen_cm_id and the event. The > > + * listen_cm_id contains the client cm_handler, context and > > + * device. These are copied when the device is cloned. The event > > + * contains the new four tuple. > > + * > > + * An error on the child should not affect the parent, so this > > + * function does not return a value. > > + */ > > +static void cm_conn_req_handler(struct iwcm_id_private *listen_id_priv, > > + struct iw_cm_event *iw_event) > > +{ > > + unsigned long flags; > > + struct iw_cm_id *cm_id; > > + struct iwcm_id_private *cm_id_priv; > > + int ret; > > + > > + /* The provider should never generate a connection request > > + * event with a bad status. > > + */ > > + BUG_ON(iw_event->status); > > + > > + /* We could be destroying the listening id. If so, ignore this > > + * upcall. */ > > + spin_lock_irqsave(&listen_id_priv->lock, flags); > > + if (listen_id_priv->state != IW_CM_STATE_LISTEN) { > > + spin_unlock_irqrestore(&listen_id_priv->lock, flags); > > + return; > > + } > > + spin_unlock_irqrestore(&listen_id_priv->lock, flags); > > + > > + cm_id = iw_create_cm_id(listen_id_priv->id.device, > > + listen_id_priv->id.cm_handler, > > + listen_id_priv->id.context); > > + /* If the cm_id could not be created, ignore the request */ > > + if (IS_ERR(cm_id)) > > + return; > > + > > + cm_id->provider_data = iw_event->provider_data; > > + cm_id->local_addr = iw_event->local_addr; > > + cm_id->remote_addr = iw_event->remote_addr; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + cm_id_priv->state = IW_CM_STATE_CONN_RECV; > > + > > + /* Call the client CM handler */ > > + ret = cm_id->cm_handler(cm_id, iw_event); > > + if (ret) { > > + printk("destroying child id %p, ret=%d\n", > > + cm_id, ret); > > We probably don't always want to print a message here. > Yes. I'll change this to a pr_debug(). > > + set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); > > + destroy_cm_id(cm_id); > > + if (atomic_read(&cm_id_priv->refcount)==0) > > + kfree(cm_id); > > + } > > +} > > + > > +/* > > + * Passive Side: CM_ID <-- ESTABLISHED > > + * > > + * The provider generated an ESTABLISHED event which means that > > + * the MPA negotion has completed successfully and we are now in MPA > > + * FPDU mode. > > + * > > + * This event can only be received in the CONN_RECV state. If the > > + * remote peer closed, the ESTABLISHED event would be received followed > > + * by the CLOSE event. If the app closes, it will block until we wake > > + * it up after processing this event. > > + */ > > +static int cm_conn_est_handler(struct iwcm_id_private *cm_id_priv, > > + struct iw_cm_event *iw_event) > > +{ > > + unsigned long flags; > > + int ret = 0; > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + > > + /* We clear the CONNECT_WAIT bit here to allow the callback > > + * function to call iw_cm_disconnect. Calling iw_destroy_cm_id > > + * from a callback handler is not allowed */ > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + switch (cm_id_priv->state) { > > + case IW_CM_STATE_CONN_RECV: > > + cm_id_priv->state = IW_CM_STATE_ESTABLISHED; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); > > + break; > > + default: > > + BUG_ON(1); > > Can just BUG_ON the state and avoid the switch. Same comment applies below. > ok. > > + } > > + wake_up_all(&cm_id_priv->connect_wait); > > + > > + return ret; > > +} > > + > > +/* > > + * Active Side: CM_ID <-- ESTABLISHED > > + * > > + * The app has called connect and is waiting for the established event to > > + * post it's requests to the server. This event will wake up anyone > > + * blocked in iw_cm_disconnect or iw_destroy_id. > > + */ > > +static int cm_conn_rep_handler(struct iwcm_id_private *cm_id_priv, > > + struct iw_cm_event *iw_event) > > +{ > > + unsigned long flags; > > + int ret = 0; > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + /* Clear the connect wait bit so a callback function calling > > + * iw_cm_disconnect will not wait and deadlock this thread */ > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + switch (cm_id_priv->state) { > > + case IW_CM_STATE_CONN_SENT: > > + if (iw_event->status == IW_CM_EVENT_STATUS_ACCEPTED) { > > + cm_id_priv->id.local_addr = iw_event->local_addr; > > + cm_id_priv->id.remote_addr = iw_event->remote_addr; > > + cm_id_priv->state = IW_CM_STATE_ESTABLISHED; > > + } else { > > + /* REJECTED or RESET */ > > + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); > > + cm_id_priv->qp = NULL; > > + cm_id_priv->state = IW_CM_STATE_IDLE; > > + } > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); > > + break; > > + default: > > + BUG_ON(1); > > + } > > + /* Wake up waiters on connect complete */ > > + wake_up_all(&cm_id_priv->connect_wait); > > + > > + return ret; > > +} > > + > > +/* > > + * CM_ID <-- CLOSING > > + * > > + * If in the ESTABLISHED state, move to CLOSING. > > + */ > > +static void cm_disconnect_handler(struct iwcm_id_private *cm_id_priv, > > + struct iw_cm_event *iw_event) > > +{ > > + unsigned long flags; > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + if (cm_id_priv->state == IW_CM_STATE_ESTABLISHED) > > + cm_id_priv->state = IW_CM_STATE_CLOSING; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > +} > > + > > +/* > > + * CM_ID <-- IDLE > > + * > > + * If in the ESTBLISHED or CLOSING states, the QP will have have been > > + * moved by the provider to the ERR state. Disassociate the CM_ID from > > + * the QP, move to IDLE, and remove the 'connected' reference. > > + * > > + * If in some other state, the cm_id was destroyed asynchronously. > > + * This is the last reference that will result in waking up > > + * the app thread blocked in iw_destroy_cm_id. > > + */ > > +static int cm_close_handler(struct iwcm_id_private *cm_id_priv, > > + struct iw_cm_event *iw_event) > > +{ > > + unsigned long flags; > > + int ret = 0; > > + /* TT */printk("%s:%d cm_id_priv=%p, state=%d\n", > > + __FUNCTION__, __LINE__, > > + cm_id_priv,cm_id_priv->state); > > Will want to remove this. > oops. yes... > - Sean From tom at opengridcomputing.com Thu Jun 1 10:11:58 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 01 Jun 2006 12:11:58 -0500 Subject: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager. In-Reply-To: <447E1720.7000307@ichips.intel.com> References: <20060531182650.3308.81538.stgit@stevo-desktop> <20060531182652.3308.1244.stgit@stevo-desktop> <447E1720.7000307@ichips.intel.com> Message-ID: <1149181918.18855.23.camel@trinity.ogc.int> On Wed, 2006-05-31 at 15:22 -0700, Sean Hefty wrote: > Steve Wise wrote: > > +/* > > + * Release a reference on cm_id. If the last reference is being removed > > + * and iw_destroy_cm_id is waiting, wake up the waiting thread. > > + */ > > +static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv) > > +{ > > + int ret = 0; > > + > > + BUG_ON(atomic_read(&cm_id_priv->refcount)==0); > > + if (atomic_dec_and_test(&cm_id_priv->refcount)) { > > + BUG_ON(!list_empty(&cm_id_priv->work_list)); > > + if (waitqueue_active(&cm_id_priv->destroy_wait)) { > > + BUG_ON(cm_id_priv->state != IW_CM_STATE_DESTROYING); > > + BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, > > + &cm_id_priv->flags)); > > + ret = 1; > > + wake_up(&cm_id_priv->destroy_wait); > > We recently changed the RDMA CM, IB CM, and a couple of other modules from using > wait objects to completions. This avoids a race condition between decrementing > the reference count, which allows destruction to proceed, and calling wake_up on > a freed cm_id. My guess is that you may need to do the same. > > Can you also explain the use of the return value here? It's ignored below in > rem_ref() and destroy_cm_id(). > > > +static void add_ref(struct iw_cm_id *cm_id) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + atomic_inc(&cm_id_priv->refcount); > > +} > > + > > +static void rem_ref(struct iw_cm_id *cm_id) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + iwcm_deref_id(cm_id_priv); > > +} > > + > > > +/* > > + * CM_ID <-- CLOSING > > + * > > + * Block if a passive or active connection is currenlty being processed. Then > > + * process the event as follows: > > + * - If we are ESTABLISHED, move to CLOSING and modify the QP state > > + * based on the abrupt flag > > + * - If the connection is already in the CLOSING or IDLE state, the peer is > > + * disconnecting concurrently with us and we've already seen the > > + * DISCONNECT event -- ignore the request and return 0 > > + * - Disconnect on a listening endpoint returns -EINVAL > > + */ > > +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + unsigned long flags; > > + int ret = 0; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + /* Wait if we're currently in a connect or accept downcall */ > > + wait_event(cm_id_priv->connect_wait, > > + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); > > Am I understanding this check correctly? You're checking to see if the user has > called iw_cm_disconnect() at the same time that they called iw_cm_connect() or > iw_cm_accept(). Are connect / accept blocking, or are you just waiting for an > event? Yes. The application (or the case I saw was user-mode exit logic after ctrl-C) cleaning up at random times relative to connection establishment. > > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + switch (cm_id_priv->state) { > > + case IW_CM_STATE_ESTABLISHED: > > + cm_id_priv->state = IW_CM_STATE_CLOSING; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + if (cm_id_priv->qp) { /* QP could be for user-mode client */ > > + if (abrupt) > > + ret = iwcm_modify_qp_err(cm_id_priv->qp); > > + else > > + ret = iwcm_modify_qp_sqd(cm_id_priv->qp); > > + /* > > + * If both sides are disconnecting the QP could > > + * already be in ERR or SQD states > > + */ > > + ret = 0; > > + } > > + else > > + ret = -EINVAL; > > + break; > > + case IW_CM_STATE_LISTEN: > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + ret = -EINVAL; > > + break; > > + case IW_CM_STATE_CLOSING: > > + /* remote peer closed first */ > > + case IW_CM_STATE_IDLE: > > + /* accept or connect returned !0 */ > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + break; > > + case IW_CM_STATE_CONN_RECV: > > + /* > > + * App called disconnect before/without calling accept after > > + * connect_request event delivered. > > + */ > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + break; > > + case IW_CM_STATE_CONN_SENT: > > + /* Can only get here if wait above fails */ > > + default: > > + BUG_ON(1); > > + } > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(iw_cm_disconnect); > > +static void destroy_cm_id(struct iw_cm_id *cm_id) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + unsigned long flags; > > + int ret; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + /* Wait if we're currently in a connect or accept downcall. A > > + * listening endpoint should never block here. */ > > + wait_event(cm_id_priv->connect_wait, > > + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); > > Same question/comment as above. > > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + switch (cm_id_priv->state) { > > + case IW_CM_STATE_LISTEN: > > + cm_id_priv->state = IW_CM_STATE_DESTROYING; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + /* destroy the listening endpoint */ > > + ret = cm_id->device->iwcm->destroy_listen(cm_id); > > + break; > > + case IW_CM_STATE_ESTABLISHED: > > + cm_id_priv->state = IW_CM_STATE_DESTROYING; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + /* Abrupt close of the connection */ > > + (void)iwcm_modify_qp_err(cm_id_priv->qp); > > + break; > > + case IW_CM_STATE_IDLE: > > + case IW_CM_STATE_CLOSING: > > + cm_id_priv->state = IW_CM_STATE_DESTROYING; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + break; > > + case IW_CM_STATE_CONN_RECV: > > + /* > > + * App called destroy before/without calling accept after > > + * receiving connection request event notification. > > + */ > > + cm_id_priv->state = IW_CM_STATE_DESTROYING; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + break; > > + case IW_CM_STATE_CONN_SENT: > > + case IW_CM_STATE_DESTROYING: > > + default: > > + BUG_ON(1); > > + break; > > + } > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > As an alternative, you could hold the lock from above, an let the LISTEN / > ESTABLISHED state checks release and reacquire. > > > + if (cm_id_priv->qp) { > > + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); > > + cm_id_priv->qp = NULL; > > + } > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + > > + (void)iwcm_deref_id(cm_id_priv); > > +} > > + > > +/* > > + * This function is only called by the application thread and cannot > > + * be called by the event thread. The function will wait for all > > + * references to be released on the cm_id and then kfree the cm_id > > + * object. > > + */ > > +void iw_destroy_cm_id(struct iw_cm_id *cm_id) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)); > > + > > + destroy_cm_id(cm_id); > > + > > + wait_event(cm_id_priv->destroy_wait, > > + !atomic_read(&cm_id_priv->refcount)); > > + > > + kfree(cm_id_priv); > > +} > > +EXPORT_SYMBOL(iw_destroy_cm_id); > > + > > +/* > > + * CM_ID <-- LISTEN > > + * > > + * Start listening for connect requests. Generates one CONNECT_REQUEST > > + * event for each inbound connect request. > > + */ > > +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + unsigned long flags; > > + int ret = 0; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + switch (cm_id_priv->state) { > > + case IW_CM_STATE_IDLE: > > + cm_id_priv->state = IW_CM_STATE_LISTEN; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + ret = cm_id->device->iwcm->create_listen(cm_id, backlog); > > + if (ret) > > + cm_id_priv->state = IW_CM_STATE_IDLE; > > + break; > > + default: > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + ret = -EINVAL; > > + } > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(iw_cm_listen); > > + > > +/* > > + * CM_ID <-- IDLE > > + * > > + * Rejects an inbound connection request. No events are generated. > > + */ > > +int iw_cm_reject(struct iw_cm_id *cm_id, > > + const void *private_data, > > + u8 private_data_len) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + unsigned long flags; > > + int ret; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) { > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + wake_up_all(&cm_id_priv->connect_wait); > > + return -EINVAL; > > + } > > + cm_id_priv->state = IW_CM_STATE_IDLE; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + > > + ret = cm_id->device->iwcm->reject(cm_id, private_data, > > + private_data_len); > > + > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + wake_up_all(&cm_id_priv->connect_wait); > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(iw_cm_reject); > > + > > +/* > > + * CM_ID <-- ESTABLISHED > > + * > > + * Accepts an inbound connection request and generates an ESTABLISHED > > + * event. Callers of iw_cm_disconnect and iw_destroy_cm_id will block > > + * until the ESTABLISHED event is received from the provider. > > + */ > > This makes it sound like we're just waiting for an event. > > > +int iw_cm_accept(struct iw_cm_id *cm_id, > > + struct iw_cm_conn_param *iw_param) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + struct ib_qp *qp; > > + unsigned long flags; > > + int ret; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) { > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + wake_up_all(&cm_id_priv->connect_wait); > > + return -EINVAL; > > + } > > + /* Get the ib_qp given the QPN */ > > + qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); > > + if (!qp) { > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + return -EINVAL; > > + } > > + cm_id->device->iwcm->add_ref(qp); > > + cm_id_priv->qp = qp; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + > > + ret = cm_id->device->iwcm->accept(cm_id, iw_param); > > + if (ret) { > > + /* An error on accept precludes provider events */ > > + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV); > > + cm_id_priv->state = IW_CM_STATE_IDLE; > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + if (cm_id_priv->qp) { > > + cm_id->device->iwcm->rem_ref(qp); > > + cm_id_priv->qp = NULL; > > + } > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + printk("Accept failed, ret=%d\n", ret); > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + wake_up_all(&cm_id_priv->connect_wait); > > + } > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(iw_cm_accept); > > + > > +/* > > + * Active Side: CM_ID <-- CONN_SENT > > + * > > + * If successful, results in the generation of a CONNECT_REPLY > > + * event. iw_cm_disconnect and iw_cm_destroy will block until the > > + * CONNECT_REPLY event is received from the provider. > > + */ > > +int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) > > +{ > > + struct iwcm_id_private *cm_id_priv; > > + int ret = 0; > > + unsigned long flags; > > + struct ib_qp *qp; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + if (cm_id_priv->state != IW_CM_STATE_IDLE) { > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + wake_up_all(&cm_id_priv->connect_wait); > > + return -EINVAL; > > + } > > + > > + /* Get the ib_qp given the QPN */ > > + qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); > > + if (!qp) { > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + return -EINVAL; > > + } > > + cm_id->device->iwcm->add_ref(qp); > > + cm_id_priv->qp = qp; > > + cm_id_priv->state = IW_CM_STATE_CONN_SENT; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + > > + ret = cm_id->device->iwcm->connect(cm_id, iw_param); > > + if (ret) { > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + if (cm_id_priv->qp) { > > + cm_id->device->iwcm->rem_ref(qp); > > + cm_id_priv->qp = NULL; > > + } > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT); > > + cm_id_priv->state = IW_CM_STATE_IDLE; > > + printk("Connect failed, ret=%d\n", ret); > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + wake_up_all(&cm_id_priv->connect_wait); > > + } > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(iw_cm_connect); > > + > > +/* > > + * Passive Side: new CM_ID <-- CONN_RECV > > + * > > + * Handles an inbound connect request. The function creates a new > > + * iw_cm_id to represent the new connection and inherits the client > > + * callback function and other attributes from the listening parent. > > + * > > + * The work item contains a pointer to the listen_cm_id and the event. The > > + * listen_cm_id contains the client cm_handler, context and > > + * device. These are copied when the device is cloned. The event > > + * contains the new four tuple. > > + * > > + * An error on the child should not affect the parent, so this > > + * function does not return a value. > > + */ > > +static void cm_conn_req_handler(struct iwcm_id_private *listen_id_priv, > > + struct iw_cm_event *iw_event) > > +{ > > + unsigned long flags; > > + struct iw_cm_id *cm_id; > > + struct iwcm_id_private *cm_id_priv; > > + int ret; > > + > > + /* The provider should never generate a connection request > > + * event with a bad status. > > + */ > > + BUG_ON(iw_event->status); > > + > > + /* We could be destroying the listening id. If so, ignore this > > + * upcall. */ > > + spin_lock_irqsave(&listen_id_priv->lock, flags); > > + if (listen_id_priv->state != IW_CM_STATE_LISTEN) { > > + spin_unlock_irqrestore(&listen_id_priv->lock, flags); > > + return; > > + } > > + spin_unlock_irqrestore(&listen_id_priv->lock, flags); > > + > > + cm_id = iw_create_cm_id(listen_id_priv->id.device, > > + listen_id_priv->id.cm_handler, > > + listen_id_priv->id.context); > > + /* If the cm_id could not be created, ignore the request */ > > + if (IS_ERR(cm_id)) > > + return; > > + > > + cm_id->provider_data = iw_event->provider_data; > > + cm_id->local_addr = iw_event->local_addr; > > + cm_id->remote_addr = iw_event->remote_addr; > > + > > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > > + cm_id_priv->state = IW_CM_STATE_CONN_RECV; > > + > > + /* Call the client CM handler */ > > + ret = cm_id->cm_handler(cm_id, iw_event); > > + if (ret) { > > + printk("destroying child id %p, ret=%d\n", > > + cm_id, ret); > > We probably don't always want to print a message here. > > > + set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); > > + destroy_cm_id(cm_id); > > + if (atomic_read(&cm_id_priv->refcount)==0) > > + kfree(cm_id); > > + } > > +} > > + > > +/* > > + * Passive Side: CM_ID <-- ESTABLISHED > > + * > > + * The provider generated an ESTABLISHED event which means that > > + * the MPA negotion has completed successfully and we are now in MPA > > + * FPDU mode. > > + * > > + * This event can only be received in the CONN_RECV state. If the > > + * remote peer closed, the ESTABLISHED event would be received followed > > + * by the CLOSE event. If the app closes, it will block until we wake > > + * it up after processing this event. > > + */ > > +static int cm_conn_est_handler(struct iwcm_id_private *cm_id_priv, > > + struct iw_cm_event *iw_event) > > +{ > > + unsigned long flags; > > + int ret = 0; > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + > > + /* We clear the CONNECT_WAIT bit here to allow the callback > > + * function to call iw_cm_disconnect. Calling iw_destroy_cm_id > > + * from a callback handler is not allowed */ > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + switch (cm_id_priv->state) { > > + case IW_CM_STATE_CONN_RECV: > > + cm_id_priv->state = IW_CM_STATE_ESTABLISHED; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); > > + break; > > + default: > > + BUG_ON(1); > > Can just BUG_ON the state and avoid the switch. Same comment applies below. > > > + } > > + wake_up_all(&cm_id_priv->connect_wait); > > + > > + return ret; > > +} > > + > > +/* > > + * Active Side: CM_ID <-- ESTABLISHED > > + * > > + * The app has called connect and is waiting for the established event to > > + * post it's requests to the server. This event will wake up anyone > > + * blocked in iw_cm_disconnect or iw_destroy_id. > > + */ > > +static int cm_conn_rep_handler(struct iwcm_id_private *cm_id_priv, > > + struct iw_cm_event *iw_event) > > +{ > > + unsigned long flags; > > + int ret = 0; > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + /* Clear the connect wait bit so a callback function calling > > + * iw_cm_disconnect will not wait and deadlock this thread */ > > + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); > > + switch (cm_id_priv->state) { > > + case IW_CM_STATE_CONN_SENT: > > + if (iw_event->status == IW_CM_EVENT_STATUS_ACCEPTED) { > > + cm_id_priv->id.local_addr = iw_event->local_addr; > > + cm_id_priv->id.remote_addr = iw_event->remote_addr; > > + cm_id_priv->state = IW_CM_STATE_ESTABLISHED; > > + } else { > > + /* REJECTED or RESET */ > > + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); > > + cm_id_priv->qp = NULL; > > + cm_id_priv->state = IW_CM_STATE_IDLE; > > + } > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); > > + break; > > + default: > > + BUG_ON(1); > > + } > > + /* Wake up waiters on connect complete */ > > + wake_up_all(&cm_id_priv->connect_wait); > > + > > + return ret; > > +} > > + > > +/* > > + * CM_ID <-- CLOSING > > + * > > + * If in the ESTABLISHED state, move to CLOSING. > > + */ > > +static void cm_disconnect_handler(struct iwcm_id_private *cm_id_priv, > > + struct iw_cm_event *iw_event) > > +{ > > + unsigned long flags; > > + > > + spin_lock_irqsave(&cm_id_priv->lock, flags); > > + if (cm_id_priv->state == IW_CM_STATE_ESTABLISHED) > > + cm_id_priv->state = IW_CM_STATE_CLOSING; > > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > +} > > + > > +/* > > + * CM_ID <-- IDLE > > + * > > + * If in the ESTBLISHED or CLOSING states, the QP will have have been > > + * moved by the provider to the ERR state. Disassociate the CM_ID from > > + * the QP, move to IDLE, and remove the 'connected' reference. > > + * > > + * If in some other state, the cm_id was destroyed asynchronously. > > + * This is the last reference that will result in waking up > > + * the app thread blocked in iw_destroy_cm_id. > > + */ > > +static int cm_close_handler(struct iwcm_id_private *cm_id_priv, > > + struct iw_cm_event *iw_event) > > +{ > > + unsigned long flags; > > + int ret = 0; > > + /* TT */printk("%s:%d cm_id_priv=%p, state=%d\n", > > + __FUNCTION__, __LINE__, > > + cm_id_priv,cm_id_priv->state); > > Will want to remove this. > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Thu Jun 1 10:48:38 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 1 Jun 2006 20:48:38 +0300 Subject: [openib-general] [PATCH TRIVIAL] opensm: fix comment in osm_matrix.h Message-ID: <20060601174838.GA12872@sashak.voltaire.com> This fixes the function description comment in osm_matrix.h Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_matrix.h | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/osm/include/opensm/osm_matrix.h b/osm/include/opensm/osm_matrix.h index c6c5107..0903708 100644 --- a/osm/include/opensm/osm_matrix.h +++ b/osm/include/opensm/osm_matrix.h @@ -321,7 +321,7 @@ osm_lid_matrix_get_num_ports( * osm_lid_matrix_get_least_hops * * DESCRIPTION -* Returns the number of ports in this lid matrix. +* Returns the least number of hops for specified lid * * SYNOPSIS */ @@ -345,7 +345,7 @@ osm_lid_matrix_get_least_hops( * [in] LID (host order) for which to retrieve the shortest hop count. * * RETURN VALUES -* Returns the number of ports in this lid matrix. +* Returns the least number of hops for specified lid * * NOTES * From william at pellicano.biz Thu Jun 1 05:45:16 2006 From: william at pellicano.biz (Reginald) Date: Thu, 01 Jun 2006 13:45:16 +0100 Subject: [openib-general] Medicines for men before Valentine Day !!! Message-ID: <000001c685a3$24311e00$0100007f@D62LYD61> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Untitled-2.jpg Type: image/jpeg Size: 20429 bytes Desc: not available URL: From sashak at voltaire.com Thu Jun 1 11:09:49 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 1 Jun 2006 21:09:49 +0300 Subject: [openib-general] QoS RFC - Resend using a friendly mailer In-Reply-To: <20060530224917.GM29770@esmail.cup.hp.com> References: <20060530190936.GD21212@sashak.voltaire.com> <20060530224917.GM29770@esmail.cup.hp.com> Message-ID: <20060601180949.GB14883@sashak.voltaire.com> On 15:49 Tue 30 May , Grant Grundler wrote: > On Tue, May 30, 2006 at 10:09:36PM +0300, Sasha Khapyorsky wrote: > > > XML style syntax is provided for the policy file. > > > > Why XML? It is not too much readable and writable (by human) format. > > It is human readable and very portable. > An example is here: > http://svn.gnumonks.org/trunk/mmio_test/mmio_test.xml Yes it is readable, but for many people it is _less_ readable and even _less_ writable than "plain" text. > And GPL libraries can parse XML. It is true, but currently we have "portability" complaints even against using libpthread. Sasha > So the new code is fairly short: > http://svn.gnumonks.org/trunk/mmio_test/xmlin.c > > hth, > grant From sashak at voltaire.com Thu Jun 1 11:51:03 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 1 Jun 2006 21:51:03 +0300 Subject: [openib-general] QoS RFC - Resend using a friendly mailer In-Reply-To: References: Message-ID: <20060601185103.GC14883@sashak.voltaire.com> Hi Eitan, Some more comments related to OpenSM. On 17:53 Tue 30 May , Eitan Zahavi wrote: > > 9. OpenSM features > ------------------- > The QoS related functionality to be provided by OpenSM can be split into two > main parts: > > 3.1. Fabric Setup > During fabric initialization the SM should parse the policy and apply its > settings to the discovered fabric elements. The following actions should be > performed: > * Parsing of policy > * Node Group identification. Warning should be provided for each node not > specified but found. > * SL2VL settings validation should be checked: > + A warning will be provided if there are no matching targets for the SL2VL > setting statement. > + An error message will be printed to the log file if an invalid setting is > found. A setting is invalid if it refers to: > - Non existing port numbers of the target devices > - Unsupported VLs for the target device. In the later case the map to non > existing VLs should be replaced to VL15 i.e. packets will be dropped. Not sure that unsupported VLs mapping to VL15 is best option. Actually if SL2VL will be specified per port group this may mean that at least in "generic" case all group members should have similar physical capabilities or "reliable" part of SLs will be limited by lowest VLCap in this group (other SLs will be just dropped somewhere). In current SL2VL mapping implementation we are using such rule to replace unsupported VLs: (new VL) = (requested VL) % (operational data VLs) This may have some disadvantage too, but I think it is generally "safer". Also I guess that by "unsupported VLs" you are referring unsupported or non-configured VLs. > * SL2VL setting is to be performed > * VL Arbitration table settings should be validated according to the following > rules: > + A warning will be provided if there are no matching targets for the setting > statement > + An error will be provided if the port number exceeds the target ports > + An error will be generated if the table length exceeds device capabilities > + An warning will be generated if the table quote a VL that is not supported > by the target device Should there be replacement rule for not supported VLs? In IBTA spec (v.1, p.190, l.14) is stated that entry with unsupported VL may be skipped _OR_ "trusted" to other (supported) VL. I think if we will not care about unsupported replacement there may be hole for "device/vendor dependent" behavior. Sasha From iod00d at hp.com Thu Jun 1 12:07:45 2006 From: iod00d at hp.com (Grant Grundler) Date: Thu, 1 Jun 2006 12:07:45 -0700 Subject: [openib-general] QoS RFC - Resend using a friendly mailer In-Reply-To: <20060601180949.GB14883@sashak.voltaire.com> References: <20060530190936.GD21212@sashak.voltaire.com> <20060530224917.GM29770@esmail.cup.hp.com> <20060601180949.GB14883@sashak.voltaire.com> Message-ID: <20060601190745.GA7670@esmail.cup.hp.com> On Thu, Jun 01, 2006 at 09:09:49PM +0300, Sasha Khapyorsky wrote: > On 15:49 Tue 30 May , Grant Grundler wrote: > > On Tue, May 30, 2006 at 10:09:36PM +0300, Sasha Khapyorsky wrote: > > > > XML style syntax is provided for the policy file. > > > > > > Why XML? It is not too much readable and writable (by human) format. > > > > It is human readable and very portable. > > An example is here: > > http://svn.gnumonks.org/trunk/mmio_test/mmio_test.xml > > Yes it is readable, but for many people it is _less_ readable and even > _less_ writable than "plain" text. This might be a good starting point for "many people": http://ahds.ac.uk/creating/information-papers/xml-editors/ I tried conglomerate (debian) and it doesn't like mmiot_test.xnl for some reason. But I suppose that could be fixed. Anyway, my point is there is no shortage of GUIs to edit XML files and verify syntactical correctness. hth, grant From rdreier at cisco.com Thu Jun 1 13:05:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 01 Jun 2006 13:05:56 -0700 Subject: [openib-general] Re: [PATCHv2 1/2] resend: mthca support for max_map_per_fmr device attribute In-Reply-To: (Or Gerlitz's message of "Tue, 30 May 2006 09:22:59 +0300 (IDT)") References: Message-ID: I had a chance to look at this, and I don't believe it is precisely correct for mem-free HCAs with the current FMR implementation. > + /* on memfull HCA an FMR can be remapped 2^B - 1 times where B < 32 is > + * the number of bits which are not used for MPT addressing, on memfree > + * HCA B=8 so an FMR can be remapped 255 times. > + */ > + if(!mthca_is_memfree(mdev)) > + props->max_map_per_fmr = (1 << (32 - > + long_log2(mdev->limits.num_mpts))) - 1; > + else > + props->max_map_per_fmr = (1 << 8) - 1; Look at mthca_arbel_map_phys_fmr(). The question is how often key will repeat after being indexed, and when MTHCA_FLAG_SINAI_OPT is not set, then the same increment is used in the mem-free case as in the Tavor case. So I think the code I quoted should really be: if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) props->max_map_per_fmr = (1 << (32 - long_log2(mdev->limits.num_mpts))) - 1; else props->max_map_per_fmr = (1 << 8) - 1; Do you agree? If so I can fix this patch up myself and apply it. - R. From mshefty at ichips.intel.com Thu Jun 1 14:09:12 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 01 Jun 2006 14:09:12 -0700 Subject: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager. In-Reply-To: <1149181233.31610.34.camel@stevo-desktop> References: <20060531182650.3308.81538.stgit@stevo-desktop> <20060531182652.3308.1244.stgit@stevo-desktop> <447E1720.7000307@ichips.intel.com> <1149181233.31610.34.camel@stevo-desktop> Message-ID: <447F5778.6010202@ichips.intel.com> Steve Wise wrote: >>>+int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt) >>>+{ >>>+ struct iwcm_id_private *cm_id_priv; >>>+ unsigned long flags; >>>+ int ret = 0; >>>+ >>>+ cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); >>>+ /* Wait if we're currently in a connect or accept downcall */ >>>+ wait_event(cm_id_priv->connect_wait, >>>+ !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); >> >>Am I understanding this check correctly? You're checking to see if the user has >>called iw_cm_disconnect() at the same time that they called iw_cm_connect() or >>iw_cm_accept(). Are connect / accept blocking, or are you just waiting for an >>event? > > > The CM must wait for the low level provider to finish a connect() or > accept() operation before telling the low level provider to disconnect > via modifying the iwarp QP. Regardless of whether they block, this > disconnect can happen concurrently with the connect/accept so we need to > hold the disconnect until the connect/accept completes. > > >>>+EXPORT_SYMBOL(iw_cm_disconnect); >>>+static void destroy_cm_id(struct iw_cm_id *cm_id) >>>+{ >>>+ struct iwcm_id_private *cm_id_priv; >>>+ unsigned long flags; >>>+ int ret; >>>+ >>>+ cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); >>>+ /* Wait if we're currently in a connect or accept downcall. A >>>+ * listening endpoint should never block here. */ >>>+ wait_event(cm_id_priv->connect_wait, >>>+ !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); >> >>Same question/comment as above. >> > > > Same answer. There's a difference between trying to handle the user calling disconnect/destroy at the same time a call to accept/connect is active, versus the user calling disconnect/destroy after accept/connect have returned. In the latter case, I think you're fine. In the first case, this is allowing a user to call destroy at the same time that they're calling accept/connect. Additionally, there's no guarantee that the F_CONNECT_WAIT flag has been set by accept/connect by the time disconnect/destroy tests it. - Sean From gaoq at cse.ohio-state.edu Thu Jun 1 14:44:01 2006 From: gaoq at cse.ohio-state.edu (Qi Gao) Date: Thu, 1 Jun 2006 17:44:01 -0400 Subject: [openib-general] EINTR in ibv_get_cq_event Message-ID: <007501c685c4$86931c30$0763a8c0@Brunhild> Hi, I'm trying to use the ibv_get_cq_event, and I see the following behavior: This is my code: ---------- ret = ibv_get_cq_event(cm_ud_comp_ch, &ev_cq, &ev_ctx); if (ret) { fprintf(stderr, "Failed to get cq_event: %d\n", ret); perror("ibv_get_cq_event"); } ---------- Most times it's OK, but sometimes I see: ---------- Failed to get cq_event: -1 ibv_get_cq_event: Interrupted system call ---------- Could someone tell me what may be happening? Thanks, Qi From faulkner at opengridcomputing.com Thu Jun 1 15:06:05 2006 From: faulkner at opengridcomputing.com (Boyd R. Faulkner) Date: Thu, 1 Jun 2006 17:06:05 -0500 Subject: [openib-general] [PATCH] librdmacm: ucma_init reads past end of device_list Message-ID: <200606011706.05383.faulkner@opengridcomputing.com> The code currently in place seems to expect there to be a null element at the end of the dev_list to trigger the end of the loop. ibv_get_device_list does not provide such an entry, but the number of entries is available. This patch retrieves that number and loops based on it. If ibv_get_device_list should return a list with a null element at the end, then it is not working correctly. This patch will work with either of the possible intended behaviors of ibv_get_device_list. Fix spelling of "liste". Index: cma.c =================================================================== --- cma.c (revision 7568) +++ cma.c (working copy) @@ -183,6 +183,7 @@ static int ucma_init(void) { int i; + int num_devices; struct cma_device *cma_dev; struct ibv_device_attr attr; int ret; @@ -201,14 +202,14 @@ goto err; } - dev_list = ibv_get_device_list(NULL); + dev_list = ibv_get_device_list(&num_devices); if (!dev_list) { - printf("CMA: unable to get RDMA device liste\n"); + printf("CMA: unable to get RDMA device list\n"); ret = -ENODEV; goto err; } - for (i = 0; dev_list[i]; ++i) { + for (i = 0; i < num_devices; ++i) { cma_dev = malloc(sizeof *cma_dev); if (!cma_dev) { ret = -ENOMEM; From tom at opengridcomputing.com Thu Jun 1 15:21:16 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 01 Jun 2006 17:21:16 -0500 Subject: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager. In-Reply-To: <447F5778.6010202@ichips.intel.com> References: <20060531182650.3308.81538.stgit@stevo-desktop> <20060531182652.3308.1244.stgit@stevo-desktop> <447E1720.7000307@ichips.intel.com> <1149181233.31610.34.camel@stevo-desktop> <447F5778.6010202@ichips.intel.com> Message-ID: <1149200476.18855.83.camel@trinity.ogc.int> On Thu, 2006-06-01 at 14:09 -0700, Sean Hefty wrote: > Steve Wise wrote: > >>>+int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt) > >>>+{ > >>>+ struct iwcm_id_private *cm_id_priv; > >>>+ unsigned long flags; > >>>+ int ret = 0; > >>>+ > >>>+ cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > >>>+ /* Wait if we're currently in a connect or accept downcall */ > >>>+ wait_event(cm_id_priv->connect_wait, > >>>+ !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); > >> > >>Am I understanding this check correctly? You're checking to see if the user has > >>called iw_cm_disconnect() at the same time that they called iw_cm_connect() or > >>iw_cm_accept(). Are connect / accept blocking, or are you just waiting for an > >>event? > > > > > > The CM must wait for the low level provider to finish a connect() or > > accept() operation before telling the low level provider to disconnect > > via modifying the iwarp QP. Regardless of whether they block, this > > disconnect can happen concurrently with the connect/accept so we need to > > hold the disconnect until the connect/accept completes. > > > > > >>>+EXPORT_SYMBOL(iw_cm_disconnect); > >>>+static void destroy_cm_id(struct iw_cm_id *cm_id) > >>>+{ > >>>+ struct iwcm_id_private *cm_id_priv; > >>>+ unsigned long flags; > >>>+ int ret; > >>>+ > >>>+ cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > >>>+ /* Wait if we're currently in a connect or accept downcall. A > >>>+ * listening endpoint should never block here. */ > >>>+ wait_event(cm_id_priv->connect_wait, > >>>+ !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); > >> > >>Same question/comment as above. > >> > > > > > > Same answer. > > There's a difference between trying to handle the user calling > disconnect/destroy at the same time a call to accept/connect is active, versus > the user calling disconnect/destroy after accept/connect have returned. In the > latter case, I think you're fine. In the first case, this is allowing a user to > call destroy at the same time that they're calling accept/connect. > Additionally, there's no guarantee that the F_CONNECT_WAIT flag has been set by > accept/connect by the time disconnect/destroy tests it. The problem is that we can't synchronously cancel an outstanding connect request. Once we've asked the adapter to connect, we can't tell him to stop, we have to wait for it to fail. During the time period between when we ask to connect and the adapter says yeah-or-nay, the user hits ctrl-C. This is the case where disconnect and/or destroy gets called and we have to block it waiting for the outstanding connect request to complete. One alternative to this approach is to do the kfree of the cm_id in the deref logic. This was the original design and leaves the object around to handle the completion of the connect and still allows the app to clean up and go away without all this waitin' around. When the adapter finally finishes and releases it's reference, the object is kfree'd. Hope this helps. > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From caitlinb at broadcom.com Thu Jun 1 15:28:24 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 1 Jun 2006 15:28:24 -0700 Subject: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager. Message-ID: <54AD0F12E08D1541B826BE97C98F99F150D3E6@NT-SJCA-0751.brcm.ad.broadcom.com> >> >> There's a difference between trying to handle the user calling >> disconnect/destroy at the same time a call to accept/connect is >> active, versus the user calling disconnect/destroy after >> accept/connect have returned. In the latter case, I think you're >> fine. In the first case, this is allowing a user to call > destroy at the same time that they're calling accept/connect. >> Additionally, there's no guarantee that the F_CONNECT_WAIT flag has >> been set by accept/connect by the time disconnect/destroy tests it. > > The problem is that we can't synchronously cancel an > outstanding connect request. Once we've asked the adapter to > connect, we can't tell him to stop, we have to wait for it to > fail. During the time period between when we ask to connect > and the adapter says yeah-or-nay, the user hits ctrl-C. This > is the case where disconnect and/or destroy gets called and > we have to block it waiting for the outstanding connect > request to complete. > > One alternative to this approach is to do the kfree of the > cm_id in the deref logic. This was the original design and > leaves the object around to handle the completion of the > connect and still allows the app to clean up and go away > without all this waitin' around. When the adapter finally > finishes and releases it's reference, the object is kfree'd. > > Hope this helps. > Why couldn't you synchronously put the cm_id in a state of "pending delete" and do the actual delete when the RNIC provides a response to the request? There could even be an optional method to see if the device is capable of cancelling the request. I know it can't yank a SYN back from the wire, but it could refrain from retransmitting. From mshefty at ichips.intel.com Thu Jun 1 15:34:43 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 01 Jun 2006 15:34:43 -0700 Subject: [openib-general] Re: [PATCH] librdmacm: ucma_init reads past end of device_list In-Reply-To: <200606011706.05383.faulkner@opengridcomputing.com> References: <200606011706.05383.faulkner@opengridcomputing.com> Message-ID: <447F6B83.6000902@ichips.intel.com> Boyd R. Faulkner wrote: > The code currently in place seems to expect there to be a null element at the > end of the dev_list to trigger the end of the loop. ibv_get_device_list > does not provide such an entry, but the number of entries is > available. This patch retrieves that number and loops based on it. > If ibv_get_device_list should return a list with a null element at the end, > then it is not working correctly. This patch will work with either of the > possible intended behaviors of ibv_get_device_list. > > Fix spelling of "liste". Thanks - can you please send a signed-off-by line? - Sean From flaloto at webmail.co.za Fri Jun 2 13:39:24 2006 From: flaloto at webmail.co.za (flaloto at webmail.co.za) Date: Fri, 02 Jun 2006 12:39:24 -0800 Subject: [openib-general] FINAL NOTICE OF AWARD NOTIFICATION Message-ID: <20060601225235.D3CEC22834D@openib.ca.sandia.gov> FROM THE DESK OF THE PROMOTIONS LOTTERY MANAGER, PROTEA WINNERS ORGANIZATION LOTTERY SOUTH AFRICA, 13,LAKE VIEW DRIVE,AUCKLAND PARK P.O.BOX 296, AUCKLAND PARK, JOHANNESBURY, SOUTH AFRICA. Your kind Attention: FINAL NOTICE OF AWARD NOTIFICATION We are pleased to inform you of the announcement today the 1St of JUNE 2006,of winners of the PROTEA WINNERS ORGANIZATION LOTTERY, held on 31ST OF MAY 2006 as part of our promotional draws. Participants were selected through a computer ballot system drawn from 2,500,000 email addresses of individuals and companies from Africa, America, Asia, Australia,Canada,Europe, Middle East,and New Zealand as part of our electronic business Promotions lottery Program. You qualified for the draw as a result of you visiting Various websites we are running the e-business promotions lottery for. You/Your Company, attached to ticket number 139-3201-6409,with serial number 570-10 drew the lucky numbers 1,8,14,20,31,46,72,and consequently won in the Second Category. You have therefore been approved for a lump sum pay out of US$3,000,000.00 in cash,which is the winning payout for Second category winners.This is from the total prize money of S$21,000,000.00 shared among the Seven international winners in the Second category. CONGRATULATIONS! Your fund is now deposited with the Maco Finance and Security Company insured in your name. Due to the mix up of some numbers and names, we award strictly from public notice until your claim has been processed and your money remitted to your account. This is part of our security protocol to avoid double claiming or unscrupulous acts by participants of this program. We hope with a part of your prize, you will participate in our up coming mid year (2007) high stakes US$1.3 billion International Lottery. To begin your claim, please contact your claim agent immediately: DR MARK ZOLO FOREIGN SERVICE MANAGER, TEL: 00-27-78-3030-229 EMAIL: markzolo_2006 at yahoo.com Kindly contact your claims officer and provide him with the following informations, 1. The Refrence Number.------------------------------------------- 2. The Batch Number.----------------------------------------------- 3. The Ticket Number.---------------------------------------------- 4. The serial Number.------------------------------------------------ 5. The lucky Number.----------------------------------------------- FULL NAMES:________________________MAILING ADDRESS:__________________________________SEX:_____________________ AGE/DATE OF BIRTH________________MARITAL STATUS:___________________ OCCUPATION:______________________TEL/FAX UMBER:_____________________ AMOUNT WON:__________________STATE/COUNTRY:_____________________COMPANY NAME:________________________ If you do not contact your claims agent within 14 working days of this notification, your winning prize money would be revoked. NOTE: In order to avoid unnecessary delays and complications, please remember to quote your reference and batch numbers. Winners are advised to keep their winning details/ information from the public to avoid fraudulent claims. (IMPORTANT) pending the transfer/claim REFERENCE NUMBER: KFQ-XV69-013b-9 BATCH NUMBER: 57-488-BBN Congratulations once again from all our staff and thank you promotions program. Sincerely, MRS STEPHANIE NKOSI. Lottery Co-ordinator PROTEA WINNERS ORGANIZATION LOTTERY SOUTH AFRICA From robert.j.woodruff at intel.com Thu Jun 1 15:40:02 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 1 Jun 2006 15:40:02 -0700 Subject: [openib-general] [PATCH] ipathverbs.c fails to compile on svn 7568 or on the ofed 1.0 branch Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007DC4264@orsmsx408> I ran into a compile problem with userspace/libipathverbs/src/ipathverbs.c This patch fixes the compile problem. --- ipathverbs.c 2006-06-01 14:56:46.000000000 -0700 +++ ipathverbs.new.c 2006-06-01 14:54:48.000000000 -0700 @@ -41,6 +41,7 @@ #include #include #include +#include #include "ipathverbs.h" -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Jun 1 15:56:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 01 Jun 2006 15:56:58 -0700 Subject: [openib-general] EINTR in ibv_get_cq_event In-Reply-To: <007501c685c4$86931c30$0763a8c0@Brunhild> (Qi Gao's message of "Thu, 1 Jun 2006 17:44:01 -0400") References: <007501c685c4$86931c30$0763a8c0@Brunhild> Message-ID: Qi> Could someone tell me what may be happening? Your process is getting a signal that interrupts the underlying read() system call. - R. From rdreier at cisco.com Thu Jun 1 15:59:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 01 Jun 2006 15:59:11 -0700 Subject: [openib-general] [PATCH] librdmacm: ucma_init reads past end of device_list In-Reply-To: <200606011706.05383.faulkner@opengridcomputing.com> (Boyd R. Faulkner's message of "Thu, 1 Jun 2006 17:06:05 -0500") References: <200606011706.05383.faulkner@opengridcomputing.com> Message-ID: Boyd> The code currently in place seems to expect there to be a Boyd> null element at the end of the dev_list to trigger the end Boyd> of the loop. ibv_get_device_list does not provide such an Boyd> entry, but the number of entries is available. This patch Boyd> retrieves that number and loops based on it. If Boyd> ibv_get_device_list should return a list with a null element Boyd> at the end, then it is not working correctly. This patch Boyd> will work with either of the possible intended behaviors of Boyd> ibv_get_device_list. This is definitely a bug in libibverbs -- I clearly wrote * ibv_get_device_list - Get list of IB devices currently available * @num_devices: optional. if non-NULL, set to the number of devices * returned in the array. * * Return a NULL-terminated array of IB devices. The array can be * released with ibv_free_device_list(). so I intended to return a NULL-terminated array. I'll fix libibverbs up. - R. From rdreier at cisco.com Thu Jun 1 16:00:50 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 01 Jun 2006 16:00:50 -0700 Subject: [openib-general] Re: [openfabrics-ewg] [PATCH] ipathverbs.c fails to compile on svn 7568 or on the ofed 1.0 branch In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007DC4264@orsmsx408> (Robert J. Woodruff's message of "Thu, 1 Jun 2006 15:40:02 -0700") References: <1AC79F16F5C5284499BB9591B33D6F0007DC4264@orsmsx408> Message-ID: > I ran into a compile problem with > userspace/libipathverbs/src/ipathverbs.c > > This patch fixes the compile problem. > > --- ipathverbs.c 2006-06-01 14:56:46.000000000 -0700 > +++ ipathverbs.new.c 2006-06-01 14:54:48.000000000 -0700 > @@ -41,6 +41,7 @@ > #include > #include > #include > +#include > > #include "ipathverbs.h" I don't think there's much point in this, since the resulting library won't actually work with the libibverbs 1.1 development tree anyway. Just build against libibverbs-1.0 until libipathverbs is fixed to work with development libibverbs versions. - R. From Clayton707 at indiatimes.com Thu Jun 1 16:23:20 2006 From: Clayton707 at indiatimes.com (Noah Lausberg) Date: Thu, 01 Jun 2006 23:23:20 -0000 Subject: [openib-general] Re: Fwd: Finding your best mortgage rate here Message-ID: <8ibenndpt0eqdagtg5t5@indiatimes.com> An HTML attachment was scrubbed... URL: From bos at pathscale.com Thu Jun 1 16:27:48 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 01 Jun 2006 16:27:48 -0700 Subject: [openib-general] [PATCH 5/5] IB/ipath: Add client reregister event generation In-Reply-To: <20060531223218.10506.76076.stgit@localhost.localdomain> References: <20060531223205.10506.51241.stgit@localhost.localdomain> <20060531223218.10506.76076.stgit@localhost.localdomain> Message-ID: <1149204468.16993.8.camel@localhost.localdomain> On Wed, 2006-05-31 at 15:32 -0700, Roland Dreier wrote: > Generate client reregister event instead of LID change event when > client reregister bit is set. Please CC me on ipath driver patches, as I'm not guaranteed to see them otherwise. The code currently in place seems to expect there to be a null element at the end of the dev_list to trigger the end of the loop. ibv_get_device_list does not provide such an entry, but the number of entries is available. This patch retrieves that number and loops based on it. If ibv_get_device_list should return a list with a null element at the end then it is not working correctly. This patch will work with either of the possible intended behaviors of ibv_get_device_list. Roland has said that ibv_get_device_list should return a list with a null element at the end. Fix spelling of "liste". Signed-off-by: Boyd Faulkner Index: cma.c =================================================================== --- cma.c (revision 7568) +++ cma.c (working copy) @@ -183,6 +183,7 @@ static int ucma_init(void) { int i; + int num_devices; struct cma_device *cma_dev; struct ibv_device_attr attr; int ret; @@ -201,14 +202,14 @@ goto err; } - dev_list = ibv_get_device_list(NULL); + dev_list = ibv_get_device_list(&num_devices); if (!dev_list) { - printf("CMA: unable to get RDMA device liste\n"); + printf("CMA: unable to get RDMA device list\n"); ret = -ENODEV; goto err; } - for (i = 0; dev_list[i]; ++i) { + for (i = 0; i < num_devices; ++i) { cma_dev = malloc(sizeof *cma_dev); if (!cma_dev) { ret = -ENOMEM; From pgpshyqeckq at hotmail.com Thu Jun 1 16:56:20 2006 From: pgpshyqeckq at hotmail.com (info8100) Date: Thu, 1 Jun 2006 16:56:20 -0700 (PDT) Subject: [openib-general] Penis Enlarge Patch works for 97% percent of men. Message-ID: <20060601235620.2DA0122834D@openib.ca.sandia.gov> An HTML attachment was scrubbed... URL: From mashirle at us.ibm.com Thu Jun 1 09:58:36 2006 From: mashirle at us.ibm.com (Shirley Ma) Date: Thu, 01 Jun 2006 09:58:36 -0700 Subject: [openib-general] [PATCH] IPoIB skb panic Message-ID: <1149181116.8085.8.camel@ibm-khxoic5vfkn.beaverton.ibm.com> Roland, I found there are two problems in path_free(), it would cause kernel skb panic. 1. path_free() should dev_kfree_skb_any() (any context) instead of dev_kfree_skb_irq() (irq context) 2. path->queue should be protected by priv->lock since there is a possible race between unicast_send_arp() and ipoib_flush_paths() when bring interface down. It's safe to use priv->lock, because skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE, which is 3. Here is the patch. Please review it and let me know if there is a problem to apply this patch. Signed-off-by: Shirley Ma diff -urpN infiniband/ulp/ipoib/ipoib_main.c infiniband-skb/ulp/ipoib/ipoib_main.c --- infiniband/ulp/ipoib/ipoib_main.c 2006-05-03 13:16:18.000000000 -0700 +++ infiniband-skb/ulp/ipoib/ipoib_main.c 2006-06-01 09:14:05.000000000 -0700 @@ -252,11 +252,11 @@ static void path_free(struct net_device struct sk_buff *skb; unsigned long flags; - while ((skb = __skb_dequeue(&path->queue))) - dev_kfree_skb_irq(skb); - spin_lock_irqsave(&priv->lock, flags); + while ((skb = __skb_dequeue(&path->queue))) + dev_kfree_skb_any(skb); + list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { /* * It's safe to call ipoib_put_ah() inside priv->lock Thanks Shirley Ma IBM LTC From gaansari at franciscomarrero.com Thu Jun 1 17:12:27 2006 From: gaansari at franciscomarrero.com (Nathaniel Roberts) Date: Fri, 02 Jun 2006 02:12:27 +0200 Subject: [openib-general] Buy OEM Software Message-ID: <000001c685d8$a982f580$0100007f@localhost> Special Offer Adobe Video Collection Adobe Premiere 1.5 Professional Adobe After Effects 6.5 Professional Adobe Audition 1.5 Adobe Encore DVD 1.5 $149.95 More Info >> Microsoft 2 in 1 MS Windows XP Pro MS Office 2003 Pro $99.95 More Info >> Microsoft + Adobe 3 in 1 MS Windows XP Pro MS Office 2003 Pro Adobe Acrobat 7.0 Professional $149.95 More Info >> Bestsellers Microsoft Office Professional Edition 2003 Rating: 6 reviews Retail price: $550.00 You save: $480.05 (87%) Our price: $69.95 [Add to cart] Microsoft Windows XP Professional Rating: 8 reviews Retail price: $200.00 You save: $150.05 (75%) Our price: $49.95 [Add to cart] Adobe Photoshop CS2 V 9.0 Rating: 3 reviews Retail price: $599.00 You save: $529.05 (88%) Our price: $69.95 [Add to cart] -------------- next part -------------- An HTML attachment was scrubbed... URL: From manpreet at gmail.com Thu Jun 1 18:22:53 2006 From: manpreet at gmail.com (Manpreet Singh) Date: Thu, 1 Jun 2006 18:22:53 -0700 Subject: [openib-general] Mellanox HCAs: outstanding RDMAs Message-ID: <67897d690606011822j7b915876l57149508623c6c4f@mail.gmail.com> Hi, It seems that the number of outstanding RDMAs that a Mellanox HCA can handle has been configured at 4 (mthca_main.c: default_profile: rdb_per_qp). And the HCAs can support a much higher value (128 I think). Could we move this value higher or atleast make it configurable? Thanks, Manpreet. -------------- next part -------------- An HTML attachment was scrubbed... URL: From troy at scl.ameslab.gov Thu Jun 1 18:57:08 2006 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Thu, 1 Jun 2006 20:57:08 -0500 Subject: [openib-general] EHCA broken for 2.6.16? Message-ID: <20060602015708.GE18223@scl.ameslab.gov> Okay guys, what's up this time? Kernel 2.6.16.. CC [M] drivers/infiniband/hw/ehca/ehca_main.o In file included from drivers/infiniband/hw/ehca/ehca_qes.h:47, from drivers/infiniband/hw/ehca/ipz_pt_fn.h:46, from drivers/infiniband/hw/ehca/ehca_classes.h:46, from drivers/infiniband/hw/ehca/ehca_main.c:43: drivers/infiniband/hw/ehca/ehca_tools.h: In function 'ehca2ib_return_code': drivers/infiniband/hw/ehca/ehca_tools.h:404: error: 'H_SUCCESS' undeclared (first use in this function) drivers/infiniband/hw/ehca/ehca_tools.h:404: error: (Each undeclared identifier is reported only once drivers/infiniband/hw/ehca/ehca_tools.h:404: error: for each function it appears in.) drivers/infiniband/hw/ehca/ehca_tools.h:406: error: 'H_BUSY' undeclared (first use in this function) drivers/infiniband/hw/ehca/ehca_tools.h:408: error: 'H_NO_MEM' undeclared (first use in this function) drivers/infiniband/hw/ehca/ehca_main.c: In function 'ehca_sense_attributes': From mayumi at hushmail.com Thu Jun 1 20:28:03 2006 From: mayumi at hushmail.com (mayumi at hushmail.com) Date: Thu, 1 Jun 2006 20:28:03 -0700 (PDT) Subject: [openib-general] =?utf-8?b?woLDhsKCw6jCgsKgwoLCpsKCwrjCgsKowpU=?= =?utf-8?b?w5TCjsKWwpHDksKCw4HCgsOEwoLDnMKCwrfCgsOLwoHCmSA=?= Message-ID: 20030927191438.48533mail@mail.pop_lachere_8754158754_top881server_system87_lachere.net �b�����Ă����܂��B �T�C�g�Ō���������ł����ǃG�b�`�Ȃ��Ƒ�D���Ȃ�ł���ˁH �b�����D���Ȃ�ł��I�I�E�E���ǂ��ܕt�������Ă�ގ�����Ȃ����E�E �悩�����炨�F�B�ɂȂ�܂��񂩁H �ł�ˁA�����p����������ł����� ���J���̉e���ł������ȃg�R�̖т��S������Ă���܂���B�B �G�b�`�̎��A���ꂪ�������C���������炵����ł����� �ǂ��������x�����Ă݂܂��H ���݂��Z��ł鏊�߂��݂����Ȃ̂ŁA �b������ǂ��Ȃ��Ă݂����ȁ` �����đ��肪���Ȃ��Ǝ₵���ł����� ��) ���񂽂�Ȏ��ȏЉ�܂��I �b��(����)�{���ł��B ���܂Q�R�˂Ŏd���͋߂��̃X�[�p�[�ŃA���o�C�g���Ă܂��B ���Ă������悤�ȃM�����n���D�݂ł�����A���Ԃ�_������B�B �����l���C������̂ŒN�Ƃł�����킯�ł͂Ȃ���ł����A �T�C�g�̃I�t��Ƃ��Œj���Ɖ�����肵�Ă�A ���̕ӂł�����ƕs���������ł��B �葫���⃍�[�v�Ŕ����Ă݂��� ���K�Ƃ���ӂ߂��Ă݂����ł��B http://lachere.net/h/ �Ƃ肠�������Ԏ��҂��Ă܂��ˁ� From weber at agentpoint.com Fri Jun 2 00:24:39 2006 From: weber at agentpoint.com (Bernard Zuniga) Date: Thu, 01 Jun 2006 23:24:39 -0800 Subject: [openib-general] Excellent mortagee ratees Message-ID: <567958880.8155089795681.JavaMail.ebayapp@sj-besreco755> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: socioeconomic.0.gif Type: image/gif Size: 8503 bytes Desc: not available URL: From postwar121 at yahoo.co.jp Thu Jun 1 23:56:06 2006 From: postwar121 at yahoo.co.jp (=?iso-2022-jp?B?bWlraQ==?=) Date: Thu, 1 Jun 2006 23:56:06 -0700 (PDT) Subject: [openib-general] =?iso-2022-jp?b?GyRCMkYkTyRkJEMkUCRqISYbKEI=?= =?iso-2022-jp?b?GyRCISYhJhsoQg==?= Message-ID: <20060602065606.8C73B22834D@openib.ca.sandia.gov> ▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽ 可愛い女の子や色っぽいお姉さま、即アポ、即ハメOKなHな女の子まで            http://uxoz.com/?juri △△△△△△△△△△△△△△△△△△△△△△△△△△△△△△△△△     。☆。☆ 。☆。完全!期間限定企画。☆。☆ 。☆。      ┏━┳━┳━┳━┳━┳━┳━┳━┳━┳━┳━┳━┳━┓        ☆見┃な┃け┃れ┃ば┃損┃す┃る┃の┃は┃ア┃ナ┃タ☆      ┗━┻━┻━┻━┻━┻━┻━┻━┻━┻━┻━┻━┻━┛ 『前にオススメガールとして紹介されました♪樹里ID:124047です☆ 今日今月のエッチ周期中!みたい(≡^∇^≡)ニャハハ☆ 上にノルと止まんない大胆な騎乗位の腰の動きは誰にも負けない自信あるよん☆ 早くあなたの上にノリたいな☆』 顔は幼い系って言われるけど『えっちな体してるってよく言われます。実際 えっちなんですが^^; 恋人気分でいちゃいちゃしたいなプレイはだいたい何でもOK』 でもNGプレイもあったりして( ;^^)痛い系かも? http://uxoz.com/?juri 拒否の方は stop at uxoz.com From anton at samba.org Thu Jun 1 23:43:46 2006 From: anton at samba.org (Anton Blanchard) Date: Fri, 2 Jun 2006 16:43:46 +1000 Subject: [openib-general] [PATCH] Fix some compile issues with libehca Message-ID: <20060602064346.GE1736@krispykreme> Hi, Heres a patch to fix some warnings about missing prototypes (memset etc), and one compile error due to libsysfs not being included. This was also giving a warning when built as 32bit: my_cq->ipz_queue.queue = (u8*)resp.ipz_queue.queue; src/ehca_umain.c:239: warning: cast to pointer from integer of different size So cast it to a long first. Is that code correct for 32bit? Anton --- Index: src/ehca_uinit.c =================================================================== --- src/ehca_uinit.c (revision 7621) +++ src/ehca_uinit.c (working copy) @@ -44,6 +44,7 @@ #include #include +#include #include #include #include @@ -51,6 +52,7 @@ #include #include #include +#include #include "ehca_uclasses.h" Index: src/ehca_umain.c =================================================================== --- src/ehca_umain.c (revision 7621) +++ src/ehca_umain.c (working copy) @@ -53,6 +53,7 @@ #include #include #include +#include #include #include @@ -234,8 +235,8 @@ /* copy data returned from kernel */ my_cq->cq_number = resp.cq_number; my_cq->token = resp.token; - my_cq->ipz_queue.queue = (u8*)resp.ipz_queue.queue; - my_cq->ipz_queue.current_q_addr = (u8*)resp.ipz_queue.queue; + my_cq->ipz_queue.queue = (u8*)(long)resp.ipz_queue.queue; + my_cq->ipz_queue.current_q_addr = (u8*)(long)resp.ipz_queue.queue; my_cq->ipz_queue.qe_size = resp.ipz_queue.qe_size; my_cq->ipz_queue.act_nr_of_sg = resp.ipz_queue.act_nr_of_sg; my_cq->ipz_queue.queue_length = resp.ipz_queue.queue_length; @@ -321,16 +322,16 @@ my_qp->qkey = resp.qkey; my_qp->real_qp_num = resp.real_qp_num; /* rqueue properties */ - my_qp->ipz_rqueue.queue = (u8*)resp.ipz_rqueue.queue; - my_qp->ipz_rqueue.current_q_addr = (u8*)resp.ipz_rqueue.queue; + my_qp->ipz_rqueue.queue = (u8*)(long)resp.ipz_rqueue.queue; + my_qp->ipz_rqueue.current_q_addr = (u8*)(long)resp.ipz_rqueue.queue; my_qp->ipz_rqueue.qe_size = resp.ipz_rqueue.qe_size; my_qp->ipz_rqueue.act_nr_of_sg = resp.ipz_rqueue.act_nr_of_sg; my_qp->ipz_rqueue.queue_length = resp.ipz_rqueue.queue_length; my_qp->ipz_rqueue.pagesize = resp.ipz_rqueue.pagesize; my_qp->ipz_rqueue.toggle_state = resp.ipz_rqueue.toggle_state; /* squeue properties */ - my_qp->ipz_squeue.queue = (u8*)resp.ipz_squeue.queue; - my_qp->ipz_squeue.current_q_addr = (u8*)resp.ipz_squeue.queue; + my_qp->ipz_squeue.queue = (u8*)(long)resp.ipz_squeue.queue; + my_qp->ipz_squeue.current_q_addr = (u8*)(long)resp.ipz_squeue.queue; my_qp->ipz_squeue.qe_size = resp.ipz_squeue.qe_size; my_qp->ipz_squeue.act_nr_of_sg = resp.ipz_squeue.act_nr_of_sg; my_qp->ipz_squeue.queue_length = resp.ipz_squeue.queue_length; From anton at samba.org Thu Jun 1 23:49:24 2006 From: anton at samba.org (Anton Blanchard) Date: Fri, 2 Jun 2006 16:49:24 +1000 Subject: [openib-general] [PATCH] Fix ipathverbs compile Message-ID: <20060602064924.GF1736@krispykreme> Similar to libehca, I had to add a sysfs include to be able to compile it. Am I missing something or is this correct? Anton --- Index: src/ipathverbs.c =================================================================== --- src/ipathverbs.c (revision 7621) +++ src/ipathverbs.c (working copy) @@ -41,6 +41,7 @@ #include #include #include +#include #include "ipathverbs.h" From mayumi at hushmail.com Fri Jun 2 00:38:46 2006 From: mayumi at hushmail.com (mayumi at hushmail.com) Date: Fri, 2 Jun 2006 00:38:46 -0700 (PDT) Subject: [openib-general] =?iso-2022-jp?b?GyRCNmJBLEUqJEs3QyReJGwbKEI=?= =?iso-2022-jp?b?GyRCJEYkaz1PPXckSD1QMnEkJCReJDskcyQrISkbKEI=?= Message-ID: 20060602151637.97125mail@mail.lovelove-queensex552158754_lookserver772_womansystem01_woman-queen-love.tv 人妻セフレ探しの決定版! ※‥※‥※‥※‥※‥※‥※‥※‥※‥※ 世の中の女性の中で、人妻が一番出会えます。 それは、時間とお金に余裕があり、旦那とのSEXに飽きているからです。 妻とはこうあるべき、という仮面を脱いだ彼女達6万3千人にご登録いただいております。 ※‥※‥※‥※‥※‥※‥※‥※‥※‥※ <<今日の新規人妻>> ------------------------------------------------------------------- キララ様(25才) コメント: あまり経験のない方・・・ 詳しく見る⇒    http://lovlyqueen.cx/h/ ------------------------------------------------------------------- 谷様(36才) コメント: なんか家事に疲れちゃった・・・ 詳しく見る⇒    http://lovlyqueen.cx/h/ ------------------------------------------------------------------- 紹介料・登録料・退会料金等全て無料 エッチが好きな女性たちがあなたの欲求を満たしてくれます。 人妻との大人の関係をぜひこちらでお楽しみください。    http://lovlyqueen.cx/h/ ⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔ 逆◎助 逆援では逢えないと思っている人いませんか?その思い込みを180度ひっくり返せるのがこのサイト! 当サイトは女性会員様の月会費で運営さしていただいてるので男性の紹介料・登録料・退会料金等全て無料となっています。 ↓↓↓↓↓    http://lovlyqueen.cx/h/ -------------------------------------------------------------------- From HNGUYEN at de.ibm.com Fri Jun 2 01:13:39 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Fri, 2 Jun 2006 10:13:39 +0200 Subject: [openib-general] EHCA broken for 2.6.16? In-Reply-To: <20060602015708.GE18223@scl.ameslab.gov> Message-ID: Hi Troy! Please use kernel 2.6.17.rc1 or later instead, because all former hvcall defines like H_Success were uppercased (to H_SUCCESS) in 2.6.17-rc1, and our code has been built against that version. For details refer to this thread http://patchwork.ozlabs.org/linuxppc/patch?id=4868. Thanks Hoang-Nam Nguyen openib-general-bounces at openib.org wrote on 02.06.2006 03:57:08: > Okay guys, what's up this time? > > Kernel 2.6.16.. > > CC [M] drivers/infiniband/hw/ehca/ehca_main.o > In file included from drivers/infiniband/hw/ehca/ehca_qes.h:47, > from drivers/infiniband/hw/ehca/ipz_pt_fn.h:46, > from drivers/infiniband/hw/ehca/ehca_classes.h:46, > from drivers/infiniband/hw/ehca/ehca_main.c:43: > drivers/infiniband/hw/ehca/ehca_tools.h: In function 'ehca2ib_return_code': > drivers/infiniband/hw/ehca/ehca_tools.h:404: error: 'H_SUCCESS' > undeclared (first use in this function) > drivers/infiniband/hw/ehca/ehca_tools.h:404: error: (Each undeclared > identifier is reported only once > drivers/infiniband/hw/ehca/ehca_tools.h:404: error: for each function it > appears in.) > drivers/infiniband/hw/ehca/ehca_tools.h:406: error: 'H_BUSY' undeclared > (first use in this function) > drivers/infiniband/hw/ehca/ehca_tools.h:408: error: 'H_NO_MEM' > undeclared (first use in this function) > drivers/infiniband/hw/ehca/ehca_main.c: In function > 'ehca_sense_attributes': > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From pickwinnersonline6 at 702mail.co.za Fri Jun 2 01:09:14 2006 From: pickwinnersonline6 at 702mail.co.za (Pick Winners Online) Date: Fri, 2 Jun 2006 01:09:14 -0700 (PDT) Subject: [openib-general] CONGRATULATIONS YOU HAVE WON Message-ID: <1579.165.146.46.236.1149235754.squirrel@www.teampopromotions.org> PICK WINNERS ONLINE ORG. 24 GRAYSTON ROAD, SANDTON JOHANNESBURG, SOUTH AFRICA. Ref No: ZAR/J0021-671/SA2006 Batch No: 002472/B8992-60 Dear Email Owner, We happily announce to you today 2nd June 2006, the Draw # (4 2 11 23 82 16 10) of the PICK WINNER ONLINE ORG. held recently. Your e-mail address attached to ticket number: 728-10185-03124-78 288 with Serial number 6712/05 drew the lucky numbers 01-05-13-267-32-05 which subsequently won you the lottery in the 2nd category i.e. Thunder ball Jackpot. You have therefore been approved to claim a total sum of US$2,200,000.00(Two Million, Two Hundred Thousand, United States Dollars) in cash credited to file RPC/908011 8308/04. All participants for the online version were selected randomly from World Wide Web sites through computer draw system and extracted from over 100,000 unions, associations and corporate bodies that are listed online. This promotion takes place monthly. Please note that your lucky winning number falls within our International Booklet representative office here in South Africa as indicated in your winning numbers. Your fund is now available for claim. Due to the mix up of some numbers and email address, we request you to quote your Reference and Batch numbers when you contact our fiduciary agent. We also advice you to keep this award strictly from Public Notice until your claim has been processed. This is part of precautionary measures to avoid double claiming and unwarranted abuse of this program by some unscrupulous elements Please be warned. To file for your claim, please contact our fiduciary agent in South Africa. FIDUCIARY AGENT CONTACT: MR. JOHN NKOSI Email:claimfiles at websurfer.co.za Tel: +27-83-7494-916 NOTE: YOU ARE REQUIRED TO FILL IN THE BELOW DETAILS FOR THE PROCESSING OF YOUR CLAIM. NAME:.................................. EMAIL ADDRESS:............................ ADDRESS:....................................................... NATIONALITY:........................... SEX:......................................... AGE:..................................... PHONE/MOBILE:.......................... FAX:................................... OCCUPATION:.................................... COMPANY:.............................. BATCH/WINNING NUMNBER:............................................ PLEASE YOU MUST BE 18 YEARS AND ABOVE TO CLAIM A PRIZE. To avoid unnecessary delays and complications, please quote our reference/batch numbers to any correspondences with our designated agent or us. Congratulations once more from all members and staff of this program, and thank you for being part of our promotional lottery program. Sincerely, Mrs. Mary Nonjabs Zonal Coordinator. ****NB: Please keep this award strictly from Public Notice until your claim has been processed**** From gruenberg at xcoglobal.com Fri Jun 2 02:59:19 2006 From: gruenberg at xcoglobal.com (Marcos Murphy) Date: Fri, 02 Jun 2006 17:59:19 +0800 Subject: [openib-general] Three Steps to the Software You Need at the Prices You Want Message-ID: <000001c6862a$b16b8580$0100007f@localhost> Special Offer Adobe Video Collection Adobe Premiere 1.5 Professional Adobe After Effects 6.5 Professional Adobe Audition 1.5 Adobe Encore DVD 1.5 $149.95 More Info >> Microsoft 2 in 1 MS Windows XP Pro MS Office 2003 Pro $99.95 More Info >> Microsoft + Adobe 3 in 1 MS Windows XP Pro MS Office 2003 Pro Adobe Acrobat 7.0 Professional $149.95 More Info >> Bestsellers Microsoft Office Professional Edition 2003 Rating: 6 reviews Retail price: $550.00 You save: $480.05 (87%) Our price: $69.95 [Add to cart] Microsoft Windows XP Professional Rating: 8 reviews Retail price: $200.00 You save: $150.05 (75%) Our price: $49.95 [Add to cart] Adobe Photoshop CS2 V 9.0 Rating: 3 reviews Retail price: $599.00 You save: $529.05 (88%) Our price: $69.95 [Add to cart] -------------- next part -------------- An HTML attachment was scrubbed... URL: From necojp at citiz.net Fri Jun 2 03:28:32 2006 From: necojp at citiz.net (=?gb2312?B?aW5mb3JtYXRpb24=?=) Date: Fri, 2 Jun 2006 03:28:32 -0700 (PDT) Subject: [openib-general] =?iso-2022-jp?b?GyRCNVUxZzhyOl0kSyVPJV4bKEI=?= =?iso-2022-jp?b?GyRCJGokRCREJCIkayEiQCQkTj13GyhC?= Message-ID: <20060602102832.E894422834D@openib.ca.sandia.gov>     淫乱マダム、資産家夫人・・     http://xbja.com/?bid17    時間もお金も有り余ってる女性達の詰め合わせ!    性欲は誰にも負けないという淫乱揃い      ホストクラブに通っても、ホストとはSEXが出来ない・・      結局は心も欲求不満も解消されない!    そんな不満たらたらな淫乱セレブ達を黙らせてあげて♪    http://xbja.com/?bid17      月に2回のデートで○○十万以上・・・               体験して実感してみて下さい!          −−−セレブの集まる逆援サイト−−−             ♪【完全無料登録制】♪          http://xbja.com/?bid17 ______________________________          配信不要はこちらへ→ delivery_decline at yahoo.co.jp -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Jun 2 03:29:36 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Jun 2006 06:29:36 -0400 Subject: [openib-general] Re: [PATCH TRIVIAL] opensm: fix comment in osm_matrix.h In-Reply-To: <20060601174838.GA12872@sashak.voltaire.com> References: <20060601174838.GA12872@sashak.voltaire.com> Message-ID: <1149244174.4510.90851.camel@hal.voltaire.com> On Thu, 2006-06-01 at 13:48, Sasha Khapyorsky wrote: > This fixes the function description comment in osm_matrix.h > > Signed-off-by: Sasha Khapyorsky Thanks. Applied to trunk and 1.0 branch. -- Hal From emiime9836 at so-net.ne.jp Fri Jun 2 03:49:36 2006 From: emiime9836 at so-net.ne.jp (=?shift-jis?B?MjAwNi0wNi0wMiAxODozNjozOQ==?=) Date: Fri, 2 Jun 2006 03:49:36 -0700 (PDT) Subject: [openib-general] =?iso-2022-jp?b?GyRCJD8kQCROOSU0cT80ISkbKEI=?= Message-ID: <20060602104936.258092283D5@openib.ca.sandia.gov> おしっこやウンコのプレイってした事ないでしょ!? 一度は経験してみない? さ○うのごはんにブリブリかけちゃって、それを食べるとか(ノ∀`) って言うのは冗談だけど、軽めなスカトロなら興味あるでしょう♪ ご主人様になって、Mっ子な女の子達を調教しちゃおう! http://vlzh.com/?e63 拒否 k_49singing_in_the_rain at yahoo.co.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: From HNGUYEN at de.ibm.com Fri Jun 2 03:41:24 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Fri, 2 Jun 2006 12:41:24 +0200 Subject: [openib-general] [PATCH] Fix some compile issues with libehca In-Reply-To: <20060602064346.GE1736@krispykreme> Message-ID: Hi, will incorporate those patches in our code. They should be correct for both 64/32 bit version of libehca. Thanks! Hoang-Nam Nguyen openib-general-bounces at openib.org wrote on 02.06.2006 08:43:46: > > Hi, > > Heres a patch to fix some warnings about missing prototypes (memset > etc), and one compile error due to libsysfs not being included. > > This was also giving a warning when built as 32bit: > > my_cq->ipz_queue.queue = (u8*)resp.ipz_queue.queue; > > src/ehca_umain.c:239: warning: cast to pointer from integer of different size > > So cast it to a long first. Is that code correct for 32bit? > > Anton > --- > > Index: src/ehca_uinit.c > =================================================================== > --- src/ehca_uinit.c (revision 7621) > +++ src/ehca_uinit.c (working copy) > @@ -44,6 +44,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -51,6 +52,7 @@ > #include > #include > #include > +#include > > #include "ehca_uclasses.h" > > Index: src/ehca_umain.c > =================================================================== > --- src/ehca_umain.c (revision 7621) > +++ src/ehca_umain.c (working copy) > @@ -53,6 +53,7 @@ > #include > #include > #include > +#include > #include > #include > > @@ -234,8 +235,8 @@ > /* copy data returned from kernel */ > my_cq->cq_number = resp.cq_number; > my_cq->token = resp.token; > - my_cq->ipz_queue.queue = (u8*)resp.ipz_queue.queue; > - my_cq->ipz_queue.current_q_addr = (u8*)resp.ipz_queue.queue; > + my_cq->ipz_queue.queue = (u8*)(long)resp.ipz_queue.queue; > + my_cq->ipz_queue.current_q_addr = (u8*)(long)resp.ipz_queue.queue; > my_cq->ipz_queue.qe_size = resp.ipz_queue.qe_size; > my_cq->ipz_queue.act_nr_of_sg = resp.ipz_queue.act_nr_of_sg; > my_cq->ipz_queue.queue_length = resp.ipz_queue.queue_length; > @@ -321,16 +322,16 @@ > my_qp->qkey = resp.qkey; > my_qp->real_qp_num = resp.real_qp_num; > /* rqueue properties */ > - my_qp->ipz_rqueue.queue = (u8*)resp.ipz_rqueue.queue; > - my_qp->ipz_rqueue.current_q_addr = (u8*)resp.ipz_rqueue.queue; > + my_qp->ipz_rqueue.queue = (u8*)(long)resp.ipz_rqueue.queue; > + my_qp->ipz_rqueue.current_q_addr = (u8*)(long)resp.ipz_rqueue.queue; > my_qp->ipz_rqueue.qe_size = resp.ipz_rqueue.qe_size; > my_qp->ipz_rqueue.act_nr_of_sg = resp.ipz_rqueue.act_nr_of_sg; > my_qp->ipz_rqueue.queue_length = resp.ipz_rqueue.queue_length; > my_qp->ipz_rqueue.pagesize = resp.ipz_rqueue.pagesize; > my_qp->ipz_rqueue.toggle_state = resp.ipz_rqueue.toggle_state; > /* squeue properties */ > - my_qp->ipz_squeue.queue = (u8*)resp.ipz_squeue.queue; > - my_qp->ipz_squeue.current_q_addr = (u8*)resp.ipz_squeue.queue; > + my_qp->ipz_squeue.queue = (u8*)(long)resp.ipz_squeue.queue; > + my_qp->ipz_squeue.current_q_addr = (u8*)(long)resp.ipz_squeue.queue; > my_qp->ipz_squeue.qe_size = resp.ipz_squeue.qe_size; > my_qp->ipz_squeue.act_nr_of_sg = resp.ipz_squeue.act_nr_of_sg; > my_qp->ipz_squeue.queue_length = resp.ipz_squeue.queue_length; > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From confirm at paypal.com Fri Jun 2 04:34:56 2006 From: confirm at paypal.com (PayPal Security Department) Date: Fri, 02 Jun 2006 04:34:56 -0700 Subject: [openib-general] *** Security Issues *** Message-ID: An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Jun 2 04:38:32 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Jun 2006 07:38:32 -0400 Subject: [openib-general] {PATCH] Some small fixes in osm_ucast_mgr.c Message-ID: <1149248311.4510.92726.camel@hal.voltaire.com> OpenSM/osm_ucast_mgr.c: Small cleanup in terms of dump file Some small cleanup near old dump file removing: replacing CL_ASSERT() with debug-unconditional check and then remove check before freeing. Signed-off-by: Sasha Khapyorsky Signed-off-by: Hal Rosenstock --- osm/opensm/osm_ucast_mgr.c | 15 ++++----------- 1 files changed, 4 insertions(+), 11 deletions(-) diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index cb59a7b..6e0d6c6 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -1148,21 +1148,14 @@ osm_ucast_mgr_process( build and download the switch forwarding tables. */ - /* initialize the fdb dump file: */ - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) + /* remove the old fdb dump file: */ + if( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) && (file_name = + (char*)cl_malloc(strlen(p_mgr->p_subn->opt.dump_files_dir) + 10)) ) { - file_name = - (char*)cl_malloc(strlen(p_mgr->p_subn->opt.dump_files_dir) + 10); - - CL_ASSERT(file_name); - strcpy(file_name, p_mgr->p_subn->opt.dump_files_dir); strcat(file_name, "/osm.fdbs"); - unlink(file_name); - - if (file_name) - cl_free(file_name); + cl_free(file_name); } cl_qmap_apply_func( p_sw_guid_tbl, From halr at voltaire.com Fri Jun 2 04:53:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Jun 2006 07:53:40 -0400 Subject: [openib-general] {PATCH] Some small fixes in osm_mcast_mgr.c Message-ID: <1149248313.4510.92728.camel@hal.voltaire.com> OpenSM/osm_mcast_mgr.c: Small cleanup in terms of dump file Some small cleanup near old dump file removing: replacing CL_ASSERT() with debug-unconditional check and then remove check before freeing. Signed-off-by: Hal Rosenstock Index: opensm/osm_mcast_mgr.c =================================================================== --- opensm/osm_mcast_mgr.c (revision 7614) +++ opensm/osm_mcast_mgr.c (working copy) @@ -1466,18 +1466,17 @@ __unlink_mcast_fdb(IN osm_mcast_mgr_t* c { char *file_name = NULL; + /* remove the old fdb dump file: */ file_name = (char*)cl_malloc(strlen(p_mgr->p_subn->opt.dump_files_dir) + 12); - CL_ASSERT(file_name); - - strcpy(file_name, p_mgr->p_subn->opt.dump_files_dir); - strcat(file_name, "/osm.mcfdbs"); - - unlink(file_name); - - if (file_name) + if( file_name ) + { + strcpy(file_name, p_mgr->p_subn->opt.dump_files_dir); + strcat(file_name, "/osm.mcfdbs"); + unlink(file_name); cl_free(file_name); + } } /********************************************************************** From swise at opengridcomputing.com Fri Jun 2 06:57:44 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 02 Jun 2006 08:57:44 -0500 Subject: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager. In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F150D3E6@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F150D3E6@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1149256664.791.3.camel@stevo-desktop> > > > > The problem is that we can't synchronously cancel an > > outstanding connect request. Once we've asked the adapter to > > connect, we can't tell him to stop, we have to wait for it to > > fail. During the time period between when we ask to connect > > and the adapter says yeah-or-nay, the user hits ctrl-C. This > > is the case where disconnect and/or destroy gets called and > > we have to block it waiting for the outstanding connect > > request to complete. > > > > One alternative to this approach is to do the kfree of the > > cm_id in the deref logic. This was the original design and > > leaves the object around to handle the completion of the > > connect and still allows the app to clean up and go away > > without all this waitin' around. When the adapter finally > > finishes and releases it's reference, the object is kfree'd. > > > > Hope this helps. > > > Why couldn't you synchronously put the cm_id in a state of > "pending delete" and do the actual delete when the RNIC > provides a response to the request? This is Tom's "alternative" mentioned above. The provider already keeps an explicit reference on the cm_id while it might possibly deliver an event on that cm_id. So if you change deref to kfree the cm_id on its last deref (when the refcnt reaches 0), then you can avoid blocking during destroy... > There could even be > an optional method to see if the device is capable of > cancelling the request. I know it can't yank a SYN back > from the wire, but it could refrain from retransmitting. I would suggest we don't add this optional method until we see an RNIC that supports canceling a connect request or accept synchronously... Steve. From bullard at jshoc.com Fri Jun 2 08:31:01 2006 From: bullard at jshoc.com (Florence Dillon) Date: Fri, 02 Jun 2006 09:31:01 -0600 Subject: [openib-general] Lowest rate approved Message-ID: <2.7.6.6.6.73150818746686.118a9470@146.246.248.81> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: laurentian.jpg Type: image/jpg Size: 7236 bytes Desc: not available URL: From ort at cambriawines.com Fri Jun 2 10:29:19 2006 From: ort at cambriawines.com (Behnam Orta) Date: Fri, 2 Jun 2006 10:29:19 -0700 Subject: [openib-general] Re: 192 VtArGGRA Message-ID: <000001c6866a$14a4ae30$d04ba8c0@tlj77> Hi, V A L \ U M S O M & L E V \ T R A P R O Z & C A M B \ E N C \ A L i S V \ A G R A X & N A X M E R \ D i A http://www.slerethey.com white light shone through it. What is this? he said. There are moon-letters here, beside the plain runes which say five feet high the door and three may walk abreast. What are moon-letters? asked the hobbit full of excitement. He loved maps, as I have told you before; and he also liked runes and letters and cunning handwriting, though when he wrote himself it was a bit thin and spidery. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mashirle at us.ibm.com Fri Jun 2 04:08:14 2006 From: mashirle at us.ibm.com (Shirley Ma) Date: Fri, 02 Jun 2006 04:08:14 -0700 Subject: [openib-general] [PATCH]Repost: IPoIB skb panic Message-ID: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com> Roland, I posted the patch yesterday, it seems it only went to web site. I repost this patch here for you to review. Please let me know if there is any problem to apply this patch. There are two problems in path_free(), which caused kernel skb panic during interface up/down stress test. 1. path_free() should call dev_kfree_skb_any() (any context) instead of dev_kfree_skb_irq() (irq context) since it is called in process context. 2. path->queue should be protected by priv->lock since there is a race between unicast_send_arp() and ipoib_flush_paths() to release skb when bringing interface down. It's safe to use priv->lock, because skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE, which is 3. Signed-off-by: Shirley Ma diff -urpN infiniband/ulp/ipoib/ipoib_main.c infiniband-skb/ulp/ipoib/ipoib_main.c --- infiniband/ulp/ipoib/ipoib_main.c 2006-05-03 13:16:18.000000000 -0700 +++ infiniband-skb/ulp/ipoib/ipoib_main.c 2006-06-01 09:14:05.000000000 -0700 @@ -252,11 +252,11 @@ static void path_free(struct net_device struct sk_buff *skb; unsigned long flags; - while ((skb = __skb_dequeue(&path->queue))) - dev_kfree_skb_irq(skb); - spin_lock_irqsave(&priv->lock, flags); + while ((skb = __skb_dequeue(&path->queue))) + dev_kfree_skb_any(skb); + list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { /* * It's safe to call ipoib_put_ah() inside priv->lock Thanks Shirley Ma IBM LTC From walter.maculan at foodservicemarketplace.com Fri Jun 2 11:13:16 2006 From: walter.maculan at foodservicemarketplace.com (Ezekiel Hall) Date: Fri, 02 Jun 2006 20:13:16 +0200 Subject: [openib-general] Hey buddy, whats up Message-ID: <000001c6866f$d3d3c200$0100007f@localhost> In a dayes without unapologetic the face of anybody grew kgy6 Black freedoms mouths, the overpower swallowed up the sun crammed air was function with suppressed accomplish The wind smelling through the long hangers and sobbed and tureen the secret symbolically -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: top.jpg Type: image/jpeg Size: 8387 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: down.gif Type: image/gif Size: 7523 bytes Desc: not available URL: From noahm at yebox.com Fri Jun 2 12:33:34 2006 From: noahm at yebox.com (Arnold Looney) Date: Fri, 02 Jun 2006 11:33:34 -0800 Subject: [openib-general] 3.25%% approvedd rattee Message-ID: <72754.$$.95673.Etrack@yahoo.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: vectorial.4.gif Type: image/gif Size: 8467 bytes Desc: not available URL: From sean.hefty at intel.com Fri Jun 2 11:43:22 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 2 Jun 2006 11:43:22 -0700 Subject: [openib-general] libmthca build issue Message-ID: I'm running into an issue trying to build libmthca. During the ./configure step, I get: checking size of long... configure: error: cannot compute sizeof (long), 77 Has anyone else run into this? - Sean From mshefty at ichips.intel.com Fri Jun 2 12:07:25 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 02 Jun 2006 12:07:25 -0700 Subject: [openib-general] libmthca build issue In-Reply-To: References: Message-ID: <44808C6D.6030708@ichips.intel.com> Sean Hefty wrote: > I'm running into an issue trying to build libmthca. > > During the ./configure step, I get: > > checking size of long... configure: error: cannot compute sizeof (long), 77 > > Has anyone else run into this? Rebooting my system and rebuilding made this error go away. - Sean From swise at opengridcomputing.com Fri Jun 2 12:09:15 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 02 Jun 2006 14:09:15 -0500 Subject: [openib-general] libmthca build issue In-Reply-To: References: Message-ID: <1149275355.11187.21.camel@stevo-desktop> On Fri, 2006-06-02 at 11:43 -0700, Sean Hefty wrote: > I'm running into an issue trying to build libmthca. > > During the ./configure step, I get: > > checking size of long... configure: error: cannot compute sizeof (long), 77 > > Has anyone else run into this? > > - Sean I just hit this too today. Inspecting the config log file revealed that it could find libibverbs.so. I ran ldconfig, then reran autogen and configure and it worked. Try that... Stevo. From swise at opengridcomputing.com Fri Jun 2 12:14:36 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 02 Jun 2006 14:14:36 -0500 Subject: [openib-general] libmthca build issue In-Reply-To: <1149275355.11187.21.camel@stevo-desktop> References: <1149275355.11187.21.camel@stevo-desktop> Message-ID: <1149275676.11187.23.camel@stevo-desktop> On Fri, 2006-06-02 at 14:09 -0500, Steve Wise wrote: > On Fri, 2006-06-02 at 11:43 -0700, Sean Hefty wrote: > > I'm running into an issue trying to build libmthca. > > > > During the ./configure step, I get: > > > > checking size of long... configure: error: cannot compute sizeof (long), 77 > > > > Has anyone else run into this? > > > > - Sean > > I just hit this too today. Inspecting the config log file revealed that > it could find libibverbs.so. I ran ldconfig, then reran autogen and ^^^^^^ Er, make that "could not"... From sean.hefty at intel.com Fri Jun 2 12:13:37 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 2 Jun 2006 12:13:37 -0700 Subject: [openib-general] libmthca build issue In-Reply-To: <1149275355.11187.21.camel@stevo-desktop> Message-ID: >I just hit this too today. Inspecting the config log file revealed that >it could find libibverbs.so. I ran ldconfig, then reran autogen and >configure and it worked. Try that... Thanks - I'll try that next time. From melody19194 at yahoo.co.jp Fri Jun 2 13:00:30 2006 From: melody19194 at yahoo.co.jp (melody19194 at yahoo.co.jp) Date: Fri, 2 Jun 2006 13:00:30 -0700 (PDT) Subject: [openib-general] =?utf-8?b?woNcwoFbwoNWwoPCg8KDwovCg2zCg2LCg2c=?= =?utf-8?b?woPCj8KBW8KDTMKDwpPCg0/Cg1TCg0PCg2fCj8K1wpHDksKPw7M=?= Message-ID: 20060603045844.61712mail@mail.love-woman889889_gogo-server114_freesystem01_freefree-lovelove.tv ����ɂ��́A������̓����f�B�[�^�c�����ǂł��B �����f�B�[�Ƃ́A�����o�[�݂̂ō\������Ă���ŋߗ��s��SNS�i�\�[�V�����l�b�g���[�L���O�T�C�g�ł��B ���񃉃��_�����I�ł��Ȃ��l�ɏ��ҏ�����炳���Ă��������܂����B ���L��URL���o�^��s���Ă��������l�b�g���[�N�������̊F�l�Ƃ̌𗬂��肢�������܂��B �@�@�@http://qqpg.com/mmt �����F�l�̓v���t�B�[���A�ʐ^��o�^�A���J���邱�Ƃɂ���Ă�葽���̕��X�ɏ��� �@�@���M���邱�Ƃ��o���܂��B���p�A�o�^�͖����ł��B �@�@�v���t�B�[���A�ʐ^�̓o�^�A���J�@���� �@�@�@http://qqpg.com/mmt ���������f�B�[�ł͐M���ł�����l�A�F�B�A���l�A�Z�b�N�X�t�����h�A���܂��܂ȃc�[�����p�ӂ��Ă���܂��B �@�@�@http://qqpg.com/mmt ���������f�B�[��g���Ή�����m�̃l�b�g���[�N���ǂ��ė���p�[�e�B�Ȃǂ̌𗬂� �@�@�ȒP�ɂł��܂��B�����ɂ͂��Ȃ��̃p�[�g�i�[����q����M���ł���l�b�g���[�N�� �@�@�`������Ă��܂��B�����f�B�[�͂ǂ����Ōq�����Ă���l���m���W�܂�o������T�C�g �@�@�ł���A���ꂪ�����f�B�[�̓����ł��B �@�@�@http://qqpg.com/mmt ����ł́A�Q����S��肨�҂����Ă���܂��B�����f�B�[�^�c�ǁB From rdreier at cisco.com Fri Jun 2 13:02:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 02 Jun 2006 13:02:33 -0700 Subject: [openib-general] [PATCH 5/5] IB/ipath: Add client reregister event generation In-Reply-To: <1149204468.16993.8.camel@localhost.localdomain> (Bryan O'Sullivan's message of "Thu, 01 Jun 2006 16:27:48 -0700") References: <20060531223205.10506.51241.stgit@localhost.localdomain> <20060531223218.10506.76076.stgit@localhost.localdomain> <1149204468.16993.8.camel@localhost.localdomain> Message-ID: Bryan> Please CC me on ipath driver patches, as I'm not guaranteed Bryan> to see them otherwise. Sorry, I realized I forgot to do that and sent a heads up as a reply to the patch email. BTW you probably will want to update the entry in MAINTAINERS now that you are qlogic and not pathscale... - R. From rdreier at cisco.com Fri Jun 2 13:09:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 02 Jun 2006 13:09:07 -0700 Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com> (Shirley Ma's message of "Fri, 02 Jun 2006 04:08:14 -0700") References: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com> Message-ID: > 1. path_free() should call dev_kfree_skb_any() (any context) instead of > dev_kfree_skb_irq() (irq context) since it is called in process > context. Agree -- although actually in the current code, plain dev_kfree_skb() would be fine. In fact, since your patch moves the free inside a spinlock, dev_kfree_skb_irq() would be correct. > 2. path->queue should be protected by priv->lock since there is a race > between unicast_send_arp() and ipoib_flush_paths() to release skb when > bringing interface down. It's safe to use priv->lock, because > skb_queue_len(&path->queue) < > IPOIB_MAX_PATH_REC_QUEUE, which is 3. I'm having a hard time understanding this race. path_free() should never be called on paths that are reachable via the list of paths or the rb-tree of paths, and unicast_send_arp() should never touch a path that is going to path_free(). Also, it seems if there is a race here then this fix is insufficient, because path_free() does a kfree() on the whole path structure, which would lead to use-after-free if unicast_send_arp() might still touch it. So could you diagram the race you are seeing? (ie what are the two different threads doing that causes a problem?) Thanks, Roland From rdreier at cisco.com Fri Jun 2 13:10:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 02 Jun 2006 13:10:49 -0700 Subject: [openib-general] [PATCH] Fix ipathverbs compile In-Reply-To: <20060602064924.GF1736@krispykreme> (Anton Blanchard's message of "Fri, 2 Jun 2006 16:49:24 +1000") References: <20060602064924.GF1736@krispykreme> Message-ID: Anton> Similar to libehca, I had to add a sysfs include to be able Anton> to compile it. Am I missing something or is this correct? The issue is that I changed the development libibverbs tree in svn to no longer use libsysfs, and libehca and libipathverbs are not updated to the new interface yet. So it is true that they won't compile against the development libibverbs tree without including the libsysfs header, but it's also true that just adding the header so they compile will lead to a driver library that doesn't work anyway. So I think it's better to leave them not compiling until they are really fixed up. - R. From rdreier at cisco.com Fri Jun 2 13:12:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 02 Jun 2006 13:12:30 -0700 Subject: [openib-general] Mellanox HCAs: outstanding RDMAs In-Reply-To: <67897d690606011822j7b915876l57149508623c6c4f@mail.gmail.com> (Manpreet Singh's message of "Thu, 1 Jun 2006 18:22:53 -0700") References: <67897d690606011822j7b915876l57149508623c6c4f@mail.gmail.com> Message-ID: Manpreet> Mellanox HCA can handle has been configured at 4 Manpreet> (mthca_main.c: default_profile: rdb_per_qp). And the Manpreet> HCAs can support a much higher value (128 I think). Manpreet> Could we move this value higher or atleast make it Manpreet> configurable? Leonid Arsh has a patch that I will integrate soon that makes this configurable. However, I'm curious. Do you have a workload where this actually makes a measurable difference? It seems that having 4 RDMA requests outstanding on the wire should be enough to get things to pipeline pretty well. If you haven't tested this, right now you can of course edit mthca_main.c to change the default value and recompile. - R. From rjwalsh at pathscale.com Fri Jun 2 13:35:19 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Fri, 02 Jun 2006 13:35:19 -0700 Subject: [openib-general] [PATCH] Fix ipathverbs compile In-Reply-To: References: <20060602064924.GF1736@krispykreme> Message-ID: <1149280519.13958.10.camel@hematite.pathscale.com> On Fri, 2006-06-02 at 13:10 -0700, Roland Dreier wrote: > Anton> Similar to libehca, I had to add a sysfs include to be able > Anton> to compile it. Am I missing something or is this correct? > > The issue is that I changed the development libibverbs tree in svn to > no longer use libsysfs, and libehca and libipathverbs are not updated > to the new interface yet. So it is true that they won't compile > against the development libibverbs tree without including the libsysfs > header, but it's also true that just adding the header so they compile > will lead to a driver library that doesn't work anyway. > > So I think it's better to leave them not compiling until they are > really fixed up. We're in the middle of getting a new software release done here, and just haven't had the bandwidth to look at this yet. I'll get around to it hopefully by the middle of next week and do the appropriate updates from the libipathverbs end. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 483 bytes Desc: This is a digitally signed message part URL: From nahoko at centralpets.com Fri Jun 2 13:53:25 2006 From: nahoko at centralpets.com (nahoko at centralpets.com) Date: Fri, 2 Jun 2006 13:53:25 -0700 (PDT) Subject: [openib-general] =?iso-2022-jp?b?GyRCI1YjSSNQJTslbCVWJDQbKEI=?= =?iso-2022-jp?b?GyRCPlIycBsoQg==?= Message-ID: 20060603043903.40893mail@mail.perfect_web88915_serebu-server55_system02_serebudeai.tv ∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞ ___________________________ ●《$》実在VIP女性様専用の男遊戯娯楽サイト《$》●  ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞ ※VIP=VERY・IMPORTANT・PERFECTION  ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ VIP女性様方は貴方様の肉体を 『現金束と交換可能な性欲塊・SEX専用人間』  ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ と思ってらっしゃいます。 気分を害されますか?それとも、興奮されますか? ━━━━━━━━━━━ ※●大事な決定事項●※ ━━━━━━━━━━━ 下記URL内にいらっしゃいます女性様は、 既に貴方様のSEX専用VIP女性会員様となっております。 同意をされましても無視をされましても、この決定事項は 一切揺るぎません。 〓SEXをして多額現金をお受け取りになられる場合は同意を。 〓本優遇権利を破棄される場合は無視を。 ─────────── http://perfection.cx/h/ ━━━━━━━━━━━━━━━━━━━━━━━━━━━ ※●貴方様のSEX専用VIP女性会員様はこちらの方●※ ━━━━━━━━━━━━━━━━━━━━━━━━━━━ ●麻耶(マヤ)様 ○29歳 ●貿易会社企画運営(現在会長) ○3サイズは上から87・57・86 ●お礼金は最低35万円〜無限大(事実可能金額) ○『私とSEXをして頂けませんか?して頂けますよね?お金なら 幾らでもあげる、お金以外の物がご希望ならそれでも構わない。 貴方の生活をがらりと変えて見せますね。貴方が私とSEXをして 頂けるのなら。その代わり、最低月に2回はお相手をして欲しいの。 それ位は頑張って頂けないと、こちらとしても不本意かと思います 一生見る事の出来ない生活と、一生稼ぐ事の出来ない現金の数。 信じないなら無視して下さい。只、貴方以外の男性の人生・運命が 変わるだけ。宜しいなら無視を。 只私は今、今日、貴方を誘ってるの、責任を持って誘っております、 ご心配はなさらないで。』(麻耶) http://perfection.cx/h/ ※ワンクリック詐欺等の疑い・入場登録料金発生等は一切御座いません From notif.dept at service159924946.paypal.com Fri Jun 2 14:06:30 2006 From: notif.dept at service159924946.paypal.com (PayPal Corp.) Date: Fri, 2 Jun 2006 14:06:30 -0700 (PDT) Subject: [openib-general] PayPal Notification - Action REQUIRED - No:159 Message-ID: <1473672486.JavaMail.15992@on-Mail002> An HTML attachment was scrubbed... URL: From weiny2 at llnl.gov Fri Jun 2 15:03:28 2006 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 02 Jun 2006 15:03:28 -0700 Subject: [Fwd: [openib-general] [PATCH] ibv_*_pingpong examples : user option for pkey] In-Reply-To: <1149284656.4510.108332.camel@hal.voltaire.com> References: <1149284656.4510.108332.camel@hal.voltaire.com> Message-ID: <20060602150328.2bcd5e48.weiny2@llnl.gov> Hal, I changed the pkey_idx to pkey-idx per your comment. But other than that this is the same patch. Roland do I need to do something else? Thanks, Ira On Fri, 02 Jun 2006 17:44:38 -0400 Hal Rosenstock wrote: > Hey Ira, > > Roland didn't respond to this. You may want to resend this patch to > him and cc: openib-general. Does it need any updating due to other > changes in this ? > > -- Hal > > -----Forwarded Message----- > > From: Ira Weiny > To: openib-general at openib.org > Subject: [openib-general] [PATCH] ibv_*_pingpong examples : user > option for pkey Date: 26 May 2006 16:54:56 -0700 > > While testing the pkey features of opensm I added this patch to be > able to check out the use of different pkeys. > > Ira > > ---- > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- A non-text attachment was scrubbed... Name: pingpong-pkey-option.patch Type: application/octet-stream Size: 8496 bytes Desc: not available URL: From swise at opengridcomputing.com Fri Jun 2 15:03:52 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 02 Jun 2006 17:03:52 -0500 Subject: [openib-general] [ANNOUNCE] New iWARP Branch Message-ID: <1149285832.11187.33.camel@stevo-desktop> Hello, The gen2 iwarp branch has been merged up to the main trunk revision 7626. The iwarp branch can be found at gen2/branches/iwarp and contains the Ammasso 1100 and Chelsio T3 drivers and user libs. If you are working on iwarp, please test out this new branch and lemme know if there are any problems. Thanks, Steve. From mashirle at us.ibm.com Fri Jun 2 08:25:13 2006 From: mashirle at us.ibm.com (Shirley Ma) Date: Fri, 02 Jun 2006 08:25:13 -0700 Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: References: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com> Message-ID: <1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com> Roland, More clarification: we saw two races here: 1. path_free() was called by both unicast_arp_send() and ipoib_flush_paths() in the same time. 0xc0000004bff0a0d0 3 1 1 0 R 0xc0000004bff0a580 *ksoftirqd/0 SP(esp) PC(eip) Function(args) 0xc00000000f707c80 0xc0000000003199d0 .skb_release_data +0x7c 0xc00000000f707c80 0xc000000000319688 (lr) .kfree_skbmem +0x20 0xc00000000f707d10 0xc000000000319688 .kfree_skbmem +0x20 0xc00000000f707da0 0xc0000000003197fc .__kfree_skb +0x148 0xc00000000f707e50 0xc00000000031e2a8 .net_tx_action +0xa4 0xc00000000f707f00 0xc00000000006ab38 .__do_softirq +0xa8 0xc00000000f707f90 0xc0000000000177b0 .call_do_softirq +0x14 0xc0000000cff83d90 0xc000000000012064 .do_softirq +0x90 0xc0000000cff83e20 0xc00000000006b0fc .ksoftirqd +0xfc 0xc0000000cff83ed0 0xc000000000081d74 .kthread +0x17c 0xc0000000cff83f90 0xc000000000017d24 .kernel_thread +0x4c KERNEL: assertion (!atomic_read(&skb->users)) failed at net/core/dev.c 2. during unicast arp skb retransmission, unicast_arp_send() appended the skb on the list, while ipoib_flush_paths() calling path_free() to free the same skb from the list. <3>KERNEL: assertion (!atomic_read(&skb->users)) failed at net/core/dev.c (1742) <4>Warning: kfree_skb passed an skb still on a list (from c00000000031e2a8). <2>kernel BUG in __kfree_skb at net/core/skbuff.c:225! (sles9 sp3 kernel) void __kfree_skb(struct sk_buff *skb) { if (skb->list) { printk(KERN_WARNING "Warning: kfree_skb passed an skb still " "on a list (from %p).\n", NET_CALLER(skb)); BUG(); } The patch will fix both problems by using priv->lock to protect path->queue list. Am I right? Thanks Shirley Ma IBM LTC From rdreier at cisco.com Fri Jun 2 16:15:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 02 Jun 2006 16:15:28 -0700 Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: <1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com> (Shirley Ma's message of "Fri, 02 Jun 2006 08:25:13 -0700") References: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com> <1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com> Message-ID: > 2. during unicast arp skb retransmission, unicast_arp_send() appended > the skb on the list, while ipoib_flush_paths() calling path_free() to > free the same skb from the list. I think I see what's going on. the skb ends up being on two lists at once I guess... - R. From rdreier at cisco.com Fri Jun 2 16:16:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 02 Jun 2006 16:16:28 -0700 Subject: [Fwd: [openib-general] [PATCH] ibv_*_pingpong examples : user option for pkey] In-Reply-To: <20060602150328.2bcd5e48.weiny2@llnl.gov> (Ira Weiny's message of "Fri, 02 Jun 2006 15:03:28 -0700") References: <1149284656.4510.108332.camel@hal.voltaire.com> <20060602150328.2bcd5e48.weiny2@llnl.gov> Message-ID: Ira> Hal, I changed the pkey_idx to pkey-idx per your comment. Ira> But other than that this is the same patch. Ira> Roland do I need to do something else? Sorry, I didn't see it the first time around. I'll take a look at it. - R. From mashirle at us.ibm.com Fri Jun 2 10:02:49 2006 From: mashirle at us.ibm.com (Shirley Ma) Date: Fri, 02 Jun 2006 10:02:49 -0700 Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: References: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com> <1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com> Message-ID: <1149267773.8085.68.camel@ibm-khxoic5vfkn.beaverton.ibm.com> On Fri, 2006-06-02 at 16:15 -0700, Roland Dreier wrote: > > 2. during unicast arp skb retransmission, unicast_arp_send() appended > > the skb on the list, while ipoib_flush_paths() calling path_free() to > > free the same skb from the list. > > I think I see what's going on. the skb ends up being on two lists at > once I guess... > > - R. The skb has only one prev, one next pointers, it can only be on one list at a time. How could skb go on two lists at once? Thanks Shirley From somenath at veritas.com Fri Jun 2 18:07:07 2006 From: somenath at veritas.com (somenath) Date: Fri, 02 Jun 2006 18:07:07 -0700 Subject: [openib-general] Mellanox HCAs: outstanding RDMAs In-Reply-To: References: <67897d690606011822j7b915876l57149508623c6c4f@mail.gmail.com> Message-ID: <4480E0BB.5070707@veritas.com> What happens if one tries to do RDMA (say write for example) higher than 4 (or 128 in changed case)? does it just wait till previos operation is completed? I don't remember seeing any error ....it was only limited by the send Q-depth which can go much larger value. thanks, som. Roland Dreier wrote: > Manpreet> Mellanox HCA can handle has been configured at 4 > Manpreet> (mthca_main.c: default_profile: rdb_per_qp). And the > Manpreet> HCAs can support a much higher value (128 I think). > > Manpreet> Could we move this value higher or atleast make it > Manpreet> configurable? > >Leonid Arsh has a patch that I will integrate soon that makes this >configurable. > >However, I'm curious. Do you have a workload where this actually >makes a measurable difference? It seems that having 4 RDMA requests >outstanding on the wire should be enough to get things to pipeline >pretty well. > >If you haven't tested this, right now you can of course edit >mthca_main.c to change the default value and recompile. > > - R. >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From rdreier at cisco.com Fri Jun 2 18:11:57 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 02 Jun 2006 18:11:57 -0700 Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: <1149267773.8085.68.camel@ibm-khxoic5vfkn.beaverton.ibm.com> (Shirley Ma's message of "Fri, 02 Jun 2006 10:02:49 -0700") References: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com> <1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com> <1149267773.8085.68.camel@ibm-khxoic5vfkn.beaverton.ibm.com> Message-ID: > The skb has only one prev, one next pointers, it can only be on one list > at a time. How could skb go on two lists at once? Good question. Actually I was wrong about understanding things before. I don't see any way that path_free() and unicast_arp_send() can be operating on the same struct ipoib_path at the same time. And I don't see how unicast_arp_send() could be handling the an skb that's already queued in a path's queue. path_free() only gets called from ipoib_flush_paths() after the path has been removed from the list of paths and the rb_tree of paths (both protected by priv->lock), so unicast_arp_send() wouldn't find the path to queue an skb. And ipoib_flush_paths() can't find a new path created by unicast_arp_send(). Obviously I'm missing something but I still don't see the real cause of your crash. - R. From rdreier at cisco.com Fri Jun 2 18:23:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 02 Jun 2006 18:23:24 -0700 Subject: [openib-general] Mellanox HCAs: outstanding RDMAs In-Reply-To: <4480E0BB.5070707@veritas.com> (somenath@veritas.com's message of "Fri, 02 Jun 2006 18:07:07 -0700") References: <67897d690606011822j7b915876l57149508623c6c4f@mail.gmail.com> <4480E0BB.5070707@veritas.com> Message-ID: > What happens if one tries to do RDMA (say write for example) higher than > 4 (or 128 in changed case)? does it just wait till previos operation > is completed? > I don't remember seeing any error ....it was only > limited by the send Q-depth which can go much larger value. Yes, the limit of outstanding RDMAs is not related to the send queue depth. Of course you can post many more than 4 RDMAs to a send queue -- the HCA just won't have more than 4 requests outstanding at a time. From trimmer at silverstorm.com Sat Jun 3 07:03:07 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Sat, 3 Jun 2006 10:03:07 -0400 Subject: [openib-general] Mellanox HCAs: outstanding RDMAs Message-ID: > > What happens if one tries to do RDMA (say write for example) higher > than > > 4 (or 128 in changed case)? does it just wait till previos operation > > is completed? > > I don't remember seeing any error ....it was only > > limited by the send Q-depth which can go much larger value. > > Yes, the limit of outstanding RDMAs is not related to the send queue > depth. Of course you can post many more than 4 RDMAs to a send queue > -- the HCA just won't have more than 4 requests outstanding at a time. To further clarity, this parameter only affects the number of concurrent outstanding RDMA Reads which the HCA will process. Once it hits this limit, the send Q will stall waiting for issued reads to complete prior to initiating new reads. It does not affect RDMA Writes. It is very analogous to outstanding reads parameters in PCI-X and PCIe (although this parameters is independent from those). The IB spec defines ordering rules for RDMA Reads and Writes. The number of outstanding RDMA Reads is negotiated by the CM during connection establishment and the QP which is sending the RDMA Read must have a value configured for this parameter which is <= the remote ends capability. In previous testing by Mellanox on SDR HCAs they indicated values beyond 2-4 did not improve performance (and in fact required more RDMA resources be allocated for the corresponding QP or HCA). Hence I suspect a very large value like 128 would offer no improvement over values in the 2-8 range. Todd Rimmer From tmicheal23 at yahoo.co.uk Sat Jun 3 07:59:31 2006 From: tmicheal23 at yahoo.co.uk (Tony Micheal) Date: Sat, 03 Jun 2006 16:59:31 +0200 Subject: [openib-general] THANK YOU Message-ID: Dear Friend , I'm happy to inform you about my success in getting those funds transferred under the cooperation of a new partner from Brazil. Presently i'm in Brazil for investment projects with my own share of the total sum. meanwhile,i didn't forget your past efforts and attempts to assist me in transferring those funds despite that it failed us some how. Now contact my secretary in Nigeria his name Mr Emeka Ibeh on emekaibeh01 at yahoo.com and ask him to send you the total of $800.000 which i kept for your compensation for all the past efforts and attempts to assist me in this matter. I appreciated your efforts at that time very much. so feel free and get in touched with my secretary Emeka and instruct him where to send the amount to you. Please do let me know immediately you receive it so that we can share the joy after all the sufferness at that time. in the moment, I�m very busy here because of the investment projects which me and the new partner are having at hand, finally, remember that I had forwarded instruction to the secretary on your behalf to receive that money, so feel free to get in touch with Emeka Ibeh , he will send the amount to you without any delay. Regards, Tony Micheal From anton at samba.org Sat Jun 3 17:05:35 2006 From: anton at samba.org (Anton Blanchard) Date: Sun, 4 Jun 2006 10:05:35 +1000 Subject: [openib-general] [PATCH] Fix ipathverbs compile In-Reply-To: <1149280519.13958.10.camel@hematite.pathscale.com> References: <20060602064924.GF1736@krispykreme> <1149280519.13958.10.camel@hematite.pathscale.com> Message-ID: <20060604000535.GA986@krispykreme> Hi, > > The issue is that I changed the development libibverbs tree in svn to > > no longer use libsysfs, and libehca and libipathverbs are not updated > > to the new interface yet. So it is true that they won't compile > > against the development libibverbs tree without including the libsysfs > > header, but it's also true that just adding the header so they compile > > will lead to a driver library that doesn't work anyway. > > > > So I think it's better to leave them not compiling until they are > > really fixed up. > > We're in the middle of getting a new software release done here, and > just haven't had the bandwidth to look at this yet. I'll get around to > it hopefully by the middle of next week and do the appropriate updates > from the libipathverbs end. Thanks for the explanation, makes sense :) Anton From anton at samba.org Sat Jun 3 17:22:00 2006 From: anton at samba.org (Anton Blanchard) Date: Sun, 4 Jun 2006 10:22:00 +1000 Subject: [openib-general] Fix some suspicious ppc64 code in dapl Message-ID: <20060604002200.GB986@krispykreme> Hi, I was reading through the ppc64 specific code in dapl/ and noticed some suspicious inline assembly. - EIEIO_ON_SMP and ISYNC_ON_SMP are in kernel UP build optimisations, we shouldnt export them to userspace. Replace it with lwsync and isync. - The comment says its implemenenting cmpxchg64 but in fact its implementing cmpxchg32. Fix the comment. Index: dapl/udapl/linux/dapl_osd.h =================================================================== --- dapl/udapl/linux/dapl_osd.h (revision 7621) +++ dapl/udapl/linux/dapl_osd.h (working copy) @@ -238,14 +238,13 @@ #endif /* __ia64__ */ #elif defined(__PPC64__) __asm__ __volatile__ ( - EIEIO_ON_SMP -"1: lwarx %0,0,%2 # __cmpxchg_u64\n\ - cmpd 0,%0,%3\n\ +" lwsync\n\ +1: lwarx %0,0,%2 # __cmpxchg_u32\n\ + cmpw 0,%0,%3\n\ bne- 2f\n\ stwcx. %4,0,%2\n\ - bne- 1b" - ISYNC_ON_SMP - "\n\ + bne- 1b\n\ + isync\n\ 2:" : "=&r" (current_value), "=m" (*v) : "r" (v), "r" (match_value), "r" (new_value), "m" (*v) From tziporet at mellanox.co.il Sun Jun 4 00:26:46 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 04 Jun 2006 10:26:46 +0300 Subject: [openib-general] Re: OFED RC6 Tag In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007D8CE2B@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0007D8CE2B@orsmsx408> Message-ID: <44828B36.5010302@mellanox.co.il> Woodruff, Robert J wrote: > Hi, > > I noticed that you now have a rc6 tag for the OFED kernel code. > Is there a tag for the userspace code ? or what SVN rev will be used > for RC6. > > woody > > There is no tag for the user level code since it is taken directly from the HEAD of branch. In the release BUILD ID you can see the svn revision of the user level. Tziporet From mst at mellanox.co.il Sun Jun 4 00:42:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 4 Jun 2006 10:42:35 +0300 Subject: [openib-general] Re: [PATCH] RFC: use stdint.h types In-Reply-To: References: <20060531085029.GP21266@mellanox.co.il> Message-ID: <20060604074235.GV21266@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] RFC: use stdint.h types > > My initial reaction is that I don't like this, since it makes it > harder to keep the kernel ABI files in sync between libraries and the > kernel. How about a perl script? #!/usr/bin/perl -pi s/\b__u(64|32|16|8)\b/uint$1_t/; s/\b__s(64|32|16|8)\b/int$1_t/; and back #!/usr/bin/perl -pi s/\b(uint)(64|32|16|8)_t\b/__u$1/; s/\b(__s)(64|32|16|8)\b/__s$1/; > Does overriding offsetof() really cause any problems? Donnu, but I'm worried there's some subtle reason gcc 4.0 defines it by means of __builtin_offsetof rather than the traditional way. > Does including break anything? Well, we are using an undocumented interface, and so it does make things fragile, take the compilation problem on sles10 as an example. -- MST From ishai at mellanox.co.il Sun Jun 4 02:43:22 2006 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Sun, 4 Jun 2006 12:43:22 +0300 Subject: [openib-general] SRP: [PATCH] Misc cleanups in ib_srp Message-ID: <20060604094322.GA9091@mellanox.co.il> Hi, Misc cleanups in ib_srp. Please consider for 2.6.18. 1) I think that it is more efficient to move the req entries from req_list to free_list in srp_reconnect_target (rather than rebuild the free_list). (In any case this code is shorter). 2) This allows us to reuse code in srp_reset_device and srp_reconnect_target and call a new function srp_reset_req. 3) We can use list_move_tail in srp_remove_req. Signed-off-by: Ishai Rabinovitz Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-05-19 11:14:35.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c 2006-05-21 17:41:25.000000000 +0300 @@ -451,14 +451,26 @@ static void srp_unmap_data(struct scsi_c scmnd->sc_data_direction); } +static void srp_remove_req(struct srp_target_port *target, struct srp_request *req) +{ + srp_unmap_data(req->scmnd, target, req); + list_move_tail(&req->list, &target->free_reqs); +} + +static void srp_reset_req(struct srp_target_port *target, struct srp_request *req) +{ + req->scmnd->result = DID_RESET << 16; + req->scmnd->scsi_done(req->scmnd); + srp_remove_req(target, req); +} + static int srp_reconnect_target(struct srp_target_port *target) { struct ib_cm_id *new_cm_id; struct ib_qp_attr qp_attr; - struct srp_request *req; + struct srp_request *req, *tmp; struct ib_wc wc; int ret; - int i; spin_lock_irq(target->scsi_host->host_lock); if (target->state != SRP_TARGET_LIVE) { @@ -494,19 +506,12 @@ static int srp_reconnect_target(struct s while (ib_poll_cq(target->cq, 1, &wc) > 0) ; /* nothing */ - list_for_each_entry(req, &target->req_queue, list) { - req->scmnd->result = DID_RESET << 16; - req->scmnd->scsi_done(req->scmnd); - srp_unmap_data(req->scmnd, target, req); - } + list_for_each_entry_safe(req, tmp, &target->req_queue, list) + srp_reset_req(target, req); target->rx_head = 0; target->tx_head = 0; target->tx_tail = 0; - INIT_LIST_HEAD(&target->free_reqs); - INIT_LIST_HEAD(&target->req_queue); - for (i = 0; i < SRP_SQ_SIZE; ++i) - list_add_tail(&target->req_ring[i].list, &target->free_reqs); ret = srp_connect_target(target); if (ret) @@ -706,13 +711,6 @@ static int srp_map_data(struct scsi_cmnd return len; } -static void srp_remove_req(struct srp_target_port *target, struct srp_request *req) -{ - srp_unmap_data(req->scmnd, target, req); - list_del(&req->list); - list_add_tail(&req->list, &target->free_reqs); -} - static void srp_process_rsp(struct srp_target_port *target, struct srp_rsp *rsp) { struct srp_request *req; @@ -1349,11 +1347,8 @@ static int srp_reset_device(struct scsi_ spin_lock_irq(target->scsi_host->host_lock); list_for_each_entry_safe(req, tmp, &target->req_queue, list) - if (req->scmnd->device == scmnd->device) { - req->scmnd->result = DID_RESET << 16; - req->scmnd->scsi_done(req->scmnd); - srp_remove_req(target, req); - } + if (req->scmnd->device == scmnd->device) + srp_reset_req(target, req); spin_unlock_irq(target->scsi_host->host_lock); -- Ishai Rabinovitz From mst at mellanox.co.il Sun Jun 4 04:16:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 4 Jun 2006 14:16:15 +0300 Subject: [openib-general] Re: [PATCH] ipoib: fix ah leak at interface down In-Reply-To: References: <200605281547.29313.eli@mellanox.co.il> <20060529151547.GO21266@mellanox.co.il> Message-ID: <20060604111615.GC21266@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib: fix ah leak at interface down > > Michael> If this makes sense, please push into 2.6.17. > > Yes, looks OK for 2.6.17. Out of curiousity: > > Michael> This might result in leaks (we see ah leaks which we > Michael> think can be attributed to this bug) as new packets get > Michael> posted while the interface is going down. > > with this patch applied, do the leaks go away? We've just got a confirmation from customer that this patch really fixes the AH leak. Please ask Linus to pull it into 2.6.17. -- MST From jackm at mellanox.co.il Sun Jun 4 07:10:04 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 4 Jun 2006 17:10:04 +0300 Subject: [openib-general] Removing mpi subtree from ofed branch Message-ID: <200606041710.04801.jackm@mellanox.co.il> I would like to remove the userspace mpi subtree from the ofed branch (https://openib.org/svn/gen2/branches/1.0/src/userspace). MPI is supplied in ofed as a separate package, which is not taken from the ofed branch. The presence of the mpi directory in the ofed branch is therefore misleading. If no one objects, I'll delete the mpi subtree from the ofed branch in a week (June 11). - Jack From eli at mellanox.co.il Sun Jun 4 07:17:08 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Sun, 04 Jun 2006 17:17:08 +0300 Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: <1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com> References: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com> <1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com> Message-ID: <1149430628.6779.14.camel@mtls03.yok.mtl.com> > More clarification: we saw two races here: > 1. path_free() was called by both unicast_arp_send() and > ipoib_flush_paths() in the same time. It is not possible to call path_free() on the same object from both unicast_arp_send() and ipoib_flush_paths(). This becasue unicast_arp_send() calls it only for newly created objects for which path_rec_create() failed, in which case the object was never inserted into the list or the rb_tree. > 2. during unicast arp skb retransmission, unicast_arp_send() appended > the skb on the list, while ipoib_flush_paths() calling path_free() to > free the same skb from the list. I don't see any issue here as well. Can you reproduce the crash? If you do, can you send how? From sweitzen at cisco.com Sun Jun 4 09:59:07 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Sun, 4 Jun 2006 09:59:07 -0700 Subject: [openib-general] Removing mpi subtree from ofed branch Message-ID: > I would like to remove the userspace mpi subtree from the ofed branch > (https://openib.org/svn/gen2/branches/1.0/src/userspace). > > MPI is supplied in ofed as a separate package, which is not > taken from the > ofed branch. The presence of the mpi directory in the ofed branch is > therefore misleading. So why don't we put the OFED MVAPICH MPI source in the branch then? It is also kinda confusing that the OFED MVAPICH is a tarball and not in subversion, given that it is based off the code that is in suvbersion. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems From xma at us.ibm.com Sun Jun 4 10:49:36 2006 From: xma at us.ibm.com (Shirley Ma) Date: Sun, 4 Jun 2006 10:49:36 -0700 Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: <1149430628.6779.14.camel@mtls03.yok.mtl.com> Message-ID: Ohmm. That's a myth. So this problem is hardware independent, right? It's not easy to reproduce it. ifconfig up and down stress test could hit this problem occasionally. thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From narravul at cse.ohio-state.edu Sun Jun 4 21:43:12 2006 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Mon, 5 Jun 2006 00:43:12 -0400 (EDT) Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: <1149285832.11187.33.camel@stevo-desktop> References: <1149285832.11187.33.camel@stevo-desktop> Message-ID: Hi Steve, We are trying the new iwarp branch on ammasso adapters. The installation has gone fine. However, on running rping there is a error during disconnect phase. $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999 libibverbs: Warning: no userspace device-specific driver found for uverbs1 driver search path: /usr/local/lib/infiniband libibverbs: Warning: no userspace device-specific driver found for uverbs0 driver search path: /usr/local/lib/infiniband ping data: rdm ping data: rdm ping data: rdm ping data: rdm cq completion failed status 5 DISCONNECT EVENT... *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 *** Aborted There are no apparent errors showing up in dmesg. Is this error currently expected? Thanks, --Sundeep. On Fri, 2 Jun 2006, Steve Wise wrote: > Hello, > > The gen2 iwarp branch has been merged up to the main trunk revision > 7626. The iwarp branch can be found at gen2/branches/iwarp and > contains the Ammasso 1100 and Chelsio T3 drivers and user libs. > > If you are working on iwarp, please test out this new branch and lemme > know if there are any problems. > > > Thanks, > > Steve. > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From k_mahesh85 at yahoo.co.in Sun Jun 4 22:39:56 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Mon, 5 Jun 2006 06:39:56 +0100 (BST) Subject: [openib-general] problem with memory registration-RDMA kernel utliity Message-ID: <20060605053956.81862.qmail@web8313.mail.in.yahoo.com> i am trying to develop a kernel utility to perform RDMA read/write operations i am facing a problem with memory regiatration in it. my code looks like......... u64 *addr_array; addr_array = kmalloc(sizeof(u64),GFP_KERNEL); //i am using only one page buffer test->mem = kmalloc(4096,GFP_KERNEL); // buffer on which RDMA_READ is to be performed test->fmr = ib_alloc_fmr(test->pd,IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE, fmr_attr); //fmr_attr is intialised properly addr_array[0] = virt_to_phys(test->mem) ; ret = ib_map_phys_fmr(test->fmr,addr_array[0],1,(u64)test->mem); All these operations are not generating any errors But when i pass this address (addr_array[0]) as the remote address, the RDMA_READ operation on this address is generating IB_WC_REM_ACCESS_ERROR completion. am i missing anything in the process of registering the memory????? Thanks n regards K.Mahesh Send instant messages to your online friends http://in.messenger.yahoo.com Stay connected with your friends even when away from PC. Link: http://in.mobile.yahoo.com/new/messenger/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Sun Jun 4 22:59:43 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 5 Jun 2006 08:59:43 +0300 Subject: [openib-general] problem with memory registration-RDMA kernel utliity In-Reply-To: <20060605053956.81862.qmail@web8313.mail.in.yahoo.com> References: <20060605053956.81862.qmail@web8313.mail.in.yahoo.com> Message-ID: <200606050859.44108.dotanb@mellanox.co.il> Hi. > All these operations are not generating any errors > But when i pass this address (addr_array[0]) as the remote address, the RDMA_READ operation on this address is generating IB_WC_REM_ACCESS_ERROR completion. > > am i missing anything in the process of registering the memory????? 1) Did you enable the RDMA_READ + RDMA_WRITE in the modify QP (qp_access_flags) in the responder side? 2) Do you have more than one PD (QP and MR PD should be the same)? 3) you should check that the address + rkey that the requestor sides uses are the right values .. Dotan From mst at mellanox.co.il Sun Jun 4 23:39:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 5 Jun 2006 09:39:23 +0300 Subject: [openib-general] Re: problem with memory registration-RDMA kernel utliity In-Reply-To: <20060605053956.81862.qmail@web8313.mail.in.yahoo.com> References: <20060605053956.81862.qmail@web8313.mail.in.yahoo.com> Message-ID: <20060605063923.GI21266@mellanox.co.il> Quoting r. keshetti mahesh : > addr_array[0] = virt_to_phys(test->mem) ; Not related to your problem, but you really should be using the DMA API to get the DMA address and pass that to memory registration verbs. -- MST From mst at mellanox.co.il Mon Jun 5 01:11:37 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 5 Jun 2006 11:11:37 +0300 Subject: [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: References: <1149430628.6779.14.camel@mtls03.yok.mtl.com> Message-ID: <20060605081136.GJ21266@mellanox.co.il> Quoting r. Shirley Ma : > Subject: Re: Re: [PATCH]Repost: IPoIB skb panic > > > Ohmm. That's a myth. So this problem is hardware independent, right? > It's not easy to reproduce it. ifconfig up and down stress test could hit this problem occasionally. Could be the same problem Eli's recent patch fixed. http://www.mail-archive.com/openib-general at openib.org/msg20894.html Please try with that applied. -- MST From eitan at mellanox.co.il Mon Jun 5 02:36:46 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 05 Jun 2006 12:36:46 +0300 Subject: [openib-general] [PATCH] osm: segfault fix in osm_get_gid_by_mad_addr Message-ID: <86lksceyfl.fsf@mtl066.yok.mtl.com> Hi Hal I got a report regarding crashes in osm_get_gid_by_mad_addr. It was missing a check on p_port looked up by LID. The affected flows are reports and multicast joins. The fix modified the function to return status (instead of GID). I did run some simulation flows after the fix but please double check before commit. Eitan Signed-off-by: Eitan Zahavi Index: include/opensm/osm_subnet.h =================================================================== --- include/opensm/osm_subnet.h (revision 7542) +++ include/opensm/osm_subnet.h (working copy) @@ -770,11 +770,12 @@ struct _osm_port; * * SYNOPSIS */ -ib_gid_t +ib_api_status_t osm_get_gid_by_mad_addr( IN struct _osm_log *p_log, IN const osm_subn_t *p_subn, - IN const struct _osm_mad_addr *p_mad_addr ); + IN const struct _osm_mad_addr *p_mad_addr, + OUT ib_gid_t *p_gid); /* * PARAMETERS * p_log @@ -786,8 +787,11 @@ osm_get_gid_by_mad_addr( * p_mad_addr * [in] Pointer to mad address object. * +* p_gid +* [out] Pointer to teh GID structure to fill in. +* * RETURN VALUES -* Requestor gid object if found. Null otherwise. +* IB_SUCCESS if was able to find the GID by address given * * NOTES * Index: opensm/osm_subnet.c =================================================================== --- opensm/osm_subnet.c (revision 7670) +++ opensm/osm_subnet.c (working copy) @@ -236,16 +236,24 @@ osm_subn_init( /********************************************************************** **********************************************************************/ -ib_gid_t +ib_api_status_t osm_get_gid_by_mad_addr( IN osm_log_t* p_log, IN const osm_subn_t *p_subn, - IN const osm_mad_addr_t *p_mad_addr ) + IN const osm_mad_addr_t *p_mad_addr, + OUT ib_gid_t *p_gid) { const cl_ptr_vector_t* p_tbl; const osm_port_t* p_port = NULL; const osm_physp_t* p_physp = NULL; - ib_gid_t request_gid; + + if ( p_gid == NULL ) + { + osm_log( p_log, OSM_LOG_ERROR, + "osm_get_gid_by_mad_addr: ERR 7505 " + "Provided output GID is NULL\n"); + return(IB_INVALID_PARAMETER); + } /* Find the port gid of the request in the subnet */ p_tbl = &p_subn->port_lid_tbl; @@ -256,9 +264,18 @@ osm_get_gid_by_mad_addr( cl_ntoh16(p_mad_addr->dest_lid)) { p_port = cl_ptr_vector_get( p_tbl, cl_ntoh16(p_mad_addr->dest_lid) ); + if ( p_port == NULL ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "osm_get_gid_by_mad_addr: " + "Did not find any port with LID: 0x%X\n", + cl_ntoh16(p_mad_addr->dest_lid) + ); + return(IB_INVALID_PARAMETER); + } p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num); - request_gid.unicast.interface_id = p_physp->port_guid; - request_gid.unicast.prefix = p_subn->opt.subnet_prefix; + p_gid->unicast.interface_id = p_physp->port_guid; + p_gid->unicast.prefix = p_subn->opt.subnet_prefix; } else { @@ -270,7 +287,7 @@ osm_get_gid_by_mad_addr( ); } - return request_gid; + return( IB_SUCCESS ); } /********************************************************************** Index: opensm/osm_sa_informinfo.c =================================================================== --- opensm/osm_sa_informinfo.c (revision 7670) +++ opensm/osm_sa_informinfo.c (working copy) @@ -348,6 +348,7 @@ osm_infr_rcv_process_set_method( uint8_t subscribe; ib_net32_t qpn; uint8_t resp_time_val; + ib_api_status_t res; OSM_LOG_ENTER( p_rcv->p_log, osm_infr_rcv_process_set_method ); @@ -382,8 +383,24 @@ osm_infr_rcv_process_set_method( inform_info_rec.inform_record.subscriber_enum = 0; /* update the subscriber GID according to mad address */ - inform_info_rec.inform_record.subscriber_gid = - osm_get_gid_by_mad_addr( p_rcv->p_log, p_rcv->p_subn, &p_madw->mad_addr ); + res = osm_get_gid_by_mad_addr( + p_rcv->p_log, + p_rcv->p_subn, + &p_madw->mad_addr, + &inform_info_rec.inform_record.subscriber_gid); + if ( res != NULL ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_infr_rcv_process_set_method: ERR 4308 " + "Got Subscribe Request from unknown LID: 0x%04X\n", + cl_ntoh16(p_madw->mad_addr.dest_lid) + ); + osm_sa_send_error( + p_rcv->p_resp, + p_madw, + IB_SA_MAD_STATUS_REQ_INVALID); + goto Exit; + } /* * MODIFICATIONS DONE ON INCOMING REQUEST: Index: opensm/osm_sa_mcmember_record.c =================================================================== --- opensm/osm_sa_mcmember_record.c (revision 7670) +++ opensm/osm_sa_mcmember_record.c (working copy) @@ -437,12 +437,21 @@ __add_new_mgrp_port( { boolean_t proxy_join; ib_gid_t requester_gid; + ib_api_status_t res; /* set the proxy_join if the requester gid is not identical to the joined gid */ - requester_gid = osm_get_gid_by_mad_addr( p_rcv->p_log, + res = osm_get_gid_by_mad_addr( p_rcv->p_log, p_rcv->p_subn, - p_mad_addr ); + p_mad_addr, &requester_gid ); + if ( res != IB_SUCCESS ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__add_new_mgrp_port: ERR 1B22: " + "Could not find GUID for requestor.\n" ); + + return IB_INVALID_PARAMETER; + } if (!memcmp(&p_recvd_mcmember_rec->port_gid, &requester_gid, sizeof(ib_gid_t))) @@ -755,6 +764,7 @@ __validate_modify(IN osm_mcmr_recv_t* co ib_net64_t portguid; ib_gid_t request_gid; osm_physp_t* p_request_physp; + ib_api_status_t res; portguid = p_recvd_mcmember_rec->port_gid.unicast.interface_id; @@ -775,9 +785,19 @@ __validate_modify(IN osm_mcmr_recv_t* co { /* The proxy_join is not set. Modifying can by done only if the requester GID == PortGID */ - request_gid = osm_get_gid_by_mad_addr(p_rcv->p_log, + res = osm_get_gid_by_mad_addr(p_rcv->p_log, p_rcv->p_subn, - p_mad_addr ); + p_mad_addr, + &request_gid); + + if ( res != IB_SUCCESS ) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__validate_modify: " + "Could not find any port by given request address.\n" + ); + return FALSE; + } if (memcmp(&((*pp_mcm_port)->port_gid), &request_gid, sizeof(ib_gid_t))) { From halr at voltaire.com Mon Jun 5 03:07:57 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jun 2006 06:07:57 -0400 Subject: [openib-general] Re: [PATCH] osm: segfault fix in osm_get_gid_by_mad_addr In-Reply-To: <86lksceyfl.fsf@mtl066.yok.mtl.com> References: <86lksceyfl.fsf@mtl066.yok.mtl.com> Message-ID: <1149502076.4510.202028.camel@hal.voltaire.com> Hi Eitan, On Mon, 2006-06-05 at 05:36, Eitan Zahavi wrote: > Hi Hal > > I got a report regarding crashes in osm_get_gid_by_mad_addr. > It was missing a check on p_port looked up by LID. The affected > flows are reports and multicast joins. > > The fix modified the function to return status (instead of GID). > I did run some simulation flows after the fix but please double > check before commit. See comments below. > Eitan > > Signed-off-by: Eitan Zahavi > > Index: include/opensm/osm_subnet.h > =================================================================== > --- include/opensm/osm_subnet.h (revision 7542) > +++ include/opensm/osm_subnet.h (working copy) > @@ -770,11 +770,12 @@ struct _osm_port; > * > * SYNOPSIS > */ > -ib_gid_t > +ib_api_status_t > osm_get_gid_by_mad_addr( > IN struct _osm_log *p_log, > IN const osm_subn_t *p_subn, > - IN const struct _osm_mad_addr *p_mad_addr ); > + IN const struct _osm_mad_addr *p_mad_addr, > + OUT ib_gid_t *p_gid); > /* > * PARAMETERS > * p_log > @@ -786,8 +787,11 @@ osm_get_gid_by_mad_addr( > * p_mad_addr > * [in] Pointer to mad address object. > * > +* p_gid > +* [out] Pointer to teh GID structure to fill in. > +* > * RETURN VALUES > -* Requestor gid object if found. Null otherwise. > +* IB_SUCCESS if was able to find the GID by address given > * > * NOTES > * > Index: opensm/osm_subnet.c > =================================================================== > --- opensm/osm_subnet.c (revision 7670) > +++ opensm/osm_subnet.c (working copy) > @@ -236,16 +236,24 @@ osm_subn_init( > > /********************************************************************** > **********************************************************************/ > -ib_gid_t > +ib_api_status_t > osm_get_gid_by_mad_addr( > IN osm_log_t* p_log, > IN const osm_subn_t *p_subn, > - IN const osm_mad_addr_t *p_mad_addr ) > + IN const osm_mad_addr_t *p_mad_addr, > + OUT ib_gid_t *p_gid) > { > const cl_ptr_vector_t* p_tbl; > const osm_port_t* p_port = NULL; > const osm_physp_t* p_physp = NULL; > - ib_gid_t request_gid; > + > + if ( p_gid == NULL ) > + { > + osm_log( p_log, OSM_LOG_ERROR, > + "osm_get_gid_by_mad_addr: ERR 7505 " > + "Provided output GID is NULL\n"); > + return(IB_INVALID_PARAMETER); > + } > > /* Find the port gid of the request in the subnet */ > p_tbl = &p_subn->port_lid_tbl; > @@ -256,9 +264,18 @@ osm_get_gid_by_mad_addr( > cl_ntoh16(p_mad_addr->dest_lid)) > { > p_port = cl_ptr_vector_get( p_tbl, cl_ntoh16(p_mad_addr->dest_lid) ); > + if ( p_port == NULL ) > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "osm_get_gid_by_mad_addr: " > + "Did not find any port with LID: 0x%X\n", > + cl_ntoh16(p_mad_addr->dest_lid) > + ); > + return(IB_INVALID_PARAMETER); > + } > p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num); > - request_gid.unicast.interface_id = p_physp->port_guid; > - request_gid.unicast.prefix = p_subn->opt.subnet_prefix; > + p_gid->unicast.interface_id = p_physp->port_guid; > + p_gid->unicast.prefix = p_subn->opt.subnet_prefix; > } > else > { Isn't an error status needed to be returned for this else ? > @@ -270,7 +287,7 @@ osm_get_gid_by_mad_addr( > ); > } > > - return request_gid; > + return( IB_SUCCESS ); > } > > /********************************************************************** > Index: opensm/osm_sa_informinfo.c > =================================================================== > --- opensm/osm_sa_informinfo.c (revision 7670) > +++ opensm/osm_sa_informinfo.c (working copy) > @@ -348,6 +348,7 @@ osm_infr_rcv_process_set_method( > uint8_t subscribe; > ib_net32_t qpn; > uint8_t resp_time_val; > + ib_api_status_t res; > > OSM_LOG_ENTER( p_rcv->p_log, osm_infr_rcv_process_set_method ); > > @@ -382,8 +383,24 @@ osm_infr_rcv_process_set_method( > inform_info_rec.inform_record.subscriber_enum = 0; > > /* update the subscriber GID according to mad address */ > - inform_info_rec.inform_record.subscriber_gid = > - osm_get_gid_by_mad_addr( p_rcv->p_log, p_rcv->p_subn, &p_madw->mad_addr ); > + res = osm_get_gid_by_mad_addr( > + p_rcv->p_log, > + p_rcv->p_subn, > + &p_madw->mad_addr, > + &inform_info_rec.inform_record.subscriber_gid); > + if ( res != NULL ) Should this be IB_SUCCESS rather than NULL ? > + { > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "osm_infr_rcv_process_set_method: ERR 4308 " > + "Got Subscribe Request from unknown LID: 0x%04X\n", > + cl_ntoh16(p_madw->mad_addr.dest_lid) > + ); > + osm_sa_send_error( > + p_rcv->p_resp, > + p_madw, > + IB_SA_MAD_STATUS_REQ_INVALID); > + goto Exit; > + } > > /* > * MODIFICATIONS DONE ON INCOMING REQUEST: > Index: opensm/osm_sa_mcmember_record.c > =================================================================== > --- opensm/osm_sa_mcmember_record.c (revision 7670) > +++ opensm/osm_sa_mcmember_record.c (working copy) > @@ -437,12 +437,21 @@ __add_new_mgrp_port( > { > boolean_t proxy_join; > ib_gid_t requester_gid; > + ib_api_status_t res; > > /* set the proxy_join if the requester gid is not identical to the > joined gid */ > - requester_gid = osm_get_gid_by_mad_addr( p_rcv->p_log, > + res = osm_get_gid_by_mad_addr( p_rcv->p_log, > p_rcv->p_subn, > - p_mad_addr ); > + p_mad_addr, &requester_gid ); > + if ( res != IB_SUCCESS ) > + { > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "__add_new_mgrp_port: ERR 1B22: " > + "Could not find GUID for requestor.\n" ); ERR 1B22 is already in use. > + > + return IB_INVALID_PARAMETER; > + } Also, based on this change, the caller of __add_new_mgrp_port should not just send SA error with IB_SA_MAD_STATUS_NO_RESOURCES but rather base it off the error status now. -- Hal > if (!memcmp(&p_recvd_mcmember_rec->port_gid, &requester_gid, > sizeof(ib_gid_t))) > @@ -755,6 +764,7 @@ __validate_modify(IN osm_mcmr_recv_t* co > ib_net64_t portguid; > ib_gid_t request_gid; > osm_physp_t* p_request_physp; > + ib_api_status_t res; > > portguid = p_recvd_mcmember_rec->port_gid.unicast.interface_id; > > @@ -775,9 +785,19 @@ __validate_modify(IN osm_mcmr_recv_t* co > { > /* The proxy_join is not set. Modifying can by done only > if the requester GID == PortGID */ > - request_gid = osm_get_gid_by_mad_addr(p_rcv->p_log, > + res = osm_get_gid_by_mad_addr(p_rcv->p_log, > p_rcv->p_subn, > - p_mad_addr ); > + p_mad_addr, > + &request_gid); > + > + if ( res != IB_SUCCESS ) > + { > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > + "__validate_modify: " > + "Could not find any port by given request address.\n" > + ); > + return FALSE; > + } > > if (memcmp(&((*pp_mcm_port)->port_gid), &request_gid, sizeof(ib_gid_t))) > { > From eitan at mellanox.co.il Mon Jun 5 04:33:47 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 5 Jun 2006 14:33:47 +0300 Subject: [openib-general] QoS RFC - Resend using a friendly mailer Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3023687C0@mtlexch01.mtl.com> Hi Sasha, Please see my comments below > > > > 9. OpenSM features > > ------------------- > > The QoS related functionality to be provided by OpenSM can be split into two > > main parts: > > > > 3.1. Fabric Setup > > During fabric initialization the SM should parse the policy and apply its > > settings to the discovered fabric elements. The following actions should be > > performed: > > * Parsing of policy > > * Node Group identification. Warning should be provided for each node not > > specified but found. > > * SL2VL settings validation should be checked: > > + A warning will be provided if there are no matching targets for the SL2VL > > setting statement. > > + An error message will be printed to the log file if an invalid setting is > > found. A setting is invalid if it refers to: > > - Non existing port numbers of the target devices > > - Unsupported VLs for the target device. In the later case the map to non > > existing VLs should be replaced to VL15 i.e. packets will be dropped. > > Not sure that unsupported VLs mapping to VL15 is best option. Actually > if SL2VL will be specified per port group this may mean that at least in > "generic" case all group members should have similar physical > capabilities or "reliable" part of SLs will be limited by lowest VLCap > in this group (other SLs will be just dropped somewhere). [EZ] I prefer not hiding the mismatch. In my mind the explicit setting should be provided for each of the groups of switches that do not share same VLs support. But this is not a strong requirement in my mind. In general I would prefer to get a clear error message when the fabric can not support the given policy. Once such error is provided I think we could use whatever "recovery" option you have in mind. > > In current SL2VL mapping implementation we are using such rule to replace > unsupported VLs: (new VL) = (requested VL) % (operational data VLs) > This may have some disadvantage too, but I think it is generally "safer". [EZ] It is safer since it will not cause data loss. But then the QoS will probably be broken. > > Also I guess that by "unsupported VLs" you are referring unsupported or > non-configured VLs. [EZ] Yes true. > > > * SL2VL setting is to be performed > > * VL Arbitration table settings should be validated according to the following > > rules: > > + A warning will be provided if there are no matching targets for the setting > > statement > > + An error will be provided if the port number exceeds the target ports > > + An error will be generated if the table length exceeds device capabilities > > + An warning will be generated if the table quote a VL that is not supported > > by the target device > > Should there be replacement rule for not supported VLs? > > In IBTA spec (v.1, p.190, l.14) is stated that entry with unsupported VL > may be skipped _OR_ "trusted" to other (supported) VL. I think if we will > not care about unsupported replacement there may be hole for > "device/vendor dependent" behavior. [EZ] OK good point. Lets have a replacement rule. > > Sasha From eitan at mellanox.co.il Mon Jun 5 05:33:07 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 5 Jun 2006 15:33:07 +0300 Subject: [openib-general] RE: [PATCH] osm: segfault fix in osm_get_gid_by_mad_addr Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3023687C4@mtlexch01.mtl.com> Hi Hal, I will re-send the patch with fixes. I also replied to the comments below > See comments below. > > > p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num); > > - request_gid.unicast.interface_id = p_physp->port_guid; > > - request_gid.unicast.prefix = p_subn->opt.subnet_prefix; > > + p_gid->unicast.interface_id = p_physp->port_guid; > > + p_gid->unicast.prefix = p_subn->opt.subnet_prefix; > > } > > else > > { > > Isn't an error status needed to be returned for this else ? [EZ] Correct > > > @@ -382,8 +383,24 @@ osm_infr_rcv_process_set_method( > > inform_info_rec.inform_record.subscriber_enum = 0; > > > > /* update the subscriber GID according to mad address */ > > - inform_info_rec.inform_record.subscriber_gid = > > - osm_get_gid_by_mad_addr( p_rcv->p_log, p_rcv->p_subn, &p_madw- > >mad_addr ); > > + res = osm_get_gid_by_mad_addr( > > + p_rcv->p_log, > > + p_rcv->p_subn, > > + &p_madw->mad_addr, > > + &inform_info_rec.inform_record.subscriber_gid); > > + if ( res != NULL ) > > Should this be IB_SUCCESS rather than NULL ? [EZ] True. > > > + { > > * MODIFICATIONS DONE ON INCOMING REQUEST: > > Index: opensm/osm_sa_mcmember_record.c > > =================================================================== > > --- opensm/osm_sa_mcmember_record.c (revision 7670) > > +++ opensm/osm_sa_mcmember_record.c (working copy) > > @@ -437,12 +437,21 @@ __add_new_mgrp_port( > > { > > boolean_t proxy_join; > > ib_gid_t requester_gid; > > + ib_api_status_t res; > > > > /* set the proxy_join if the requester gid is not identical to the > > joined gid */ > > - requester_gid = osm_get_gid_by_mad_addr( p_rcv->p_log, > > + res = osm_get_gid_by_mad_addr( p_rcv->p_log, > > p_rcv->p_subn, > > - p_mad_addr ); > > + p_mad_addr, &requester_gid ); > > + if ( res != IB_SUCCESS ) > > + { > > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > + "__add_new_mgrp_port: ERR 1B22: " > > + "Could not find GUID for requestor.\n" ); > > ERR 1B22 is already in use. [EZ] OK last was 1B28 using 1B29 > > > + > > + return IB_INVALID_PARAMETER; > > + } > > Also, based on this change, the caller of __add_new_mgrp_port should not > just send SA error with IB_SA_MAD_STATUS_NO_RESOURCES but rather base it > off the error status now. [EZ] Correct. But I think there is no error message that fits exactly the case where the requester is not known to the SM. I will use invalid parameter. > > -- Hal > > > if (!memcmp(&p_recvd_mcmember_rec->port_gid, &requester_gid, > > sizeof(ib_gid_t))) > > @@ -755,6 +764,7 @@ __validate_modify(IN osm_mcmr_recv_t* co > > ib_net64_t portguid; > > ib_gid_t request_gid; > > osm_physp_t* p_request_physp; > > + ib_api_status_t res; > > > > portguid = p_recvd_mcmember_rec->port_gid.unicast.interface_id; > > > > @@ -775,9 +785,19 @@ __validate_modify(IN osm_mcmr_recv_t* co > > { > > /* The proxy_join is not set. Modifying can by done only > > if the requester GID == PortGID */ > > - request_gid = osm_get_gid_by_mad_addr(p_rcv->p_log, > > + res = osm_get_gid_by_mad_addr(p_rcv->p_log, > > p_rcv->p_subn, > > - p_mad_addr ); > > + p_mad_addr, > > + &request_gid); > > + > > + if ( res != IB_SUCCESS ) > > + { > > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > > + "__validate_modify: " > > + "Could not find any port by given request address.\n" > > + ); > > + return FALSE; > > + } > > > > if (memcmp(&((*pp_mcm_port)->port_gid), &request_gid, sizeof(ib_gid_t))) > > { > > From Thomas.Talpey at netapp.com Mon Jun 5 05:31:11 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 05 Jun 2006 08:31:11 -0400 Subject: [openib-general] Mellanox HCAs: outstanding RDMAs In-Reply-To: References: Message-ID: <7.0.1.0.2.20060605081948.044849d0@netapp.com> At 10:03 AM 6/3/2006, Rimmer, Todd wrote: >> Yes, the limit of outstanding RDMAs is not related to the send queue >> depth. Of course you can post many more than 4 RDMAs to a send queue >> -- the HCA just won't have more than 4 requests outstanding at a time. > >To further clarity, this parameter only affects the number of concurrent >outstanding RDMA Reads which the HCA will process. Once it hits this >limit, the send Q will stall waiting for issued reads to complete prior >to initiating new reads. It's worse than that - the send queue must stall for *all* operations. Otherwise the hardware has to track in-progress operations which are queued after stalled ones. It really breaks the initiation model. Semantically, the provider is not required to provide any such flow control behavior by the way. The Mellanox one apparently does, but it is not a requirement of the verbs, it's a requirement on the upper layer. If more RDMA Reads are posted than the remote peer supports, the connection may break. >The number of outstanding RDMA Reads is negotiated by the CM during >connection establishment and the QP which is sending the RDMA Read must >have a value configured for this parameter which is <= the remote ends >capability. In other words, we're probably stuck at 4. :-) I don't think there is any Mellanox-based implementation that has ever supported > 4. >In previous testing by Mellanox on SDR HCAs they indicated values beyond >2-4 did not improve performance (and in fact required more RDMA >resources be allocated for the corresponding QP or HCA). Hence I >suspect a very large value like 128 would offer no improvement over >values in the 2-8 range. I am not so sure of that. For one thing, it's dependent on VERY small latencies. The presence of a switch, or link extenders will make a huge difference. Second, heavy multi-QP firmware loads will increase the latencies. Third, constants are pretty much never a good idea in networking. The NFS/RDMA client tries to set the maximum IRD value it can obtain. RDMA Read is used quite heavily by the server to fetch client data segments for NFS writes. Tom. From eitan at mellanox.co.il Mon Jun 5 05:34:03 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 05 Jun 2006 15:34:03 +0300 Subject: [openib-general] [PATCH] osm: segfault fix in osm_get_gid_by_mad_addr (take 2) Message-ID: <864pyzok78.fsf@mtl066.yok.mtl.com> Hi Hal I got a report regarding crashes in osm_get_gid_by_mad_addr. It was missing a check on p_port looked up by LID. The affected flows are reports and multicast joins. The fix modified the function to return status (instead of GID). I did run some simulation flows after the fix but please double check before commit. This time I hope I did not missed anything Eitan Signed-off-by: Eitan Zahavi Index: include/opensm/osm_subnet.h =================================================================== --- include/opensm/osm_subnet.h (revision 7542) +++ include/opensm/osm_subnet.h (working copy) @@ -770,11 +770,12 @@ struct _osm_port; * * SYNOPSIS */ -ib_gid_t +ib_api_status_t osm_get_gid_by_mad_addr( IN struct _osm_log *p_log, IN const osm_subn_t *p_subn, - IN const struct _osm_mad_addr *p_mad_addr ); + IN const struct _osm_mad_addr *p_mad_addr, + OUT ib_gid_t *p_gid); /* * PARAMETERS * p_log @@ -786,8 +787,11 @@ osm_get_gid_by_mad_addr( * p_mad_addr * [in] Pointer to mad address object. * +* p_gid +* [out] Pointer to teh GID structure to fill in. +* * RETURN VALUES -* Requestor gid object if found. Null otherwise. +* IB_SUCCESS if was able to find the GID by address given * * NOTES * Index: opensm/osm_subnet.c =================================================================== --- opensm/osm_subnet.c (revision 7670) +++ opensm/osm_subnet.c (working copy) @@ -236,16 +236,24 @@ osm_subn_init( /********************************************************************** **********************************************************************/ -ib_gid_t +ib_api_status_t osm_get_gid_by_mad_addr( IN osm_log_t* p_log, IN const osm_subn_t *p_subn, - IN const osm_mad_addr_t *p_mad_addr ) + IN const osm_mad_addr_t *p_mad_addr, + OUT ib_gid_t *p_gid) { const cl_ptr_vector_t* p_tbl; const osm_port_t* p_port = NULL; const osm_physp_t* p_physp = NULL; - ib_gid_t request_gid; + + if ( p_gid == NULL ) + { + osm_log( p_log, OSM_LOG_ERROR, + "osm_get_gid_by_mad_addr: ERR 7505 " + "Provided output GID is NULL\n"); + return(IB_INVALID_PARAMETER); + } /* Find the port gid of the request in the subnet */ p_tbl = &p_subn->port_lid_tbl; @@ -256,9 +264,18 @@ osm_get_gid_by_mad_addr( cl_ntoh16(p_mad_addr->dest_lid)) { p_port = cl_ptr_vector_get( p_tbl, cl_ntoh16(p_mad_addr->dest_lid) ); + if ( p_port == NULL ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "osm_get_gid_by_mad_addr: " + "Did not find any port with LID: 0x%X\n", + cl_ntoh16(p_mad_addr->dest_lid) + ); + return(IB_INVALID_PARAMETER); + } p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num); - request_gid.unicast.interface_id = p_physp->port_guid; - request_gid.unicast.prefix = p_subn->opt.subnet_prefix; + p_gid->unicast.interface_id = p_physp->port_guid; + p_gid->unicast.prefix = p_subn->opt.subnet_prefix; } else { @@ -268,9 +285,10 @@ osm_get_gid_by_mad_addr( "Lid is out of range: 0x%X\n", cl_ntoh16(p_mad_addr->dest_lid) ); + return(IB_INVALID_PARAMETER); } - return request_gid; + return( IB_SUCCESS ); } /********************************************************************** Index: opensm/osm_sa_informinfo.c =================================================================== --- opensm/osm_sa_informinfo.c (revision 7670) +++ opensm/osm_sa_informinfo.c (working copy) @@ -348,6 +348,7 @@ osm_infr_rcv_process_set_method( uint8_t subscribe; ib_net32_t qpn; uint8_t resp_time_val; + ib_api_status_t res; OSM_LOG_ENTER( p_rcv->p_log, osm_infr_rcv_process_set_method ); @@ -382,8 +383,24 @@ osm_infr_rcv_process_set_method( inform_info_rec.inform_record.subscriber_enum = 0; /* update the subscriber GID according to mad address */ - inform_info_rec.inform_record.subscriber_gid = - osm_get_gid_by_mad_addr( p_rcv->p_log, p_rcv->p_subn, &p_madw->mad_addr ); + res = osm_get_gid_by_mad_addr( + p_rcv->p_log, + p_rcv->p_subn, + &p_madw->mad_addr, + &inform_info_rec.inform_record.subscriber_gid); + if ( res != IB_SUCCESS ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_infr_rcv_process_set_method: ERR 4308 " + "Got Subscribe Request from unknown LID: 0x%04X\n", + cl_ntoh16(p_madw->mad_addr.dest_lid) + ); + osm_sa_send_error( + p_rcv->p_resp, + p_madw, + IB_SA_MAD_STATUS_REQ_INVALID); + goto Exit; + } /* * MODIFICATIONS DONE ON INCOMING REQUEST: Index: opensm/osm_sa_mcmember_record.c =================================================================== --- opensm/osm_sa_mcmember_record.c (revision 7670) +++ opensm/osm_sa_mcmember_record.c (working copy) @@ -437,12 +437,21 @@ __add_new_mgrp_port( { boolean_t proxy_join; ib_gid_t requester_gid; + ib_api_status_t res; /* set the proxy_join if the requester gid is not identical to the joined gid */ - requester_gid = osm_get_gid_by_mad_addr( p_rcv->p_log, + res = osm_get_gid_by_mad_addr( p_rcv->p_log, p_rcv->p_subn, - p_mad_addr ); + p_mad_addr, &requester_gid ); + if ( res != IB_SUCCESS ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__add_new_mgrp_port: ERR 1B29: " + "Could not find GUID for requestor.\n" ); + + return IB_INVALID_PARAMETER; + } if (!memcmp(&p_recvd_mcmember_rec->port_gid, &requester_gid, sizeof(ib_gid_t))) @@ -755,6 +764,7 @@ __validate_modify(IN osm_mcmr_recv_t* co ib_net64_t portguid; ib_gid_t request_gid; osm_physp_t* p_request_physp; + ib_api_status_t res; portguid = p_recvd_mcmember_rec->port_gid.unicast.interface_id; @@ -775,9 +785,19 @@ __validate_modify(IN osm_mcmr_recv_t* co { /* The proxy_join is not set. Modifying can by done only if the requester GID == PortGID */ - request_gid = osm_get_gid_by_mad_addr(p_rcv->p_log, + res = osm_get_gid_by_mad_addr(p_rcv->p_log, p_rcv->p_subn, - p_mad_addr ); + p_mad_addr, + &request_gid); + + if ( res != IB_SUCCESS ) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__validate_modify: " + "Could not find any port by given request address.\n" + ); + return FALSE; + } if (memcmp(&((*pp_mcm_port)->port_gid), &request_gid, sizeof(ib_gid_t))) { @@ -1759,7 +1779,11 @@ osm_mcmr_rcv_join_mgrp( __cleanup_mgrp(p_rcv, mlid); CL_PLOCK_RELEASE( p_rcv->p_lock ); + if (status == IB_INVALID_PARAMETER) + sa_status = IB_SA_MAD_STATUS_REQ_INVALID; + else sa_status = IB_SA_MAD_STATUS_NO_RESOURCES; + osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status); goto Exit; } From k_mahesh85 at yahoo.co.in Mon Jun 5 05:37:33 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Mon, 5 Jun 2006 13:37:33 +0100 (BST) Subject: [openib-general] Re: problem with memory registration-RDMA kernel utliity Message-ID: <20060605123733.24901.qmail@web8315.mail.in.yahoo.com> i have added dma_map_single() and sent the address i got from that to perform RDMA_READ, now it is not at all generating any completion event and just halting there itself. below is the changed code...... ----------------------------------------------------------------------------------------------------------------- i am trying to develop a kernel utility to perform RDMA read/write operations i am facing a problem with memory regiatration in it. my code looks like......... u64 *addr_array; addr_array = kmalloc(sizeof(u64),GFP_KERNEL); //i am using only one page buffer test->mem = kmalloc(4096,GFP_KERNEL); // buffer on which RDMA_READ is to be performed test->fmr = ib_alloc_fmr(test->pd,IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE, fmr_attr); //fmr_attr is intialised properly addr_array[0] = dma_map_single(test->device->dma_device,test->mem,4096,DMA_TO_DEVICE); ret = ib_map_phys_fmr(test->fmr,addr_array[0],1,addr_array); All these operations are not generating any errors But when i pass this address (addr_array[0]) as the remote address, the RDMA_READ operation on this address is not generating any completion event and halting there itself. am i missing anything in the process of registering the memory????? Thanks n regards K.Mahesh Send instant messages to your online friends http://in.messenger.yahoo.com Stay connected with your friends even when away from PC. Link: http://in.mobile.yahoo.com/new/messenger/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Mon Jun 5 05:40:53 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 05 Jun 2006 15:40:53 +0300 Subject: [openib-general] [PATCH] osm: management class constants are unit8 not uint16 Message-ID: <863bejojvu.fsf@mtl066.yok.mtl.com> Hi Hal Cleaning up compilation warnings I found that the osm_vendor_mlx_svc.h was using NTOH16 on the class constants. Eitan Signed-off-by: Eitan Zahavi Index: include/vendor/osm_vendor_mlx_svc.h =================================================================== --- include/vendor/osm_vendor_mlx_svc.h (revision 7542) +++ include/vendor/osm_vendor_mlx_svc.h (working copy) @@ -119,8 +119,8 @@ osmv_mad_is_rmpp(IN const ib_mad_t *p_ma rmpp_flags = ((ib_rmpp_mad_t*)p_mad)->rmpp_flags; /* HACK - JUST SA and DevMgt for now - need to add BIS and DevAdm */ - if ( (p_mad->mgmt_class != CL_NTOH16(IB_MCLASS_SUBN_ADM)) && - (p_mad->mgmt_class != CL_NTOH16(IB_MCLASS_DEV_MGMT)) ) + if ( (p_mad->mgmt_class != IB_MCLASS_SUBN_ADM) && + (p_mad->mgmt_class != IB_MCLASS_DEV_MGMT) ) return(0); return (0 != (rmpp_flags & IB_RMPP_FLAG_ACTIVE)); } From ogerlitz at voltaire.com Mon Jun 5 05:41:42 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 05 Jun 2006 15:41:42 +0300 Subject: [openib-general] Re: [PATCHv2 1/2] resend: mthca support for max_map_per_fmr device attribute In-Reply-To: References: Message-ID: <44842686.10002@voltaire.com> Roland Dreier wrote: > I had a chance to look at this, and I don't believe it is precisely > correct for mem-free HCAs with the current FMR implementation. > > > + /* on memfull HCA an FMR can be remapped 2^B - 1 times where B < 32 is > > + * the number of bits which are not used for MPT addressing, on memfree > > + * HCA B=8 so an FMR can be remapped 255 times. > > + */ > > + if(!mthca_is_memfree(mdev)) > > + props->max_map_per_fmr = (1 << (32 - > > + long_log2(mdev->limits.num_mpts))) - 1; > > + else > > + props->max_map_per_fmr = (1 << 8) - 1; > > Look at mthca_arbel_map_phys_fmr(). The question is how often key > will repeat after being indexed, and when MTHCA_FLAG_SINAI_OPT is not > set, then the same increment is used in the mem-free case as in the > Tavor case. > > So I think the code I quoted should really be: > > if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) > props->max_map_per_fmr = (1 << (32 - > long_log2(mdev->limits.num_mpts))) - 1; > else > props->max_map_per_fmr = (1 << 8) - 1; > > Do you agree? If so I can fix this patch up myself and apply it. Yes it makes sense, but you need the check should be if (!(dev->mthca_flags & MTHCA_FLAG_SINAI_OPT)) instead of if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) also, what about the other patch which changes fmr_pool.c to query the device, have you got(reviewed/accepted) it? i have modified it to allocate the device attr struct on the heap as you have asked. Or. From eitan at mellanox.co.il Mon Jun 5 05:51:53 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 05 Jun 2006 15:51:53 +0300 Subject: [openib-general] [PATCH] osm: trivial missing header files fix Message-ID: <861wu3ojdi.fsf@mtl066.yok.mtl.com> Hi Hal Cleaning up compilation warnings I found there missing includes in various sources. Eitan Signed-off-by: Eitan Zahavi Index: include/vendor/osm_vendor_mlx_txn.h =================================================================== --- include/vendor/osm_vendor_mlx_txn.h (revision 7542) +++ include/vendor/osm_vendor_mlx_txn.h (working copy) @@ -37,6 +37,9 @@ #ifndef _OSMV_TXN_H_ #define _OSMV_TXN_H_ +#include +#include + #include #include #include Index: libvendor/osm_vendor_mlx_hca.c =================================================================== --- libvendor/osm_vendor_mlx_hca.c (revision 7542) +++ libvendor/osm_vendor_mlx_hca.c (working copy) @@ -39,6 +39,7 @@ # include #endif /* HAVE_CONFIG_H */ +#include #include #if defined(OSM_VENDOR_INTF_MTL) | defined(OSM_VENDOR_INTF_TS) #undef IN Index: libvendor/osm_vendor_mlx_hca_sim.c =================================================================== --- libvendor/osm_vendor_mlx_hca_sim.c (revision 7542) +++ libvendor/osm_vendor_mlx_hca_sim.c (working copy) @@ -43,6 +43,7 @@ #undef IN #undef OUT +#include #include #include #include Index: opensm/osm_node_info_rcv.c =================================================================== --- opensm/osm_node_info_rcv.c (revision 7670) +++ opensm/osm_node_info_rcv.c (working copy) @@ -55,6 +55,7 @@ # include #endif /* HAVE_CONFIG_H */ +#include #include #include #include Index: opensm/osm_drop_mgr.c =================================================================== --- opensm/osm_drop_mgr.c (revision 7670) +++ opensm/osm_drop_mgr.c (working copy) @@ -51,6 +51,7 @@ # include #endif /* HAVE_CONFIG_H */ +#include #include #include #include From dotanb at mellanox.co.il Mon Jun 5 05:57:22 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 5 Jun 2006 15:57:22 +0300 Subject: [openib-general] Mellanox HCAs: outstanding RDMAs In-Reply-To: <7.0.1.0.2.20060605081948.044849d0@netapp.com> References: <7.0.1.0.2.20060605081948.044849d0@netapp.com> Message-ID: <200606051557.22577.dotanb@mellanox.co.il> Hi. > In other words, we're probably stuck at 4. :-) I don't think there is any > Mellanox-based implementation that has ever supported > 4. The VAPI driver (gen1 driver for Mellanox HCAs) supported 8 outstanding RDMA Read/Atomic operations. I guess that the magic value "4" is a low level driver issue and not HCA issue. Dotan From eitan at mellanox.co.il Mon Jun 5 05:59:45 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 05 Jun 2006 15:59:45 +0300 Subject: [openib-general] [PATCH] osm: trivial missing cast in osmt_service call for memcmp Message-ID: <86zmgrn4fy.fsf@mtl066.yok.mtl.com> Hi Hal Last one of my cleaning up compilation warnings I found a missing cast in osmtest service name compare. Eitan Signed-off-by: Eitan Zahavi Index: osmtest/osmt_service.c =================================================================== --- osmtest/osmt_service.c (revision 7542) +++ osmtest/osmt_service.c (working copy) @@ -1138,8 +1138,8 @@ osmt_get_all_services_and_check_names( I "osmt_get_all_services_and_check_names: " "-I- Comparing source name : >%s<, with record name : >%s<, idx : %d\n", p_valid_service_names_arr[j],p_rec->service_name, p_checked_names[j]); - if ( strcmp(p_valid_service_names_arr[j], - p_rec->service_name) == 0 ) + if ( strcmp((char *)p_valid_service_names_arr[j], + (char *)p_rec->service_name) == 0 ) { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_all_services_and_check_names: " From halr at voltaire.com Mon Jun 5 05:57:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jun 2006 08:57:47 -0400 Subject: [openib-general] Re: [PATCH] osm: management class constants are unit8 not uint16 In-Reply-To: <863bejojvu.fsf@mtl066.yok.mtl.com> References: <863bejojvu.fsf@mtl066.yok.mtl.com> Message-ID: <1149512262.4510.206671.camel@hal.voltaire.com> On Mon, 2006-06-05 at 08:40, Eitan Zahavi wrote: > Hi Hal > > Cleaning up compilation warnings I found that the osm_vendor_mlx_svc.h > was using NTOH16 on the class constants. > > Eitan > > Signed-off-by: Eitan Zahavi Thanks. Applied to both trunk and 1.0 branch. -- Hal From jlentini at netapp.com Mon Jun 5 06:38:43 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 5 Jun 2006 09:38:43 -0400 (EDT) Subject: [openib-general] Fix some suspicious ppc64 code in dapl In-Reply-To: <20060604002200.GB986@krispykreme> References: <20060604002200.GB986@krispykreme> Message-ID: > Index: dapl/udapl/linux/dapl_osd.h > =================================================================== > --- dapl/udapl/linux/dapl_osd.h (revision 7621) > +++ dapl/udapl/linux/dapl_osd.h (working copy) > @@ -238,14 +238,13 @@ > #endif /* __ia64__ */ > #elif defined(__PPC64__) > __asm__ __volatile__ ( > - EIEIO_ON_SMP > -"1: lwarx %0,0,%2 # __cmpxchg_u64\n\ > - cmpd 0,%0,%3\n\ > +" lwsync\n\ > +1: lwarx %0,0,%2 # __cmpxchg_u32\n\ > + cmpw 0,%0,%3\n\ > bne- 2f\n\ > stwcx. %4,0,%2\n\ > - bne- 1b" > - ISYNC_ON_SMP > - "\n\ > + bne- 1b\n\ > + isync\n\ > 2:" > : "=&r" (current_value), "=m" (*v) > : "r" (v), "r" (match_value), "r" (new_value), "m" (*v) Thank you Anton. Could you replying with a signed off by line? I'll properly attribute this fix to you in the commit log. From halr at voltaire.com Mon Jun 5 07:21:05 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jun 2006 10:21:05 -0400 Subject: [openib-general] Re: [PATCH] osm: segfault fix in osm_get_gid_by_mad_addr (take 2) In-Reply-To: <864pyzok78.fsf@mtl066.yok.mtl.com> References: <864pyzok78.fsf@mtl066.yok.mtl.com> Message-ID: <1149517245.4510.208652.camel@hal.voltaire.com> On Mon, 2006-06-05 at 08:34, Eitan Zahavi wrote: > Hi Hal > > I got a report regarding crashes in osm_get_gid_by_mad_addr. > It was missing a check on p_port looked up by LID. The affected > flows are reports and multicast joins. > > The fix modified the function to return status (instead of GID). > I did run some simulation flows after the fix but please double > check before commit. > > This time I hope I did not missed anything > > Eitan > > Signed-off-by: Eitan Zahavi Thanks. Applied (with some cosmetic changes) to both trunk and 1.0 branch. -- Hal From halr at voltaire.com Mon Jun 5 08:05:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jun 2006 11:05:34 -0400 Subject: [openib-general] [PATCH] OpenSM: Don't exit when log fills disk Message-ID: <1149519927.4510.209794.camel@hal.voltaire.com> OpenSM: Don't exit when log fills disk Signed-off-by: Hal Rosenstock Index: opensm/osm_log.c =================================================================== --- opensm/osm_log.c (revision 7645) +++ opensm/osm_log.c (working copy) @@ -80,6 +80,9 @@ static char *month_str[] = { }; #endif /* ndef WIN32 */ +static int log_exit_count = 0; + + void osm_log( IN osm_log_t* const p_log, @@ -175,8 +178,10 @@ osm_log( if (ret < 0) { - fprintf(stderr, "OSM LOG FAILURE! Probably quota exceeded\n"); - exit(1); + if (log_exit_count++ < 10) + { + fprintf(stderr, "OSM LOG FAILURE! Quota probably exceeded\n"); + } } } } From hbchen at lanl.gov Mon Jun 5 08:12:03 2006 From: hbchen at lanl.gov (hbchen) Date: Mon, 05 Jun 2006 09:12:03 -0600 Subject: [openib-general] Question about the IPoIB bandwidth performance ? In-Reply-To: <86lksceyfl.fsf@mtl066.yok.mtl.com> References: <86lksceyfl.fsf@mtl066.yok.mtl.com> Message-ID: <448449C3.9000705@lanl.gov> Hi, I have a question about the IPoIB bandwidth performance. I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G ethernet card, and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface). NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization (IPoNIC/LB) --------------------- ---------------- -------------- ---------------------------------- Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface) Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface) Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing using Linux 2.6.14.6) (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website) IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing using Linux 2.6.14.6) 474MB/sec 37% (the best from OpenIB mailing list) (2.6.12-rc5 patch 1) Why the bandwidth utilization of IPoIB is so low compared to the others NICs? There must be a lot of room to improve the IPoIB software to reach 75%+ bandwidth utilization. HB Chen Los Alamos National Lab hbchen at labl.gov From halr at voltaire.com Mon Jun 5 08:21:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jun 2006 11:21:23 -0400 Subject: [openib-general] Question about the IPoIB bandwidth performance ? In-Reply-To: <448449C3.9000705@lanl.gov> References: <86lksceyfl.fsf@mtl066.yok.mtl.com> <448449C3.9000705@lanl.gov> Message-ID: <1149520880.4510.210194.camel@hal.voltaire.com> On Mon, 2006-06-05 at 11:12, hbchen wrote: > Hi, > I have a question about the IPoIB bandwidth performance. > I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G > ethernet card, > and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface). > > > NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization > (IPoNIC/LB) > --------------------- ---------------- -------------- > ---------------------------------- > Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface) > Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface) > Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing > using Linux 2.6.14.6) > (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website) > IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing > using Linux 2.6.14.6) > 474MB/sec 37% (the best from OpenIB mailing list) > (2.6.12-rc5 patch 1) > > Why the bandwidth utilization of IPoIB is so low compared to the others > NICs? One thing to note is that the max utilization of 10G IB (4x) is 8G due to the signalling being included in this rate (unlike ethernet whose rate represents the data rate and does not include the signalling overhead). -- Hal > There must be a lot of room to improve the IPoIB software to reach 75%+ > bandwidth utilization. > > > HB Chen > Los Alamos National Lab > hbchen at labl.gov > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hbchen at lanl.gov Mon Jun 5 08:38:24 2006 From: hbchen at lanl.gov (hbchen) Date: Mon, 05 Jun 2006 09:38:24 -0600 Subject: [openib-general] Question about the IPoIB bandwidth performance ? In-Reply-To: <1149520880.4510.210194.camel@hal.voltaire.com> References: <86lksceyfl.fsf@mtl066.yok.mtl.com> <448449C3.9000705@lanl.gov> <1149520880.4510.210194.camel@hal.voltaire.com> Message-ID: <44844FF0.9020309@lanl.gov> Hal Rosenstock wrote: >On Mon, 2006-06-05 at 11:12, hbchen wrote: > > >>Hi, >>I have a question about the IPoIB bandwidth performance. >>I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G >>ethernet card, >>and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface). >> >> >>NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization >>(IPoNIC/LB) >>--------------------- ---------------- -------------- >>---------------------------------- >>Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface) >>Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface) >>Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing >>using Linux 2.6.14.6) >>(PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website) >>IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing >>using Linux 2.6.14.6) >>474MB/sec 37% (the best from OpenIB mailing list) >>(2.6.12-rc5 patch 1) >> >>Why the bandwidth utilization of IPoIB is so low compared to the others >>NICs? >> >> > >One thing to note is that the max utilization of 10G IB (4x) is 8G due >to the signalling being included in this rate (unlike ethernet whose >rate represents the data rate and does not include the signalling >overhead). > > Hal, Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low. >> IPoIB=420MB/sec >> bandwidth utilization= 420/1024 = 41.01% HB >-- Hal > > > >>There must be a lot of room to improve the IPoIB software to reach 75%+ >>bandwidth utilization. >> >> >>HB Chen >>Los Alamos National Lab >>hbchen at labl.gov >> >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Jun 5 08:34:50 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jun 2006 11:34:50 -0400 Subject: [openib-general] Re: [PATCH] osm: trivial missing header files fix In-Reply-To: <861wu3ojdi.fsf@mtl066.yok.mtl.com> References: <861wu3ojdi.fsf@mtl066.yok.mtl.com> Message-ID: <1149521684.4510.210522.camel@hal.voltaire.com> On Mon, 2006-06-05 at 08:51, Eitan Zahavi wrote: > Hi Hal > > Cleaning up compilation warnings I found there missing includes in > various sources. > > Eitan > > Signed-off-by: Eitan Zahavi Thanks. Applied to trunk only. -- Hal From halr at voltaire.com Mon Jun 5 08:45:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jun 2006 11:45:28 -0400 Subject: [openib-general] Re: [PATCH] osm: trivial missing cast in osmt_service call for memcmp In-Reply-To: <86zmgrn4fy.fsf@mtl066.yok.mtl.com> References: <86zmgrn4fy.fsf@mtl066.yok.mtl.com> Message-ID: <1149522314.4510.210789.camel@hal.voltaire.com> Hi Eitan, On Mon, 2006-06-05 at 08:59, Eitan Zahavi wrote: > Hi Hal > > Last one of my cleaning up compilation warnings I found a missing > cast in osmtest service name compare. > > Eitan > > Signed-off-by: Eitan Zahavi Thanks. Applied to trunk only. -- Hal From wombat2 at us.ibm.com Mon Jun 5 08:54:42 2006 From: wombat2 at us.ibm.com (Bernard King-Smith) Date: Mon, 5 Jun 2006 11:54:42 -0400 Subject: [openib-general] Question about the IPoIB bandwidth performance ? In-Reply-To: <20060605152617.496942283DA@openib.ca.sandia.gov> Message-ID: Hal Rosenstock wrote: > On Mon, 2006-06-05 at 11:12, hbchen wrote: > > Hi, > > I have a question about the IPoIB bandwidth performance. > > I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G > > ethernet card, > > and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface). > > > > > > NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization > > (IPoNIC/LB) > > --------------------- ---------------- -------------- > > ---------------------------------- > > Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface) > > Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface) > > Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing > > > using Linux 2.6.14.6) > > (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website) > > IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing > > using Linux 2.6.14.6) > > 474MB/sec 37% (the best from OpenIB mailing list) > > (2.6.12-rc5 patch 1) > > > > Why the bandwidth utilization of IPoIB is so low compared to the others > > NICs? > > One thing to note is that the max utilization of 10G IB (4x) is 8G due > to the signalling being included in this rate (unlike ethernet whose > rate represents the data rate and does not include the signalling > overhead). > > -- Hal > You also have larger IP packets when you use GigE ( especially in large send/offload ) and Myrinet. I think Myrinet uses a 60K MTU and for GigE, without large send you get a 9000 MTU. With large send you get a 64K buffer to the adapter so fragmentation to 1500/9000 IP packets is offloaded in the adapter. Currently with IPoIB using UD mode, you have to generate lots of 2K packets. With serialized IBoIP drivers you end up bottlenecking on a single CPU. There is a IPoIB-CM IEFT spec out which should significantly improve IPoIB performance if implemented. > > There must be a lot of room to improve the IPoIB software to reach 75%+ > > bandwidth utilization. > > > > > > HB Chen > > Los Alamos National Lab > > hbchen at labl.gov > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner From xma at us.ibm.com Mon Jun 5 09:02:36 2006 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 5 Jun 2006 09:02:36 -0700 Subject: [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: <20060605081136.GJ21266@mellanox.co.il> Message-ID: Michael, I will apply this patch. This patch would reduce the race, not address the problem. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Jun 5 09:01:14 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 Jun 2006 09:01:14 -0700 Subject: [openib-general] Re: [PATCHv2 1/2] resend: mthca support for max_map_per_fmr device attribute In-Reply-To: <44842686.10002@voltaire.com> (Or Gerlitz's message of "Mon, 05 Jun 2006 15:41:42 +0300") References: <44842686.10002@voltaire.com> Message-ID: > Yes it makes sense, but you need the check should be > > if (!(dev->mthca_flags & MTHCA_FLAG_SINAI_OPT)) > > instead of > > if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) Yep, you're right, I got it backwards. > also, what about the other patch which changes fmr_pool.c to query the > device, have you got(reviewed/accepted) it? i have modified it to > allocate the device attr struct on the heap as you have asked. It looks fine. I was just reviewing everything together. - R. From Thomas.Talpey at netapp.com Mon Jun 5 08:52:03 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 05 Jun 2006 11:52:03 -0400 Subject: [openib-general] Question about the IPoIB bandwidth performance ? In-Reply-To: <44844FF0.9020309@lanl.gov> References: <86lksceyfl.fsf@mtl066.yok.mtl.com> <448449C3.9000705@lanl.gov> <1149520880.4510.210194.camel@hal.voltaire.com> <44844FF0.9020309@lanl.gov> Message-ID: <7.0.1.0.2.20060605114203.043ad738@netapp.com> At 11:38 AM 6/5/2006, hbchen wrote: >Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low. >>> IPoIB=420MB/sec >>> bandwidth utilization= 420/1024 = 41.01% Helen, have you measured the CPU utilizations during these runs? Perhaps you are out of CPU. Outrageous opinion follows. Frankly, an IB HCA running Ethernet emulation is approximately the world's worst 10GbE adapter (not to put too fine of a point on it :-) ) There is no hardware checksumming, nor large-send offloading, both of which force overhead onto software. And, as you just discovered it isn't even 10Gb! In general, network emulation layers are always going to perform more poorly than native implementations. But this is only a generality learned from years of experience with them. Tom. From hbchen at lanl.gov Mon Jun 5 09:11:30 2006 From: hbchen at lanl.gov (hbchen) Date: Mon, 05 Jun 2006 10:11:30 -0600 Subject: [openib-general] Question about the IPoIB bandwidth performance ? In-Reply-To: <7.0.1.0.2.20060605114203.043ad738@netapp.com> References: <86lksceyfl.fsf@mtl066.yok.mtl.com> <448449C3.9000705@lanl.gov> <1149520880.4510.210194.camel@hal.voltaire.com> <44844FF0.9020309@lanl.gov> <7.0.1.0.2.20060605114203.043ad738@netapp.com> Message-ID: <448457B2.6050608@lanl.gov> Talpey, Thomas wrote: >At 11:38 AM 6/5/2006, hbchen wrote: > > >>Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low. >> >> >>>>IPoIB=420MB/sec >>>>bandwidth utilization= 420/1024 = 41.01% >>>> >>>> > > >Helen, have you measured the CPU utilizations during these runs? >Perhaps you are out of CPU. > > > Tom, I am HB Chen from LANL not the Helen Chen from SNL. I didn't run out of CPU. It is about 70-80 % of CPU utilization. >Outrageous opinion follows. > >Frankly, an IB HCA running Ethernet emulation is approximately the >world's worst 10GbE adapter (not to put too fine of a point on it :-) ) > > The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98% bandwidth utilization why not the IPoIB ? HB Chen hbchen at lanl.gov >There is no hardware checksumming, nor large-send offloading, both >of which force overhead onto software. And, as you just discovered >it isn't even 10Gb! > >In general, network emulation layers are always going to perform more >poorly than native implementations. But this is only a generality learned >from years of experience with them. > >Tom. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Jun 5 09:11:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 Jun 2006 09:11:16 -0700 Subject: [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: (Shirley Ma's message of "Mon, 5 Jun 2006 09:02:36 -0700") References: Message-ID: Shirley> I will apply this patch. This patch would reduce the Shirley> race, not address the problem. Does anyone know what the problem really is? I sure don't. - R. From Thomas.Talpey at netapp.com Mon Jun 5 09:17:20 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 05 Jun 2006 12:17:20 -0400 Subject: [openib-general] Question about the IPoIB bandwidth performance ? In-Reply-To: <448457B2.6050608@lanl.gov> References: <86lksceyfl.fsf@mtl066.yok.mtl.com> <448449C3.9000705@lanl.gov> <1149520880.4510.210194.camel@hal.voltaire.com> <44844FF0.9020309@lanl.gov> <7.0.1.0.2.20060605114203.043ad738@netapp.com> <448457B2.6050608@lanl.gov> Message-ID: <7.0.1.0.2.20060605121321.043ad738@netapp.com> At 12:11 PM 6/5/2006, hbchen wrote: >>Perhaps you are out of CPU. >> >> >Tom, >I am HB Chen from LANL not the Helen Chen from SNL. Oops, sorry! I have too many email messages going by. :-) HB, then. >I didn't run out of CPU. It is about 70-80 % of CPU utilization. But, is one CPU at 100%? Interrupt processing, for example. > >> >>Outrageous opinion follows. >> >>Frankly, an IB HCA running Ethernet emulation is approximately the >>world's worst 10GbE adapter (not to put too fine of a point on it :-) ) >> >The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98% bandwidth utilization why not the IPoIB ? I am not familiar with the implementation Myrinet uses. In any case, I am not saying that an emulation can't reach certain goals, just that they will pretty much always be inferior to native approaches. Sometimes far inferior. Tom. From felix at chelsio.com Mon Jun 5 09:32:10 2006 From: felix at chelsio.com (Felix Marti) Date: Mon, 5 Jun 2006 09:32:10 -0700 Subject: [openib-general] Question about the IPoIB bandwidth performance ? Message-ID: <8A71B368A89016469F72CD08050AD33486F05A@maui.asicdesigners.com> ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of hbchen Sent: Monday, June 05, 2006 9:12 AM To: Talpey, Thomas Cc: openib-general at openib.org Subject: Re: [openib-general] Question about the IPoIB bandwidth performance ? Talpey, Thomas wrote: At 11:38 AM 6/5/2006, hbchen wrote: Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low. IPoIB=420MB/sec bandwidth utilization= 420/1024 = 41.01% Helen, have you measured the CPU utilizations during these runs? Perhaps you are out of CPU. Tom, I am HB Chen from LANL not the Helen Chen from SNL. I didn't run out of CPU. It is about 70-80 % of CPU utilization. Outrageous opinion follows. Frankly, an IB HCA running Ethernet emulation is approximately the world's worst 10GbE adapter (not to put too fine of a point on it :-) ) The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98% bandwidth utilization why not the IPoIB ? [Felix:] As pointed out earlier: it is the message rate. If you change the mtu to 1500B (instead of the non-standard 9000B Jumbo frames) performance will drop into the same range as what you see with IPoIB (limited by the receiver). HB Chen hbchen at lanl.gov There is no hardware checksumming, nor large-send offloading, both of which force overhead onto software. And, as you just discovered it isn't even 10Gb! In general, network emulation layers are always going to perform more poorly than native implementations. But this is only a generality learned from years of experience with them. Tom. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ishai at mellanox.co.il Mon Jun 5 08:32:13 2006 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Mon, 5 Jun 2006 18:32:13 +0300 Subject: [openib-general] SRP [PATCH 0/4] Kernel support for removal and restoration of target Message-ID: <20060605153213.GA7472@mellanox.co.il> Hi Roland, I'm sending 4 patches that implement kernel support for removal and restoration of a target (will be used by ibsrpdm). Some comments about them 1) The first patch splits reconnect to 2 two functions: _srp_remove_target and _srp_restore_target. _srp_remove_target uses the functions I sent in previous patch (Misc cleanups in ib_srp). If you want I can resend this patch without using the previous patch (But then there will be a problem with the previous patch :( ). 2) These patches implement the following behavior: When someone writes the string "remove" to /sys/class/scsi_host/host?/remove_target the corresponding target goes to a DISCONNECTED state (After closing the cm, and reset all pending requests). Now when the scsi performs queuecommand to this host a SCSI_MLQUEUE_HOST_BUSY is returned. These causes the scsi layer to wait until the target turns to LIVE state. This is very nice if the user that initiated the remove_target knows what he is doing and will perform a restore_target later. On the other hand it may be problematic if the target remains DISCONNECTED and user applications that try to access this target remain stuck in the kernel (in the scsi layer) I've several ideas on how to handle it (have a timeout after which queuecommand will return fail, try to perform a restore_target after a timeout, make sure the daemon will run a restore_target after a timeout) but I'm not sure they are the correct thing to do. I'm waiting for suggestions. In any case I believe we should apply these patches and add solution to this problem later. Please comment. -- Ishai Rabinovitz From ishai at mellanox.co.il Mon Jun 5 08:33:32 2006 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Mon, 5 Jun 2006 18:33:32 +0300 Subject: [openib-general] SRP [PATCH 1/4] split srp_reconnect_target Message-ID: <20060605153332.GB7472@mellanox.co.il> Split the srp_reconnect_target to two functions _srp_remove_target and _srp_restore_target. These functions will be used later in patch series also to allow removal and restoration of a target from the sysfs. I made some changes in order to support this: 1) There are two new states: SRP_TARGET_DISCONNECTED - The state after _srp_remove_target was successfully executed and before _srp_restore_target is executed. SRP_TARGET_DISCONNECTING - The state while _srp_remove_target is executed. SRP_TARGET_CONNECTING is now the state while _srp_restore_target is executed. 2) The value of target->cm_id can be NULL. This happens after _srp_remove_target destroyed the old cm_id and before _srp_restore_target created the new cm_id. Signed-off-by: Ishai Rabinovitz Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-06-04 10:03:25.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c 2006-06-04 10:54:26.000000000 +0300 @@ -40,6 +40,7 @@ #include #include #include +#include #include @@ -373,7 +374,8 @@ static void srp_remove_work(void *target spin_unlock(&target->srp_host->target_lock); scsi_remove_host(target->scsi_host); - ib_destroy_cm_id(target->cm_id); + if (target->cm_id) + ib_destroy_cm_id(target->cm_id); srp_free_target_ib(target); scsi_host_put(target->scsi_host); } @@ -464,20 +466,57 @@ static void srp_reset_req(struct srp_tar srp_remove_req(target, req); } -static int srp_reconnect_target(struct srp_target_port *target) +static void srp_remove_target_port(struct srp_target_port *target) +{ + /* + * Kill our target port off. + * However, we have to defer the real removal because we might + * be in the context of the SCSI error handler now, which + * would deadlock if we call scsi_remove_host(). + */ + spin_lock_irq(target->scsi_host->host_lock); + if (target->state != SRP_TARGET_REMOVED) { + target->state = SRP_TARGET_DEAD; + INIT_WORK(&target->work, srp_remove_work, target); + schedule_work(&target->work); + } + spin_unlock_irq(target->scsi_host->host_lock); +} + +static int _srp_remove_target(struct srp_target_port *target) { - struct ib_cm_id *new_cm_id; struct ib_qp_attr qp_attr; struct srp_request *req, *tmp; struct ib_wc wc; - int ret; + int ret = 0; spin_lock_irq(target->scsi_host->host_lock); - if (target->state != SRP_TARGET_LIVE) { + switch (target->state) { + case SRP_TARGET_REMOVED: + case SRP_TARGET_DEAD: + ret = -ENOENT; + break; + + case SRP_TARGET_DISCONNECTING: + case SRP_TARGET_CONNECTING: + ret = -EAGAIN; /* So that the caller will try again later - + after the connection ends one way or another */ + break; + + case SRP_TARGET_DISCONNECTED: + ret = -ENOTCONN; + break; + + case SRP_TARGET_LIVE: + break; + } + + if (ret) { spin_unlock_irq(target->scsi_host->host_lock); - return -EAGAIN; + return ret; } - target->state = SRP_TARGET_CONNECTING; + + target->state = SRP_TARGET_DISCONNECTING; spin_unlock_irq(target->scsi_host->host_lock); srp_disconnect_target(target); @@ -485,24 +525,14 @@ static int srp_reconnect_target(struct s * Now get a new local CM ID so that we avoid confusing the * target in case things are really fouled up. */ - new_cm_id = ib_create_cm_id(target->srp_host->dev->dev, - srp_cm_handler, target); - if (IS_ERR(new_cm_id)) { - ret = PTR_ERR(new_cm_id); - goto err; - } ib_destroy_cm_id(target->cm_id); - target->cm_id = new_cm_id; + target->cm_id = NULL; qp_attr.qp_state = IB_QPS_RESET; ret = ib_modify_qp(target->qp, &qp_attr, IB_QP_STATE); if (ret) goto err; - ret = srp_init_qp(target, target->qp); - if (ret) - goto err; - while (ib_poll_cq(target->cq, 1, &wc) > 0) ; /* nothing */ @@ -513,6 +543,49 @@ static int srp_reconnect_target(struct s target->tx_head = 0; target->tx_tail = 0; + spin_lock_irq(target->scsi_host->host_lock); + if (target->state == SRP_TARGET_DISCONNECTING) { + ret = 0; + target->state = SRP_TARGET_DISCONNECTED; + } else + ret = -EAGAIN; + spin_unlock_irq(target->scsi_host->host_lock); + + return ret; + +err: + printk(KERN_ERR PFX "remove failed (%d), removing target port.\n", ret); + + srp_remove_target_port(target); + + return ret; +} + +static int _srp_restore_target(struct srp_target_port *target) +{ + struct ib_cm_id *new_cm_id; + int ret; + + spin_lock_irq(target->scsi_host->host_lock); + if (target->state != SRP_TARGET_DISCONNECTED) { + spin_unlock_irq(target->scsi_host->host_lock); + return -EAGAIN; + } + target->state = SRP_TARGET_CONNECTING; + spin_unlock_irq(target->scsi_host->host_lock); + + new_cm_id = ib_create_cm_id(target->srp_host->dev->dev, + srp_cm_handler, target); + if (IS_ERR(new_cm_id)) { + ret = PTR_ERR(new_cm_id); + goto err; + } + target->cm_id = new_cm_id; + + ret = srp_init_qp(target, target->qp); + if (ret) + goto err; + ret = srp_connect_target(target); if (ret) goto err; @@ -528,25 +601,22 @@ static int srp_reconnect_target(struct s return ret; err: - printk(KERN_ERR PFX "reconnect failed (%d), removing target port.\n", ret); + printk(KERN_ERR PFX "restore failed (%d), removing target port.\n", ret); - /* - * We couldn't reconnect, so kill our target port off. - * However, we have to defer the real removal because we might - * be in the context of the SCSI error handler now, which - * would deadlock if we call scsi_remove_host(). - */ - spin_lock_irq(target->scsi_host->host_lock); - if (target->state == SRP_TARGET_CONNECTING) { - target->state = SRP_TARGET_DEAD; - INIT_WORK(&target->work, srp_remove_work, target); - schedule_work(&target->work); - } - spin_unlock_irq(target->scsi_host->host_lock); + srp_remove_target_port(target); return ret; } +static int srp_reconnect_target(struct srp_target_port *target) +{ + int ret = _srp_remove_target(target); + if (ret && ret != -ENOTCONN) + return ret; + + return _srp_restore_target(target); +} + static int srp_map_fmr(struct srp_device *dev, struct scatterlist *scat, int sg_cnt, struct srp_request *req, struct srp_direct_buf *buf) @@ -933,6 +1003,13 @@ static int __srp_post_send(struct srp_ta return ret; } +static int srp_target_is_not_connected(struct srp_target_port *target) +{ + return (1 << target->state) & + ((1 << SRP_TARGET_CONNECTING) | (1 << SRP_TARGET_DISCONNECTING) | + (1 << SRP_TARGET_DISCONNECTED)); +} + static int srp_queuecommand(struct scsi_cmnd *scmnd, void (*done)(struct scsi_cmnd *)) { @@ -942,7 +1019,7 @@ static int srp_queuecommand(struct scsi_ struct srp_cmd *cmd; int len; - if (target->state == SRP_TARGET_CONNECTING) + if (unlikely(srp_target_is_not_connected(target))) goto err; if (target->state == SRP_TARGET_DEAD || @@ -1292,6 +1369,9 @@ static int srp_abort(struct scsi_cmnd *s printk(KERN_ERR "SRP abort called\n"); + if (srp_target_is_not_connected(target)) + return FAILED; + if (srp_find_req(target, scmnd, &req)) return FAILED; if (srp_send_tsk_mgmt(target, req, SRP_TSK_ABORT_TASK)) @@ -1320,6 +1400,9 @@ static int srp_reset_device(struct scsi_ printk(KERN_ERR "SRP reset_device called\n"); + if (srp_target_is_not_connected(target)) + return FAILED; + if (srp_find_req(target, scmnd, &req)) return FAILED; if (srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET)) @@ -1914,8 +2000,10 @@ static void srp_remove_one(struct ib_dev list_for_each_entry_safe(target, tmp_target, &host->target_list, list) { scsi_remove_host(target->scsi_host); - srp_disconnect_target(target); - ib_destroy_cm_id(target->cm_id); + if (target->cm_id) { + srp_disconnect_target(target); + ib_destroy_cm_id(target->cm_id); + } srp_free_target_ib(target); scsi_host_put(target->scsi_host); } Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.h =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.h 2006-06-04 10:02:47.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.h 2006-06-04 10:03:25.000000000 +0300 @@ -75,6 +75,8 @@ enum { enum srp_target_state { SRP_TARGET_LIVE, SRP_TARGET_CONNECTING, + SRP_TARGET_DISCONNECTED, + SRP_TARGET_DISCONNECTING, SRP_TARGET_DEAD, SRP_TARGET_REMOVED }; -- Ishai Rabinovitz From ishai at mellanox.co.il Mon Jun 5 08:34:33 2006 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Mon, 5 Jun 2006 18:34:33 +0300 Subject: [openib-general] SRP [PATCH 2/4] remove target Message-ID: <20060605153433.GC7472@mellanox.co.il> Add support to remove_target from sysfs. Signed-off-by: Ishai Rabinovitz Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-06-05 16:46:55.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c 2006-06-05 17:11:11.000000000 +0300 @@ -1516,6 +1516,10 @@ return sprintf(buf, "%d\n", target->zero_req_lim); } +static ssize_t srp_remove_target(struct class_device *cdev, + const char *buf, size_t count); + +static CLASS_DEVICE_ATTR(remove_target, S_IWUSR, NULL, srp_remove_target); static CLASS_DEVICE_ATTR(id_ext, S_IRUGO, show_id_ext, NULL); static CLASS_DEVICE_ATTR(ioc_guid, S_IRUGO, show_ioc_guid, NULL); static CLASS_DEVICE_ATTR(service_id, S_IRUGO, show_service_id, NULL); @@ -1524,6 +1528,7 @@ static CLASS_DEVICE_ATTR(zero_req_lim, S_IRUGO, show_zero_req_lim, NULL); static struct class_device_attribute *srp_host_attrs[] = { + &class_device_attr_remove_target, &class_device_attr_id_ext, &class_device_attr_ioc_guid, &class_device_attr_service_id, @@ -1814,6 +1819,23 @@ static CLASS_DEVICE_ATTR(add_target, S_IWUSR, NULL, srp_create_target); +static ssize_t srp_remove_target(struct class_device *cdev, + const char *buf, size_t count) +{ + int ret; + const char const remove_str[] = "remove"; + + if (strncmp(buf, "remove", sizeof(remove_str))) + return -EINVAL; + + ret = _srp_remove_target(host_to_target(class_to_shost(cdev))); + + if (ret) + return ret; + + return count; +} + static ssize_t show_ibdev(struct class_device *class_dev, char *buf) { struct srp_host *host = -- Ishai Rabinovitz From ishai at mellanox.co.il Mon Jun 5 08:35:17 2006 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Mon, 5 Jun 2006 18:35:17 +0300 Subject: [openib-general] SRP [PATCH 3/4] restore target Message-ID: <20060605153517.GD7472@mellanox.co.il> Add support to restore_target from sysfs. Signed-off-by: Ishai Rabinovitz Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-06-04 10:01:50.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c 2006-06-04 10:02:27.000000000 +0300 @@ -1551,16 +1551,21 @@ static ssize_t show_zero_req_lim(struct static ssize_t srp_remove_target(struct class_device *cdev, const char *buf, size_t count); -static CLASS_DEVICE_ATTR(remove_target, S_IWUSR, NULL, srp_remove_target); -static CLASS_DEVICE_ATTR(id_ext, S_IRUGO, show_id_ext, NULL); -static CLASS_DEVICE_ATTR(ioc_guid, S_IRUGO, show_ioc_guid, NULL); -static CLASS_DEVICE_ATTR(service_id, S_IRUGO, show_service_id, NULL); -static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); -static CLASS_DEVICE_ATTR(dgid, S_IRUGO, show_dgid, NULL); -static CLASS_DEVICE_ATTR(zero_req_lim, S_IRUGO, show_zero_req_lim, NULL); +static ssize_t srp_restore_target(struct class_device *cdev, + const char *buf, size_t count); + +static CLASS_DEVICE_ATTR(remove_target, S_IWUSR, NULL, srp_remove_target); +static CLASS_DEVICE_ATTR(restore_target, S_IWUSR, NULL, srp_restore_target); +static CLASS_DEVICE_ATTR(id_ext, S_IRUGO, show_id_ext, NULL); +static CLASS_DEVICE_ATTR(ioc_guid, S_IRUGO, show_ioc_guid, NULL); +static CLASS_DEVICE_ATTR(service_id, S_IRUGO, show_service_id, NULL); +static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); +static CLASS_DEVICE_ATTR(dgid, S_IRUGO, show_dgid, NULL); +static CLASS_DEVICE_ATTR(zero_req_lim, S_IRUGO, show_zero_req_lim, NULL); static struct class_device_attribute *srp_host_attrs[] = { &class_device_attr_remove_target, + &class_device_attr_restore_target, &class_device_attr_id_ext, &class_device_attr_ioc_guid, &class_device_attr_service_id, @@ -1861,6 +1866,17 @@ static ssize_t srp_remove_target(struct return count; } +static ssize_t srp_restore_target(struct class_device *cdev, + const char *buf, size_t count) +{ + int ret = _srp_restore_target(host_to_target(class_to_shost(cdev))); + + if (ret) + return ret; + + return count; +} + static ssize_t show_ibdev(struct class_device *class_dev, char *buf) { struct srp_host *host = -- Ishai Rabinovitz From ishai at mellanox.co.il Mon Jun 5 08:36:06 2006 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Mon, 5 Jun 2006 18:36:06 +0300 Subject: [openib-general] SRP [PATCH 4/4] show_srp_state Message-ID: <20060605153606.GE7472@mellanox.co.il> Add query for srp_state in sysfs. Signed-off-by: Ishai Rabinovitz Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-05-31 18:52:14.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c 2006-06-04 14:21:52.000000000 +0300 @@ -1362,6 +1362,26 @@ static int srp_reset_host(struct scsi_cm return ret; } +static ssize_t show_srp_state(struct class_device *cdev, char *buf) +{ + struct srp_target_port *target = host_to_target(class_to_shost(cdev)); + enum srp_target_state target_state = target->state; + + static const char *state_name[] = { + [SRP_TARGET_LIVE] = "LIVE", + [SRP_TARGET_CONNECTING] = "CONNECTING", + [SRP_TARGET_DISCONNECTING] = "DISCONNECTING", + [SRP_TARGET_DISCONNECTED] = "DISCONNECTED", + [SRP_TARGET_DEAD] = "DEAD", + [SRP_TARGET_REMOVED] = "REMOVED", + }; + + if (target_state >= 0 && target_state < ARRAY_SIZE(state_name)) + return sprintf(buf, "%s\n", state_name[target_state]); + + return sprintf(buf, "UNKNOWN\n"); +} + static ssize_t show_id_ext(struct class_device *cdev, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(cdev)); @@ -1439,6 +1459,7 @@ static ssize_t show_zero_req_lim(struct static CLASS_DEVICE_ATTR(remove_target, S_IWUSR, NULL, srp_remove_target); static CLASS_DEVICE_ATTR(restore_target, S_IWUSR, NULL, srp_restore_target); +static CLASS_DEVICE_ATTR(srp_state, S_IRUGO, show_srp_state, NULL); static CLASS_DEVICE_ATTR(id_ext, S_IRUGO, show_id_ext, NULL); static CLASS_DEVICE_ATTR(ioc_guid, S_IRUGO, show_ioc_guid, NULL); static CLASS_DEVICE_ATTR(service_id, S_IRUGO, show_service_id, NULL); @@ -1447,6 +1468,7 @@ static CLASS_DEVICE_ATTR(dgid, S_IRUGO, static struct class_device_attribute *srp_host_attrs[] = { &class_device_attr_remove_target, &class_device_attr_restore_target, + &class_device_attr_srp_state, &class_device_attr_id_ext, &class_device_attr_ioc_guid, &class_device_attr_service_id, -- Ishai Rabinovitz From mst at mellanox.co.il Mon Jun 5 09:40:14 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 5 Jun 2006 19:40:14 +0300 Subject: [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: References: Message-ID: <20060605164014.GA32268@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic > > Shirley> I will apply this patch. This patch would reduce the > Shirley> race, not address the problem. > > Does anyone know what the problem really is? I sure don't. Not me :). I suspect Shirley is seeing results of memory corruption as a result of interface getting restarted - the problem fixed by Eli's patch. -- MST From wombat2 at us.ibm.com Mon Jun 5 09:53:02 2006 From: wombat2 at us.ibm.com (Bernard King-Smith) Date: Mon, 5 Jun 2006 12:53:02 -0400 Subject: [openib-general] Question about the IPoIB bandwidth performance ? In-Reply-To: <20060605161150.A21DE2283DA@openib.ca.sandia.gov> Message-ID: > Thomas Talpey said: > At 11:38 AM 6/5/2006, hbchen wrote: > >Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very > low. > >>> IPoIB=420MB/sec > >>> bandwidth utilization= 420/1024 = 41.01% > > > Helen, have you measured the CPU utilizations during these runs? > Perhaps you are out of CPU. > > Outrageous opinion follows. > > Frankly, an IB HCA running Ethernet emulation is approximately the > world's worst 10GbE adapter (not to put too fine of a point on it :-) ) > There is no hardware checksumming, nor large-send offloading, both > of which force overhead onto software. And, as you just discovered > it isn't even 10Gb! > > In general, network emulation layers are always going to perform more > poorly than native implementations. But this is only a generality learned > from years of experience with them > > Tom. Hold on here.... Who said anything about Ethernnet emulation. Hal said he is running straight Netperf over IB not ethernet emulation. I don't think that any IB HCAs today support offloaded checksum and large send. You are comparing apples and oranges. The only appropriate comparison is to use the IBM HCA compared to the mthca adapters. I think Hal's point is actually comparing "any" IB adapter against GigE and Myrinet. Both the mthca and IBM HCA's should get similar IPoIB performance using identical OpenIB stacks. Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner openib-general-re quest at openib.org Sent by: To openib-general-bo openib-general at openib.org unces at openib.org cc Subject 06/05/2006 12:11 openib-general Digest, Vol 24, PM Issue 22 Please respond to openib-general at op enib.org Send openib-general mailing list submissions to openib-general at openib.org To subscribe or unsubscribe via the World Wide Web, visit http://openib.org/mailman/listinfo/openib-general or, via email, send a message with subject or body 'help' to openib-general-request at openib.org You can reach the person managing the list at openib-general-owner at openib.org When replying, please edit your Subject line so it is more specific than "Re: Contents of openib-general digest..." Today's Topics: 1. Re: Question about the IPoIB bandwidth performance ? (hbchen) 2. Re: [PATCH] osm: trivial missing header files fix (Hal Rosenstock) 3. Re: [PATCH] osm: trivial missing cast in osmt_service call for memcmp (Hal Rosenstock) 4. Re: Question about the IPoIB bandwidth performance ? (Bernard King-Smith) 5. Re: Re: [PATCH]Repost: IPoIB skb panic (Shirley Ma) 6. Re: [PATCHv2 1/2] resend: mthca support for max_map_per_fmr device attribute (Roland Dreier) 7. Re: Question about the IPoIB bandwidth performance ? (Talpey, Thomas) 8. Re: Question about the IPoIB bandwidth performance ? (hbchen) ----- Message from "hbchen" on Mon, 05 Jun 2006 09:38:24 -0600 ----- To: "Hal Rosenstock" cc: "OPENIB" Subject: Re: [openib-general] Question about the IPoIB bandwidth performance ? Hal Rosenstock wrote: On Mon, 2006-06-05 at 11:12, hbchen wrote: Hi, I have a question about the IPoIB bandwidth performance. I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G ethernet card, and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface). NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization (IPoNIC/LB) --------------------- ---------------- -------------- ---------------------------------- Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface) Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface) Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing using Linux 2.6.14.6) (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website) IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing using Linux 2.6.14.6) 474MB/sec 37% (the best from OpenIB mailing list) (2.6.12-rc5 patch 1) Why the bandwidth utilization of IPoIB is so low compared to the others NICs? One thing to note is that the max utilization of 10G IB (4x) is 8G due to the signalling being included in this rate (unlike ethernet whose rate represents the data rate and does not include the signalling overhead). Hal, Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low. >> IPoIB=420MB/sec >> bandwidth utilization= 420/1024 = 41.01% HB -- Hal There must be a lot of room to improve the IPoIB software to reach 75%+ bandwidth utilization. HB Chen Los Alamos National Lab hbchen at labl.gov _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ----- Message from "Hal Rosenstock" on 05 Jun 2006 11:34:50 -0400 ----- To: "Eitan Zahavi" cc: "OPENIB" Subject: [openib-general] Re: [PATCH] osm: trivial missing header files fix On Mon, 2006-06-05 at 08:51, Eitan Zahavi wrote: > Hi Hal > > Cleaning up compilation warnings I found there missing includes in > various sources. > > Eitan > > Signed-off-by: Eitan Zahavi Thanks. Applied to trunk only. -- Hal ----- Message from "Hal Rosenstock" on 05 Jun 2006 11:45:28 -0400 ----- To: "Eitan Zahavi" cc: "OPENIB" Subject [openib-general] Re: [PATCH] osm: trivial missing cast in : osmt_service call for memcmp Hi Eitan, On Mon, 2006-06-05 at 08:59, Eitan Zahavi wrote: > Hi Hal > > Last one of my cleaning up compilation warnings I found a missing > cast in osmtest service name compare. > > Eitan > > Signed-off-by: Eitan Zahavi Thanks. Applied to trunk only. -- Hal ----- Message from "Bernard King-Smith" on Mon, 5 Jun 2006 11:54:42 -0400 ----- To: openib-general at openib.org Subject: Re: [openib-general] Question about the IPoIB bandwidth performance ? Hal Rosenstock wrote: > On Mon, 2006-06-05 at 11:12, hbchen wrote: > > Hi, > > I have a question about the IPoIB bandwidth performance. > > I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G > > ethernet card, > > and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface). > > > > > > NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization > > (IPoNIC/LB) > > --------------------- ---------------- -------------- > > ---------------------------------- > > Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface) > > Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface) > > Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing > > > using Linux 2.6.14.6) > > (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website) > > IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing > > using Linux 2.6.14.6) > > 474MB/sec 37% (the best from OpenIB mailing list) > > (2.6.12-rc5 patch 1) > > > > Why the bandwidth utilization of IPoIB is so low compared to the others > > NICs? > > One thing to note is that the max utilization of 10G IB (4x) is 8G due > to the signalling being included in this rate (unlike ethernet whose > rate represents the data rate and does not include the signalling > overhead). > > -- Hal > You also have larger IP packets when you use GigE ( especially in large send/offload ) and Myrinet. I think Myrinet uses a 60K MTU and for GigE, without large send you get a 9000 MTU. With large send you get a 64K buffer to the adapter so fragmentation to 1500/9000 IP packets is offloaded in the adapter. Currently with IPoIB using UD mode, you have to generate lots of 2K packets. With serialized IBoIP drivers you end up bottlenecking on a single CPU. There is a IPoIB-CM IEFT spec out which should significantly improve IPoIB performance if implemented. > > There must be a lot of room to improve the IPoIB software to reach 75%+ > > bandwidth utilization. > > > > > > HB Chen > > Los Alamos National Lab > > hbchen at labl.gov > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner ----- Message from "Shirley Ma" on Mon, 5 Jun 2006 09:02:36 -0700 ----- To: "Michael S. Tsirkin" cc: "Roland Dreier" , mashirle at us.ibm.com, openib-general at openib.org Subjec [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic t: Michael, I will apply this patch. This patch would reduce the race, not address the problem. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ----- Message from "Roland Dreier" on Mon, 05 Jun 2006 09:01:14 -0700 ----- To: "Or Gerlitz" cc: openib-general at openib.org Subjec [openib-general] Re: [PATCHv2 1/2] resend: mthca support for t: max_map_per_fmr device attribute > Yes it makes sense, but you need the check should be > > if (!(dev->mthca_flags & MTHCA_FLAG_SINAI_OPT)) > > instead of > > if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) Yep, you're right, I got it backwards. > also, what about the other patch which changes fmr_pool.c to query the > device, have you got(reviewed/accepted) it? i have modified it to > allocate the device attr struct on the heap as you have asked. It looks fine. I was just reviewing everything together. - R. ----- Message from "Talpey, Thomas" on Mon, 05 Jun 2006 11:52:03 -0400 ----- To: "hbchen" cc: openib-general at openib.org Subject: Re: [openib-general] Question about the IPoIB bandwidth performance ? At 11:38 AM 6/5/2006, hbchen wrote: >Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low. >>> IPoIB=420MB/sec >>> bandwidth utilization= 420/1024 = 41.01% Helen, have you measured the CPU utilizations during these runs? Perhaps you are out of CPU. Outrageous opinion follows. Frankly, an IB HCA running Ethernet emulation is approximately the world's worst 10GbE adapter (not to put too fine of a point on it :-) ) There is no hardware checksumming, nor large-send offloading, both of which force overhead onto software. And, as you just discovered it isn't even 10Gb! In general, network emulation layers are always going to perform more poorly than native implementations. But this is only a generality learned from years of experience with them. Tom. ----- Message from "hbchen" on Mon, 05 Jun 2006 10:11:30 -0600 ----- To: "Talpey, Thomas" cc: openib-general at openib.org Subject: Re: [openib-general] Question about the IPoIB bandwidth performance ? Talpey, Thomas wrote: At 11:38 AM 6/5/2006, hbchen wrote: Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low. IPoIB=420MB/sec bandwidth utilization= 420/1024 = 41.01% Helen, have you measured the CPU utilizations during these runs? Perhaps you are out of CPU. Tom, I am HB Chen from LANL not the Helen Chen from SNL. I didn't run out of CPU. It is about 70-80 % of CPU utilization. Outrageous opinion follows. Frankly, an IB HCA running Ethernet emulation is approximately the world's worst 10GbE adapter (not to put too fine of a point on it :-) ) The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98% bandwidth utilization why not the IPoIB ? HB Chen hbchen at lanl.gov There is no hardware checksumming, nor large-send offloading, both of which force overhead onto software. And, as you just discovered it isn't even 10Gb! In general, network emulation layers are always going to perform more poorly than native implementations. But this is only a generality learned from years of experience with them. Tom. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general From hycsw at ca.sandia.gov Mon Jun 5 09:55:11 2006 From: hycsw at ca.sandia.gov (Helen Chen) Date: Mon, 5 Jun 2006 09:55:11 -0700 (PDT) Subject: [openib-general] Question about the IPoIB bandwidth performance ? Message-ID: <200606051655.JAA18854@ca.sandia.gov> Tom, We are in the process of measuring the CPU utilization on our NFS/RDMA experiments in contrast with regular the NFS, we also intend to include netperf numbers and will keep you posted with our results as soon as possible. Helen ----- original Message ----- >From openib-general-bounces at openib.org Mon Jun 5 09:03:56 2006 Helen, have you measured the CPU utilizations during these runs? Perhaps you are out of CPU. Outrageous opinion follows. Frankly, an IB HCA running Ethernet emulation is approximately the world's worst 10GbE adapter (not to put too fine of a point on it :-) ) There is no hardware checksumming, nor large-send offloading, both of which force overhead onto software. And, as you just discovered it isn't even 10Gb! In general, network emulation layers are always going to perform more poorly than native implementations. But this is only a generality learned from years of experience with them. Tom. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Thomas.Talpey at netapp.com Mon Jun 5 10:08:17 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 05 Jun 2006 13:08:17 -0400 Subject: [openib-general] Question about the IPoIB bandwidth performance ? In-Reply-To: References: <20060605161150.A21DE2283DA@openib.ca.sandia.gov> Message-ID: <7.0.1.0.2.20060605130607.086feab0@netapp.com> >Who said anything about Ethernnet emulation. Hal said he is running >straight Netperf over IB not ethernet emulation. I don't think that any IB >HCAs today support offloaded checksum and large send. You are comparing >apples and oranges. I consider IPoIB to be Ethernet emulation. As for apples and oranges, my point exactly. Tom. At 12:53 PM 6/5/2006, Bernard King-Smith wrote: >> Thomas Talpey said: >> At 11:38 AM 6/5/2006, hbchen wrote: >> >Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth >utilization is still very > low. >> >>> IPoIB=420MB/sec >> >>> bandwidth utilization= 420/1024 = 41.01% >> >> >> Helen, have you measured the CPU utilizations during these runs? >> Perhaps you are out of CPU. >> >> Outrageous opinion follows. >> >> Frankly, an IB HCA running Ethernet emulation is approximately the >> world's worst 10GbE adapter (not to put too fine of a point on it :-) ) >> There is no hardware checksumming, nor large-send offloading, both >> of which force overhead onto software. And, as you just discovered >> it isn't even 10Gb! >> >> In general, network emulation layers are always going to perform more >> poorly than native implementations. But this is only a generality learned >> from years of experience with them >> >> Tom. > >Hold on here.... > >Who said anything about Ethernnet emulation. Hal said he is running >straight Netperf over IB not ethernet emulation. I don't think that any IB >HCAs today support offloaded checksum and large send. You are comparing >apples and oranges. The only appropriate comparison is to use the IBM HCA >compared to the mthca adapters. I think Hal's point is actually comparing >"any" IB adapter against GigE and Myrinet. Both the mthca and IBM HCA's >should get similar IPoIB performance using identical OpenIB stacks. > > >Bernie King-Smith >IBM Corporation >Server Group >Cluster System Performance >wombat2 at us.ibm.com (845)433-8483 >Tie. 293-8483 or wombat2 on NOTES > >"We are not responsible for the world we are born into, only for the world >we leave when we die. >So we have to accept what has gone before us and work to change the only >thing we can, >-- The Future." William Shatner > > > > openib-general-re > quest at openib.org > Sent by: To > openib-general-bo openib-general at openib.org > unces at openib.org cc > > Subject > 06/05/2006 12:11 openib-general Digest, Vol 24, > PM Issue 22 > > > Please respond to > openib-general at op > enib.org > > > > > > >Send openib-general mailing list submissions to > openib-general at openib.org > >To subscribe or unsubscribe via the World Wide Web, visit > http://openib.org/mailman/listinfo/openib-general >or, via email, send a message with subject or body 'help' to > openib-general-request at openib.org > >You can reach the person managing the list at > openib-general-owner at openib.org > >When replying, please edit your Subject line so it is more specific >than "Re: Contents of openib-general digest..." >Today's Topics: > > 1. Re: Question about the IPoIB bandwidth performance ? >(hbchen) > 2. Re: [PATCH] osm: trivial missing header files fix (Hal Rosenstock) > 3. Re: [PATCH] osm: trivial missing cast in osmt_service call > for memcmp (Hal Rosenstock) > 4. Re: Question about the IPoIB bandwidth performance ? > (Bernard King-Smith) > 5. Re: Re: [PATCH]Repost: IPoIB skb panic (Shirley Ma) > 6. Re: [PATCHv2 1/2] resend: mthca support for >max_map_per_fmr > device attribute (Roland Dreier) > 7. Re: Question about the IPoIB bandwidth performance ? > (Talpey, Thomas) > 8. Re: Question about the IPoIB bandwidth performance ? (hbchen) > >----- Message from "hbchen" on Mon, 05 Jun 2006 09:38:24 >-0600 ----- > > To: "Hal Rosenstock" > > cc: "OPENIB" > > Subject: Re: [openib-general] Question about the IPoIB bandwidth > performance ? > > >Hal Rosenstock wrote: > On Mon, 2006-06-05 at 11:12, hbchen wrote: > > Hi, > I have a question about the IPoIB bandwidth performance. > I did netperf testing using Single GiGE, Myrinet D card, > Myrinet 10G > ethernet card, > and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface). > > > NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth > utilization > (IPoNIC/LB) > --------------------- ---------------- -------------- > ---------------------------------- > Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X > interface) > Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X > interface) > Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My > testing > using Linux 2.6.14.6) > (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website) > IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My > testing > using Linux 2.6.14.6) > 474MB/sec 37% (the best from OpenIB mailing list) > (2.6.12-rc5 patch 1) > > Why the bandwidth utilization of IPoIB is so low compared to > the others > NICs? > > > One thing to note is that the max utilization of 10G IB (4x) is 8G > due > to the signalling being included in this rate (unlike ethernet whose > rate represents the data rate and does not include the signalling > overhead). > >Hal, >Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth >utilization is still very low. >>> IPoIB=420MB/sec >>> bandwidth utilization= 420/1024 = 41.01% > > >HB > > > > > -- Hal > > > There must be a lot of room to improve the IPoIB software to > reach 75%+ > bandwidth utilization. > > > HB Chen > Los Alamos National Lab > hbchen at labl.gov > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > >----- Message from "Hal Rosenstock" on 05 Jun 2006 >11:34:50 -0400 ----- > > To: "Eitan Zahavi" > > cc: "OPENIB" > > Subject: [openib-general] Re: [PATCH] osm: trivial missing header files > fix > > >On Mon, 2006-06-05 at 08:51, Eitan Zahavi wrote: >> Hi Hal >> >> Cleaning up compilation warnings I found there missing includes in >> various sources. >> >> Eitan >> >> Signed-off-by: Eitan Zahavi > >Thanks. Applied to trunk only. > >-- Hal > > > >----- Message from "Hal Rosenstock" on 05 Jun 2006 >11:45:28 -0400 ----- > > To: "Eitan Zahavi" > > cc: "OPENIB" > > Subject [openib-general] Re: [PATCH] osm: trivial missing cast in > : osmt_service call for memcmp > > >Hi Eitan, > >On Mon, 2006-06-05 at 08:59, Eitan Zahavi wrote: >> Hi Hal >> >> Last one of my cleaning up compilation warnings I found a missing >> cast in osmtest service name compare. >> >> Eitan >> >> Signed-off-by: Eitan Zahavi > >Thanks. Applied to trunk only. > >-- Hal > > > >----- Message from "Bernard King-Smith" on Mon, 5 Jun >2006 11:54:42 -0400 ----- > > To: openib-general at openib.org > > Subject: Re: [openib-general] Question about the IPoIB bandwidth > performance ? > > >Hal Rosenstock wrote: > >> On Mon, 2006-06-05 at 11:12, hbchen wrote: >> > Hi, >> > I have a question about the IPoIB bandwidth performance. >> > I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G >> > ethernet card, >> > and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface). >> > >> > >> > NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization >> > (IPoNIC/LB) >> > --------------------- ---------------- -------------- >> > ---------------------------------- >> > Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface) >> > Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface) >> > Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing >> > > using Linux 2.6.14.6) >> > (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website) >> > IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing >> > using Linux 2.6.14.6) >> > 474MB/sec 37% (the best from OpenIB mailing list) >> > (2.6.12-rc5 patch 1) >> > >> > Why the bandwidth utilization of IPoIB is so low compared to the others >> > NICs? >> >> One thing to note is that the max utilization of 10G IB (4x) is 8G due >> to the signalling being included in this rate (unlike ethernet whose >> rate represents the data rate and does not include the signalling >> overhead). >> >> -- Hal >> > >You also have larger IP packets when you use GigE ( especially in large >send/offload ) and Myrinet. I think Myrinet uses a 60K MTU and for GigE, >without large send you get a 9000 MTU. With large send you get a 64K buffer >to the adapter so fragmentation to 1500/9000 IP packets is offloaded in the >adapter. > >Currently with IPoIB using UD mode, you have to generate lots of 2K >packets. With serialized IBoIP drivers you end up bottlenecking on a single >CPU. There is a IPoIB-CM IEFT spec out which should significantly improve >IPoIB performance if implemented. > >> > There must be a lot of room to improve the IPoIB software to reach 75%+ >> > bandwidth utilization. >> > >> > >> > HB Chen >> > Los Alamos National Lab >> > hbchen at labl.gov >> > >> > _______________________________________________ >> > openib-general mailing list >> > openib-general at openib.org >> > http://openib.org/mailman/listinfo/openib-general >> > >> > To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general >> > > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > > >Bernie King-Smith >IBM Corporation >Server Group >Cluster System Performance >wombat2 at us.ibm.com (845)433-8483 >Tie. 293-8483 or wombat2 on NOTES > >"We are not responsible for the world we are born into, only for the world >we leave when we die. >So we have to accept what has gone before us and work to change the only >thing we can, >-- The Future." William Shatner > > > > >----- Message from "Shirley Ma" on Mon, 5 Jun 2006 >09:02:36 -0700 ----- > > To: "Michael S. Tsirkin" > > cc: "Roland Dreier" , mashirle at us.ibm.com, > openib-general at openib.org > > Subjec [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic > t: > > >Michael, > >I will apply this patch. This patch would reduce the race, not address the >problem. > >Thanks >Shirley Ma >IBM Linux Technology Center >15300 SW Koll Parkway >Beaverton, OR 97006-6063 >Phone(Fax): (503) 578-7638 >----- Message from "Roland Dreier" on Mon, 05 Jun 2006 >09:01:14 -0700 ----- > > To: "Or Gerlitz" > > cc: openib-general at openib.org > > Subjec [openib-general] Re: [PATCHv2 1/2] resend: mthca support for > t: max_map_per_fmr device attribute > > > > Yes it makes sense, but you need the check should be > > > > if (!(dev->mthca_flags & MTHCA_FLAG_SINAI_OPT)) > > > > instead of > > > > if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) > >Yep, you're right, I got it backwards. > > > also, what about the other patch which changes fmr_pool.c to query the > > device, have you got(reviewed/accepted) it? i have modified it to > > allocate the device attr struct on the heap as you have asked. > >It looks fine. I was just reviewing everything together. > > - R. > > >----- Message from "Talpey, Thomas" on Mon, 05 >Jun 2006 11:52:03 -0400 ----- > > To: "hbchen" > > cc: openib-general at openib.org > > Subject: Re: [openib-general] Question about the IPoIB bandwidth > performance ? > > >At 11:38 AM 6/5/2006, hbchen wrote: >>Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth >utilization is still very low. >>>> IPoIB=420MB/sec >>>> bandwidth utilization= 420/1024 = 41.01% > > >Helen, have you measured the CPU utilizations during these runs? >Perhaps you are out of CPU. > >Outrageous opinion follows. > >Frankly, an IB HCA running Ethernet emulation is approximately the >world's worst 10GbE adapter (not to put too fine of a point on it :-) ) >There is no hardware checksumming, nor large-send offloading, both >of which force overhead onto software. And, as you just discovered >it isn't even 10Gb! > >In general, network emulation layers are always going to perform more >poorly than native implementations. But this is only a generality learned >from years of experience with them. > >Tom. > > > >----- Message from "hbchen" on Mon, 05 Jun 2006 10:11:30 >-0600 ----- > > To: "Talpey, Thomas" > > cc: openib-general at openib.org > > Subject: Re: [openib-general] Question about the IPoIB bandwidth > performance ? > > >Talpey, Thomas wrote: > At 11:38 AM 6/5/2006, hbchen wrote: > > Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB > bandwidth utilization is still very low. > > IPoIB=420MB/sec > bandwidth utilization= 420/1024 = 41.01% > > > > Helen, have you measured the CPU utilizations during these runs? > Perhaps you are out of CPU. > > >Tom, >I am HB Chen from LANL not the Helen Chen from SNL. >I didn't run out of CPU. It is about 70-80 % of CPU utilization. > > Outrageous opinion follows. > > Frankly, an IB HCA running Ethernet emulation is approximately the > world's worst 10GbE adapter (not to put too fine of a point on it :-) > ) > >The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98% bandwidth >utilization why not the IPoIB ? > >HB Chen >hbchen at lanl.gov > There is no hardware checksumming, nor large-send offloading, both > of which force overhead onto software. And, as you just discovered > it isn't even 10Gb! > > In general, network emulation layers are always going to perform more > poorly than native implementations. But this is only a generality > learned > from years of experience with them. > > Tom. > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Mon Jun 5 10:16:13 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 Jun 2006 10:16:13 -0700 Subject: [openib-general] Re: [PATCHv2 2/2] resend: port the fmr pool to use the max_map_per_fmr device attribute In-Reply-To: (Or Gerlitz's message of "Tue, 30 May 2006 09:23:41 +0300 (IDT)") References: Message-ID: Thanks, applied both patches. From rdreier at cisco.com Mon Jun 5 10:21:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 Jun 2006 10:21:33 -0700 Subject: [openib-general] [git pull] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This has a one-line bug fix: Eli Cohen: IPoIB: Fix AH leak at interface down drivers/infiniband/ulp/ipoib/ipoib_ib.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index a54da42..8406839 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -275,6 +275,7 @@ static void ipoib_ib_handle_wc(struct ne spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; if (netif_queue_stopped(dev) && + test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) && priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) netif_wake_queue(dev); spin_unlock_irqrestore(&priv->tx_lock, flags); From somenath at veritas.com Mon Jun 5 10:50:55 2006 From: somenath at veritas.com (somenath) Date: Mon, 05 Jun 2006 10:50:55 -0700 Subject: [openib-general] Mellanox HCAs: outstanding RDMAs In-Reply-To: <7.0.1.0.2.20060605081948.044849d0@netapp.com> References: <7.0.1.0.2.20060605081948.044849d0@netapp.com> Message-ID: <44846EFF.2020705@veritas.com> Talpey, Thomas wrote: >At 10:03 AM 6/3/2006, Rimmer, Todd wrote: > > >>>Yes, the limit of outstanding RDMAs is not related to the send queue >>>depth. Of course you can post many more than 4 RDMAs to a send queue >>>-- the HCA just won't have more than 4 requests outstanding at a time. >>> >>> >>To further clarity, this parameter only affects the number of concurrent >>outstanding RDMA Reads which the HCA will process. Once it hits this >>limit, the send Q will stall waiting for issued reads to complete prior >>to initiating new reads. >> >> > >It's worse than that - the send queue must stall for *all* operations. >Otherwise the hardware has to track in-progress operations which are >queued after stalled ones. It really breaks the initiation model. > > possibility of stalling is scary! is there any way one can figure out: 1. number of outstanding sends at a given point of time in Send Q? 2. maximum number of outstanding sends ever posted (during the lifetime of Q)? its possible to measure those in ULPs, but then that may not match exactly what is seen in the real Q...so, is there any low level tool to measure this? thanks, som. >Semantically, the provider is not required to provide any such flow control >behavior by the way. The Mellanox one apparently does, but it is not >a requirement of the verbs, it's a requirement on the upper layer. If more >RDMA Reads are posted than the remote peer supports, the connection >may break. > > > >>The number of outstanding RDMA Reads is negotiated by the CM during >>connection establishment and the QP which is sending the RDMA Read must >>have a value configured for this parameter which is <= the remote ends >>capability. >> >> > >In other words, we're probably stuck at 4. :-) I don't think there is any >Mellanox-based implementation that has ever supported > 4. > > > >>In previous testing by Mellanox on SDR HCAs they indicated values beyond >>2-4 did not improve performance (and in fact required more RDMA >>resources be allocated for the corresponding QP or HCA). Hence I >>suspect a very large value like 128 would offer no improvement over >>values in the 2-8 range. >> >> > >I am not so sure of that. For one thing, it's dependent on VERY small >latencies. The presence of a switch, or link extenders will make a huge >difference. Second, heavy multi-QP firmware loads will increase the >latencies. Third, constants are pretty much never a good idea in >networking. > >The NFS/RDMA client tries to set the maximum IRD value it can obtain. >RDMA Read is used quite heavily by the server to fetch client data >segments for NFS writes. > >Tom. > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From parks at lanl.gov Mon Jun 5 11:16:31 2006 From: parks at lanl.gov (Parks Fields) Date: Mon, 05 Jun 2006 12:16:31 -0600 Subject: [openib-general] Question about the IPoIB bandwidth performance ? In-Reply-To: <7.0.1.0.2.20060605130607.086feab0@netapp.com> References: <20060605161150.A21DE2283DA@openib.ca.sandia.gov> <7.0.1.0.2.20060605130607.086feab0@netapp.com> Message-ID: <7.0.1.0.2.20060605120638.025f6270@lanl.gov> > >I consider IPoIB to be Ethernet emulation. > >As for apples and oranges, my point exactly. It is not really about comparisons. Here at LANL we have an environment where all our new Clusters have to mount our global parallel file system Panasas. It is ethernet and will be for a while. Cluster interconnect is IB and the compute nodes do NOT have ethernet, so we created i-o nodes to "bridge " IB to ethernet. Compute node----IB---i/o node---10gig---ethernet switch ---- panasas We like to match / balance the network to bandwidth to storage bandwidth plus try to achieve 1GB/sec per TF of the machine. EX: 50TF machine = 50 GB/sec of storage bandwidth needed. So if IPoIB would give us ~700 MB/sec and came out the other side with 10gigE at ~800 that would be nice. Hope this helps. We are now trying to find out is SDP will work end-to-end. thanks parks ***** Correspondence ***** This email contains no programmatic content that requires independent ADC review From swise at opengridcomputing.com Mon Jun 5 11:18:17 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 05 Jun 2006 13:18:17 -0500 Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: References: <1149285832.11187.33.camel@stevo-desktop> Message-ID: <1149531497.2766.12.camel@stevo-desktop> On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote: > Hi Steve, > We are trying the new iwarp branch on ammasso adapters. The installation > has gone fine. However, on running rping there is a error during > disconnect phase. > > $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999 > libibverbs: Warning: no userspace device-specific driver found for uverbs1 > driver search path: /usr/local/lib/infiniband > libibverbs: Warning: no userspace device-specific driver found for uverbs0 > driver search path: /usr/local/lib/infiniband > ping data: rdm > ping data: rdm > ping data: rdm > ping data: rdm > cq completion failed status 5 > DISCONNECT EVENT... > *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 *** > Aborted > > There are no apparent errors showing up in dmesg. Is this error > currently expected? > > Thanks, > --Sundeep. > The cq completion failure is expected (rping doesn't try to gracefully close down). But the glibc error sounds like a bug. Can you try this on an IB transport? Also, why are you getting "no driver found" errors for uverbs0 and uverbs1? Are these amso devices? Boyd, can you please try and reproduce this here? Steve. From swise at opengridcomputing.com Mon Jun 5 11:32:53 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 05 Jun 2006 13:32:53 -0500 Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: References: <1149285832.11187.33.camel@stevo-desktop> Message-ID: <1149532374.15071.1.camel@stevo-desktop> By the way, I assume you configured, rebuilt and reinstalled libibverbs, librdmacm, and libamso? I do not see this on my systems using a 2.6.16.5 kernel on a SUSE 9.2 distro. What distro/kernel verions? Thanx, Steve. On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote: > Hi Steve, > We are trying the new iwarp branch on ammasso adapters. The installation > has gone fine. However, on running rping there is a error during > disconnect phase. > > $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999 > libibverbs: Warning: no userspace device-specific driver found for uverbs1 > driver search path: /usr/local/lib/infiniband > libibverbs: Warning: no userspace device-specific driver found for uverbs0 > driver search path: /usr/local/lib/infiniband > ping data: rdm > ping data: rdm > ping data: rdm > ping data: rdm > cq completion failed status 5 > DISCONNECT EVENT... > *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 *** > Aborted > > There are no apparent errors showing up in dmesg. Is this error > currently expected? > > Thanks, > --Sundeep. > > On Fri, 2 Jun 2006, Steve Wise wrote: > > > Hello, > > > > The gen2 iwarp branch has been merged up to the main trunk revision > > 7626. The iwarp branch can be found at gen2/branches/iwarp and > > contains the Ammasso 1100 and Chelsio T3 drivers and user libs. > > > > If you are working on iwarp, please test out this new branch and lemme > > know if there are any problems. > > > > > > Thanks, > > > > Steve. > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From Thomas.Talpey at netapp.com Mon Jun 5 11:36:27 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 05 Jun 2006 14:36:27 -0400 Subject: [openib-general] Question about the IPoIB bandwidth performance ? In-Reply-To: <7.0.1.0.2.20060605120638.025f6270@lanl.gov> References: <20060605161150.A21DE2283DA@openib.ca.sandia.gov> <7.0.1.0.2.20060605130607.086feab0@netapp.com> <7.0.1.0.2.20060605120638.025f6270@lanl.gov> Message-ID: <7.0.1.0.2.20060605143006.086feab0@netapp.com> Thanks Parks, this is a very interesting perspective. I will avoid going into my rant about edge devices for now, however. :-) I am not sure what you mean about using SDP "end to end". I assume you would perhaps use SDP to these edge nodes, but this would require terminating the SDP connection and re-issuing the stream over TCP to the Panasas box, wouldn't it? Would this bridging be done in-kernel, like your IPoIB/Ethernet solution today, or would you implement a daemon? It will be a difficult challenge, I predict. Tom. At 02:16 PM 6/5/2006, Parks Fields wrote: > >> >>I consider IPoIB to be Ethernet emulation. >> >>As for apples and oranges, my point exactly. > > >It is not really about comparisons. Here at LANL we have an >environment where all our new Clusters have to mount our global >parallel file system Panasas. It is ethernet and will be for a while. > >Cluster interconnect is IB and the compute nodes do NOT have >ethernet, so we created i-o nodes to "bridge " IB to ethernet. > >Compute node----IB---i/o node---10gig---ethernet switch ---- panasas > >We like to match / balance the network to bandwidth to storage >bandwidth plus try to achieve 1GB/sec per TF of the machine. EX: >50TF machine = 50 GB/sec of storage bandwidth needed. > >So if IPoIB would give us ~700 MB/sec and came out the other side >with 10gigE at ~800 that would be nice. >Hope this helps. We are now trying to find out is SDP will work end-to-end. > >thanks >parks > > > > ***** Correspondence ***** > >This email contains no programmatic content that requires independent >ADC review > > > From parks at lanl.gov Mon Jun 5 11:54:04 2006 From: parks at lanl.gov (Parks Fields) Date: Mon, 05 Jun 2006 12:54:04 -0600 Subject: [openib-general] Question about the IPoIB bandwidth performance ? In-Reply-To: <7.0.1.0.2.20060605143006.086feab0@netapp.com> References: <20060605161150.A21DE2283DA@openib.ca.sandia.gov> <7.0.1.0.2.20060605130607.086feab0@netapp.com> <7.0.1.0.2.20060605120638.025f6270@lanl.gov> <7.0.1.0.2.20060605143006.086feab0@netapp.com> Message-ID: <7.0.1.0.2.20060605124604.02601c00@lanl.gov> At 12:36 PM 6/5/2006, Talpey, Thomas wrote: >Thanks Parks, this is a very interesting perspective. >I will avoid going into my rant about edge devices for >now, however. :-) Cool, you can send it direct if you want. >I am not sure what you mean about using SDP "end to end". >I assume you would perhaps use SDP to these edge nodes, >but this would require terminating the SDP connection and >re-issuing the stream over TCP to the Panasas box, wouldn't it? yes It would probably have to work that way. Another problem would be SDP is not routeable. >Would this bridging be done in-kernel, like your IPoIB/Ethernet >solution today, or would you implement a daemon? It will be >a difficult challenge, I predict. We are just starting to think about things like this, and trying to keep an open mind to all possibilities. We have no solutions to do this yet. There might be better ways. So you are correct and haven't thought it all the way through and have no alterative plan other than IPoIB at the moment. My next step will be testing 4x-ddr IPoIB before doing anything else. parks ***** Correspondence ***** This email contains no programmatic content that requires independent ADC review From rkuchimanchi at silverstorm.com Mon Jun 5 11:56:52 2006 From: rkuchimanchi at silverstorm.com (Ramachandra K) Date: Tue, 06 Jun 2006 00:26:52 +0530 Subject: [openib-general] Re: [PATCH] SRP : Use correct port identifier format according to target io_class In-Reply-To: <1149171133.7588.45.camel@Prawra.gs-lab.com> References: <1149171133.7588.45.camel@Prawra.gs-lab.com> Message-ID: <44847E74.3000409@silverstorm.com> Hi Roland, Did you get a chance to look at the modified SRP patches that I sent last week ? Regards, Ram Ramchandra K wrote: > On Mon, 2006-05-29 at 10:07 -0700, Roland Dreier wrote: >> Overall seems OK. Some comments: > > I am resending the patch with the modifications you suggested. > >> > +#define SRP_REV10_IO_CLASS 0xFF00 >> > +#define SRP_REV16A_IO_CLASS 0x0100 >> >> I think these should be in an enum in , since they're >> generic constants from the SRP spec. >> > I have defined the IO class values as an enum in . I am > sending this as a separate patch. I am not sure if those changes > are to be submitted here, since srp.h is not in the Open Fabrics > code base. But both the patches have to be applied together for > the SRP code to compile. > > > Signed-off-by: Ramachandra K > > Index: infiniband/ulp/srp/ib_srp.c > =================================================================== > --- infiniband/ulp/srp/ib_srp.c (revision 7615) > +++ infiniband/ulp/srp/ib_srp.c (working copy) > @@ -321,8 +321,33 @@ > req->priv.req_it_iu_len = cpu_to_be32(srp_max_iu_len); > req->priv.req_buf_fmt = cpu_to_be16(SRP_BUF_FORMAT_DIRECT | > SRP_BUF_FORMAT_INDIRECT); > - memcpy(req->priv.initiator_port_id, target->srp_host->initiator_port_id, 16); > /* > + * Older targets conforming to Rev 10 of the SRP specification > + * use the port identifier format which is > + * > + * lower 8 bytes : GUID > + * upper 8 bytes : extension > + * > + * Where as according to the new SRP specification (Rev 16a), the > + * port identifier format is > + * > + * lower 8 bytes : extension > + * upper 8 bytes : GUID > + * > + * So check the IO class of the target to decide which format to use. > + */ > + > + /* If its Rev 10, flip the initiator port id fields */ > + if (target->io_class == SRP_REV10_IO_CLASS) { > + memcpy(req->priv.initiator_port_id, > + target->srp_host->initiator_port_id + 8 , 8); > + memcpy(req->priv.initiator_port_id + 8, > + target->srp_host->initiator_port_id, 8); > + } else { > + memcpy(req->priv.initiator_port_id, > + target->srp_host->initiator_port_id, 16); > + } > + /* > * Topspin/Cisco SRP targets will reject our login unless we > * zero out the first 8 bytes of our initiator port ID. The > * second 8 bytes must be our local node GUID, but we always > @@ -334,8 +359,13 @@ > (unsigned long long) be64_to_cpu(target->ioc_guid)); > memset(req->priv.initiator_port_id, 0, 8); > } > - memcpy(req->priv.target_port_id, &target->id_ext, 8); > - memcpy(req->priv.target_port_id + 8, &target->ioc_guid, 8); > + if (target->io_class == SRP_REV10_IO_CLASS) { > + memcpy(req->priv.target_port_id, &target->ioc_guid, 8); > + memcpy(req->priv.target_port_id + 8, &target->id_ext, 8); > + } else { > + memcpy(req->priv.target_port_id, &target->id_ext, 8); > + memcpy(req->priv.target_port_id + 8, &target->ioc_guid, 8); > + } > > status = ib_send_cm_req(target->cm_id, &req->param); > > @@ -1513,6 +1543,7 @@ > SRP_OPT_SERVICE_ID = 1 << 4, > SRP_OPT_MAX_SECT = 1 << 5, > SRP_OPT_MAX_CMD_PER_LUN = 1 << 6, > + SRP_OPT_IO_CLASS = 1 << 7, > SRP_OPT_ALL = (SRP_OPT_ID_EXT | > SRP_OPT_IOC_GUID | > SRP_OPT_DGID | > @@ -1528,6 +1559,7 @@ > { SRP_OPT_SERVICE_ID, "service_id=%s" }, > { SRP_OPT_MAX_SECT, "max_sect=%d" }, > { SRP_OPT_MAX_CMD_PER_LUN, "max_cmd_per_lun=%d" }, > + { SRP_OPT_IO_CLASS, "io_class=%x" }, > { SRP_OPT_ERR, NULL } > }; > > @@ -1611,7 +1643,19 @@ > } > target->scsi_host->cmd_per_lun = min(token, SRP_SQ_SIZE); > break; > - > + case SRP_OPT_IO_CLASS: > + if (match_hex(args, &token)) { > + printk(KERN_WARNING PFX "bad IO class parameter '%s' \n", p); > + goto out; > + } > + if (token == SRP_REV10_IO_CLASS || token == SRP_REV16A_IO_CLASS) > + target->io_class = token; > + else > + printk(KERN_WARNING PFX "unknown IO class parameter value" > + " %x specified. Use %x or %x. Defaulting to IO class %x\n", > + token, SRP_REV10_IO_CLASS, SRP_REV16A_IO_CLASS, > + SRP_REV16A_IO_CLASS); > + break; > default: > printk(KERN_WARNING PFX "unknown parameter or missing value " > "'%s' in target creation request\n", p); > @@ -1654,6 +1698,7 @@ > target = host_to_target(target_host); > memset(target, 0, sizeof *target); > > + target->io_class = SRP_REV16A_IO_CLASS; > target->scsi_host = target_host; > target->srp_host = host; > > Index: infiniband/ulp/srp/ib_srp.h > =================================================================== > --- infiniband/ulp/srp/ib_srp.h (revision 7615) > +++ infiniband/ulp/srp/ib_srp.h (working copy) > @@ -122,6 +122,7 @@ > __be64 id_ext; > __be64 ioc_guid; > __be64 service_id; > + __be16 io_class; > struct srp_host *srp_host; > struct Scsi_Host *scsi_host; > char target_name[32]; > > From narravul at cse.ohio-state.edu Mon Jun 5 11:50:47 2006 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Mon, 5 Jun 2006 14:50:47 -0400 (EDT) Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: <1149531497.2766.12.camel@stevo-desktop> Message-ID: > The cq completion failure is expected (rping doesn't try to gracefully > close down). But the glibc error sounds like a bug. OK. > Can you try this on an IB transport? Also, why are you getting "no > driver found" errors for uverbs0 and uverbs1? Are these amso devices? I will try this on the IB transport. I am not sure about the "no driver found" warnings. btw, only the amso devises are installed on the nodes. Is there some tool to check which device uverbs0 is connected to? --Sundeep. > > Boyd, can you please try and reproduce this here? > > Steve. > > From swise at opengridcomputing.com Mon Jun 5 12:01:44 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 05 Jun 2006 14:01:44 -0500 Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: References: Message-ID: <1149534104.15071.4.camel@stevo-desktop> On Mon, 2006-06-05 at 14:50 -0400, Sundeep Narravula wrote: > > The cq completion failure is expected (rping doesn't try to gracefully > > close down). But the glibc error sounds like a bug. > > OK. > > > Can you try this on an IB transport? Also, why are you getting "no > > driver found" errors for uverbs0 and uverbs1? Are these amso devices? > > I will try this on the IB transport. > > I am not sure about the "no driver found" warnings. btw, only the amso > devises are installed on the nodes. Is there some tool to check which > device uverbs0 is connected to? > I'm not sure how to map these. But if you have mthca adapter installed, and the libmthca driver isn't installed, you'll see these types of warnings. I'm guessing the glibc error is finding some rping bug. Maybe you have a later version of libc than my suse 9.2 distro? Stevo. From rdreier at cisco.com Mon Jun 5 12:01:27 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 Jun 2006 12:01:27 -0700 Subject: [openib-general] Re: [PATCH] SRP : Use correct port identifier format according to target io_class In-Reply-To: <44847E74.3000409@silverstorm.com> (Ramachandra K.'s message of "Tue, 06 Jun 2006 00:26:52 +0530") References: <1149171133.7588.45.camel@Prawra.gs-lab.com> <44847E74.3000409@silverstorm.com> Message-ID: Ramachandra> Hi Roland, Did you get a chance to look at the Ramachandra> modified SRP patches that I sent last week ? Yes, I will fix them up and apply them. - R. From narravul at cse.ohio-state.edu Mon Jun 5 11:58:48 2006 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Mon, 5 Jun 2006 14:58:48 -0400 (EDT) Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: <1149532374.15071.1.camel@stevo-desktop> Message-ID: > By the way, I assume you configured, rebuilt and reinstalled libibverbs, > librdmacm, and libamso? Yes. I have done these. > > I do not see this on my systems using a 2.6.16.5 kernel on a SUSE 9.2 > distro. What distro/kernel verions? The kernel used is 2.6.16 on a RH-AS4. --Sundeep. > > Thanx, > > > Steve. > > > On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote: > > Hi Steve, > > We are trying the new iwarp branch on ammasso adapters. The installation > > has gone fine. However, on running rping there is a error during > > disconnect phase. > > > > $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999 > > libibverbs: Warning: no userspace device-specific driver found for uverbs1 > > driver search path: /usr/local/lib/infiniband > > libibverbs: Warning: no userspace device-specific driver found for uverbs0 > > driver search path: /usr/local/lib/infiniband > > ping data: rdm > > ping data: rdm > > ping data: rdm > > ping data: rdm > > cq completion failed status 5 > > DISCONNECT EVENT... > > *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 *** > > Aborted > > > > There are no apparent errors showing up in dmesg. Is this error > > currently expected? > > > > Thanks, > > --Sundeep. > > > > On Fri, 2 Jun 2006, Steve Wise wrote: > > > > > Hello, > > > > > > The gen2 iwarp branch has been merged up to the main trunk revision > > > 7626. The iwarp branch can be found at gen2/branches/iwarp and > > > contains the Ammasso 1100 and Chelsio T3 drivers and user libs. > > > > > > If you are working on iwarp, please test out this new branch and lemme > > > know if there are any problems. > > > > > > > > > Thanks, > > > > > > Steve. > > > > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From tziporet at mellanox.co.il Mon Jun 5 12:09:52 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 5 Jun 2006 22:09:52 +0300 Subject: [openib-general] Fix some suspicious ppc64 code in dapl Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7122@mtlexch01.mtl.com> Hi James, Is it important to take this patch to the OFED release? Thanks Tziporet -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of James Lentini Sent: Monday, June 05, 2006 4:39 PM To: Anton Blanchard Cc: openib-general at openib.org Subject: Re: [openib-general] Fix some suspicious ppc64 code in dapl > Index: dapl/udapl/linux/dapl_osd.h > =================================================================== > --- dapl/udapl/linux/dapl_osd.h (revision 7621) > +++ dapl/udapl/linux/dapl_osd.h (working copy) > @@ -238,14 +238,13 @@ > #endif /* __ia64__ */ > #elif defined(__PPC64__) > __asm__ __volatile__ ( > - EIEIO_ON_SMP > -"1: lwarx %0,0,%2 # __cmpxchg_u64\n\ > - cmpd 0,%0,%3\n\ > +" lwsync\n\ > +1: lwarx %0,0,%2 # __cmpxchg_u32\n\ > + cmpw 0,%0,%3\n\ > bne- 2f\n\ > stwcx. %4,0,%2\n\ > - bne- 1b" > - ISYNC_ON_SMP > - "\n\ > + bne- 1b\n\ > + isync\n\ > 2:" > : "=&r" (current_value), "=m" (*v) > : "r" (v), "r" (match_value), "r" (new_value), "m" (*v) Thank you Anton. Could you replying with a signed off by line? I'll properly attribute this fix to you in the commit log. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Mon Jun 5 12:17:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 Jun 2006 12:17:59 -0700 Subject: [openib-general] RFC: Add I/O class enum values to Message-ID: Does anyone have an objection to me merging the trivial patch below through my git tree? This will be used by the IB SRP initiator to work with SilverStorm targets, which still implement rev. 10 of the SRP spec. I could just make these values private to the IB initiator, but I figured that things directly from the SRP spec belong in rather than in a particular driver's private header. Thanks, Roland diff-tree a13ac0e9f99636a043d197f3349a67303ce4a701 (from bb61dd1fbf59f2291295986bed1f99b48f513fa4) Author: Ramachandra K Date: Mon Jun 5 12:13:52 2006 -0700 [SCSI] srp.h: Add I/O Class values Add enum values for I/O Class values from rev. 10 and rev. 16a SRP drafts. The values are used to detect targets that implement obsolete revisions of SRP, so that the initiator can use the old format for port identifier when connecting to them. Signed-off-by: Ramachandra K Signed-off-by: Roland Dreier diff --git a/include/scsi/srp.h b/include/scsi/srp.h index 637f77e..ad178fa 100644 --- a/include/scsi/srp.h +++ b/include/scsi/srp.h @@ -87,6 +87,11 @@ enum srp_login_rej_reason { SRP_LOGIN_REJ_CHANNEL_LIMIT_REACHED = 0x00010006 }; +enum { + SRP_REV10_IB_IO_CLASS = 0xff00, + SRP_REV16A_IB_IO_CLASS = 0x0100 +}; + struct srp_direct_buf { __be64 va; __be32 key; From narravul at cse.ohio-state.edu Mon Jun 5 12:23:53 2006 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Mon, 5 Jun 2006 15:23:53 -0400 (EDT) Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: <1149534104.15071.4.camel@stevo-desktop> Message-ID: > I'm not sure how to map these. But if you have mthca adapter installed, > and the libmthca driver isn't installed, you'll see these types of > warnings. This is the case. We have the adapters for mthca installed but not the drivers. > I'm guessing the glibc error is finding some rping bug. Maybe you have > a later version of libc than my suse 9.2 distro? The glibc version we are using is 2.3.4 --Sundeep. > > > Stevo. > > From James.Bottomley at SteelEye.com Mon Jun 5 12:40:57 2006 From: James.Bottomley at SteelEye.com (James Bottomley) Date: Mon, 05 Jun 2006 14:40:57 -0500 Subject: [openib-general] Re: RFC: Add I/O class enum values to In-Reply-To: References: Message-ID: <1149536457.3479.2.camel@mulgrave.il.steeleye.com> On Mon, 2006-06-05 at 12:17 -0700, Roland Dreier wrote: > Does anyone have an objection to me merging the trivial patch below > through my git tree? This will be used by the IB SRP initiator to > work with SilverStorm targets, which still implement rev. 10 of the > SRP spec. I could just make these values private to the IB initiator, > but I figured that things directly from the SRP spec belong in > rather than in a particular driver's private header. No objection here ... but if you do, it will entangle our git trees even more nastily, since the srp.h file is created in the scsi-misc-2.6 tree. James From rdreier at cisco.com Mon Jun 5 12:55:43 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 Jun 2006 12:55:43 -0700 Subject: [openib-general] Re: RFC: Add I/O class enum values to In-Reply-To: <1149536457.3479.2.camel@mulgrave.il.steeleye.com> (James Bottomley's message of "Mon, 05 Jun 2006 14:40:57 -0500") References: <1149536457.3479.2.camel@mulgrave.il.steeleye.com> Message-ID: James> No objection here ... but if you do, it will entangle our James> git trees even more nastily, since the srp.h file is James> created in the scsi-misc-2.6 tree. No, I think we're OK. srp.h is already in Linus's tree (it went in as part of the original IB SRP initiator merge), and scsi-misc doesn't have any changes after ec448a0a36 (which is already upstream) in it. So putting the IO Class change in my tree actually reduces the dependency between our trees, since I can put the IB SRP changes in my tree without worrying about you merging the srp.h change first. - R. From swise at opengridcomputing.com Mon Jun 5 13:07:46 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 05 Jun 2006 15:07:46 -0500 Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: References: Message-ID: <1149538066.15071.16.camel@stevo-desktop> On Mon, 2006-06-05 at 15:23 -0400, Sundeep Narravula wrote: > > I'm not sure how to map these. But if you have mthca adapter installed, > > and the libmthca driver isn't installed, you'll see these types of > > warnings. > > This is the case. We have the adapters for mthca installed but not the > drivers. > > > I'm guessing the glibc error is finding some rping bug. Maybe you have > > a later version of libc than my suse 9.2 distro? > > The glibc version we are using is 2.3.4 > My systems are 2.3.3-118 (that's the version in the rpm name). From faulkner at opengridcomputing.com Mon Jun 5 13:12:57 2006 From: faulkner at opengridcomputing.com (Boyd R. Faulkner) Date: Mon, 5 Jun 2006 15:12:57 -0500 Subject: [openib-general] Serialization in ib_uverbs Message-ID: <200606051512.57653.faulkner@opengridcomputing.com> I have a question about the intent of the mutex lock ib_uverbs_idr_mutex used in kernel interface from user libraries. It appears to be a lock on the idr linked lists but as a great many, if not all, of the ib_uverbs commands grab the mutex at the start of the function and hold it to the end, it acts to serialize all the library accesses to the kernel. Is this intended? If a driver, say, waits for all references to an object to be removed before the close completes, all accesses stop while that occurs and if the command to make that happen needs that mutex, everything stops. I have seen this in practice. Any insight would be appreciated. Thanks, Boyd -- Boyd R. Faulkner Open Grid Computing, Inc. Phone: 512-343-9196 x109 Fax: 512-343-5450 From rdreier at cisco.com Mon Jun 5 13:36:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 Jun 2006 13:36:03 -0700 Subject: [openib-general] Serialization in ib_uverbs In-Reply-To: <200606051512.57653.faulkner@opengridcomputing.com> (Boyd R. Faulkner's message of "Mon, 5 Jun 2006 15:12:57 -0500") References: <200606051512.57653.faulkner@opengridcomputing.com> Message-ID: Boyd> I have a question about the intent of the mutex lock Boyd> ib_uverbs_idr_mutex used in kernel interface from user Boyd> libraries. It appears to be a lock on the idr linked lists Boyd> but as a great many, if not all, of the ib_uverbs commands Boyd> grab the mutex at the start of the function and hold it to Boyd> the end, it acts to serialize all the library accesses to Boyd> the kernel. Is this intended? Yes, when I first implemented things it made things a lot easier to serialize things. For example holding the mutex during the entire process of creating a QP prevents the associated CQs from being destroyed in the middle of the operation. It does seem to be a scalability problem for some devices/workloads, and Robert Walsh from qlogic was planning on looking at this (replacing the mutex with a reference counting scheme). I don't know if he's started on this or not. - R. From faulkner at opengridcomputing.com Mon Jun 5 13:38:34 2006 From: faulkner at opengridcomputing.com (Boyd R. Faulkner) Date: Mon, 5 Jun 2006 15:38:34 -0500 Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: <1149534104.15071.4.camel@stevo-desktop> References: <1149534104.15071.4.camel@stevo-desktop> Message-ID: <200606051538.35084.faulkner@opengridcomputing.com> On Mon June 5 2006 14:01, Steve Wise wrote: > On Mon, 2006-06-05 at 14:50 -0400, Sundeep Narravula wrote: > > > The cq completion failure is expected (rping doesn't try to gracefully > > > close down). But the glibc error sounds like a bug. > > > > OK. > > > > > Can you try this on an IB transport? Also, why are you getting "no > > > driver found" errors for uverbs0 and uverbs1? Are these amso devices? > > > > I will try this on the IB transport. > > > > I am not sure about the "no driver found" warnings. btw, only the amso > > devises are installed on the nodes. Is there some tool to check which > > device uverbs0 is connected to? > > I'm not sure how to map these. But if you have mthca adapter installed, > and the libmthca driver isn't installed, you'll see these types of > warnings. You will also get this warning on the latest CM if you have not updated the library to use ibv_driver_init vs. openib_driver_init. This drop for libamso happened last Friday, Jun 2. Check and see if you have that. > > I'm guessing the glibc error is finding some rping bug. Maybe you have > a later version of libc than my suse 9.2 distro? > > > Stevo. -- Boyd R. Faulkner Open Grid Computing, Inc. Phone: 512-343-9196 x109 Fax: 512-343-5450 From rjwalsh at pathscale.com Mon Jun 5 13:50:11 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Mon, 05 Jun 2006 13:50:11 -0700 Subject: [openib-general] Serialization in ib_uverbs In-Reply-To: References: <200606051512.57653.faulkner@opengridcomputing.com> Message-ID: <1149540611.15423.6.camel@hematite.internal.keyresearch.com> > It does seem to be a scalability problem for some devices/workloads, > and Robert Walsh from qlogic was planning on looking at this > (replacing the mutex with a reference counting scheme). I don't know > if he's started on this or not. Not yet, but I'll be starting this as soon as I'm done with some other release-related work. Probably in the next few days I'll be starting. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From manpreet at gmail.com Mon Jun 5 14:15:36 2006 From: manpreet at gmail.com (Manpreet Singh) Date: Mon, 5 Jun 2006 14:15:36 -0700 Subject: [openib-general] Mellanox HCAs: outstanding RDMAs In-Reply-To: References: <67897d690606011822j7b915876l57149508623c6c4f@mail.gmail.com> Message-ID: <67897d690606051415o3675207o549ce7e084d618b8@mail.gmail.com> We have seen this happen over an IB analyzer. Recompiling the mthca driver with a high value like 64 or 128 works around this problem. When the condition hits, the HCA receiving the 4+ RDMAs generates an invalid request error. Any ideas as to when this patch might enter the mainline sources? Thanks, Manpreet. On 6/2/06, Roland Dreier wrote: > > Manpreet> Mellanox HCA can handle has been configured at 4 > Manpreet> (mthca_main.c: default_profile: rdb_per_qp). And the > Manpreet> HCAs can support a much higher value (128 I think). > > Manpreet> Could we move this value higher or atleast make it > Manpreet> configurable? > > Leonid Arsh has a patch that I will integrate soon that makes this > configurable. > > However, I'm curious. Do you have a workload where this actually > makes a measurable difference? It seems that having 4 RDMA requests > outstanding on the wire should be enough to get things to pipeline > pretty well. > > If you haven't tested this, right now you can of course edit > mthca_main.c to change the default value and recompile. > > - R. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From faulkner at opengridcomputing.com Mon Jun 5 14:16:34 2006 From: faulkner at opengridcomputing.com (Boyd R. Faulkner) Date: Mon, 5 Jun 2006 16:16:34 -0500 Subject: [openib-general] Serialization in ib_uverbs In-Reply-To: References: <200606051512.57653.faulkner@opengridcomputing.com> Message-ID: <200606051616.34617.faulkner@opengridcomputing.com> On Mon June 5 2006 15:36, Roland Dreier wrote: > Boyd> I have a question about the intent of the mutex lock > Boyd> ib_uverbs_idr_mutex used in kernel interface from user > Boyd> libraries. It appears to be a lock on the idr linked lists > Boyd> but as a great many, if not all, of the ib_uverbs commands > Boyd> grab the mutex at the start of the function and hold it to > Boyd> the end, it acts to serialize all the library accesses to > Boyd> the kernel. Is this intended? > > Yes, when I first implemented things it made things a lot easier to > serialize things. For example holding the mutex during the entire > process of creating a QP prevents the associated CQs from being > destroyed in the middle of the operation. > > It does seem to be a scalability problem for some devices/workloads, > and Robert Walsh from qlogic was planning on looking at this > (replacing the mutex with a reference counting scheme). I don't know > if he's started on this or not. > > - R. It is in the pipe. Sweet. Thanks, Boyd -- Boyd R. Faulkner Open Grid Computing, Inc. Phone: 512-343-9196 x109 Fax: 512-343-5450 From sean.hefty at intel.com Mon Jun 5 17:05:25 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 5 Jun 2006 17:05:25 -0700 Subject: [openib-general] [PATCH 1/2] RDMA CM: allow user to set IB CM timeout and retries Message-ID: Allow users to override the default number of retries and timeout used by the RDMA CM when connecting over Infiniband. Some applications, like MPI, are unable to connect within the default timeout value when scaling up. Signed-off-by: Sean Hefty --- Index: include/rdma/rdma_user_cm.h =================================================================== --- include/rdma/rdma_user_cm.h (revision 7619) +++ include/rdma/rdma_user_cm.h (working copy) @@ -203,6 +203,7 @@ enum { /* IB specific option names for get/set. */ enum { IB_PATH_OPTIONS = 1, + IB_CM_REQ_OPTIONS = 2, }; struct rdma_ucm_get_option_resp { Index: include/rdma/rdma_cm_ib.h =================================================================== --- include/rdma/rdma_cm_ib.h (revision 7619) +++ include/rdma/rdma_cm_ib.h (working copy) @@ -44,4 +44,26 @@ int rdma_set_ib_paths(struct rdma_cm_id *id, struct ib_sa_path_rec *path_rec, int num_paths); +struct ib_cm_req_opt { + u8 remote_cm_response_timeout; + u8 local_cm_response_timeout; + u8 max_cm_retries; +}; + +/** + * rdma_get_ib_req_info - Retrieves the current IB CM REQ / SIDR REQ values + * that will be used when connection, or performing service ID resolution. + * @id: Connection identifier associated with the request. + * @info: Current values for CM REQ messages. + */ +int rdma_get_ib_req_info(struct rdma_cm_id *id, struct ib_cm_req_opt *info); + +/** + * rdma_set_ib_req_info - Sets the current IB CM REQ / SIDR REQ values + * that will be used when connection, or performing service ID resolution. + * @id: Connection identifier associated with the request. + * @info: New values for CM REQ messages. + */ +int rdma_set_ib_req_info(struct rdma_cm_id *id, struct ib_cm_req_opt *info); + #endif /* RDMA_CM_IB_H */ Index: core/ucma_ib.c =================================================================== --- core/ucma_ib.c (revision 7619) +++ core/ucma_ib.c (working copy) @@ -81,12 +81,37 @@ static int ucma_get_paths(struct rdma_cm return ret; } +static int ucma_get_req_opt(struct rdma_cm_id *id, void __user *opt, + int *optlen) +{ + struct ib_cm_req_opt req_opt; + int ret = 0; + + if (!opt) + goto out; + + if (*optlen < sizeof req_opt) { + ret = -ENOMEM; + goto out; + } + + ret = rdma_get_ib_req_info(id, &req_opt); + if (!ret) + if (copy_to_user(opt, &req_opt, sizeof req_opt)) + ret = -EFAULT; +out: + *optlen = sizeof req_opt; + return ret; +} + int ucma_get_ib_option(struct rdma_cm_id *id, int optname, void *optval, int *optlen) { switch (optname) { case IB_PATH_OPTIONS: return ucma_get_paths(id, optval, optlen); + case IB_CM_REQ_OPTIONS: + return ucma_get_req_opt(id, optval, optlen); default: return -EINVAL; } @@ -132,12 +157,27 @@ out: return ret; } +static int ucma_set_req_opt(struct rdma_cm_id *id, void __user *opt, int optlen) +{ + struct ib_cm_req_opt req_opt; + + if (optlen != sizeof req_opt) + return -EINVAL; + + if (copy_from_user(&req_opt, opt, sizeof req_opt)) + return -EFAULT; + + return rdma_set_ib_req_info(id, &req_opt); +} + int ucma_set_ib_option(struct rdma_cm_id *id, int optname, void *optval, int optlen) { switch (optname) { case IB_PATH_OPTIONS: return ucma_set_paths(id, optval, optlen); + case IB_CM_REQ_OPTIONS: + return ucma_set_req_opt(id, optval, optlen); default: return -EINVAL; } Index: core/cma.c =================================================================== --- core/cma.c (revision 7619) +++ core/cma.c (working copy) @@ -126,6 +126,10 @@ struct rdma_id_private { struct ib_cm_id *ib; } cm_id; + union { + struct ib_cm_req_opt *req; + } options; + u32 seq_num; u32 qp_num; enum ib_qp_type qp_type; @@ -710,6 +714,7 @@ void rdma_destroy_id(struct rdma_cm_id * wait_for_completion(&id_priv->comp); kfree(id_priv->id.route.path_rec); + kfree(id_priv->options.req); kfree(id_priv); } EXPORT_SYMBOL(rdma_destroy_id); @@ -1240,6 +1245,65 @@ err: } EXPORT_SYMBOL(rdma_set_ib_paths); +static inline u8 cma_get_ib_remote_timeout(struct rdma_id_private *id_priv) +{ + return id_priv->options.req ? + id_priv->options.req->remote_cm_response_timeout : + CMA_CM_RESPONSE_TIMEOUT; +} + +static inline u8 cma_get_ib_local_timeout(struct rdma_id_private *id_priv) +{ + return id_priv->options.req ? + id_priv->options.req->local_cm_response_timeout : + CMA_CM_RESPONSE_TIMEOUT; +} + +static inline u8 cma_get_ib_cm_retries(struct rdma_id_private *id_priv) +{ + return id_priv->options.req ? + id_priv->options.req->max_cm_retries : CMA_MAX_CM_RETRIES; +} + +int rdma_get_ib_req_info(struct rdma_cm_id *id, struct ib_cm_req_opt *info) +{ + struct rdma_id_private *id_priv; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_ROUTE_RESOLVED)) + return -EINVAL; + + info->remote_cm_response_timeout = cma_get_ib_remote_timeout(id_priv); + info->local_cm_response_timeout = cma_get_ib_local_timeout(id_priv); + info->max_cm_retries = cma_get_ib_cm_retries(id_priv); + return 0; +} +EXPORT_SYMBOL(rdma_get_ib_req_info); + +int rdma_set_ib_req_info(struct rdma_cm_id *id, struct ib_cm_req_opt *info) +{ + struct rdma_id_private *id_priv; + + if (info->remote_cm_response_timeout > 0x1F || + info->local_cm_response_timeout > 0x1F || + info->max_cm_retries > 0xF) + return -EINVAL; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_ROUTE_RESOLVED)) + return -EINVAL; + + if (!id_priv->options.req) { + id_priv->options.req = kmalloc(sizeof *info, GFP_KERNEL); + if (!id_priv->options.req) + return -ENOMEM; + } + + *id_priv->options.req = *info; + return 0; +} +EXPORT_SYMBOL(rdma_set_ib_req_info); + int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) { struct rdma_id_private *id_priv; @@ -1646,9 +1710,9 @@ static int cma_connect_ib(struct rdma_id req.flow_control = conn_param->flow_control; req.retry_count = conn_param->retry_count; req.rnr_retry_count = conn_param->rnr_retry_count; - req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; - req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; - req.max_cm_retries = CMA_MAX_CM_RETRIES; + req.remote_cm_response_timeout = cma_get_ib_remote_timeout(id_priv); + req.local_cm_response_timeout = cma_get_ib_local_timeout(id_priv); + req.max_cm_retries = cma_get_ib_cm_retries(id_priv); req.srq = id_priv->srq ? 1 : 0; ret = ib_send_cm_req(id_priv->cm_id.ib, &req); @@ -1707,7 +1771,7 @@ static int cma_accept_ib(struct rdma_id_ rep.private_data_len = conn_param->private_data_len; rep.responder_resources = conn_param->responder_resources; rep.initiator_depth = conn_param->initiator_depth; - rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT; + rep.target_ack_delay = cma_get_ib_local_timeout(id_priv); rep.failover_accepted = 0; rep.flow_control = conn_param->flow_control; rep.rnr_retry_count = conn_param->rnr_retry_count; From sean.hefty at intel.com Mon Jun 5 17:11:49 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 5 Jun 2006 17:11:49 -0700 Subject: [openib-general] [PATCH 2/2] librdmacm: allow user to set IB CM timeout and retries Message-ID: Userspace support to allow overriding the default timeout and retry used by the RDMA CM when connecting over Infiniband. This patch moves the Infiniband specific options into their own header file. Signed-off-by: Sean Hefty --- Index: include/rdma/rdma_cma_ib.h =================================================================== --- include/rdma/rdma_cma_ib.h (revision 0) +++ include/rdma/rdma_cma_ib.h (revision 0) @@ -0,0 +1,47 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(RDMA_CMA_IB_H) +#define RDMA_CMA_IB_H + +#include + + +/* IB specific option names for get/set. */ +enum { + IB_PATH_OPTIONS = 1, /* struct ibv_kern_path_rec */ + IB_CM_REQ_OPTIONS = 2 /* struct ib_cm_req_opt */ +}; + +struct ib_cm_req_opt { + uint8_t remote_cm_response_timeout; + uint8_t local_cm_response_timeout; + uint8_t max_cm_retries; +}; + +#endif /* RDMA_CMA_IB_H */ Index: include/rdma/rdma_cma.h =================================================================== --- include/rdma/rdma_cma.h (revision 7636) +++ include/rdma/rdma_cma.h (working copy) @@ -60,11 +60,6 @@ enum { RDMA_PROTO_IB = 1, }; -/* IB specific option names for get/set. */ -enum { - IB_PATH_OPTIONS = 1, -}; - struct ib_addr { union ibv_gid sgid; union ibv_gid dgid; Index: Makefile.am =================================================================== --- Makefile.am (revision 7636) +++ Makefile.am (working copy) @@ -27,10 +27,12 @@ examples_rping_LDADD = $(top_builddir)/s librdmacmincludedir = $(includedir)/rdma librdmacminclude_HEADERS = include/rdma/rdma_cma_abi.h \ - include/rdma/rdma_cma.h + include/rdma/rdma_cma.h \ + include/rdma/rdma_cma_ib.h EXTRA_DIST = include/rdma/rdma_cma_abi.h \ include/rdma/rdma_cma.h \ + include/rdma/rdma_cma_ib.h \ src/librdmacm.map \ librdmacm.spec.in From arlin.r.davis at intel.com Mon Jun 5 17:16:31 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 5 Jun 2006 17:16:31 -0700 Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS Message-ID: James, Here is a patch to the openib-cma provider that uses the new set_option feature of the uCMA to adjust connect request timeout and retry values. The defaults are a little quick for some consumers. They are now bumped up from 3 retries to 15 and are tunable with uDAPL environment variables. Also, included a fix to disallow any event after a disconnect event. You need to sync up the commit with Sean's patch for the uCMA get/set IB_CM_REQ_OPTIONS. I would like to get this in OFED RC6 if possible. Thanks, -arlin Signed-off by: Arlin Davis ardavis at ichips.intel.com Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 7694) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -264,7 +264,15 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_N /* set inline max with env or default, get local lid and gid 0 */ hca_ptr->ib_trans.max_inline_send = dapl_os_get_env_val("DAPL_MAX_INLINE", INLINE_SEND_DEFAULT); - + + /* set CM timer defaults */ + hca_ptr->ib_trans.max_cm_timeout = + dapl_os_get_env_val("DAPL_MAX_CM_RESPONSE_TIME", + IB_CM_RESPONSE_TIMEOUT); + hca_ptr->ib_trans.max_cm_retries = + dapl_os_get_env_val("DAPL_MAX_CM_RETRIES", + IB_CM_RETRIES); + /* EVD events without direct CQ channels, non-blocking */ hca_ptr->ib_trans.ib_cq = ibv_create_comp_channel(hca_ptr->ib_hca_handle); Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 7694) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -58,6 +58,7 @@ #include "dapl_ib_util.h" #include #include +#include extern struct rdma_event_channel *g_cm_events; @@ -85,7 +86,6 @@ static inline uint64_t cpu_to_be64(uint6 (unsigned short)((SID % IB_PORT_MOD) + IB_PORT_BASE) :\ (unsigned short)SID) - static void dapli_addr_resolve(struct dapl_cm_id *conn) { int ret; @@ -114,6 +114,8 @@ static void dapli_addr_resolve(struct da static void dapli_route_resolve(struct dapl_cm_id *conn) { int ret; + size_t optlen = sizeof(struct ib_cm_req_opt); + struct ib_cm_req_opt req_opt; #ifdef DAPL_DBG struct rdma_addr *ipaddr = &conn->cm_id->route.addr; struct ib_addr *ibaddr = &conn->cm_id->route.addr.addr.ibaddr; @@ -143,13 +145,43 @@ static void dapli_route_resolve(struct d cpu_to_be64(ibaddr->dgid.global.interface_id)); dapl_dbg_log(DAPL_DBG_TYPE_CM, - " rdma_connect: cm_id %p pdata %p plen %d rr %d ind %d\n", + " route_resolve: cm_id %p pdata %p plen %d rr %d ind %d\n", conn->cm_id, conn->params.private_data, conn->params.private_data_len, conn->params.responder_resources, conn->params.initiator_depth ); + /* Get default connect request timeout values, and adjust */ + ret = rdma_get_option(conn->cm_id, RDMA_PROTO_IB, IB_CM_REQ_OPTIONS, + (void*)&req_opt, &optlen); + if (ret) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, " rdma_get_option failed: %s\n", + strerror(errno)); + goto bail; + } + + dapl_dbg_log(DAPL_DBG_TYPE_CM, " route_resolve: " + "Set CR times - response %d to %d, retry %d to %d\n", + req_opt.remote_cm_response_timeout, + conn->hca->ib_trans.max_cm_timeout, + req_opt.max_cm_retries, + conn->hca->ib_trans.max_cm_retries); + + /* Use hca response time setting for connect requests */ + req_opt.max_cm_retries = conn->hca->ib_trans.max_cm_retries; + req_opt.remote_cm_response_timeout = + conn->hca->ib_trans.max_cm_timeout; + req_opt.local_cm_response_timeout = + req_opt.remote_cm_response_timeout; + ret = rdma_set_option(conn->cm_id, RDMA_PROTO_IB, IB_CM_REQ_OPTIONS, + (void*)&req_opt, optlen); + if (ret) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, " rdma_set_option failed: %s\n", + strerror(errno)); + goto bail; + } + ret = rdma_connect(conn->cm_id, &conn->params); if (ret) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, " rdma_connect failed: %s\n", @@ -273,14 +305,37 @@ static void dapli_cm_active_cb(struct da } dapl_os_unlock(&conn->lock); + /* There is a chance that we can get events after + * the consumer calls disconnect in a pending state + * since the IB CM and uDAPL states are not shared. + * In some cases, IB CM could generate either a DCONN + * or CONN_ERR after the consumer returned from + * dapl_ep_disconnect with a DISCONNECTED event + * already queued. Check state here and bail to + * avoid any events after a disconnect. + */ + if (DAPL_BAD_HANDLE(conn->ep, DAPL_MAGIC_EP)) + return; + + dapl_os_lock(&conn->ep->header.lock); + if (conn->ep->param.ep_state == DAT_EP_STATE_DISCONNECTED) { + dapl_os_unlock(&conn->ep->header.lock); + return; + } + if (event->event == RDMA_CM_EVENT_DISCONNECTED) + conn->ep->param.ep_state = DAT_EP_STATE_DISCONNECTED; + + dapl_os_unlock(&conn->ep->header.lock); + switch (event->event) { case RDMA_CM_EVENT_UNREACHABLE: case RDMA_CM_EVENT_CONNECT_ERROR: - dapl_dbg_log( - DAPL_DBG_TYPE_WARN, - " dapli_cm_active_handler: CONN_ERR " - " event=0x%x status=%d\n", - event->event, event->status); + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " dapli_cm_active_handler: CONN_ERR " + " event=0x%x status=%d %s\n", + event->event, event->status, + (event->status == -110)?"TIMEOUT":"" ); dapl_evd_connection_callback(conn, IB_CME_DESTINATION_UNREACHABLE, @@ -368,25 +423,23 @@ static void dapli_cm_passive_cb(struct d event->private_data, new_conn->sp); break; case RDMA_CM_EVENT_UNREACHABLE: - dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE, - NULL, conn->sp); - case RDMA_CM_EVENT_CONNECT_ERROR: dapl_dbg_log( - DAPL_DBG_TYPE_WARN, - " dapli_cm_passive: CONN_ERR " - " event=0x%x status=%d", - " on SRC 0x%x,0x%x DST 0x%x,0x%x\n", - event->event, event->status, - ntohl(((struct sockaddr_in *) - &ipaddr->src_addr)->sin_addr.s_addr), - ntohs(((struct sockaddr_in *) - &ipaddr->src_addr)->sin_port), - ntohl(((struct sockaddr_in *) - &ipaddr->dst_addr)->sin_addr.s_addr), - ntohs(((struct sockaddr_in *) - &ipaddr->dst_addr)->sin_port)); + DAPL_DBG_TYPE_WARN, + " dapli_cm_passive: CONN_ERR " + " event=0x%x status=%d %s" + " on SRC 0x%x,0x%x DST 0x%x,0x%x\n", + event->event, event->status, + (event->status == -110)?"TIMEOUT":"", + ntohl(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_port), + ntohl(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_port)); dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE, NULL, conn->sp); Index: dapl/openib_cma/dapl_ib_util.h =================================================================== --- dapl/openib_cma/dapl_ib_util.h (revision 7694) +++ dapl/openib_cma/dapl_ib_util.h (working copy) @@ -67,8 +67,8 @@ typedef ib_hca_handle_t dapl_ibal_ca_t; #define IB_RC_RETRY_COUNT 7 #define IB_RNR_RETRY_COUNT 7 -#define IB_CM_RESPONSE_TIMEOUT 18 /* 1 sec */ -#define IB_MAX_CM_RETRIES 7 +#define IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */ +#define IB_CM_RETRIES 15 #define IB_REQ_MRA_TIMEOUT 27 /* a little over 9 minutes */ #define IB_MAX_AT_RETRY 3 #define IB_TARGET_MAX 4 /* max_qp_ous_rd_atom */ @@ -252,6 +252,8 @@ typedef struct _ib_hca_transport ib_async_cq_handler_t async_cq_error; ib_async_dto_handler_t async_cq; ib_async_qp_handler_t async_qp_error; + uint8_t max_cm_timeout; + uint8_t max_cm_retries; } ib_hca_transport_t; From rdreier at cisco.com Mon Jun 5 21:42:22 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 Jun 2006 21:42:22 -0700 Subject: [openib-general] Re: [PATCH] SRP : Use correct port identifier format according to target io_class In-Reply-To: <1149171133.7588.45.camel@Prawra.gs-lab.com> (Ramchandra K.'s message of "Thu, 01 Jun 2006 19:42:13 +0530") References: <1149171133.7588.45.camel@Prawra.gs-lab.com> Message-ID: Thanks, I applied this. From k_mahesh85 at yahoo.co.in Mon Jun 5 21:51:43 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Tue, 6 Jun 2006 05:51:43 +0100 (BST) Subject: [openib-general] repost-problem with memory registration-RDMA kernel utliity Message-ID: <20060606045143.81301.qmail@web8327.mail.in.yahoo.com> can anybody me suggest me the correct way to register a buffer for doing RDMA operations i have already posted my code in the previous thread but that is not working fine. it is a kernel utility and i have obtained the buffer by using kmalloc, now how can i register this inorder to perform RDMA operations over it. -Mahesh Send instant messages to your online friends http://in.messenger.yahoo.com Stay connected with your friends even when away from PC. Link: http://in.mobile.yahoo.com/new/messenger/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Jun 6 00:08:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 10:08:15 +0300 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: <67897d690606051415o3675207o549ce7e084d618b8@mail.gmail.com> References: <67897d690606051415o3675207o549ce7e084d618b8@mail.gmail.com> Message-ID: <20060606070814.GA2432@mellanox.co.il> Quoting r. Manpreet Singh : > Subject: Re: Mellanox HCAs: outstanding RDMAs > > We have seen this happen over an IB analyzer. Recompiling the mthca driver with a high value like 64 or 128 works around this problem. > When the condition hits, the HCA receiving the 4+ RDMAs generates an invalid request error. Posting more read work requests than might be outstanding simultaneously on the wire is not an error. I think the fact you are getting an error means you are configuring max_rd_atomic/max_dest_rd_atomic on the local versus remote side incorrectly (these represent the Number of responder resources for RDMA Read/atomic ops and Number of Outstanding RDMA Read/atomic ops at destination, respectively). If so this is a bug in ULP, working around it by increasing the number of credits on both sides does not seem like the right thing to do. See 12.7.29 RESPONDER RESOURCES, and 12.7.30 INITIATOR DEPTH. -- MST From mst at mellanox.co.il Tue Jun 6 00:09:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 10:09:54 +0300 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: <44846EFF.2020705@veritas.com> References: <44846EFF.2020705@veritas.com> Message-ID: <20060606070954.GB2432@mellanox.co.il> Quoting r. somenath : > possibility of stalling is scary! You might want to review chapter 9.5 TRANSACTION ORDERING for info on when will ordering rules cause the IB QP to stall. -- MST From mst at mellanox.co.il Tue Jun 6 00:43:14 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 10:43:14 +0300 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: <7.0.1.0.2.20060605081948.044849d0@netapp.com> References: <7.0.1.0.2.20060605081948.044849d0@netapp.com> Message-ID: <20060606074314.GC2432@mellanox.co.il> Quoting r. Talpey, Thomas : > Semantically, the provider is not required to provide any such flow control > behavior by the way. The Mellanox one apparently does, but it is not > a requirement of the verbs, it's a requirement on the upper layer. If more > RDMA Reads are posted than the remote peer supports, the connection > may break. This does not sound right. Isn't this the meaning of this field: "Initiator Depth: Number of RDMA Reads & atomic operations outstanding at any time"? Shouldn't any provider enforce this limit? -- MST From Thomas.Talpey at netapp.com Tue Jun 6 05:24:23 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 06 Jun 2006 08:24:23 -0400 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: <20060606074314.GC2432@mellanox.co.il> References: <7.0.1.0.2.20060605081948.044849d0@netapp.com> <20060606074314.GC2432@mellanox.co.il> Message-ID: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com> At 03:43 AM 6/6/2006, Michael S. Tsirkin wrote: >Quoting r. Talpey, Thomas : >> Semantically, the provider is not required to provide any such flow control >> behavior by the way. The Mellanox one apparently does, but it is not >> a requirement of the verbs, it's a requirement on the upper layer. If more >> RDMA Reads are posted than the remote peer supports, the connection >> may break. > >This does not sound right. Isn't this the meaning of this field: >"Initiator Depth: Number of RDMA Reads & atomic operations >outstanding at any time"? Shouldn't any provider enforce this limit? The core spec does not require it. An implementation *may* enforce it, but is not *required* to do so. And as pointed out in the other message, there are repercussions of doing so. I believe the silent queue stalling is a bit of a time bomb for upper layers, whose implementers are quite likely unaware of the danger. I greatly prefer an implementation which simply sends the RDMA Read request, resulting in a failed (but unblocked!) connection. Silence is a very dangerous thing, no matter how helpful the intent. Tom. From Thomas.Talpey at netapp.com Tue Jun 6 05:13:32 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 06 Jun 2006 08:13:32 -0400 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: <20060606070954.GB2432@mellanox.co.il> References: <44846EFF.2020705@veritas.com> <20060606070954.GB2432@mellanox.co.il> Message-ID: <7.0.1.0.2.20060606080728.086feab0@netapp.com> At 03:09 AM 6/6/2006, Michael S. Tsirkin wrote: >Quoting r. somenath : >> possibility of stalling is scary! > >You might want to review chapter 9.5 TRANSACTION ORDERING for info on when will >ordering rules cause the IB QP to stall. MST, are you disagreeing that RDMA Reads can stall the queue? Section 9.5, C9-25 lays it right out as the first requirement: >> C9-25: A requester shall transmit request messages in the order that the >> Work Queue Elements (WQEs) were posted. Therefore, a provider which implements flow control on RDMA Reads cannot transmit new sends until the prior RDMA Reads can be initiated. Of course, they may complete in a somewhat different order... It's all about flow control - which is not mandatory. It's a convenient, but very risky thing. Upper layers are often unaware of its ramifications. Tom. From mst at mellanox.co.il Tue Jun 6 05:44:26 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 15:44:26 +0300 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: <7.0.1.0.2.20060606080728.086feab0@netapp.com> References: <7.0.1.0.2.20060606080728.086feab0@netapp.com> Message-ID: <20060606124426.GH2432@mellanox.co.il> Quoting r. Talpey, Thomas : > Subject: Re: Mellanox HCAs: outstanding RDMAs > > At 03:09 AM 6/6/2006, Michael S. Tsirkin wrote: > >Quoting r. somenath : > >> possibility of stalling is scary! > > > >You might want to review chapter 9.5 TRANSACTION ORDERING for info on when > >will ordering rules cause the IB QP to stall. > > MST, are you disagreeing that RDMA Reads can stall the queue? I don't disagree with this of course. I was simply suggesting to ULP designers to read the chapter 9.5 and become aware of the rules, taking them into account at early stages of protocol design. -- MST From Thomas.Talpey at netapp.com Tue Jun 6 05:52:04 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 06 Jun 2006 08:52:04 -0400 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: <20060606124426.GH2432@mellanox.co.il> References: <7.0.1.0.2.20060606080728.086feab0@netapp.com> <20060606124426.GH2432@mellanox.co.il> Message-ID: <7.0.1.0.2.20060606084959.0469bcc8@netapp.com> At 08:44 AM 6/6/2006, Michael S. Tsirkin wrote: >> MST, are you disagreeing that RDMA Reads can stall the queue? > >I don't disagree with this of course. I was simply suggesting to ULP designers >to read the chapter 9.5 and become aware of the rules, taking them >into account at early stages of protocol design. :-) RTFM? I still think flow control is wrong and dangerous thing for RDMA Read. If it never happened, and the connections just failed, we'd never have the issue. Also, I'm certain we'll see upper layers that work on one provider, only to fail on another. Sigh. Tom. From mst at mellanox.co.il Tue Jun 6 05:56:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 15:56:34 +0300 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com> References: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com> Message-ID: <20060606125634.GI2432@mellanox.co.il> Quoting r. Talpey, Thomas : > Subject: Re: Mellanox HCAs: outstanding RDMAs > > At 03:43 AM 6/6/2006, Michael S. Tsirkin wrote: > >Quoting r. Talpey, Thomas : > >> Semantically, the provider is not required to provide any such flow control > >> behavior by the way. The Mellanox one apparently does, but it is not > >> a requirement of the verbs, it's a requirement on the upper layer. If more > >> RDMA Reads are posted than the remote peer supports, the connection > >> may break. > > > >This does not sound right. Isn't this the meaning of this field: > >"Initiator Depth: Number of RDMA Reads & atomic operations > >outstanding at any time"? Shouldn't any provider enforce this limit? > > The core spec does not require it. An implementation *may* enforce it, > but is not *required* to do so. And as pointed out in the other message, > there are repercussions of doing so. Interesting, I wasn't aware of such interpretation of the spec. When QP is modified to RTS, the initiator depth is passed to it, which suggests that the provider must obey, not ignore this parameter. No? -- MST From Thomas.Talpey at netapp.com Tue Jun 6 06:42:15 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 06 Jun 2006 09:42:15 -0400 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: <20060606125634.GI2432@mellanox.co.il> References: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com> <20060606125634.GI2432@mellanox.co.il> Message-ID: <7.0.1.0.2.20060606093959.086feab0@netapp.com> At 08:56 AM 6/6/2006, Michael S. Tsirkin wrote: >> The core spec does not require it. An implementation *may* enforce it, >> but is not *required* to do so. And as pointed out in the other message, >> there are repercussions of doing so. > >Interesting, I wasn't aware of such interpretation of the spec. >When QP is modified to RTS, the initiator depth is passed to it, which >suggests that the provider must obey, not ignore this parameter. No? This is the difference between "may" and "must". The value is provided, but I don't see anything in the spec that makes a requirement on its enforcement. Table 107 says the consumer can query it, that's about as close as it comes. There's some discussion about CM exchange too. Don't forget about iWARP, btw. Tom. From jlentini at netapp.com Tue Jun 6 06:44:51 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 6 Jun 2006 09:44:51 -0400 (EDT) Subject: [openib-general] Fix some suspicious ppc64 code in dapl In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7122@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7122@mtlexch01.mtl.com> Message-ID: On Mon, 5 Jun 2006, Tziporet Koren wrote: > Is it important to take this patch to the OFED release? It may fix http://openib.org/bugzilla/show_bug.cgi?id=48 From halr at voltaire.com Tue Jun 6 06:45:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Jun 2006 09:45:00 -0400 Subject: [openib-general] [PATCH] [MINOR] OpenSM: Minor improvement to a couple of SA error paths Message-ID: <1149601493.4510.243499.camel@hal.voltaire.com> OpenSM: Minor improvement to a couple of SA error paths Signed-off-by: Hal Rosenstock Index: opensm/osm_sa_slvl_record.c =================================================================== --- opensm/osm_sa_slvl_record.c (revision 7718) +++ opensm/osm_sa_slvl_record.c (working copy) @@ -158,15 +158,6 @@ __osm_sa_slvl_create( OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_slvl_create ); - if (p_physp->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH) - { - lid = osm_physp_get_port_info_ptr( p_physp )->base_lid; - } - else - { - lid = osm_node_get_base_lid( p_physp->p_node, 0 ); - } - p_rec_item = (osm_slvl_item_t*)cl_qlock_pool_get( &p_rcv->pool ); if( p_rec_item == NULL ) { @@ -177,6 +168,15 @@ __osm_sa_slvl_create( goto Exit; } + if (p_physp->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH) + { + lid = osm_physp_get_port_info_ptr( p_physp )->base_lid; + } + else + { + lid = osm_node_get_base_lid( p_physp->p_node, 0 ); + } + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, Index: opensm/osm_sa_vlarb_record.c =================================================================== --- opensm/osm_sa_vlarb_record.c (revision 7718) +++ opensm/osm_sa_vlarb_record.c (working copy) @@ -158,15 +158,6 @@ __osm_sa_vl_arb_create( OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_vl_arb_create ); - if (p_physp->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH) - { - lid = osm_physp_get_port_info_ptr( p_physp )->base_lid; - } - else - { - lid = osm_node_get_base_lid( p_physp->p_node, 0 ); - } - p_rec_item = (osm_vl_arb_item_t*)cl_qlock_pool_get( &p_rcv->pool ); if( p_rec_item == NULL ) { @@ -177,6 +168,15 @@ __osm_sa_vl_arb_create( goto Exit; } + if (p_physp->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH) + { + lid = osm_physp_get_port_info_ptr( p_physp )->base_lid; + } + else + { + lid = osm_node_get_base_lid( p_physp->p_node, 0 ); + } + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, From rdreier at cisco.com Tue Jun 6 07:40:26 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 07:40:26 -0700 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: <7.0.1.0.2.20060606093959.086feab0@netapp.com> (Thomas Talpey's message of "Tue, 06 Jun 2006 09:42:15 -0400") References: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com> <20060606125634.GI2432@mellanox.co.il> <7.0.1.0.2.20060606093959.086feab0@netapp.com> Message-ID: Thomas> This is the difference between "may" and "must". The value Thomas> is provided, but I don't see anything in the spec that Thomas> makes a requirement on its enforcement. Table 107 says the Thomas> consumer can query it, that's about as close as it Thomas> comes. There's some discussion about CM exchange too. This seems like a very strained interpretation of the spec. For example, there's no explicit language in the IB spec that requires an HCA to use the destination LID passed via a modify QP operation, but I don't think anyone would seriously argue that an implementation that sent messages to some other random destination was compliant. In the same way, if I pass a limit for the number of outstanding RDMA/atomic operations in to a modify QP operation, I would expect the HCA to use that limit. - R. From Thomas.Talpey at netapp.com Tue Jun 6 07:49:08 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 06 Jun 2006 10:49:08 -0400 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: References: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com> <20060606125634.GI2432@mellanox.co.il> <7.0.1.0.2.20060606093959.086feab0@netapp.com> Message-ID: <7.0.1.0.2.20060606104534.086feab0@netapp.com> At 10:40 AM 6/6/2006, Roland Dreier wrote: > Thomas> This is the difference between "may" and "must". The value > Thomas> is provided, but I don't see anything in the spec that > Thomas> makes a requirement on its enforcement. Table 107 says the > Thomas> consumer can query it, that's about as close as it > Thomas> comes. There's some discussion about CM exchange too. > >This seems like a very strained interpretation of the spec. For I don't see how strained has anything to do with it. It's not saying anything either way. So, a legal implementation can make either choice. We're talking about the spec! But, it really doesn't matter. The point is, an upper layer should be paying attention to the number of RDMA Reads it posts, or else suffer either the queue-stalling or connection-failing consequences. Bad stuff either way. Tom. >example, there's no explicit language in the IB spec that requires an >HCA to use the destination LID passed via a modify QP operation, but I >don't think anyone would seriously argue that an implementation that >sent messages to some other random destination was compliant. > >In the same way, if I pass a limit for the number of outstanding >RDMA/atomic operations in to a modify QP operation, I would expect the >HCA to use that limit. > > - R. From rdreier at cisco.com Tue Jun 6 08:00:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 08:00:16 -0700 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: <7.0.1.0.2.20060606104534.086feab0@netapp.com> (Thomas Talpey's message of "Tue, 06 Jun 2006 10:49:08 -0400") References: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com> <20060606125634.GI2432@mellanox.co.il> <7.0.1.0.2.20060606093959.086feab0@netapp.com> <7.0.1.0.2.20060606104534.086feab0@netapp.com> Message-ID: Thomas> I don't see how strained has anything to do with it. It's Thomas> not saying anything either way. So, a legal implementation Thomas> can make either choice. We're talking about the spec! I guess the reason I say it is strained is because the spec does have the following compliance statement for the modify QP verb: C11-8 Upon invocation of this Verb, the CI shall modify the attributes for the specified QP... So what should I expect to happen if I modify the number of outstanding RDMA Read/atomic operations? That the HCA will ignore that attribute? To me the only sensible interpretation of the spec is that setting a limit on outstanding operations will limit the number of outstanding operations. If the attribute doesn't do anything, then why would the spec include it? - R. From trimmer at silverstorm.com Tue Jun 6 09:43:23 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Tue, 6 Jun 2006 12:43:23 -0400 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs Message-ID: > Talpey, Thomas > Sent: Tuesday, June 06, 2006 10:49 AM > > At 10:40 AM 6/6/2006, Roland Dreier wrote: > > Thomas> This is the difference between "may" and "must". The value > > Thomas> is provided, but I don't see anything in the spec that > > Thomas> makes a requirement on its enforcement. Table 107 says the > > Thomas> consumer can query it, that's about as close as it > > Thomas> comes. There's some discussion about CM exchange too. > > > >This seems like a very strained interpretation of the spec. For > > I don't see how strained has anything to do with it. It's not saying > anything > either way. So, a legal implementation can make either choice. We're > talking about the spec! > > But, it really doesn't matter. The point is, an upper layer should be > paying > attention to the number of RDMA Reads it posts, or else suffer either the > queue-stalling or connection-failing consequences. Bad stuff either way. > > Tom. Somewhere beneath this discussion is a bug in the application or IB stack. I'm not sure which "may" in the spec you are referring to, but the "may"s I have found all are for cases where the responder might support only 1 outstanding request. In all cases the negotiation protocol must be followed and the requestor is not allowed to exceed the negotiated limit. The mechanism should be: client queries its local HCA and determines responder resources (eg. number of concurrent outstanding RDMA reads on the wire from the remote end where this end will respond with the read data) and initiator depth (eg. number of concurrent outstanding RDMA reads which this end can initiate as the requestor). client puts the above information in the CM REQ. server similarly gets its information from its local CA and negotiates down the values to the MIN of each side (REP.InitiatorDepth = MIN(REQ.ResponderResources, server's local CAs Initiator depth); REP.ResponderResources = MIN(REQ.InitiatorDepth, server's local CAs responder resources). If server does not support RDMA Reads, it can REJ. If client decided the negotiated values are insufficient to meet its goals, it can disconnect. Each side sets its QP parameters via modify QP appropriately. Note they too will be mirror images of eachother: client: QP.Max RDMA Reads as Initiator = REP.ResponderResources QP.Max RDMA reads as responder = REP.InitiatorDepth server: QP.Max RDMA Reads as responder = REP.ResponderResources QP.Max RDMA reads as initiator = REP.InitiatorDepth We have done a lot of high stress RDMA Read traffic with Mellanox HCAs and provided the above negotiation is followed, we have seen no issues. Note however that by default a Mellanox HCA typically reports a large InitiatorDepth (128) and a modest ResponderResources (4-8). Hence when I hear that Responder Resources must be grown to 128 for some application to reliably work, it implies the negotiation I outlined above is not being followed. Note that the ordering rules in table 76 of IBTA 1.2 show how reads and write on a send queue are ordered. There are many cases where an op can pass an outstanding RDMA read, hence it is not always bad to queue extra RDMA reads. If needed, the Fence can be sent to force order. For many apps, its going to be better to get the items onto queue and let the QP handle the outstanding reads cases rather than have the app add a level of queuing for this purpose. Letting the HCA do the queuing will allow for a more rapid initiation of subsequent reads. Todd Rimmer From sean.hefty at intel.com Tue Jun 6 09:55:11 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Jun 2006 09:55:11 -0700 Subject: [openib-general] multicast questions Message-ID: Does anyone know if the following multicast configurations have been tested? 1. Receiving messages on the same port that they were sent, but on a different QP. 2. Receiving messages on multiple QPs on the same port. - Sean From mst at mellanox.co.il Tue Jun 6 10:23:46 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 20:23:46 +0300 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: <7.0.1.0.2.20060606104534.086feab0@netapp.com> References: <7.0.1.0.2.20060606104534.086feab0@netapp.com> Message-ID: <20060606172345.GB4397@mellanox.co.il> Quoting r. Talpey, Thomas : > But, it really doesn't matter. The point is, an upper layer should be paying > attention to the number of RDMA Reads it posts, or else suffer either the > queue-stalling or connection-failing consequences. Bad stuff either way. Queue-stalling is not necessarily bad, for example if the ULP needs to perform multiple RDMA reads anyway. You can use multipe QPs if you do not require ordering between operations. Connection-failing *is* bad stuff, IMO it might be compliant but its clearly broken in the same way that a NIC that drops all packets might be complaint but is broken. -- MST From halr at voltaire.com Tue Jun 6 10:27:07 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Jun 2006 13:27:07 -0400 Subject: [openib-general] [PATCH][MINOR] OpenSM: Fix inconsistent use of osm_log level Message-ID: <1149614823.4510.248559.camel@hal.voltaire.com> OpenSM: Fix inconsistent use of osm_log level Also, some other cosmetic changes Signed-off-by: Hal Rosenstock Index: opensm/osm_pkey_rcv.c =================================================================== --- opensm/osm_pkey_rcv.c (revision 7733) +++ opensm/osm_pkey_rcv.c (working copy) @@ -200,13 +200,10 @@ osm_pkey_rcv_process( */ if( !osm_physp_is_valid( p_physp ) ) { - if( osm_log_is_active( p_rcv->p_log, OSM_LOG_VERBOSE ) ) - { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_pkey_rcv_process: ERR 4807: " - "Got invalid port number 0x%X\n", - port_num ); - } + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_pkey_rcv_process: ERR 4807: " + "Got invalid port number 0x%X\n", + port_num ); goto Exit; } Index: opensm/osm_sa_guidinfo_record.c =================================================================== --- opensm/osm_sa_guidinfo_record.c (revision 7733) +++ opensm/osm_sa_guidinfo_record.c (working copy) @@ -171,7 +171,7 @@ __osm_gir_rcv_new_gir( if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_gir_rcv_new_gir: " "New GUIDInfoRecord: lid 0x%X, block num %d\n", cl_ntoh16( match_lid ), block_num ); @@ -220,7 +220,7 @@ __osm_sa_gir_create_gir( if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_sa_gir_create_gir: " "Looking for GUIDRecord with LID: 0x%X GUID:0x%016" PRIx64 "\n", cl_ntoh16( match_lid ), @@ -282,7 +282,7 @@ __osm_sa_gir_create_gir( */ if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_sa_gir_create_gir: " "Comparing LID: 0x%X <= 0x%X <= 0x%X\n", cl_ntoh16( base_lid_ho ), @@ -495,7 +495,7 @@ osm_gir_rcv_process( if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_gir_rcv_process: " + "osm_gir_rcv_process: ERR 5103: " "Got more than one record for SubnAdmGet (%u)\n", num_rec ); osm_sa_send_error( p_rcv->p_resp, p_madw, Index: opensm/osm_sa_vlarb_record.c =================================================================== --- opensm/osm_sa_vlarb_record.c (revision 7733) +++ opensm/osm_sa_vlarb_record.c (working copy) @@ -179,7 +179,7 @@ __osm_sa_vl_arb_create( if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_sa_vl_arb_create: " "New VLArbitration for: port 0x%016" PRIx64 ", lid 0x%X, port# 0x%X Block:%u\n", @@ -416,7 +416,7 @@ osm_vlarb_rec_rcv_process( else { /* port out of range */ osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_vlarb_rec_rcv_process: " + "osm_vlarb_rec_rcv_process: ERR 2A01: " "Given LID (%u) is out of range:%u\n", cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) ); } @@ -444,7 +444,7 @@ osm_vlarb_rec_rcv_process( if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_vlarb_rec_rcv_process: " + "osm_vlarb_rec_rcv_process: ERR 2A08: " "Got more than one record for SubnAdmGet (%u)\n", num_rec ); osm_sa_send_error( p_rcv->p_resp, p_madw, Index: opensm/osm_sa_multipath_record.c =================================================================== --- opensm/osm_sa_multipath_record.c (revision 7733) +++ opensm/osm_sa_multipath_record.c (working copy) @@ -1281,7 +1281,8 @@ __osm_mpr_rcv_process_pairs( max_paths - total_paths, comp_mask, p_list ); total_paths += num_paths; - osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_mpr_rcv_process_pairs: " + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_mpr_rcv_process_pairs: " "%d paths %d total paths %d max paths\n", num_paths, total_paths, max_paths ); /* Just take first NumbPaths found */ @@ -1468,7 +1469,8 @@ osm_mpr_rcv_process( if ( sa_status != IB_SA_MAD_STATUS_SUCCESS || !nsrc || !ndest ) { if ( sa_status == IB_SA_MAD_STATUS_SUCCESS && ( !nsrc || !ndest ) ) - osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_mpr_rcv_process_cb: ERR 4512: " + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_mpr_rcv_process_cb: ERR 4512: " "__osm_mpr_rcv_get_end_points failed, not enough GIDs " "(nsrc %d ndest %d)\n", nsrc, ndest); Index: opensm/osm_subnet.c =================================================================== --- opensm/osm_subnet.c (revision 7733) +++ opensm/osm_subnet.c (working copy) @@ -250,7 +250,7 @@ osm_get_gid_by_mad_addr( if ( p_gid == NULL ) { osm_log( p_log, OSM_LOG_ERROR, - "osm_get_gid_by_mad_addr: ERR 7505 " + "osm_get_gid_by_mad_addr: ERR 7505: " "Provided output GID is NULL\n"); return(IB_INVALID_PARAMETER); } @@ -281,7 +281,7 @@ osm_get_gid_by_mad_addr( { /* The dest_lid is not in the subnet table - this is an error */ osm_log( p_log, OSM_LOG_ERROR, - "osm_get_gid_by_mad_addr: ERR 7501 " + "osm_get_gid_by_mad_addr: ERR 7501: " "LID is out of range: 0x%X\n", cl_ntoh16(p_mad_addr->dest_lid) ); @@ -316,7 +316,7 @@ osm_get_physp_by_mad_addr( { /* The port is not in the port_lid table - this is an error */ osm_log( p_log, OSM_LOG_ERROR, - "osm_get_physp_by_mad_addr: ERR 7502 " + "osm_get_physp_by_mad_addr: ERR 7502: " "Cannot locate port object by lid: 0x%X\n", cl_ntoh16(p_mad_addr->dest_lid) ); @@ -329,7 +329,7 @@ osm_get_physp_by_mad_addr( { /* The dest_lid is not in the subnet table - this is an error */ osm_log( p_log, OSM_LOG_ERROR, - "osm_get_physp_by_mad_addr: ERR 7503 " + "osm_get_physp_by_mad_addr: ERR 7503: " "Lid is out of range: 0x%X\n", cl_ntoh16(p_mad_addr->dest_lid) ); @@ -365,7 +365,7 @@ osm_get_port_by_mad_addr( { /* The dest_lid is not in the subnet table - this is an error */ osm_log( p_log, OSM_LOG_ERROR, - "osm_get_port_by_mad_addr: ERR 7504 " + "osm_get_port_by_mad_addr: ERR 7504: " "Lid is out of range: 0x%X\n", cl_ntoh16(p_mad_addr->dest_lid) ); Index: opensm/osm_sa_lft_record.c =================================================================== --- opensm/osm_sa_lft_record.c (revision 7733) +++ opensm/osm_sa_lft_record.c (working copy) @@ -510,7 +510,7 @@ osm_lftr_rcv_process( { osm_log(p_rcv->p_log, OSM_LOG_ERROR, "osm_lftr_rcv_process: ERR 4411: " - "osm_vendor_send. status = %s\n", + "osm_vendor_send status = %s\n", ib_get_err_str(status)); goto Exit; } Index: opensm/osm_pkey_rcv_ctrl.c =================================================================== --- opensm/osm_pkey_rcv_ctrl.c (revision 7733) +++ opensm/osm_pkey_rcv_ctrl.c (working copy) @@ -110,7 +110,7 @@ osm_pkey_rcv_ctrl_init( { osm_log( p_log, OSM_LOG_ERROR, "osm_pkey_rcv_ctrl_init: ERR 4901: " - "Dispatcher registration failed.\n" ); + "Dispatcher registration failed\n" ); status = IB_INSUFFICIENT_RESOURCES; goto Exit; } Index: opensm/osm_sa_service_record.c =================================================================== --- opensm/osm_sa_service_record.c (revision 7733) +++ opensm/osm_sa_service_record.c (working copy) @@ -1115,7 +1115,7 @@ osm_sr_rcv_process( default: osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "osm_sr_rcv_process: " - "Bad Method (%s)\n", ib_get_sa_method_str( p_sa_mad->method )); + "Bad Method (%s)\n", ib_get_sa_method_str( p_sa_mad->method ) ); osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status ); break; } Index: opensm/osm_sa_portinfo_record.c =================================================================== --- opensm/osm_sa_portinfo_record.c (revision 7733) +++ opensm/osm_sa_portinfo_record.c (working copy) @@ -168,7 +168,7 @@ __osm_pir_rcv_new_pir( if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_pir_rcv_new_pir: " "New PortInfoRecord: port 0x%016" PRIx64 ", lid 0x%X, port# 0x%X\n", @@ -678,7 +678,7 @@ osm_pir_rcv_process( else { osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_pir_rcv_process: " + "osm_pir_rcv_process: ERR 2101: " "Given LID (%u) is out of range:%u\n", cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl)); } @@ -694,7 +694,7 @@ osm_pir_rcv_process( else { osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_pir_rcv_process: " + "osm_pir_rcv_process: ERR 2103: " "Given LID (%u) is out of range:%u\n", cl_ntoh16(p_pi->base_lid), cl_ptr_vector_get_size(p_tbl)); } @@ -721,7 +721,7 @@ osm_pir_rcv_process( if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_pir_rcv_process: " + "osm_pir_rcv_process: ERR 2108: " "Got more than one record for SubnAdmGet (%u)\n", num_rec ); osm_sa_send_error( p_rcv->p_resp, p_madw, @@ -852,7 +852,7 @@ osm_pir_rcv_process( { osm_log(p_rcv->p_log, OSM_LOG_ERROR, "osm_pir_rcv_process: ERR 2107: " - "osm_vendor_send. status = %s\n", + "osm_vendor_send status = %s\n", ib_get_err_str(status)); goto Exit; } Index: opensm/osm_sa_pkey_record.c =================================================================== --- opensm/osm_sa_pkey_record.c (revision 7733) +++ opensm/osm_sa_pkey_record.c (working copy) @@ -169,7 +169,7 @@ __osm_sa_pkey_create( if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_sa_pkey_create: " "New P_Key table for: port 0x%016" PRIx64 ", lid 0x%X, port# 0x%X Block:%u\n", @@ -432,7 +432,7 @@ osm_pkey_rec_rcv_process( else { /* port out of range */ osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_pkey_rec_rcv_process: " + "osm_pkey_rec_rcv_process: ERR 4609: " "Given LID (%u) is out of range:%u\n", cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl)); } @@ -460,7 +460,7 @@ osm_pkey_rec_rcv_process( if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_pkey_rec_rcv_process: " + "osm_pkey_rec_rcv_process: ERR 460A: " "Got more than one record for SubnAdmGet (%u)\n", num_rec ); osm_sa_send_error( p_rcv->p_resp, p_madw, Index: opensm/osm_inform.c =================================================================== --- opensm/osm_inform.c (revision 7733) +++ opensm/osm_inform.c (working copy) @@ -283,7 +283,7 @@ osm_infr_insert_to_db( "Inserting a new InformInfo Record into Database\n"); osm_log( p_log, OSM_LOG_DEBUG, "osm_infr_insert_to_db: " - "Dump before insertion (size : %d) : \n", + "Dump before insertion (size : %d)\n", cl_qlist_count(&p_subn->sa_infr_list) ); __dump_all_informs(p_subn, p_log); @@ -295,7 +295,7 @@ osm_infr_insert_to_db( osm_log( p_log, OSM_LOG_DEBUG, "osm_infr_insert_to_db: " - "Dump after insertion (size : %d) : \n", + "Dump after insertion (size : %d)\n", cl_qlist_count(&p_subn->sa_infr_list) ); __dump_all_informs(p_subn, p_log); OSM_LOG_EXIT( p_log ); Index: opensm/osm_sa_slvl_record.c =================================================================== --- opensm/osm_sa_slvl_record.c (revision 7733) +++ opensm/osm_sa_slvl_record.c (working copy) @@ -179,7 +179,7 @@ __osm_sa_slvl_create( if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_sa_slvl_create: " "New SLtoVL Map for: OUT port 0x%016" PRIx64 ", lid 0x%X, port# 0x%X to In Port:%u\n", @@ -395,7 +395,7 @@ osm_slvl_rec_rcv_process( else { /* port out of range */ osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_slvl_rec_rcv_process: " + "osm_slvl_rec_rcv_process: ERR 2601: " "Given LID (%u) is out of range:%u\n", cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl)); } @@ -423,7 +423,7 @@ osm_slvl_rec_rcv_process( if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_slvl_rec_rcv_process: " + "osm_slvl_rec_rcv_process: ERR 2607: " "Got more than one record for SubnAdmGet (%u)\n", num_rec ); osm_sa_send_error( p_rcv->p_resp, p_madw, Index: opensm/osm_mcast_mgr.c =================================================================== --- opensm/osm_mcast_mgr.c (revision 7738) +++ opensm/osm_mcast_mgr.c (working copy) @@ -1130,7 +1130,7 @@ osm_mcast_mgr_process_single( p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl; mlid_ho = cl_ntoh16( mlid ); - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "osm_mcast_mgr_process_single: " @@ -1249,7 +1249,7 @@ osm_mcast_mgr_process_single( { if( join_state & IB_JOIN_STATE_SEND_ONLY ) { - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "osm_mcast_mgr_process_single: " @@ -1269,7 +1269,7 @@ osm_mcast_mgr_process_single( } else { - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "osm_mcast_mgr_process_single: " Index: opensm/osm_trap_rcv.c =================================================================== --- opensm/osm_trap_rcv.c (revision 7733) +++ opensm/osm_trap_rcv.c (working copy) @@ -678,7 +678,7 @@ __osm_trap_rcv_process_sm( OSM_LOG_ENTER( p_rcv->p_log, __osm_trap_rcv_process_sm ); osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_trap_rcv_process_sm: " + "__osm_trap_rcv_process_sm: ERR 3807: " "This function is not supported yet\n"); OSM_LOG_EXIT( p_rcv->p_log ); @@ -696,7 +696,7 @@ __osm_trap_rcv_process_response( OSM_LOG_ENTER( p_rcv->p_log, __osm_trap_rcv_process_response ); osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_trap_rcv_process_response: " + "__osm_trap_rcv_process_response: ERR 3808: " "This function is not supported yet\n"); OSM_LOG_EXIT( p_rcv->p_log ); Index: opensm/osm_sa_informinfo.c =================================================================== --- opensm/osm_sa_informinfo.c (revision 7733) +++ opensm/osm_sa_informinfo.c (working copy) @@ -357,14 +357,15 @@ osm_infr_rcv_process_set_method( p_recvd_inform_info = (ib_inform_info_t*)ib_sa_mad_get_payload_ptr( p_sa_mad ); - /* the dump routine is not defined yet - if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) - { - osm_dump_inform_info_record( p_rcv->p_log, - p_recvd_service_rec, - OSM_LOG_DEBUG ); - } - */ +#if 0 + /* the dump routine is not implemented yet */ + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_dump_inform_info_record( p_rcv->p_log, + p_recvd_inform_info, + OSM_LOG_DEBUG ); + } +#endif /* Grab the lock */ cl_plock_excl_acquire( p_rcv->p_lock ); Index: opensm/osm_ucast_updn.c =================================================================== --- opensm/osm_ucast_updn.c (revision 7733) +++ opensm/osm_ucast_updn.c (working copy) @@ -879,7 +879,7 @@ osm_subn_calc_up_down_min_hop_table( if (num_guids == 0) { osm_log(&(osm.log), OSM_LOG_ERROR, - "osm_subn_calc_up_down_min_hop_table: " + "osm_subn_calc_up_down_min_hop_table: ERR AA0A: " "No guids were given or number of guids is 0\n"); return 1; } Index: opensm/osm_sa_node_record.c =================================================================== --- opensm/osm_sa_node_record.c (revision 7733) +++ opensm/osm_sa_node_record.c (working copy) @@ -161,7 +161,7 @@ __osm_nr_rcv_new_nr( if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_nr_rcv_new_nr: " "New NodeRecord: node 0x%016" PRIx64 "\n\t\t\t\tport 0x%016" PRIx64 ", lid 0x%X\n", @@ -211,7 +211,7 @@ __osm_nr_rcv_create_nr( if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_nr_rcv_create_nr: " "Looking for NodeRecord with LID: 0x%X GUID:0x%016" PRIx64 "\n", cl_ntoh16( match_lid ), @@ -257,7 +257,7 @@ __osm_nr_rcv_create_nr( */ if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_nr_rcv_create_nr: " "Comparing LID: 0x%X <= 0x%X <= 0x%X\n", cl_ntoh16( base_lid_ho ), @@ -326,7 +326,7 @@ __osm_nr_rcv_by_comp_mask( */ if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_nr_rcv_by_comp_mask: " "Looking for node 0x%016" PRIx64 ", found 0x%016" PRIx64 "\n", @@ -493,7 +493,7 @@ osm_nr_rcv_process( */ if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1) ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_nr_rcv_process: " + "osm_nr_rcv_process: ERR 1D03: " "Got more than one record for SubnAdmGet (%u)\n", num_rec ); osm_sa_send_error( p_rcv->p_resp, p_madw, Index: opensm/osm_sa_link_record.c =================================================================== --- opensm/osm_sa_link_record.c (revision 7733) +++ opensm/osm_sa_link_record.c (working copy) @@ -312,7 +312,7 @@ __osm_lr_rcv_get_physp_link( { osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_lr_rcv_get_physp_link: " - "Acquiring link record.\n" + "Acquiring link record\n" "\t\t\t\tsrc port 0x%" PRIx64 " (port 0x%X)" ", dest port 0x%" PRIx64 " (port 0x%X)\n", cl_ntoh64( osm_physp_get_port_guid( p_src_physp ) ), @@ -606,7 +606,7 @@ __osm_lr_rcv_respond( if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_lr_rcv_respond: " + "__osm_lr_rcv_respond: ERR 1806: " "Got more than one record for SubnAdmGet (%u)\n", num_rec ); osm_sa_send_error( p_rcv->p_resp, p_madw, Index: opensm/osm_slvl_map_rcv.c =================================================================== --- opensm/osm_slvl_map_rcv.c (revision 7733) +++ opensm/osm_slvl_map_rcv.c (working copy) @@ -211,13 +211,10 @@ osm_slvl_rcv_process( */ if( !osm_physp_is_valid( p_physp ) ) { - if( osm_log_is_active( p_rcv->p_log, OSM_LOG_VERBOSE ) ) - { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_slvl_rcv_process: " - "Got invalid port number 0x%X\n", - out_port_num ); - } + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_slvl_rcv_process: " + "Got invalid port number 0x%X\n", + out_port_num ); goto Exit; } Index: opensm/osm_sa_link_record_ctrl.c =================================================================== --- opensm/osm_sa_link_record_ctrl.c (revision 7733) +++ opensm/osm_sa_link_record_ctrl.c (working copy) @@ -116,7 +116,7 @@ osm_lr_rcv_ctrl_init( { osm_log( p_log, OSM_LOG_ERROR, "osm_lr_rcv_ctrl_init: ERR 1901: " - "Dispatcher registration failed.\n" ); + "Dispatcher registration failed\n" ); status = IB_INSUFFICIENT_RESOURCES; goto Exit; } Index: opensm/osm_qos.c =================================================================== --- opensm/osm_qos.c (revision 7733) +++ opensm/osm_qos.c (working copy) @@ -279,7 +279,8 @@ static ib_api_status_t qos_physp_setup(o /* setup vl high limit */ status = vl_high_limit_update(p_req, p, qcfg); if (status != IB_SUCCESS) { - osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: " + osm_log(p_log, OSM_LOG_ERROR, + "qos_physp_setup: ERR 6201 : " "failed to update VLHighLimit " "for port %" PRIx64 " #%d\n", cl_ntoh64(p->port_guid), port_num); @@ -289,7 +290,8 @@ static ib_api_status_t qos_physp_setup(o /* setup VLArbitration */ status = vlarb_update(p_req, p, port_num, qcfg); if (status != IB_SUCCESS) { - osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: " + osm_log(p_log, OSM_LOG_ERROR, + "qos_physp_setup: ERR 6202 : " "failed to update VLArbitration tables " "for port %" PRIx64 " #%d\n", cl_ntoh64(p->port_guid), port_num); @@ -299,7 +301,8 @@ static ib_api_status_t qos_physp_setup(o /* setup Sl2VL tables */ status = sl2vl_update(p_req, p, port_num, qcfg); if (status != IB_SUCCESS) { - osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: " + osm_log(p_log, OSM_LOG_ERROR, + "qos_physp_setup: ERR 6203 : " "failed to update SL2VLMapping tables " "for port %" PRIx64 " #%d\n", cl_ntoh64(p->port_guid), port_num); Index: opensm/osm_sa_mcmember_record.c =================================================================== --- opensm/osm_sa_mcmember_record.c (revision 7733) +++ opensm/osm_sa_mcmember_record.c (working copy) @@ -2286,7 +2286,7 @@ osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_mcmr_query_mgrp: ERR 1B17: " - "osm_vendor_send. status = %s\n", + "osm_vendor_send status = %s\n", ib_get_err_str(status) ); goto Exit; } Index: opensm/osm_drop_mgr.c =================================================================== --- opensm/osm_drop_mgr.c (revision 7733) +++ opensm/osm_drop_mgr.c (working copy) @@ -512,7 +512,7 @@ __osm_drop_mgr_check_node( if ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) { - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + osm_log( p_mgr->p_log, OSM_LOG_ERROR, "__osm_drop_mgr_check_node: ERR 0107: " "Node 0x%016" PRIx64 " is not a switch node\n", cl_ntoh64( node_guid ) ); Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 7733) +++ opensm/osm_lid_mgr.c (working copy) @@ -637,7 +637,7 @@ __osm_lid_mgr_init_sweep( osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_init_sweep: " "final free lid range [0x%x:0x%x]\n", - p_range->min_lid, p_range->max_lid); + p_range->min_lid, p_range->max_lid ); OSM_LOG_EXIT( p_mgr->p_log ); return status; @@ -757,7 +757,7 @@ __osm_lid_mgr_find_free_lid_range( /* if we run out of lids, give an error and abort! */ osm_log( p_mgr->p_log, OSM_LOG_ERROR, "__osm_lid_mgr_find_free_lid_range: ERR 0307: " - "OPENSM RAN OUT OF LIDS!!!\n"); + "OPENSM RAN OUT OF LIDS!!!\n" ); CL_ASSERT( 0 ); } @@ -827,7 +827,7 @@ __osm_lid_mgr_get_port_lid( osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_get_port_lid: " "0x%016" PRIx64" matches its known lid:0x%04x\n", - guid, min_lid); + guid, min_lid ); goto Exit; } else @@ -848,7 +848,7 @@ __osm_lid_mgr_get_port_lid( osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_get_port_lid: " "0x%016" PRIx64" has no persistent lid assigned\n", - guid); + guid ); } /* if the port info carries a lid it must be lmc aligned and not mapped @@ -872,7 +872,7 @@ __osm_lid_mgr_get_port_lid( osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_get_port_lid: " "0x%016" PRIx64" lid range:[0x%x-0x%x] is free\n", - guid, *p_min_lid, *p_max_lid); + guid, *p_min_lid, *p_max_lid ); goto NewLidSet; } else @@ -881,7 +881,7 @@ __osm_lid_mgr_get_port_lid( "__osm_lid_mgr_get_port_lid: " "0x%016" PRIx64 " existing lid range:[0x%x:0x%x] is not free\n", - guid, min_lid, min_lid + num_lids - 1); + guid, min_lid, min_lid + num_lids - 1 ); } } else @@ -890,7 +890,7 @@ __osm_lid_mgr_get_port_lid( "__osm_lid_mgr_get_port_lid: " "0x%016" PRIx64 " existing lid range:[0x%x:0x%x] is not lmc aligned\n", - guid, min_lid, min_lid + num_lids - 1); + guid, min_lid, min_lid + num_lids - 1 ); } } @@ -902,7 +902,7 @@ __osm_lid_mgr_get_port_lid( osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_get_port_lid: " "0x%016" PRIx64" assigned a new lid range:[0x%x-0x%x]\n", - guid, *p_min_lid, *p_max_lid); + guid, *p_min_lid, *p_max_lid ); lid_changed = 1; NewLidSet: @@ -1339,9 +1339,9 @@ osm_lid_mgr_process_sm( { osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, "osm_lid_mgr_process_sm: " - "Invoking UI function pfn_ui_pre_lid_assign\n"); + "Invoking UI function pfn_ui_pre_lid_assign\n" ); p_mgr->p_subn->opt.pfn_ui_pre_lid_assign( - p_mgr->p_subn->opt.ui_pre_lid_assign_ctx); + p_mgr->p_subn->opt.ui_pre_lid_assign_ctx ); } /* Set the send_set_reqs of the p_mgr to FALSE, and Index: opensm/osm_pkey_mgr.c =================================================================== --- opensm/osm_pkey_mgr.c (revision 7733) +++ opensm/osm_pkey_mgr.c (working copy) @@ -245,7 +245,7 @@ pkey_mgr_update_peer_port( if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) { osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: " + "pkey_mgr_update_peer_port: ERR 0502: " "pkey_mgr_enforce_partition() failed to update " "node 0x%016" PRIx64 " port %u\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), @@ -272,7 +272,7 @@ pkey_mgr_update_peer_port( ret_val = TRUE; else osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: " + "pkey_mgr_update_peer_port: ERR 0503: " "pkey_mgr_update_pkey_entry() failed to update " "pkey table block %d for node 0x%016" PRIx64 " port %u\n", @@ -332,7 +332,7 @@ static boolean_t pkey_mgr_update_port( ret_val = TRUE; else osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_port: " + "pkey_mgr_update_port: ERR 0504: " "pkey_mgr_update_pkey_entry() failed to update " "pkey table block %d for node 0x%016" PRIx64 " port %u\n", block_index, @@ -393,7 +393,8 @@ osm_pkey_mgr_process( if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS ) { - osm_log( &p_osm->log, OSM_LOG_ERROR, "osm_pkey_mgr_process: " + osm_log( &p_osm->log, OSM_LOG_ERROR, + "osm_pkey_mgr_process: ERR 0505: " "osm_prtn_make_partitions() failed\n" ); goto _err; } Index: opensm/osm_vl_arb_rcv.c =================================================================== --- opensm/osm_vl_arb_rcv.c (revision 7733) +++ opensm/osm_vl_arb_rcv.c (working copy) @@ -211,13 +211,10 @@ osm_vla_rcv_process( */ if( !osm_physp_is_valid( p_physp ) ) { - if( osm_log_is_active( p_rcv->p_log, OSM_LOG_VERBOSE ) ) - { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_vla_rcv_process: " - "Got invalid port number 0x%X\n", - port_num ); - } + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_vla_rcv_process: " + "Got invalid port number 0x%X\n", + port_num ); goto Exit; } From mst at mellanox.co.il Tue Jun 6 11:11:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 21:11:32 +0300 Subject: [openib-general] RFC: ib_cache_event problems Message-ID: <20060606181132.GA4701@mellanox.co.il> Hello! We are seeing the following problems in ib_cache_event: 1. If a GFP_ATOMIC allocation fails, it seems that cache won't be updated 2. Since cache isn't updated immediately, but by queueing a work request, it is possible for e.g. IP over IB, to query the cache as a result of event and get a stale value. Consider for example ipoib - in this case ipoib_pkey_dev_check_presence returns an incorrect value. We are actually seeing this happening in stress testing. Since the SM will not retry the MAD, event won't be regenerated, so values ULP gets from cache may never get updated. Suggestions 1. Cache should create ib_update_work objects statically upon hotplug event. 2. Need a mechanism for cache to consume events which trigger cache updates, and delay reporting them to ULPs until after cache is updated. Opinions? -- MST From rkuchimanchi at silverstorm.com Tue Jun 6 11:17:34 2006 From: rkuchimanchi at silverstorm.com (Ramachandra K) Date: Tue, 06 Jun 2006 23:47:34 +0530 Subject: [openib-general] Re: [PATCH] SRP : Use correct port identifier format according to target io_class In-Reply-To: References: <1149171133.7588.45.camel@Prawra.gs-lab.com> (Ramchandra K.'s message of "Thu, 01 Jun 2006 19:42:13 +0530") Message-ID: <44861416.2864.88C6C5@rkuchimanchi.silverstorm.com> > Thanks, I applied this. Thanks a lot Roland. But there was also a patch for ibsrpdm to display the IO class of the target. I am including it below for your convenience. Regards, Ram Signed-off-by: Ramachandra K Index: userspace/srptools/src/srp-dm.c =================================================================== --- userspace/srptools/src/srp-dm.c (revision 7738) +++ userspace/srptools/src/srp-dm.c (working copy) @@ -399,6 +399,7 @@ pr_human(" vendor ID: %06x\n", ntohl(ioc_prof.vendor_id) >> 8); pr_human(" device ID: %06x\n", ntohl(ioc_prof.device_id)); pr_human(" ID: %s\n", ioc_prof.id); + pr_human(" IO class : %hx\n", ntohs(ioc_prof.io_class)); pr_human(" service entries: %d\n", ioc_prof.service_entries); for (j = 0; j < ioc_prof.service_entries; j += 4) { @@ -429,11 +430,13 @@ "ioc_guid=%016llx," "dgid=%016llx%016llx," "pkey=ffff," + "io_class=%hx," "service_id=%016llx\n", id_ext, (unsigned long long) ntohll(ioc_prof.guid), (unsigned long long) subnet_prefix, (unsigned long long) guid, + ntohs(ioc_prof.io_class), (unsigned long long) ntohll(svc_entries.service[k].id)); } } From mshefty at ichips.intel.com Tue Jun 6 11:28:48 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 06 Jun 2006 11:28:48 -0700 Subject: [openib-general] multicast questions In-Reply-To: References: Message-ID: <4485C960.50802@ichips.intel.com> Sean Hefty wrote: > Does anyone know if the following multicast configurations have been tested? > > 1. Receiving messages on the same port that they were sent, but on a different > QP. > > 2. Receiving messages on multiple QPs on the same port. These are the situations that I'm seeing. (Note that this is with new code, so it's entirely possible that the errors are in the new code.) Works: Process A1 on node A starts and joins a group. Process B1 on node B starts and joins the same group. Process B1 sends messages. process A1 receives messages. Failure 1 (loopback): Process A1 on node A starts and joins a group. Process A2 on node A starts and joins the same group. Process A2 sends messages. Process A1 does not see messages. Failure 2 (multiple receivers): Process A1 on node A starts and joins a group. Process A2 on node A starts and joins the same group. Process B1 on node B starts and joins the same group. Process B1 sends messages. Process A1 receives messages. Process A2 does not see messages. It appears that the multicast groups are being created successfully, and the QPs are attaching to the groups. - Sean From mshefty at ichips.intel.com Tue Jun 6 11:43:53 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 06 Jun 2006 11:43:53 -0700 Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS In-Reply-To: References: Message-ID: <4485CCE9.7020807@ichips.intel.com> Arlin Davis wrote: > Here is a patch to the openib-cma provider that uses the new set_option > feature of the uCMA to adjust connect request timeout and retry values. The > defaults are a little quick for some consumers. They are now bumped up from 3 > retries to 15 and are tunable with uDAPL environment variables. Also, > included a fix to disallow any event after a disconnect event. > > You need to sync up the commit with Sean's patch for the uCMA get/set > IB_CM_REQ_OPTIONS. The RDMA CM changes for kernel and userspace have been committed to allow setting the timeout / retry values. - Sean From eitan at mellanox.co.il Tue Jun 6 11:50:10 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 6 Jun 2006 21:50:10 +0300 Subject: [openib-general] RE: [PATCH][MINOR] OpenSM: Fix inconsistent use of osm_log level Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3023687E6@mtlexch01.mtl.com> Hi Hal, Thanks for cleaning this up. I see you also cleaned up missing ":" in errors etc. Good to go from my perspective Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, June 06, 2006 8:27 PM > To: openib-general at openib.org > Cc: Eitan Zahavi > Subject: [PATCH][MINOR] OpenSM: Fix inconsistent use of osm_log level > > OpenSM: Fix inconsistent use of osm_log level > Also, some other cosmetic changes > > Signed-off-by: Hal Rosenstock > > Index: opensm/osm_pkey_rcv.c > =================================================================== > --- opensm/osm_pkey_rcv.c (revision 7733) > +++ opensm/osm_pkey_rcv.c (working copy) > @@ -200,13 +200,10 @@ osm_pkey_rcv_process( > */ > if( !osm_physp_is_valid( p_physp ) ) > { > - if( osm_log_is_active( p_rcv->p_log, OSM_LOG_VERBOSE ) ) > - { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_pkey_rcv_process: ERR 4807: " > - "Got invalid port number 0x%X\n", > - port_num ); > - } > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "osm_pkey_rcv_process: ERR 4807: " > + "Got invalid port number 0x%X\n", > + port_num ); > goto Exit; > } > > Index: opensm/osm_sa_guidinfo_record.c > =================================================================== > --- opensm/osm_sa_guidinfo_record.c (revision 7733) > +++ opensm/osm_sa_guidinfo_record.c (working copy) > @@ -171,7 +171,7 @@ __osm_gir_rcv_new_gir( > > if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_gir_rcv_new_gir: " > "New GUIDInfoRecord: lid 0x%X, block num %d\n", > cl_ntoh16( match_lid ), block_num ); > @@ -220,7 +220,7 @@ __osm_sa_gir_create_gir( > > if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_sa_gir_create_gir: " > "Looking for GUIDRecord with LID: 0x%X GUID:0x%016" PRIx64 "\n", > cl_ntoh16( match_lid ), > @@ -282,7 +282,7 @@ __osm_sa_gir_create_gir( > */ > if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_sa_gir_create_gir: " > "Comparing LID: 0x%X <= 0x%X <= 0x%X\n", > cl_ntoh16( base_lid_ho ), > @@ -495,7 +495,7 @@ osm_gir_rcv_process( > if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && > (num_rec > 1)) { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_gir_rcv_process: " > + "osm_gir_rcv_process: ERR 5103: " > "Got more than one record for SubnAdmGet (%u)\n", > num_rec ); > osm_sa_send_error( p_rcv->p_resp, p_madw, > Index: opensm/osm_sa_vlarb_record.c > =================================================================== > --- opensm/osm_sa_vlarb_record.c (revision 7733) > +++ opensm/osm_sa_vlarb_record.c (working copy) > @@ -179,7 +179,7 @@ __osm_sa_vl_arb_create( > > if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_sa_vl_arb_create: " > "New VLArbitration for: port 0x%016" PRIx64 > ", lid 0x%X, port# 0x%X Block:%u\n", > @@ -416,7 +416,7 @@ osm_vlarb_rec_rcv_process( > else > { /* port out of range */ > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_vlarb_rec_rcv_process: " > + "osm_vlarb_rec_rcv_process: ERR 2A01: " > "Given LID (%u) is out of range:%u\n", > cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) ); > } > @@ -444,7 +444,7 @@ osm_vlarb_rec_rcv_process( > if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && > (num_rec > 1)) { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_vlarb_rec_rcv_process: " > + "osm_vlarb_rec_rcv_process: ERR 2A08: " > "Got more than one record for SubnAdmGet (%u)\n", > num_rec ); > osm_sa_send_error( p_rcv->p_resp, p_madw, > Index: opensm/osm_sa_multipath_record.c > =================================================================== > --- opensm/osm_sa_multipath_record.c (revision 7733) > +++ opensm/osm_sa_multipath_record.c (working copy) > @@ -1281,7 +1281,8 @@ __osm_mpr_rcv_process_pairs( > max_paths - total_paths, > comp_mask, p_list ); > total_paths += num_paths; > - osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_mpr_rcv_process_pairs: " > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_mpr_rcv_process_pairs: " > "%d paths %d total paths %d max paths\n", > num_paths, total_paths, max_paths ); > /* Just take first NumbPaths found */ > @@ -1468,7 +1469,8 @@ osm_mpr_rcv_process( > if ( sa_status != IB_SA_MAD_STATUS_SUCCESS || !nsrc || !ndest ) > { > if ( sa_status == IB_SA_MAD_STATUS_SUCCESS && ( !nsrc || !ndest ) ) > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_mpr_rcv_process_cb: ERR > 4512: " > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "osm_mpr_rcv_process_cb: ERR 4512: " > "__osm_mpr_rcv_get_end_points failed, not enough GIDs " > "(nsrc %d ndest %d)\n", > nsrc, ndest); > Index: opensm/osm_subnet.c > =================================================================== > --- opensm/osm_subnet.c (revision 7733) > +++ opensm/osm_subnet.c (working copy) > @@ -250,7 +250,7 @@ osm_get_gid_by_mad_addr( > if ( p_gid == NULL ) > { > osm_log( p_log, OSM_LOG_ERROR, > - "osm_get_gid_by_mad_addr: ERR 7505 " > + "osm_get_gid_by_mad_addr: ERR 7505: " > "Provided output GID is NULL\n"); > return(IB_INVALID_PARAMETER); > } > @@ -281,7 +281,7 @@ osm_get_gid_by_mad_addr( > { > /* The dest_lid is not in the subnet table - this is an error */ > osm_log( p_log, OSM_LOG_ERROR, > - "osm_get_gid_by_mad_addr: ERR 7501 " > + "osm_get_gid_by_mad_addr: ERR 7501: " > "LID is out of range: 0x%X\n", > cl_ntoh16(p_mad_addr->dest_lid) > ); > @@ -316,7 +316,7 @@ osm_get_physp_by_mad_addr( > { > /* The port is not in the port_lid table - this is an error */ > osm_log( p_log, OSM_LOG_ERROR, > - "osm_get_physp_by_mad_addr: ERR 7502 " > + "osm_get_physp_by_mad_addr: ERR 7502: " > "Cannot locate port object by lid: 0x%X\n", > cl_ntoh16(p_mad_addr->dest_lid) > ); > @@ -329,7 +329,7 @@ osm_get_physp_by_mad_addr( > { > /* The dest_lid is not in the subnet table - this is an error */ > osm_log( p_log, OSM_LOG_ERROR, > - "osm_get_physp_by_mad_addr: ERR 7503 " > + "osm_get_physp_by_mad_addr: ERR 7503: " > "Lid is out of range: 0x%X\n", > cl_ntoh16(p_mad_addr->dest_lid) > ); > @@ -365,7 +365,7 @@ osm_get_port_by_mad_addr( > { > /* The dest_lid is not in the subnet table - this is an error */ > osm_log( p_log, OSM_LOG_ERROR, > - "osm_get_port_by_mad_addr: ERR 7504 " > + "osm_get_port_by_mad_addr: ERR 7504: " > "Lid is out of range: 0x%X\n", > cl_ntoh16(p_mad_addr->dest_lid) > ); > Index: opensm/osm_sa_lft_record.c > =================================================================== > --- opensm/osm_sa_lft_record.c (revision 7733) > +++ opensm/osm_sa_lft_record.c (working copy) > @@ -510,7 +510,7 @@ osm_lftr_rcv_process( > { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > "osm_lftr_rcv_process: ERR 4411: " > - "osm_vendor_send. status = %s\n", > + "osm_vendor_send status = %s\n", > ib_get_err_str(status)); > goto Exit; > } > Index: opensm/osm_pkey_rcv_ctrl.c > =================================================================== > --- opensm/osm_pkey_rcv_ctrl.c (revision 7733) > +++ opensm/osm_pkey_rcv_ctrl.c (working copy) > @@ -110,7 +110,7 @@ osm_pkey_rcv_ctrl_init( > { > osm_log( p_log, OSM_LOG_ERROR, > "osm_pkey_rcv_ctrl_init: ERR 4901: " > - "Dispatcher registration failed.\n" ); > + "Dispatcher registration failed\n" ); > status = IB_INSUFFICIENT_RESOURCES; > goto Exit; > } > Index: opensm/osm_sa_service_record.c > =================================================================== > --- opensm/osm_sa_service_record.c (revision 7733) > +++ opensm/osm_sa_service_record.c (working copy) > @@ -1115,7 +1115,7 @@ osm_sr_rcv_process( > default: > osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "osm_sr_rcv_process: " > - "Bad Method (%s)\n", ib_get_sa_method_str( p_sa_mad->method )); > + "Bad Method (%s)\n", ib_get_sa_method_str( p_sa_mad->method ) ); > osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status ); > break; > } > Index: opensm/osm_sa_portinfo_record.c > =================================================================== > --- opensm/osm_sa_portinfo_record.c (revision 7733) > +++ opensm/osm_sa_portinfo_record.c (working copy) > @@ -168,7 +168,7 @@ __osm_pir_rcv_new_pir( > > if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_pir_rcv_new_pir: " > "New PortInfoRecord: port 0x%016" PRIx64 > ", lid 0x%X, port# 0x%X\n", > @@ -678,7 +678,7 @@ osm_pir_rcv_process( > else > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_pir_rcv_process: " > + "osm_pir_rcv_process: ERR 2101: " > "Given LID (%u) is out of range:%u\n", > cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl)); > } > @@ -694,7 +694,7 @@ osm_pir_rcv_process( > else > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_pir_rcv_process: " > + "osm_pir_rcv_process: ERR 2103: " > "Given LID (%u) is out of range:%u\n", > cl_ntoh16(p_pi->base_lid), cl_ptr_vector_get_size(p_tbl)); > } > @@ -721,7 +721,7 @@ osm_pir_rcv_process( > if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && > (num_rec > 1)) { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_pir_rcv_process: " > + "osm_pir_rcv_process: ERR 2108: " > "Got more than one record for SubnAdmGet (%u)\n", > num_rec ); > osm_sa_send_error( p_rcv->p_resp, p_madw, > @@ -852,7 +852,7 @@ osm_pir_rcv_process( > { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > "osm_pir_rcv_process: ERR 2107: " > - "osm_vendor_send. status = %s\n", > + "osm_vendor_send status = %s\n", > ib_get_err_str(status)); > goto Exit; > } > Index: opensm/osm_sa_pkey_record.c > =================================================================== > --- opensm/osm_sa_pkey_record.c (revision 7733) > +++ opensm/osm_sa_pkey_record.c (working copy) > @@ -169,7 +169,7 @@ __osm_sa_pkey_create( > > if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_sa_pkey_create: " > "New P_Key table for: port 0x%016" PRIx64 > ", lid 0x%X, port# 0x%X Block:%u\n", > @@ -432,7 +432,7 @@ osm_pkey_rec_rcv_process( > else > { /* port out of range */ > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_pkey_rec_rcv_process: " > + "osm_pkey_rec_rcv_process: ERR 4609: " > "Given LID (%u) is out of range:%u\n", > cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl)); > } > @@ -460,7 +460,7 @@ osm_pkey_rec_rcv_process( > if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && > (num_rec > 1)) { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_pkey_rec_rcv_process: " > + "osm_pkey_rec_rcv_process: ERR 460A: " > "Got more than one record for SubnAdmGet (%u)\n", > num_rec ); > osm_sa_send_error( p_rcv->p_resp, p_madw, > Index: opensm/osm_inform.c > =================================================================== > --- opensm/osm_inform.c (revision 7733) > +++ opensm/osm_inform.c (working copy) > @@ -283,7 +283,7 @@ osm_infr_insert_to_db( > "Inserting a new InformInfo Record into Database\n"); > osm_log( p_log, OSM_LOG_DEBUG, > "osm_infr_insert_to_db: " > - "Dump before insertion (size : %d) : \n", > + "Dump before insertion (size : %d)\n", > cl_qlist_count(&p_subn->sa_infr_list) ); > __dump_all_informs(p_subn, p_log); > > @@ -295,7 +295,7 @@ osm_infr_insert_to_db( > > osm_log( p_log, OSM_LOG_DEBUG, > "osm_infr_insert_to_db: " > - "Dump after insertion (size : %d) : \n", > + "Dump after insertion (size : %d)\n", > cl_qlist_count(&p_subn->sa_infr_list) ); > __dump_all_informs(p_subn, p_log); > OSM_LOG_EXIT( p_log ); > Index: opensm/osm_sa_slvl_record.c > =================================================================== > --- opensm/osm_sa_slvl_record.c (revision 7733) > +++ opensm/osm_sa_slvl_record.c (working copy) > @@ -179,7 +179,7 @@ __osm_sa_slvl_create( > > if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_sa_slvl_create: " > "New SLtoVL Map for: OUT port 0x%016" PRIx64 > ", lid 0x%X, port# 0x%X to In Port:%u\n", > @@ -395,7 +395,7 @@ osm_slvl_rec_rcv_process( > else > { /* port out of range */ > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_slvl_rec_rcv_process: " > + "osm_slvl_rec_rcv_process: ERR 2601: " > "Given LID (%u) is out of range:%u\n", > cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl)); > } > @@ -423,7 +423,7 @@ osm_slvl_rec_rcv_process( > if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && > (num_rec > 1)) { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_slvl_rec_rcv_process: " > + "osm_slvl_rec_rcv_process: ERR 2607: " > "Got more than one record for SubnAdmGet (%u)\n", > num_rec ); > osm_sa_send_error( p_rcv->p_resp, p_madw, > Index: opensm/osm_mcast_mgr.c > =================================================================== > --- opensm/osm_mcast_mgr.c (revision 7738) > +++ opensm/osm_mcast_mgr.c (working copy) > @@ -1130,7 +1130,7 @@ osm_mcast_mgr_process_single( > p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl; > mlid_ho = cl_ntoh16( mlid ); > > - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) > + if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) > { > osm_log( p_mgr->p_log, OSM_LOG_DEBUG, > "osm_mcast_mgr_process_single: " > @@ -1249,7 +1249,7 @@ osm_mcast_mgr_process_single( > { > if( join_state & IB_JOIN_STATE_SEND_ONLY ) > { > - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) > + if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) > { > osm_log( p_mgr->p_log, OSM_LOG_DEBUG, > "osm_mcast_mgr_process_single: " > @@ -1269,7 +1269,7 @@ osm_mcast_mgr_process_single( > } > else > { > - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) > + if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) > { > osm_log( p_mgr->p_log, OSM_LOG_DEBUG, > "osm_mcast_mgr_process_single: " > Index: opensm/osm_trap_rcv.c > =================================================================== > --- opensm/osm_trap_rcv.c (revision 7733) > +++ opensm/osm_trap_rcv.c (working copy) > @@ -678,7 +678,7 @@ __osm_trap_rcv_process_sm( > OSM_LOG_ENTER( p_rcv->p_log, __osm_trap_rcv_process_sm ); > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_trap_rcv_process_sm: " > + "__osm_trap_rcv_process_sm: ERR 3807: " > "This function is not supported yet\n"); > > OSM_LOG_EXIT( p_rcv->p_log ); > @@ -696,7 +696,7 @@ __osm_trap_rcv_process_response( > OSM_LOG_ENTER( p_rcv->p_log, __osm_trap_rcv_process_response ); > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_trap_rcv_process_response: " > + "__osm_trap_rcv_process_response: ERR 3808: " > "This function is not supported yet\n"); > > OSM_LOG_EXIT( p_rcv->p_log ); > Index: opensm/osm_sa_informinfo.c > =================================================================== > --- opensm/osm_sa_informinfo.c (revision 7733) > +++ opensm/osm_sa_informinfo.c (working copy) > @@ -357,14 +357,15 @@ osm_infr_rcv_process_set_method( > p_recvd_inform_info = > (ib_inform_info_t*)ib_sa_mad_get_payload_ptr( p_sa_mad ); > > - /* the dump routine is not defined yet > - if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > - { > - osm_dump_inform_info_record( p_rcv->p_log, > - p_recvd_service_rec, > - OSM_LOG_DEBUG ); > - } > - */ > +#if 0 > + /* the dump routine is not implemented yet */ > + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > + { > + osm_dump_inform_info_record( p_rcv->p_log, > + p_recvd_inform_info, > + OSM_LOG_DEBUG ); > + } > +#endif > > /* Grab the lock */ > cl_plock_excl_acquire( p_rcv->p_lock ); > Index: opensm/osm_ucast_updn.c > =================================================================== > --- opensm/osm_ucast_updn.c (revision 7733) > +++ opensm/osm_ucast_updn.c (working copy) > @@ -879,7 +879,7 @@ osm_subn_calc_up_down_min_hop_table( > if (num_guids == 0) > { > osm_log(&(osm.log), OSM_LOG_ERROR, > - "osm_subn_calc_up_down_min_hop_table: " > + "osm_subn_calc_up_down_min_hop_table: ERR AA0A: " > "No guids were given or number of guids is 0\n"); > return 1; > } > Index: opensm/osm_sa_node_record.c > =================================================================== > --- opensm/osm_sa_node_record.c (revision 7733) > +++ opensm/osm_sa_node_record.c (working copy) > @@ -161,7 +161,7 @@ __osm_nr_rcv_new_nr( > > if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_nr_rcv_new_nr: " > "New NodeRecord: node 0x%016" PRIx64 > "\n\t\t\t\tport 0x%016" PRIx64 ", lid 0x%X\n", > @@ -211,7 +211,7 @@ __osm_nr_rcv_create_nr( > > if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_nr_rcv_create_nr: " > "Looking for NodeRecord with LID: 0x%X GUID:0x%016" PRIx64 "\n", > cl_ntoh16( match_lid ), > @@ -257,7 +257,7 @@ __osm_nr_rcv_create_nr( > */ > if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_nr_rcv_create_nr: " > "Comparing LID: 0x%X <= 0x%X <= 0x%X\n", > cl_ntoh16( base_lid_ho ), > @@ -326,7 +326,7 @@ __osm_nr_rcv_by_comp_mask( > */ > if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_nr_rcv_by_comp_mask: " > "Looking for node 0x%016" PRIx64 > ", found 0x%016" PRIx64 "\n", > @@ -493,7 +493,7 @@ osm_nr_rcv_process( > */ > if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1) ) { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_nr_rcv_process: " > + "osm_nr_rcv_process: ERR 1D03: " > "Got more than one record for SubnAdmGet (%u)\n", > num_rec ); > osm_sa_send_error( p_rcv->p_resp, p_madw, > Index: opensm/osm_sa_link_record.c > =================================================================== > --- opensm/osm_sa_link_record.c (revision 7733) > +++ opensm/osm_sa_link_record.c (working copy) > @@ -312,7 +312,7 @@ __osm_lr_rcv_get_physp_link( > { > osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_lr_rcv_get_physp_link: " > - "Acquiring link record.\n" > + "Acquiring link record\n" > "\t\t\t\tsrc port 0x%" PRIx64 " (port 0x%X)" > ", dest port 0x%" PRIx64 " (port 0x%X)\n", > cl_ntoh64( osm_physp_get_port_guid( p_src_physp ) ), > @@ -606,7 +606,7 @@ __osm_lr_rcv_respond( > if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && > (num_rec > 1)) { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_lr_rcv_respond: " > + "__osm_lr_rcv_respond: ERR 1806: " > "Got more than one record for SubnAdmGet (%u)\n", > num_rec ); > osm_sa_send_error( p_rcv->p_resp, p_madw, > Index: opensm/osm_slvl_map_rcv.c > =================================================================== > --- opensm/osm_slvl_map_rcv.c (revision 7733) > +++ opensm/osm_slvl_map_rcv.c (working copy) > @@ -211,13 +211,10 @@ osm_slvl_rcv_process( > */ > if( !osm_physp_is_valid( p_physp ) ) > { > - if( osm_log_is_active( p_rcv->p_log, OSM_LOG_VERBOSE ) ) > - { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_slvl_rcv_process: " > - "Got invalid port number 0x%X\n", > - out_port_num ); > - } > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "osm_slvl_rcv_process: " > + "Got invalid port number 0x%X\n", > + out_port_num ); > goto Exit; > } > > Index: opensm/osm_sa_link_record_ctrl.c > =================================================================== > --- opensm/osm_sa_link_record_ctrl.c (revision 7733) > +++ opensm/osm_sa_link_record_ctrl.c (working copy) > @@ -116,7 +116,7 @@ osm_lr_rcv_ctrl_init( > { > osm_log( p_log, OSM_LOG_ERROR, > "osm_lr_rcv_ctrl_init: ERR 1901: " > - "Dispatcher registration failed.\n" ); > + "Dispatcher registration failed\n" ); > status = IB_INSUFFICIENT_RESOURCES; > goto Exit; > } > Index: opensm/osm_qos.c > =================================================================== > --- opensm/osm_qos.c (revision 7733) > +++ opensm/osm_qos.c (working copy) > @@ -279,7 +279,8 @@ static ib_api_status_t qos_physp_setup(o > /* setup vl high limit */ > status = vl_high_limit_update(p_req, p, qcfg); > if (status != IB_SUCCESS) { > - osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: " > + osm_log(p_log, OSM_LOG_ERROR, > + "qos_physp_setup: ERR 6201 : " > "failed to update VLHighLimit " > "for port %" PRIx64 " #%d\n", > cl_ntoh64(p->port_guid), port_num); > @@ -289,7 +290,8 @@ static ib_api_status_t qos_physp_setup(o > /* setup VLArbitration */ > status = vlarb_update(p_req, p, port_num, qcfg); > if (status != IB_SUCCESS) { > - osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: " > + osm_log(p_log, OSM_LOG_ERROR, > + "qos_physp_setup: ERR 6202 : " > "failed to update VLArbitration tables " > "for port %" PRIx64 " #%d\n", > cl_ntoh64(p->port_guid), port_num); > @@ -299,7 +301,8 @@ static ib_api_status_t qos_physp_setup(o > /* setup Sl2VL tables */ > status = sl2vl_update(p_req, p, port_num, qcfg); > if (status != IB_SUCCESS) { > - osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: " > + osm_log(p_log, OSM_LOG_ERROR, > + "qos_physp_setup: ERR 6203 : " > "failed to update SL2VLMapping tables " > "for port %" PRIx64 " #%d\n", > cl_ntoh64(p->port_guid), port_num); > Index: opensm/osm_sa_mcmember_record.c > =================================================================== > --- opensm/osm_sa_mcmember_record.c (revision 7733) > +++ opensm/osm_sa_mcmember_record.c (working copy) > @@ -2286,7 +2286,7 @@ osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "osm_mcmr_query_mgrp: ERR 1B17: " > - "osm_vendor_send. status = %s\n", > + "osm_vendor_send status = %s\n", > ib_get_err_str(status) ); > goto Exit; > } > Index: opensm/osm_drop_mgr.c > =================================================================== > --- opensm/osm_drop_mgr.c (revision 7733) > +++ opensm/osm_drop_mgr.c (working copy) > @@ -512,7 +512,7 @@ __osm_drop_mgr_check_node( > > if ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) > { > - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, > + osm_log( p_mgr->p_log, OSM_LOG_ERROR, > "__osm_drop_mgr_check_node: ERR 0107: " > "Node 0x%016" PRIx64 " is not a switch node\n", > cl_ntoh64( node_guid ) ); > Index: opensm/osm_lid_mgr.c > =================================================================== > --- opensm/osm_lid_mgr.c (revision 7733) > +++ opensm/osm_lid_mgr.c (working copy) > @@ -637,7 +637,7 @@ __osm_lid_mgr_init_sweep( > osm_log( p_mgr->p_log, OSM_LOG_DEBUG, > "__osm_lid_mgr_init_sweep: " > "final free lid range [0x%x:0x%x]\n", > - p_range->min_lid, p_range->max_lid); > + p_range->min_lid, p_range->max_lid ); > > OSM_LOG_EXIT( p_mgr->p_log ); > return status; > @@ -757,7 +757,7 @@ __osm_lid_mgr_find_free_lid_range( > /* if we run out of lids, give an error and abort! */ > osm_log( p_mgr->p_log, OSM_LOG_ERROR, > "__osm_lid_mgr_find_free_lid_range: ERR 0307: " > - "OPENSM RAN OUT OF LIDS!!!\n"); > + "OPENSM RAN OUT OF LIDS!!!\n" ); > CL_ASSERT( 0 ); > } > > @@ -827,7 +827,7 @@ __osm_lid_mgr_get_port_lid( > osm_log( p_mgr->p_log, OSM_LOG_DEBUG, > "__osm_lid_mgr_get_port_lid: " > "0x%016" PRIx64" matches its known lid:0x%04x\n", > - guid, min_lid); > + guid, min_lid ); > goto Exit; > } > else > @@ -848,7 +848,7 @@ __osm_lid_mgr_get_port_lid( > osm_log( p_mgr->p_log, OSM_LOG_DEBUG, > "__osm_lid_mgr_get_port_lid: " > "0x%016" PRIx64" has no persistent lid assigned\n", > - guid); > + guid ); > } > > /* if the port info carries a lid it must be lmc aligned and not mapped > @@ -872,7 +872,7 @@ __osm_lid_mgr_get_port_lid( > osm_log( p_mgr->p_log, OSM_LOG_DEBUG, > "__osm_lid_mgr_get_port_lid: " > "0x%016" PRIx64" lid range:[0x%x-0x%x] is free\n", > - guid, *p_min_lid, *p_max_lid); > + guid, *p_min_lid, *p_max_lid ); > goto NewLidSet; > } > else > @@ -881,7 +881,7 @@ __osm_lid_mgr_get_port_lid( > "__osm_lid_mgr_get_port_lid: " > "0x%016" PRIx64 > " existing lid range:[0x%x:0x%x] is not free\n", > - guid, min_lid, min_lid + num_lids - 1); > + guid, min_lid, min_lid + num_lids - 1 ); > } > } > else > @@ -890,7 +890,7 @@ __osm_lid_mgr_get_port_lid( > "__osm_lid_mgr_get_port_lid: " > "0x%016" PRIx64 > " existing lid range:[0x%x:0x%x] is not lmc aligned\n", > - guid, min_lid, min_lid + num_lids - 1); > + guid, min_lid, min_lid + num_lids - 1 ); > } > } > > @@ -902,7 +902,7 @@ __osm_lid_mgr_get_port_lid( > osm_log( p_mgr->p_log, OSM_LOG_DEBUG, > "__osm_lid_mgr_get_port_lid: " > "0x%016" PRIx64" assigned a new lid range:[0x%x-0x%x]\n", > - guid, *p_min_lid, *p_max_lid); > + guid, *p_min_lid, *p_max_lid ); > lid_changed = 1; > > NewLidSet: > @@ -1339,9 +1339,9 @@ osm_lid_mgr_process_sm( > { > osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, > "osm_lid_mgr_process_sm: " > - "Invoking UI function pfn_ui_pre_lid_assign\n"); > + "Invoking UI function pfn_ui_pre_lid_assign\n" ); > p_mgr->p_subn->opt.pfn_ui_pre_lid_assign( > - p_mgr->p_subn->opt.ui_pre_lid_assign_ctx); > + p_mgr->p_subn->opt.ui_pre_lid_assign_ctx ); > } > > /* Set the send_set_reqs of the p_mgr to FALSE, and > Index: opensm/osm_pkey_mgr.c > =================================================================== > --- opensm/osm_pkey_mgr.c (revision 7733) > +++ opensm/osm_pkey_mgr.c (working copy) > @@ -245,7 +245,7 @@ pkey_mgr_update_peer_port( > if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) > { > osm_log( p_log, OSM_LOG_ERROR, > - "pkey_mgr_update_peer_port: " > + "pkey_mgr_update_peer_port: ERR 0502: " > "pkey_mgr_enforce_partition() failed to update " > "node 0x%016" PRIx64 " port %u\n", > cl_ntoh64( osm_node_get_node_guid( p_node ) ), > @@ -272,7 +272,7 @@ pkey_mgr_update_peer_port( > ret_val = TRUE; > else > osm_log( p_log, OSM_LOG_ERROR, > - "pkey_mgr_update_peer_port: " > + "pkey_mgr_update_peer_port: ERR 0503: " > "pkey_mgr_update_pkey_entry() failed to update " > "pkey table block %d for node 0x%016" PRIx64 > " port %u\n", > @@ -332,7 +332,7 @@ static boolean_t pkey_mgr_update_port( > ret_val = TRUE; > else > osm_log( p_log, OSM_LOG_ERROR, > - "pkey_mgr_update_port: " > + "pkey_mgr_update_port: ERR 0504: " > "pkey_mgr_update_pkey_entry() failed to update " > "pkey table block %d for node 0x%016" PRIx64 " port %u\n", > block_index, > @@ -393,7 +393,8 @@ osm_pkey_mgr_process( > > if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS ) > { > - osm_log( &p_osm->log, OSM_LOG_ERROR, "osm_pkey_mgr_process: " > + osm_log( &p_osm->log, OSM_LOG_ERROR, > + "osm_pkey_mgr_process: ERR 0505: " > "osm_prtn_make_partitions() failed\n" ); > goto _err; > } > Index: opensm/osm_vl_arb_rcv.c > =================================================================== > --- opensm/osm_vl_arb_rcv.c (revision 7733) > +++ opensm/osm_vl_arb_rcv.c (working copy) > @@ -211,13 +211,10 @@ osm_vla_rcv_process( > */ > if( !osm_physp_is_valid( p_physp ) ) > { > - if( osm_log_is_active( p_rcv->p_log, OSM_LOG_VERBOSE ) ) > - { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "osm_vla_rcv_process: " > - "Got invalid port number 0x%X\n", > - port_num ); > - } > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "osm_vla_rcv_process: " > + "Got invalid port number 0x%X\n", > + port_num ); > goto Exit; > } > > From Thomas.Talpey at netapp.com Tue Jun 6 12:07:34 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 06 Jun 2006 15:07:34 -0400 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: References: Message-ID: <7.0.1.0.2.20060606131933.04267008@netapp.com> Todd, thanks for the set-up. I'm really glad we're having this discussion! Let me give an NFS/RDMA example to illustrate why this upper layer, at least, doesn't want the HCA doing its flow control, or resource management. NFS/RDMA is a credit-based protocol which allows many operations in progress at the server. Let's say the client is currently running with an RPC slot table of 100 requests (a typical value). Of these requests, some workload-specific percentage will be reads, writes, or metadata. All NFS operations consist of one send from client to server, some number of RDMA writes (for NFS reads) or RDMA reads (for NFS writes), then terminated with one send from server to client. The number of RDMA read or write operations per NFS op depends on the amount of data being read or written, and also the memory registration strategy in use on the client. The highest-performing such strategy is an all-physical one, which results in one RDMA-able segment per physical page. NFS r/w requests are, by default, 32KB, or 8 pages typical. So, typically 8 RDMA requests (read or write) are the result. To illustrate, let's say the client is processing a multi-threaded workload, with (say) 50% reads, 20% writes, and 30% metadata such as lookup and getattr. A kernel build, for example. Therefore, of our 100 active operations, 50 are reads for 32KB each, 20 are writes of 32KB, and 30 are metadata (non-RDMA). To the server, this results in 100 requests, 100 replies, 400 RDMA writes, and 160 RDMA Reads. Of course, these overlap heavily due to the widely differing latency of each op and the highly distributed arrival times. But, for the example this is a snapshot of current load. The latency of the metadata operations is quite low, because lookup and getattr are acting on what is effectively cached data. The reads and writes however, are much longer, because they reference the filesystem. When disk queues are deep, they can take many ms. Imagine what happens if the client's IRD is 4 and the server ignores its local ORD. As soon as a write begins execution, the server posts 8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads are sent, the fifth stalls, and stalls the send queue! Even when three RDMA Reads complete, the queue remains stalled, it doesn't unblock until the fourth is done and all the RDMA Reads have been initiated. But, what just happened to all the other server send traffic? All those metadata replies, and other reads which completed? They're stuck, waiting for that one write request. In my example, these number 99 NFS ops, i.e. 654 WRs! All for one NFS write! The client operation stream effectively became single threaded. What good is the "rapid initiation of RDMA Reads" you describe in the face of this? Yes, there are many arcane and resource-intensive ways around it. But the simplest by far is to count the RDMA Reads outstanding, and for the *upper layer* to honor ORD, not the HCA. Then, the send queue never blocks, and the operation streams never loses parallelism. This is what our NFS server does. As to the depth of IRD, this is a different calculation, it's a DelayxBandwidth of the RDMA Read stream. 4 is good for local, low latency connections. But over a complicated switch infrastructure, or heaven forbid a dark fiber long link, I guarantee it will cause a bottleneck. This isn't an issue except for operations that care, but it is certainly detectable. I would like to see if a pure RDMA Read stream can fully utilize a typical IB fabric, and how much headroom an IRD of 4 provides. Not much, I predict. Closing the connection if IRD is "insufficient to meet goals" isn't a good answer, IMO. How does that benefit interoperability? Thanks for the opportunity to spout off again. Comments welcome! Tom. At 12:43 PM 6/6/2006, Rimmer, Todd wrote: > > >> Talpey, Thomas >> Sent: Tuesday, June 06, 2006 10:49 AM >> >> At 10:40 AM 6/6/2006, Roland Dreier wrote: >> > Thomas> This is the difference between "may" and "must". The >value >> > Thomas> is provided, but I don't see anything in the spec that >> > Thomas> makes a requirement on its enforcement. Table 107 says >the >> > Thomas> consumer can query it, that's about as close as it >> > Thomas> comes. There's some discussion about CM exchange too. >> > >> >This seems like a very strained interpretation of the spec. For >> >> I don't see how strained has anything to do with it. It's not saying >> anything >> either way. So, a legal implementation can make either choice. We're >> talking about the spec! >> >> But, it really doesn't matter. The point is, an upper layer should be >> paying >> attention to the number of RDMA Reads it posts, or else suffer either >the >> queue-stalling or connection-failing consequences. Bad stuff either >way. >> >> Tom. > >Somewhere beneath this discussion is a bug in the application or IB >stack. I'm not sure which "may" in the spec you are referring to, but >the "may"s I have found all are for cases where the responder might >support only 1 outstanding request. In all cases the negotiation >protocol must be followed and the requestor is not allowed to exceed the >negotiated limit. > >The mechanism should be: >client queries its local HCA and determines responder resources (eg. >number of concurrent outstanding RDMA reads on the wire from the remote >end where this end will respond with the read data) and initiator depth >(eg. number of concurrent outstanding RDMA reads which this end can >initiate as the requestor). > >client puts the above information in the CM REQ. > >server similarly gets its information from its local CA and negotiates >down the values to the MIN of each side (REP.InitiatorDepth = >MIN(REQ.ResponderResources, server's local CAs Initiator depth); >REP.ResponderResources = MIN(REQ.InitiatorDepth, server's local CAs >responder resources). If server does not support RDMA Reads, it can >REJ. > >If client decided the negotiated values are insufficient to meet its >goals, it can disconnect. > >Each side sets its QP parameters via modify QP appropriately. Note they >too will be mirror images of eachother: >client: >QP.Max RDMA Reads as Initiator = REP.ResponderResources >QP.Max RDMA reads as responder = REP.InitiatorDepth > >server: >QP.Max RDMA Reads as responder = REP.ResponderResources >QP.Max RDMA reads as initiator = REP.InitiatorDepth > >We have done a lot of high stress RDMA Read traffic with Mellanox HCAs >and provided the above negotiation is followed, we have seen no issues. >Note however that by default a Mellanox HCA typically reports a large >InitiatorDepth (128) and a modest ResponderResources (4-8). Hence when >I hear that Responder Resources must be grown to 128 for some >application to reliably work, it implies the negotiation I outlined >above is not being followed. > >Note that the ordering rules in table 76 of IBTA 1.2 show how reads and >write on a send queue are ordered. There are many cases where an op can >pass an outstanding RDMA read, hence it is not always bad to queue extra >RDMA reads. If needed, the Fence can be sent to force order. > >For many apps, its going to be better to get the items onto queue and >let the QP handle the outstanding reads cases rather than have the app >add a level of queuing for this purpose. Letting the HCA do the queuing >will allow for a more rapid initiation of subsequent reads. > >Todd Rimmer From rdreier at cisco.com Tue Jun 6 13:21:13 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 13:21:13 -0700 Subject: [openib-general] Re: [PATCH] SRP : Use correct port identifier format according to target io_class In-Reply-To: <44861416.2864.88C6C5@rkuchimanchi.silverstorm.com> ( Ramachandra K.'s message of "Tue, 06 Jun 2006 23:47:34 +0530") References: <1149171133.7588.45.camel@Prawra.gs-lab.com> <44861416.2864.88C6C5@rkuchimanchi.silverstorm.com> Message-ID: Ramachandra> Thanks a lot Roland. But there was also a patch for Ramachandra> ibsrpdm to display the IO class of the target. I am Ramachandra> including it below for your convenience. Yes, I still have this in my queue. I need to rewrite this because I don't think ibsrpdm should generate io_class options for rev. 16a targets, since that will break old kernels for no reason. But I will apply it soon. - R. From rdreier at cisco.com Tue Jun 6 13:23:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 13:23:58 -0700 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: <20060606181132.GA4701@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 6 Jun 2006 21:11:32 +0300") References: <20060606181132.GA4701@mellanox.co.il> Message-ID: > 1. Cache should create ib_update_work objects statically upon hotplug event. Seems reasonable, since multiple pending cache update events can be safely coalesced into one. > 2. Need a mechanism for cache to consume events which trigger cache updates, > and delay reporting them to ULPs until after cache is updated. This seems like overkill to me. And I don't see how to avoid GFP_ATOMIC allocations that might fail, since the cache module would need to maintain an arbitrary-length queue of pending events. - R. From mst at mellanox.co.il Tue Jun 6 13:31:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 23:31:03 +0300 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: References: Message-ID: <20060606203103.GF4719@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RFC: ib_cache_event problems > > > 1. Cache should create ib_update_work objects statically upon hotplug event. > > Seems reasonable, since multiple pending cache update events can be > safely coalesced into one. > > > 2. Need a mechanism for cache to consume events which trigger cache updates, > > and delay reporting them to ULPs until after cache is updated. > > This seems like overkill to me. How then can we solve the problem of IPoIB querying the cache as a result of an event, and getting a stale value? Note we are actually seeing this in practice when changing pkeys. > And I don't see how to avoid > GFP_ATOMIC allocations that might fail, since the cache module would > need to maintain an arbitrary-length queue of pending events. IMO order of events is typically not important, so we only need to handle up to 6 different events in some kind of bitmask. -- MST From mst at mellanox.co.il Tue Jun 6 13:35:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 23:35:38 +0300 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: References: Message-ID: <20060606203538.GG4719@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RFC: ib_cache_event problems > > > 1. Cache should create ib_update_work objects statically upon hotplug event. > > Seems reasonable, since multiple pending cache update events can be > safely coalesced into one. > > > 2. Need a mechanism for cache to consume events which trigger cache updates, > > and delay reporting them to ULPs until after cache is updated. > > This seems like overkill to me. And I don't see how to avoid > GFP_ATOMIC allocations that might fail, since the cache module would > need to maintain an arbitrary-length queue of pending events. Hmm. Thinking about it some more - how about generating the events from a mad thread in core rather than from povider? Then this would be thread context so cache could simply perform updates inline in event handler. -- MST From rdreier at cisco.com Tue Jun 6 13:40:51 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 13:40:51 -0700 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: <20060606203103.GF4719@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 6 Jun 2006 23:31:03 +0300") References: <20060606203103.GF4719@mellanox.co.il> Message-ID: Michael> How then can we solve the problem of IPoIB querying the Michael> cache as a result of an event, and getting a stale value? Michael> Note we are actually seeing this in practice when Michael> changing pkeys. It doesn't seem like a severe problem to me -- IPoIB will just check again in another second, right? The whole intention of the cache interface is that it should only be used when a stale value is not fatal. So if this isn't good enough for IPoIB, then it should just query the P_Key table directly. But of course even that could return stale values, since there's no guarantee that the P_Key table doesn't change immediately after the query operation. Michael> IMO order of events is typically not important, so we Michael> only need to handle up to 6 different events in some kind Michael> of bitmask. This seems like a strong statement -- certainly there's a big difference between the sequence "port active" then "port error" vs. "port error" then "port active". Also coalescing events means that the sequences "port error", "port active", "port error" vs. just "port error", "port active" can't be distinguished. So I think the proposed cure may be worse than the disease here. - R. From rdreier at cisco.com Tue Jun 6 13:42:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 13:42:48 -0700 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: <20060606203538.GG4719@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 6 Jun 2006 23:35:38 +0300") References: <20060606203538.GG4719@mellanox.co.il> Message-ID: Michael> Hmm. Thinking about it some more - how about generating Michael> the events from a mad thread in core rather than from Michael> povider? Then this would be thread context so cache Michael> could simply perform updates inline in event handler. You have the same problem with allocating storage in atomic context for an arbitrary-length queue of events. - R. From mst at mellanox.co.il Tue Jun 6 13:43:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 23:43:54 +0300 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: References: Message-ID: <20060606204354.GH4719@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RFC: ib_cache_event problems > > Michael> Hmm. Thinking about it some more - how about generating > Michael> the events from a mad thread in core rather than from > Michael> povider? Then this would be thread context so cache > Michael> could simply perform updates inline in event handler. > > You have the same problem with allocating storage in atomic context > for an arbitrary-length queue of events. No, on mad thread we can allocate with GFP_KERNEL I think. -- MST From mst at mellanox.co.il Tue Jun 6 13:48:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 23:48:11 +0300 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: References: Message-ID: <20060606204811.GA5472@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RFC: ib_cache_event problems > > Michael> How then can we solve the problem of IPoIB querying the > Michael> cache as a result of an event, and getting a stale value? > Michael> Note we are actually seeing this in practice when > Michael> changing pkeys. > > It doesn't seem like a severe problem to me -- IPoIB will just check > again in another second, right? That would solve the problem, but - int ipoib_ib_dev_up(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); ipoib_pkey_dev_check_presence(dev); if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { ipoib_dbg(priv, "PKEY is not assigned.\n"); return 0; } set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); return ipoib_mcast_start_thread(dev); } This doesn't seem to retry anything. -- MST From rdreier at cisco.com Tue Jun 6 13:49:05 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 13:49:05 -0700 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: <20060606204354.GH4719@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 6 Jun 2006 23:43:54 +0300") References: <20060606204354.GH4719@mellanox.co.il> Message-ID: Michael> No, on mad thread we can allocate with GFP_KERNEL I think. But how do you get into thread context? Events are generated in interrupt context, and if you want to defer the work to process context, then you have to store the information somewhere. - R. From rdreier at cisco.com Tue Jun 6 13:51:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 13:51:59 -0700 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: <20060606204811.GA5472@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 6 Jun 2006 23:48:11 +0300") References: <20060606204811.GA5472@mellanox.co.il> Message-ID: Michael> That would solve the problem, but - Michael> int ipoib_ib_dev_up(struct net_device *dev) Michael> This doesn't seem to retry anything. ipoib_main calls ipoib_pkey_dev_delay_open() before it tries ipoib_ib_dev_up(). So it should be OK if the P_Key isn't assigned yet. - R. From mst at mellanox.co.il Tue Jun 6 13:54:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 23:54:00 +0300 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: References: Message-ID: <20060606205400.GB5472@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RFC: ib_cache_event problems > > Michael> That would solve the problem, but - > > Michael> int ipoib_ib_dev_up(struct net_device *dev) > > Michael> This doesn't seem to retry anything. > > ipoib_main calls ipoib_pkey_dev_delay_open() before it tries > ipoib_ib_dev_up(). So it should be OK if the P_Key isn't assigned > yet. But ipoib_ib_dev_flush doesn't? -- MST From mst at mellanox.co.il Tue Jun 6 13:56:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 23:56:34 +0300 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: References: Message-ID: <20060606205634.GC5472@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RFC: ib_cache_event problems > > Michael> No, on mad thread we can allocate with GFP_KERNEL I think. > > But how do you get into thread context? Events are generated in > interrupt context, and if you want to defer the work to process > context, then you have to store the information somewhere. We already process incoming MADs in thread context in core. So all events related to MADs could be generated there. But you might be right - it might be easier to fix ULPs. -- MST From mst at mellanox.co.il Tue Jun 6 13:57:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Jun 2006 23:57:25 +0300 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: References: Message-ID: <20060606205725.GD5472@mellanox.co.il> Quoting r. Roland Dreier : > Michael> IMO order of events is typically not important, so we > Michael> only need to handle up to 6 different events in some kind > Michael> of bitmask. > > This seems like a strong statement Maybe add "cache updated" event for ULPs to listen on? -- MST From rdreier at cisco.com Tue Jun 6 13:59:57 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 13:59:57 -0700 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: <20060606205400.GB5472@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 6 Jun 2006 23:54:00 +0300") References: <20060606205400.GB5472@mellanox.co.il> Message-ID: Michael> But ipoib_ib_dev_flush doesn't? Ah, that looks like the bug I guess. What's the situation? SM clears P_Key table and then later readds a P_Key? - R. From mst at mellanox.co.il Tue Jun 6 14:03:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 7 Jun 2006 00:03:36 +0300 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: References: Message-ID: <20060606210336.GE5472@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RFC: ib_cache_event problems > > Michael> But ipoib_ib_dev_flush doesn't? > > Ah, that looks like the bug I guess. What's the situation? SM clears > P_Key table and then later readds a P_Key? Yes, something like that. -- MST From mshefty at ichips.intel.com Tue Jun 6 14:26:15 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 06 Jun 2006 14:26:15 -0700 Subject: [openib-general] [PATCH 1/3] verbs: add call to initialize ib_ah_attr from a work completion In-Reply-To: References: Message-ID: <4485F2F7.3020807@ichips.intel.com> Sean Hefty wrote: > Expose a new call to initialize address handle attributes from a work > completion. This functionality is duplicated by both verbs and the CM. Is there any objection to committing this patch set? 1. Export ib_init_ah_from_wc() call in verbs. 2. Add ib_init_ah_from_path() call to ib_sa. 3. Convert CM to use exported calls. This eliminates some duplicated code, and centralizes the initialization of ib_ah_attr to reduce the chance of users setting the global routing flag incorrectly. - Sean From rdreier at cisco.com Tue Jun 6 14:27:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 14:27:58 -0700 Subject: [openib-general] [PATCH 1/3] verbs: add call to initialize ib_ah_attr from a work completion In-Reply-To: <4485F2F7.3020807@ichips.intel.com> (Sean Hefty's message of "Tue, 06 Jun 2006 14:26:15 -0700") References: <4485F2F7.3020807@ichips.intel.com> Message-ID: Sean> Is there any objection to committing this patch set? I think it's fine. Should I queue it for 2.6.18 too? From sean.hefty at intel.com Tue Jun 6 14:31:40 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Jun 2006 14:31:40 -0700 Subject: [openib-general] [PATCH 1/3] verbs: add call to initialize ib_ah_attr from a work completion In-Reply-To: Message-ID: >I think it's fine. Should I queue it for 2.6.18 too? That probably makes sense. I'll send a couple of svn revs that should be safe to pull into 2.6.18 after committing this. - Sean From sweitzen at cisco.com Tue Jun 6 14:35:25 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 6 Jun 2006 14:35:25 -0700 Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS Message-ID: Arlin, I'm having trouble running Intel MPI 2.0.1 and OFED 1.0 rc5 with Intel MPI Benchmark 2.3 on a 32-node PCI-X RHEL4 U3 i686 cluster. This thread caught my eye, can you look at my output and tell me if this is the same issue? If not, are there other things I can tune, or should I file a bug somewhere? $ .../intelmpi-2.0.1-`uname -m`/bin/mpiexec -genv I_MPI_DEBUG 3 -genv I_MPI_DEVICE rdssm -genv LD_LIBRARY_PATH .../intelmpi-2.0.1-`uname -m`/lib -n 32 .../IMB_2.3/src/IMB-MPI1 PingPong I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(531): Initialization failed MPID_Init(146): channel initialization failed MPIDI_CH3_Init(937): MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in VC_post_connect (unknown)(): (null) rank 10 in job 1 192.168.1.1_33715 caused collective abort of all ranks exit status of rank 10: killed by signal 9 rank 1 in job 1 192.168.1.1_33715 caused collective abort of all ranks exit status of rank 1: killed by signal 9 rank 0 in job 1 192.168.1.1_33715 caused collective abort of all ranks exit status of rank 0: killed by signal 9 [releng at svbu-qaclus-1 intel.intel]$ Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Arlin Davis > Sent: Monday, June 05, 2006 5:17 PM > To: Lentini, James > Cc: 'openib-general' > Subject: [openib-general] [PATCH] uDAPL openib-cma provider - > add support for IB_CM_REQ_OPTIONS > > James, > > Here is a patch to the openib-cma provider that uses the new > set_option feature of the uCMA to > adjust connect request timeout and retry values. The defaults > are a little quick for some consumers. > They are now bumped up from 3 retries to 15 and are tunable > with uDAPL environment variables. Also, > included a fix to disallow any event after a disconnect event. > > You need to sync up the commit with Sean's patch for the uCMA > get/set IB_CM_REQ_OPTIONS. > > I would like to get this in OFED RC6 if possible. > > Thanks, > > -arlin > > > > Signed-off by: Arlin Davis ardavis at ichips.intel.com > > Index: dapl/openib_cma/dapl_ib_util.c > =================================================================== > --- dapl/openib_cma/dapl_ib_util.c (revision 7694) > +++ dapl/openib_cma/dapl_ib_util.c (working copy) > @@ -264,7 +264,15 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_N > /* set inline max with env or default, get local lid > and gid 0 */ > hca_ptr->ib_trans.max_inline_send = > dapl_os_get_env_val("DAPL_MAX_INLINE", > INLINE_SEND_DEFAULT); > - > + > + /* set CM timer defaults */ > + hca_ptr->ib_trans.max_cm_timeout = > + dapl_os_get_env_val("DAPL_MAX_CM_RESPONSE_TIME", > + IB_CM_RESPONSE_TIMEOUT); > + hca_ptr->ib_trans.max_cm_retries = > + dapl_os_get_env_val("DAPL_MAX_CM_RETRIES", > + IB_CM_RETRIES); > + > /* EVD events without direct CQ channels, non-blocking */ > hca_ptr->ib_trans.ib_cq = > ibv_create_comp_channel(hca_ptr->ib_hca_handle); > Index: dapl/openib_cma/dapl_ib_cm.c > =================================================================== > --- dapl/openib_cma/dapl_ib_cm.c (revision 7694) > +++ dapl/openib_cma/dapl_ib_cm.c (working copy) > @@ -58,6 +58,7 @@ > #include "dapl_ib_util.h" > #include > #include > +#include > > extern struct rdma_event_channel *g_cm_events; > > @@ -85,7 +86,6 @@ static inline uint64_t cpu_to_be64(uint6 > (unsigned short)((SID % IB_PORT_MOD) + IB_PORT_BASE) :\ > (unsigned short)SID) > > - > static void dapli_addr_resolve(struct dapl_cm_id *conn) > { > int ret; > @@ -114,6 +114,8 @@ static void dapli_addr_resolve(struct da > static void dapli_route_resolve(struct dapl_cm_id *conn) > { > int ret; > + size_t optlen = sizeof(struct ib_cm_req_opt); > + struct ib_cm_req_opt req_opt; > #ifdef DAPL_DBG > struct rdma_addr *ipaddr = &conn->cm_id->route.addr; > struct ib_addr *ibaddr = &conn->cm_id->route.addr.addr.ibaddr; > @@ -143,13 +145,43 @@ static void dapli_route_resolve(struct d > cpu_to_be64(ibaddr->dgid.global.interface_id)); > > dapl_dbg_log(DAPL_DBG_TYPE_CM, > - " rdma_connect: cm_id %p pdata %p plen %d rr %d > ind %d\n", > + " route_resolve: cm_id %p pdata %p plen %d rr > %d ind %d\n", > conn->cm_id, > conn->params.private_data, > conn->params.private_data_len, > conn->params.responder_resources, > conn->params.initiator_depth ); > > + /* Get default connect request timeout values, and adjust */ > + ret = rdma_get_option(conn->cm_id, RDMA_PROTO_IB, > IB_CM_REQ_OPTIONS, > + (void*)&req_opt, &optlen); > + if (ret) { > + dapl_dbg_log(DAPL_DBG_TYPE_ERR, " > rdma_get_option failed: %s\n", > + strerror(errno)); > + goto bail; > + } > + > + dapl_dbg_log(DAPL_DBG_TYPE_CM, " route_resolve: " > + "Set CR times - response %d to %d, retry > %d to %d\n", > + req_opt.remote_cm_response_timeout, > + conn->hca->ib_trans.max_cm_timeout, > + req_opt.max_cm_retries, > + conn->hca->ib_trans.max_cm_retries); > + > + /* Use hca response time setting for connect requests */ > + req_opt.max_cm_retries = conn->hca->ib_trans.max_cm_retries; > + req_opt.remote_cm_response_timeout = > + conn->hca->ib_trans.max_cm_timeout; > + req_opt.local_cm_response_timeout = > + req_opt.remote_cm_response_timeout; > + ret = rdma_set_option(conn->cm_id, RDMA_PROTO_IB, > IB_CM_REQ_OPTIONS, > + (void*)&req_opt, optlen); > + if (ret) { > + dapl_dbg_log(DAPL_DBG_TYPE_ERR, " > rdma_set_option failed: %s\n", > + strerror(errno)); > + goto bail; > + } > + > ret = rdma_connect(conn->cm_id, &conn->params); > if (ret) { > dapl_dbg_log(DAPL_DBG_TYPE_ERR, " rdma_connect > failed: %s\n", > @@ -273,14 +305,37 @@ static void dapli_cm_active_cb(struct da > } > dapl_os_unlock(&conn->lock); > > + /* There is a chance that we can get events after > + * the consumer calls disconnect in a pending state > + * since the IB CM and uDAPL states are not shared. > + * In some cases, IB CM could generate either a DCONN > + * or CONN_ERR after the consumer returned from > + * dapl_ep_disconnect with a DISCONNECTED event > + * already queued. Check state here and bail to > + * avoid any events after a disconnect. > + */ > + if (DAPL_BAD_HANDLE(conn->ep, DAPL_MAGIC_EP)) > + return; > + > + dapl_os_lock(&conn->ep->header.lock); > + if (conn->ep->param.ep_state == DAT_EP_STATE_DISCONNECTED) { > + dapl_os_unlock(&conn->ep->header.lock); > + return; > + } > + if (event->event == RDMA_CM_EVENT_DISCONNECTED) > + conn->ep->param.ep_state = DAT_EP_STATE_DISCONNECTED; > + > + dapl_os_unlock(&conn->ep->header.lock); > + > switch (event->event) { > case RDMA_CM_EVENT_UNREACHABLE: > case RDMA_CM_EVENT_CONNECT_ERROR: > - dapl_dbg_log( > - DAPL_DBG_TYPE_WARN, > - " dapli_cm_active_handler: CONN_ERR " > - " event=0x%x status=%d\n", > - event->event, event->status); > + dapl_dbg_log( > + DAPL_DBG_TYPE_WARN, > + " dapli_cm_active_handler: CONN_ERR " > + " event=0x%x status=%d %s\n", > + event->event, event->status, > + (event->status == -110)?"TIMEOUT":"" ); > > dapl_evd_connection_callback(conn, > > IB_CME_DESTINATION_UNREACHABLE, > @@ -368,25 +423,23 @@ static void dapli_cm_passive_cb(struct d > event->private_data, > new_conn->sp); > break; > case RDMA_CM_EVENT_UNREACHABLE: > - dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE, > - NULL, conn->sp); > - > case RDMA_CM_EVENT_CONNECT_ERROR: > > dapl_dbg_log( > - DAPL_DBG_TYPE_WARN, > - " dapli_cm_passive: CONN_ERR " > - " event=0x%x status=%d", > - " on SRC 0x%x,0x%x DST 0x%x,0x%x\n", > - event->event, event->status, > - ntohl(((struct sockaddr_in *) > - &ipaddr->src_addr)->sin_addr.s_addr), > - ntohs(((struct sockaddr_in *) > - &ipaddr->src_addr)->sin_port), > - ntohl(((struct sockaddr_in *) > - &ipaddr->dst_addr)->sin_addr.s_addr), > - ntohs(((struct sockaddr_in *) > - &ipaddr->dst_addr)->sin_port)); > + DAPL_DBG_TYPE_WARN, > + " dapli_cm_passive: CONN_ERR " > + " event=0x%x status=%d %s" > + " on SRC 0x%x,0x%x DST 0x%x,0x%x\n", > + event->event, event->status, > + (event->status == -110)?"TIMEOUT":"", > + ntohl(((struct sockaddr_in *) > + &ipaddr->src_addr)->sin_addr.s_addr), > + ntohs(((struct sockaddr_in *) > + &ipaddr->src_addr)->sin_port), > + ntohl(((struct sockaddr_in *) > + &ipaddr->dst_addr)->sin_addr.s_addr), > + ntohs(((struct sockaddr_in *) > + &ipaddr->dst_addr)->sin_port)); > > dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE, > NULL, conn->sp); > Index: dapl/openib_cma/dapl_ib_util.h > =================================================================== > --- dapl/openib_cma/dapl_ib_util.h (revision 7694) > +++ dapl/openib_cma/dapl_ib_util.h (working copy) > @@ -67,8 +67,8 @@ typedef ib_hca_handle_t dapl_ibal_ca_t; > > #define IB_RC_RETRY_COUNT 7 > #define IB_RNR_RETRY_COUNT 7 > -#define IB_CM_RESPONSE_TIMEOUT 18 /* 1 sec */ > -#define IB_MAX_CM_RETRIES 7 > +#define IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */ > +#define IB_CM_RETRIES 15 > #define IB_REQ_MRA_TIMEOUT 27 /* a little over 9 minutes */ > #define IB_MAX_AT_RETRY 3 > #define IB_TARGET_MAX 4 /* max_qp_ous_rd_atom */ > @@ -252,6 +252,8 @@ typedef struct _ib_hca_transport > ib_async_cq_handler_t async_cq_error; > ib_async_dto_handler_t async_cq; > ib_async_qp_handler_t async_qp_error; > + uint8_t max_cm_timeout; > + uint8_t max_cm_retries; > > } ib_hca_transport_t; > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From jlentini at netapp.com Tue Jun 6 14:51:17 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 6 Jun 2006 17:51:17 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS In-Reply-To: References: Message-ID: On Mon, 5 Jun 2006, Arlin Davis wrote: > Here is a patch to the openib-cma provider that uses the new > set_option feature of the uCMA to adjust connect request timeout and > retry values. The defaults are a little quick for some consumers. > They are now bumped up from 3 retries to 15 and are tunable with > uDAPL environment variables. Also, included a fix to disallow any > event after a disconnect event. Committed in revision 7755. > I would like to get this in OFED RC6 if possible. Who is the gatekeeper for OFED? One of us should bring this to their attention, but I'm not sure who to contact. From mshefty at ichips.intel.com Tue Jun 6 14:57:00 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 06 Jun 2006 14:57:00 -0700 Subject: [openib-general] svn changes for git 2.6.18 Message-ID: <4485FA2C.3010900@ichips.intel.com> Roland, The following svn revision change sets should be safe for 2.6.18: 7748 - remove duplicated pkey from SIDR REQ API 7751 - init ah_attr from wc 7752 - init ah_attr from path 7754 - convert CM to use previous two calls - Sean From rdreier at cisco.com Tue Jun 6 14:57:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 14:57:20 -0700 Subject: [openib-general] Re: SRP: [PATCH] Misc cleanups in ib_srp In-Reply-To: <20060604094322.GA9091@mellanox.co.il> (Ishai Rabinovitz's message of "Sun, 4 Jun 2006 12:43:22 +0300") References: <20060604094322.GA9091@mellanox.co.il> Message-ID: Thanks, looks good to me -- applied and queued for 2.6.18. ...for some reason I thought list_move_tail was only in new kernels, so I put it in my git tree but held it back from svn. But of course it's always been there, so I don't know what I was thinking of. - R. From sweitzen at cisco.com Tue Jun 6 15:00:30 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 6 Jun 2006 15:00:30 -0700 Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS Message-ID: Tziporet is the gatekeeper (does that make me the keymaster? :-). Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of James Lentini > Sent: Tuesday, June 06, 2006 2:51 PM > To: Arlin Davis > Cc: 'openib-general' > Subject: [openib-general] Re: [PATCH] uDAPL openib-cma > provider - add support for IB_CM_REQ_OPTIONS > > > > On Mon, 5 Jun 2006, Arlin Davis wrote: > > > Here is a patch to the openib-cma provider that uses the new > > set_option feature of the uCMA to adjust connect request > timeout and > > retry values. The defaults are a little quick for some consumers. > > They are now bumped up from 3 retries to 15 and are tunable with > > uDAPL environment variables. Also, included a fix to disallow any > > event after a disconnect event. > > Committed in revision 7755. > > > I would like to get this in OFED RC6 if possible. > > Who is the gatekeeper for OFED? One of us should bring this to their > attention, but I'm not sure who to contact. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Tue Jun 6 15:11:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 15:11:46 -0700 Subject: [openib-general] Re: SRP [PATCH 0/4] Kernel support for removal and restoration of target In-Reply-To: <20060605153213.GA7472@mellanox.co.il> (Ishai Rabinovitz's message of "Mon, 5 Jun 2006 18:32:13 +0300") References: <20060605153213.GA7472@mellanox.co.il> Message-ID: I haven't read too deeply yet, but something that would help me understand the overall plan here would be an explanation of how one would use the restore_target function. Why would I want to disconnect from a target but keep the kernel's SCSI device hanging around? - R. From rdreier at cisco.com Tue Jun 6 15:10:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 15:10:34 -0700 Subject: [openib-general] Re: [PATCH] SRPTOOLS : print out the target io_class in ibsrpdm In-Reply-To: (Ramachandra Kuchimanchi's message of "Fri, 26 May 2006 13:31:44 -0400") References: Message-ID: Thanks, I applied this. From rdreier at cisco.com Tue Jun 6 15:27:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 Jun 2006 15:27:52 -0700 Subject: [openib-general] Re: svn changes for git 2.6.18 In-Reply-To: <4485FA2C.3010900@ichips.intel.com> (Sean Hefty's message of "Tue, 06 Jun 2006 14:57:00 -0700") References: <4485FA2C.3010900@ichips.intel.com> Message-ID: OK, I dropped them in for-2.6.18. From ardavis at ichips.intel.com Tue Jun 6 15:47:42 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 06 Jun 2006 15:47:42 -0700 Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS In-Reply-To: References: Message-ID: <4486060E.8000500@ichips.intel.com> Scott Weitzenkamp (sweitzen) wrote: >Arlin, > >I'm having trouble running Intel MPI 2.0.1 and OFED 1.0 rc5 with Intel >MPI Benchmark 2.3 on a 32-node PCI-X RHEL4 U3 i686 cluster. This thread >caught my eye, can you look at my output and tell me if this is the same >issue? If not, are there other things I can tune, or should I file a >bug somewhere? > > > this looks like a configuration issue and not the timeout. The CR timeouts occured with the rdma device and not the rdssm. Is IPoIB running on the ib0 interfaces across the fabric? >$ .../intelmpi-2.0.1-`uname -m`/bin/mpiexec -genv I_MPI_DEBUG 3 -genv >I_MPI_DEVICE rdssm -genv LD_LIBRARY_PATH .../intelmpi-2.0.1-`uname >-m`/lib -n 32 .../IMB_2.3/src/IMB-MPI1 PingPong >I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so >I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma >I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma >I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so >I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma >aborting job: >Fatal error in MPI_Init: Other MPI error, error stack: >MPIR_Init_thread(531): Initialization failed >MPID_Init(146): channel initialization failed >MPIDI_CH3_Init(937): >MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in >VC_post_connect >(unknown)(): (null) >aborting job: >Fatal error in MPI_Init: Other MPI error, error stack: >MPIR_Init_thread(531): Initialization failed >MPID_Init(146): channel initialization failed >MPIDI_CH3_Init(937): >MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in >VC_post_connect >(unknown)(): (null) >aborting job: >Fatal error in MPI_Init: Other MPI error, error stack: >MPIR_Init_thread(531): Initialization failed >MPID_Init(146): channel initialization failed >MPIDI_CH3_Init(937): >MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in >VC_post_connect >(unknown)(): (null) >aborting job: >Fatal error in MPI_Init: Other MPI error, error stack: >MPIR_Init_thread(531): Initialization failed >MPID_Init(146): channel initialization failed >MPIDI_CH3_Init(937): >MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in >VC_post_connect >(unknown)(): (null) >aborting job: >Fatal error in MPI_Init: Other MPI error, error stack: >MPIR_Init_thread(531): Initialization failed >MPID_Init(146): channel initialization failed >MPIDI_CH3_Init(937): >MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in >VC_post_connect >(unknown)(): (null) >aborting job: >Fatal error in MPI_Init: Other MPI error, error stack: >MPIR_Init_thread(531): Initialization failed >MPID_Init(146): channel initialization failed >MPIDI_CH3_Init(937): >MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in >VC_post_connect >(unknown)(): (null) >aborting job: > > > From sweitzen at cisco.com Tue Jun 6 17:07:53 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 6 Jun 2006 17:07:53 -0700 Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS Message-ID: > this looks like a configuration issue and not the timeout. The CR > timeouts occured with > the rdma device and not the rdssm. Is IPoIB running on the ib0 > interfaces across the > fabric? Yes, IPoIB is running. Scott From sean.hefty at intel.com Tue Jun 6 19:36:23 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Jun 2006 19:36:23 -0700 Subject: [openib-general] [PATCH 0/4] Add support for UD QPs Message-ID: The following patch series adds support for UD QPs to userspace through the RDMA CM. UD QPs are referenced by an IP address, UDP port number. The RDMA CM abstracts SIDR for Infiniband clients. Signed-off-by: Sean Hefty --- A subsequent patch series will add multicast handling to the UD QPs. From sean.hefty at intel.com Tue Jun 6 19:43:13 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Jun 2006 19:43:13 -0700 Subject: [openib-general] [PATCH 1/4] IB CM: Save and report remote UD QP attributes after SIDR In-Reply-To: Message-ID: Record remote QP information returned from SIDR. Expose attributes through a new API. This functionality is similar to the ib_cm_init_qp_attr() routine that exists for RC QPs. Signed-off-by: Sean Hefty --- Index: core/cm.c =================================================================== --- core/cm.c (revision 7758) +++ core/cm.c (working copy) @@ -138,6 +138,7 @@ struct cm_id_private { __be64 tid; __be32 local_qpn; __be32 remote_qpn; + __be32 remote_qkey; enum ib_qp_type qp_type; __be32 sq_psn; __be32 rq_psn; @@ -2836,6 +2837,9 @@ static int cm_sidr_rep_handler(struct cm } cm_id_priv->id.state = IB_CM_IDLE; ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); + + cm_id_priv->remote_qpn = cm_sidr_rep_get_qpn(sidr_rep_msg); + cm_id_priv->remote_qkey = sidr_rep_msg->qkey; spin_unlock_irqrestore(&cm_id_priv->lock, flags); cm_format_sidr_rep_event(work); @@ -3230,6 +3234,29 @@ int ib_cm_init_qp_attr(struct ib_cm_id * } EXPORT_SYMBOL(ib_cm_init_qp_attr); +int ib_cm_get_dst_attr(struct ib_cm_id *cm_id, struct ib_ah_attr *ah_attr, + u32 *remote_qpn, u32 *remote_qkey) +{ + struct cm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = container_of(cm_id, struct cm_id_private, id); + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->id.state != IB_CM_IDLE) { + ret = -EINVAL; + goto out; + } + + *ah_attr = cm_id_priv->av.ah_attr; + *remote_qpn = be32_to_cpu(cm_id_priv->remote_qpn); + *remote_qkey = be32_to_cpu(cm_id_priv->remote_qkey); +out: + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return ret; +} +EXPORT_SYMBOL(ib_cm_get_dst_attr); + static void cm_add_one(struct ib_device *device) { struct cm_device *cm_dev; Index: include/rdma/ib_cm.h =================================================================== --- include/rdma/ib_cm.h (revision 7758) +++ include/rdma/ib_cm.h (working copy) @@ -521,6 +521,18 @@ int ib_cm_init_qp_attr(struct ib_cm_id * int *qp_attr_mask); /** + * ib_cm_get_dst_attr - Initializes the attributes for use in sending + * to a specified UD QP. + * @cm_id: Communication identifier that was used for the SIDR REQ. + * @ah_attr: Address handle attributes that should be used to send to the + * destination QP. + * @remote_qpn: Remote QPN of the destination QP. + * @remote_qkey: Remote QKey of the destination QP. + */ +int ib_cm_get_dst_attr(struct ib_cm_id *cm_id, struct ib_ah_attr *ah_attr, + u32 *remote_qpn, u32 *remote_qkey); + +/** * ib_send_cm_apr - Sends an alternate path response message in response to * a load alternate path request. * @cm_id: Connection identifier associated with the alternate path response. From sean.hefty at intel.com Tue Jun 6 19:49:08 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Jun 2006 19:49:08 -0700 Subject: [openib-general] [PATCH 2/4] Add support for UD QPs in RDMA CM In-Reply-To: Message-ID: Add support for UD QPs in the RDMA CM. UD QPs are identified by an IP address and UDP port number. The RDMA CM provides resolution of an IP address/port number to a remote QPN / QKey using existing address and route resolution and SIDR. This patch extends the RDMA CM protocol from IB CM REQ messages to IB CM SIDR REQ messages. Signed-off-by: Sean Hefty --- Index: core/cma.c =================================================================== --- core/cma.c (revision 7758) +++ core/cma.c (working copy) @@ -66,6 +66,7 @@ static DEFINE_MUTEX(lock); static struct workqueue_struct *cma_wq; static DEFINE_IDR(sdp_ps); static DEFINE_IDR(tcp_ps); +static DEFINE_IDR(udp_ps); struct cma_device { struct list_head list; @@ -473,6 +474,29 @@ int rdma_init_qp_attr(struct rdma_cm_id } EXPORT_SYMBOL(rdma_init_qp_attr); +int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr, + struct ib_ah_attr *ah_attr, u32 *remote_qpn, + u32 *remote_qkey) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: + if (!memcmp(&id->route.addr.dst_addr, addr, ip_addr_size(addr))) + ret = ib_cm_get_dst_attr(id_priv->cm_id.ib, ah_attr, + remote_qpn, remote_qkey); + break; + default: + ret = -ENOSYS; + break; + } + + return ret; +} +EXPORT_SYMBOL(rdma_get_dst_attr); + static inline int cma_zero_addr(struct sockaddr *addr) { struct in6_addr *ip6; @@ -496,9 +520,17 @@ static inline int cma_any_addr(struct so return cma_zero_addr(addr) || cma_loopback_addr(addr); } +static inline __be16 cma_port(struct sockaddr *addr) +{ + if (addr->sa_family == AF_INET) + return ((struct sockaddr_in *) addr)->sin_port; + else + return ((struct sockaddr_in6 *) addr)->sin6_port; +} + static inline int cma_any_port(struct sockaddr *addr) { - return !((struct sockaddr_in *) addr)->sin_port; + return !cma_port(addr); } static int cma_get_net_info(void *hdr, enum rdma_port_space ps, @@ -841,8 +873,8 @@ out: return ret; } -static struct rdma_id_private* cma_new_id(struct rdma_cm_id *listen_id, - struct ib_cm_event *ib_event) +static struct rdma_id_private* cma_new_conn_id(struct rdma_cm_id *listen_id, + struct ib_cm_event *ib_event) { struct rdma_id_private *id_priv; struct rdma_cm_id *id; @@ -885,6 +917,42 @@ err: return NULL; } +static struct rdma_id_private* cma_new_udp_id(struct rdma_cm_id *listen_id, + struct ib_cm_event *ib_event) +{ + struct rdma_id_private *id_priv; + struct rdma_cm_id *id; + union cma_ip_addr *src, *dst; + __u16 port; + u8 ip_ver; + int ret; + + id = rdma_create_id(listen_id->event_handler, listen_id->context, + listen_id->ps); + if (IS_ERR(id)) + return NULL; + + + if (cma_get_net_info(ib_event->private_data, listen_id->ps, + &ip_ver, &port, &src, &dst)) + goto err; + + cma_save_net_info(&id->route.addr, &listen_id->route.addr, + ip_ver, port, src, dst); + + ret = rdma_translate_ip(&id->route.addr.src_addr, + &id->route.addr.dev_addr); + if (ret) + goto err; + + id_priv = container_of(id, struct rdma_id_private, id); + id_priv->state = CMA_CONNECT; + return id_priv; +err: + rdma_destroy_id(id); + return NULL; +} + static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) { struct rdma_id_private *listen_id, *conn_id; @@ -897,7 +965,10 @@ static int cma_req_handler(struct ib_cm_ goto out; } - conn_id = cma_new_id(&listen_id->id, ib_event); + if (listen_id->id.ps == RDMA_PS_UDP) + conn_id = cma_new_udp_id(&listen_id->id, ib_event); + else + conn_id = cma_new_conn_id(&listen_id->id, ib_event); if (!conn_id) { ret = -ENOMEM; goto out; @@ -934,8 +1005,7 @@ out: static __be64 cma_get_service_id(enum rdma_port_space ps, struct sockaddr *addr) { - return cpu_to_be64(((u64)ps << 16) + - be16_to_cpu(((struct sockaddr_in *) addr)->sin_port)); + return cpu_to_be64(((u64)ps << 16) + be16_to_cpu(cma_port(addr))); } static void cma_set_compare_data(enum rdma_port_space ps, struct sockaddr *addr, @@ -1586,6 +1656,9 @@ static int cma_get_port(struct rdma_id_p case RDMA_PS_TCP: ps = &tcp_ps; break; + case RDMA_PS_UDP: + ps = &udp_ps; + break; default: return -EPROTONOSUPPORT; } @@ -1664,6 +1737,93 @@ static int cma_format_hdr(void *hdr, enu return 0; } +static int cma_sidr_rep_handler(struct ib_cm_id *cm_id, + struct ib_cm_event *ib_event) +{ + struct rdma_id_private *id_priv = cm_id->context; + enum rdma_cm_event_type event; + struct ib_cm_sidr_rep_event_param *rep = &ib_event->param.sidr_rep_rcvd; + struct rdma_route *route; + int ret = 0, status; + + atomic_inc(&id_priv->dev_remove); + if (!cma_comp(id_priv, CMA_CONNECT)) + goto out; + + switch (ib_event->event) { + case IB_CM_SIDR_REQ_ERROR: + event = RDMA_CM_EVENT_UNREACHABLE; + status = -ETIMEDOUT; + break; + case IB_CM_SIDR_REP_RECEIVED: + if (rep->status != IB_SIDR_SUCCESS) { + event = RDMA_CM_EVENT_UNREACHABLE; + status = ib_event->param.sidr_rep_rcvd.status; + break; + } + route = &id_priv->id.route; + if (rep->qkey != ntohs(cma_port(&route->addr.dst_addr))) { + event = RDMA_CM_EVENT_UNREACHABLE; + status = -EINVAL; + break; + } + event = RDMA_CM_EVENT_ESTABLISHED; + status = 0; + break; + default: + printk(KERN_ERR "RDMA CMA: unexpected IB CM event: %d", + ib_event->event); + goto out; + } + + ret = cma_notify_user(id_priv, event, status, NULL, 0); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + id_priv->cm_id.ib = NULL; + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + rdma_destroy_id(&id_priv->id); + return ret; + } +out: + cma_release_remove(id_priv); + return ret; +} + +static int cma_resolve_ib_udp(struct rdma_id_private *id_priv) +{ + struct ib_cm_sidr_req_param req; + struct rdma_route *route; + struct cma_hdr hdr; + int ret; + + id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device, + cma_sidr_rep_handler, id_priv); + if (IS_ERR(id_priv->cm_id.ib)) + return PTR_ERR(id_priv->cm_id.ib); + + route = &id_priv->id.route; + ret = cma_format_hdr(&hdr, id_priv->id.ps, route); + if (ret) + goto out; + + req.path = route->path_rec; + req.service_id = cma_get_service_id(id_priv->id.ps, + &route->addr.dst_addr); + req.timeout_ms = 1 << max(cma_get_ib_remote_timeout(id_priv) - 8, 0); + req.private_data = &hdr; + req.private_data_len = sizeof hdr; + req.max_cm_retries = cma_get_ib_cm_retries(id_priv); + + ret = ib_send_cm_sidr_req(id_priv->cm_id.ib, &req); +out: + if (ret) { + ib_destroy_cm_id(id_priv->cm_id.ib); + id_priv->cm_id.ib = NULL; + } + return ret; +} + static int cma_connect_ib(struct rdma_id_private *id_priv, struct rdma_conn_param *conn_param) { @@ -1738,7 +1898,10 @@ int rdma_connect(struct rdma_cm_id *id, switch (rdma_node_get_transport(id->device->node_type)) { case RDMA_TRANSPORT_IB: - ret = cma_connect_ib(id_priv, conn_param); + if (id->ps == RDMA_PS_UDP) + ret = cma_resolve_ib_udp(id_priv); + else + ret = cma_connect_ib(id_priv, conn_param); break; default: ret = -ENOSYS; @@ -1780,6 +1943,21 @@ static int cma_accept_ib(struct rdma_id_ return ib_send_cm_rep(id_priv->cm_id.ib, &rep); } +static int cma_send_sidr_rep(struct rdma_id_private *id_priv, + enum ib_cm_sidr_status status) +{ + struct ib_cm_sidr_rep_param rep; + + memset(&rep, 0, sizeof rep); + rep.status = status; + if (status == IB_SIDR_SUCCESS) { + rep.qp_num = id_priv->qp_num; + rep.qkey = ntohs(cma_port(&id_priv->id.route.addr.src_addr)); + } + + return ib_send_cm_sidr_rep(id_priv->cm_id.ib, &rep); +} + int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct rdma_id_private *id_priv; @@ -1797,7 +1975,9 @@ int rdma_accept(struct rdma_cm_id *id, s switch (rdma_node_get_transport(id->device->node_type)) { case RDMA_TRANSPORT_IB: - if (conn_param) + if (id->ps == RDMA_PS_UDP) + ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS); + else if (conn_param) ret = cma_accept_ib(id_priv, conn_param); else ret = cma_rep_recv(id_priv); @@ -1830,9 +2010,12 @@ int rdma_reject(struct rdma_cm_id *id, c switch (rdma_node_get_transport(id->device->node_type)) { case RDMA_TRANSPORT_IB: - ret = ib_send_cm_rej(id_priv->cm_id.ib, - IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, - private_data, private_data_len); + if (id->ps == RDMA_PS_UDP) + ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT); + else + ret = ib_send_cm_rej(id_priv->cm_id.ib, + IB_CM_REJ_CONSUMER_DEFINED, NULL, + 0, private_data, private_data_len); break; default: ret = -ENOSYS; @@ -1995,6 +2178,7 @@ static void cma_cleanup(void) destroy_workqueue(cma_wq); idr_destroy(&sdp_ps); idr_destroy(&tcp_ps); + idr_destroy(&udp_ps); } module_init(cma_init); Index: include/rdma/rdma_cm.h =================================================================== --- include/rdma/rdma_cm.h (revision 7758) +++ include/rdma/rdma_cm.h (working copy) @@ -212,9 +212,15 @@ struct rdma_conn_param { /** * rdma_connect - Initiate an active connection request. + * @id: Connection identifier to connect. + * @conn_param: Connection information used for connected QPs. * * Users must have resolved a route for the rdma_cm_id to connect with * by having called rdma_resolve_route before calling this routine. + * + * This call will either connect to a remote QP or obtain remote QP + * information for unconnected rdma_cm_id's. The actual operation is + * based on the rdma_cm_id's port space. */ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param); Index: include/rdma/rdma_cm_ib.h =================================================================== --- include/rdma/rdma_cm_ib.h (revision 7758) +++ include/rdma/rdma_cm_ib.h (working copy) @@ -44,6 +44,22 @@ int rdma_set_ib_paths(struct rdma_cm_id *id, struct ib_sa_path_rec *path_rec, int num_paths); +/** + * rdma_get_dst_attr - Retrieve information about a UDP destination. + * @id: Connection identifier associated with the request. + * @addr: Address of remote destination to retrieve information about. + * @ah_attr: Address handle attributes. A caller uses these attributes to + * create an address handle when communicating with the destination. + * @remote_qpn: The remote QP number associated with the UDP address. + * @remote_qkey: The QKey of the remote QP. + * + * Users must have called rdma_connect() to resolve the destination for a + * UD QP, or rdma_join_multicast() for multicast destinations. + */ +int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr, + struct ib_ah_attr *ah_attr, u32 *remote_qpn, + u32 *remote_qkey); + struct ib_cm_req_opt { u8 remote_cm_response_timeout; u8 local_cm_response_timeout; From sean.hefty at intel.com Tue Jun 6 19:52:47 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Jun 2006 19:52:47 -0700 Subject: [openib-general] [PATCH 3/4] uverbs: export ib_copy_ah_attr_to_user In-Reply-To: Message-ID: Export the ib_copy_ah_attr_to_user() routine to allow copy ib_ah_attr to userspace to support UD QPs. Signed-off-by: Sean Hefty --- Index: core/uverbs_marshall.c =================================================================== --- core/uverbs_marshall.c (revision 7758) +++ core/uverbs_marshall.c (working copy) @@ -32,8 +32,8 @@ #include -static void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst, - struct ib_ah_attr *src) +void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst, + struct ib_ah_attr *src) { memcpy(dst->grh.dgid, src->grh.dgid.raw, sizeof src->grh.dgid); dst->grh.flow_label = src->grh.flow_label; @@ -47,6 +47,7 @@ static void ib_copy_ah_attr_to_user(stru dst->is_global = src->ah_flags & IB_AH_GRH ? 1 : 0; dst->port_num = src->port_num; } +EXPORT_SYMBOL(ib_copy_ah_attr_to_user); void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst, struct ib_qp_attr *src) Index: include/rdma/ib_marshall.h =================================================================== --- include/rdma/ib_marshall.h (revision 7758) +++ include/rdma/ib_marshall.h (working copy) @@ -41,6 +41,9 @@ void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst, struct ib_qp_attr *src); +void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst, + struct ib_ah_attr *src); + void ib_copy_path_rec_to_user(struct ib_user_path_rec *dst, struct ib_sa_path_rec *src); From sean.hefty at intel.com Tue Jun 6 19:57:24 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Jun 2006 19:57:24 -0700 Subject: [openib-general] [PATCH 4/4] uCMA: export UD QP support to userspace In-Reply-To: Message-ID: Export the RDMA CM's support of UD QPs to the userspace library. Signed-off-by: Sean Hefty --- My intent is to bump the ABI version only once. The multicast patches will not increment the ABI. Index: core/ucma.c =================================================================== --- core/ucma.c (revision 7758) +++ core/ucma.c (working copy) @@ -41,6 +41,7 @@ #include #include #include +#include #include "ucma_ib.h" @@ -291,7 +292,7 @@ static ssize_t ucma_create_id(struct ucm return -ENOMEM; ctx->uid = cmd.uid; - ctx->cm_id = rdma_create_id(ucma_event_handler, ctx, RDMA_PS_TCP); + ctx->cm_id = rdma_create_id(ucma_event_handler, ctx, cmd.ps); if (IS_ERR(ctx->cm_id)) { ret = PTR_ERR(ctx->cm_id); goto err1; @@ -736,6 +737,40 @@ static ssize_t ucma_set_option(struct uc return ret; } +static ssize_t ucma_get_dst_attr(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_get_dst_attr cmd; + struct rdma_ucm_dst_attr_resp resp; + struct ib_ah_attr ah_attr; + struct ucma_context *ctx; + int ret; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_get_dst_attr(ctx->cm_id, (struct sockaddr *) &cmd.addr, + &ah_attr, &resp.remote_qpn, &resp.remote_qkey); + if (ret) + goto out; + + ib_copy_ah_attr_to_user(&resp.ah_attr, &ah_attr); + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + ret = -EFAULT; +out: + ucma_put_ctx(ctx); + return ret; +} + static ssize_t (*ucma_cmd_table[])(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) = { @@ -753,7 +788,8 @@ static ssize_t (*ucma_cmd_table[])(struc [RDMA_USER_CM_CMD_INIT_QP_ATTR] = ucma_init_qp_attr, [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event, [RDMA_USER_CM_CMD_GET_OPTION] = ucma_get_option, - [RDMA_USER_CM_CMD_SET_OPTION] = ucma_set_option + [RDMA_USER_CM_CMD_SET_OPTION] = ucma_set_option, + [RDMA_USER_CM_CMD_GET_DST_ATTR] = ucma_get_dst_attr }; static ssize_t ucma_write(struct file *filp, const char __user *buf, Index: include/rdma/rdma_user_cm.h =================================================================== --- include/rdma/rdma_user_cm.h (revision 7758) +++ include/rdma/rdma_user_cm.h (working copy) @@ -38,7 +38,7 @@ #include #include -#define RDMA_USER_CM_ABI_VERSION 1 +#define RDMA_USER_CM_ABI_VERSION 2 #define RDMA_MAX_PRIVATE_DATA 256 @@ -58,6 +58,7 @@ enum { RDMA_USER_CM_CMD_GET_EVENT, RDMA_USER_CM_CMD_GET_OPTION, RDMA_USER_CM_CMD_SET_OPTION, + RDMA_USER_CM_CMD_GET_DST_ATTR }; /* @@ -72,6 +73,8 @@ struct rdma_ucm_cmd_hdr { struct rdma_ucm_create_id { __u64 uid; __u64 response; + __u16 ps; + __u8 reserved[6]; }; struct rdma_ucm_create_id_resp { @@ -171,6 +174,18 @@ struct rdma_ucm_init_qp_attr { __u32 qp_state; }; +struct rdma_ucm_dst_attr_resp { + __u32 remote_qpn; + __u32 remote_qkey; + struct ib_uverbs_ah_attr ah_attr; +}; + +struct rdma_ucm_get_dst_attr { + __u64 response; + struct sockaddr_in6 addr; + __u32 id; +}; + struct rdma_ucm_get_event { __u64 response; }; From sean.hefty at intel.com Tue Jun 6 20:08:57 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Jun 2006 20:08:57 -0700 Subject: [openib-general] [PATCH 1/2] libibverbs: add helper functions for UD QP support Message-ID: Adds some helper functions to simplify using UD QPs. Add new routines: ibv_init_ah_from_wc() and ibv_create_ah_from_wc() to simplify UD QP communication. Expose ibv_copy_ah_attr_from_kern to retrieve ibv_ah_attr from kernel for a UD QP. Signed-off-by: Sean Hefty --- Index: include/infiniband/verbs.h =================================================================== --- include/infiniband/verbs.h (revision 7636) +++ include/infiniband/verbs.h (working copy) @@ -298,6 +298,15 @@ struct ibv_global_route { uint8_t traffic_class; }; +struct ibv_grh { + uint32_t version_tclass_flow; + uint16_t paylen; + uint8_t next_hdr; + uint8_t hop_limit; + union ibv_gid sgid; + union ibv_gid dgid; +}; + enum ibv_rate { IBV_RATE_MAX = 0, IBV_RATE_2_5_GBPS = 2, @@ -952,6 +961,36 @@ static inline int ibv_post_recv(struct i struct ibv_ah *ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr); /** + * ibv_init_ah_from_wc - Initializes address handle attributes from a + * work completion. + * @context: Device context on which the received message arrived. + * @port_num: Port on which the received message arrived. + * @wc: Work completion associated with the received message. + * @grh: References the received global route header. This parameter is + * ignored unless the work completion indicates that the GRH is valid. + * @ah_attr: Returned attributes that can be used when creating an address + * handle for replying to the message. + */ +int ibv_init_ah_from_wc(struct ibv_context *context, uint8_t port_num, + struct ibv_wc *wc, struct ibv_grh *grh, + struct ibv_ah_attr *ah_attr); + +/** + * ibv_create_ah_from_wc - Creates an address handle associated with the + * sender of the specified work completion. + * @pd: The protection domain associated with the address handle. + * @wc: Work completion information associated with a received message. + * @grh: References the received global route header. This parameter is + * ignored unless the work completion indicates that the GRH is valid. + * @port_num: The outbound port number to associate with the address. + * + * The address handle is used to reference a local or global destination + * in all UD QP post sends. + */ +struct ibv_ah *ibv_create_ah_from_wc(struct ibv_pd *pd, struct ibv_wc *wc, + struct ibv_grh *grh, uint8_t port_num); + +/** * ibv_destroy_ah - Destroy an address handle. */ int ibv_destroy_ah(struct ibv_ah *ah); Index: include/infiniband/marshall.h =================================================================== --- include/infiniband/marshall.h (revision 7636) +++ include/infiniband/marshall.h (working copy) @@ -51,6 +51,9 @@ BEGIN_C_DECLS void ibv_copy_qp_attr_from_kern(struct ibv_qp_attr *dst, struct ibv_kern_qp_attr *src); +void ibv_copy_ah_attr_from_kern(struct ibv_ah_attr *dst, + struct ibv_kern_ah_attr *src); + void ibv_copy_path_rec_from_kern(struct ibv_sa_path_rec *dst, struct ibv_kern_path_rec *src); Index: src/libibverbs.map =================================================================== --- src/libibverbs.map (revision 7636) +++ src/libibverbs.map (working copy) @@ -32,6 +32,8 @@ IBVERBS_1.0 { ibv_modify_qp; ibv_destroy_qp; ibv_create_ah; + ibv_init_ah_from_wc; + ibv_create_ah_from_wc; ibv_destroy_ah; ibv_attach_mcast; ibv_detach_mcast; @@ -65,6 +67,7 @@ IBVERBS_1.0 { ibv_cmd_attach_mcast; ibv_cmd_detach_mcast; ibv_copy_qp_attr_from_kern; + ibv_copy_ah_attr_from_kern; ibv_copy_path_rec_from_kern; ibv_copy_path_rec_to_kern; ibv_rate_to_mult; Index: src/verbs.c =================================================================== --- src/verbs.c (revision 7636) +++ src/verbs.c (working copy) @@ -42,6 +42,7 @@ #include #include #include +#include #include "ibverbs.h" @@ -392,6 +393,62 @@ struct ibv_ah *ibv_create_ah(struct ibv_ return ah; } +static int ibv_find_gid_index(struct ibv_context *context, uint8_t port_num, + union ibv_gid *gid) +{ + union ibv_gid sgid; + int i = 0, ret; + + do { + ret = ibv_query_gid(context, port_num, i++, &sgid); + } while (!ret && memcmp(&sgid, gid, sizeof *gid)); + + return ret ? ret : i - 1; +} + +int ibv_init_ah_from_wc(struct ibv_context *context, uint8_t port_num, + struct ibv_wc *wc, struct ibv_grh *grh, + struct ibv_ah_attr *ah_attr) +{ + uint32_t flow_class; + int ret; + + memset(ah_attr, 0, sizeof *ah_attr); + ah_attr->dlid = wc->slid; + ah_attr->sl = wc->sl; + ah_attr->src_path_bits = wc->dlid_path_bits; + ah_attr->port_num = port_num; + + if (wc->wc_flags & IBV_WC_GRH) { + ah_attr->is_global = 1; + ah_attr->grh.dgid = grh->sgid; + + ret = ibv_find_gid_index(context, port_num, &grh->dgid); + if (ret < 0) + return ret; + + ah_attr->grh.sgid_index = (uint8_t) ret; + flow_class = ntohl(grh->version_tclass_flow); + ah_attr->grh.flow_label = flow_class & 0xFFFFF; + ah_attr->grh.hop_limit = grh->hop_limit; + ah_attr->grh.traffic_class = (flow_class >> 20) & 0xFF; + } + return 0; +} + +struct ibv_ah *ibv_create_ah_from_wc(struct ibv_pd *pd, struct ibv_wc *wc, + struct ibv_grh *grh, uint8_t port_num) +{ + struct ibv_ah_attr ah_attr; + int ret; + + ret = ibv_init_ah_from_wc(pd->context, port_num, wc, grh, &ah_attr); + if (ret) + return NULL; + + return ibv_create_ah(pd, &ah_attr); +} + int ibv_destroy_ah(struct ibv_ah *ah) { return ah->context->ops.destroy_ah(ah); Index: src/marshall.c =================================================================== --- src/marshall.c (revision 7636) +++ src/marshall.c (working copy) @@ -38,8 +38,8 @@ #include -static void ibv_copy_ah_attr_from_kern(struct ibv_ah_attr *dst, - struct ibv_kern_ah_attr *src) +void ibv_copy_ah_attr_from_kern(struct ibv_ah_attr *dst, + struct ibv_kern_ah_attr *src) { memcpy(dst->grh.dgid.raw, src->grh.dgid, sizeof dst->grh.dgid); dst->grh.flow_label = src->grh.flow_label; Index: ChangeLog =================================================================== --- ChangeLog (revision 7636) +++ ChangeLog (working copy) @@ -1,3 +1,13 @@ +2006-06-07 Sean Hefty + + * src/verbs.c include/infiniband/verbs.h: Add new routines: + ibv_init_ah_from_wc() and ibv_create_ah_from_wc() to simplify UD QP + communication. + + * src/marshall.c include/infiniband/marshall.h: Expose + ibv_copy_ah_attr_from_kern to retrieve ibv_ah_attr from kernel for + a UD QP. + 2006-06-01 Roland Dreier * src/device.c (ibv_get_device_list): Actually return a From sean.hefty at intel.com Tue Jun 6 20:15:43 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Jun 2006 20:15:43 -0700 Subject: [openib-general] [PATCH 2/2] librdmacm: add UD QP support for userspace clients In-Reply-To: Message-ID: Add support for UD QPs to the RDMA CM library, along with a goofy test program. Signed-off-by: Sean Hefty --- Index: include/rdma/rdma_cma_ib.h =================================================================== --- include/rdma/rdma_cma_ib.h (revision 7743) +++ include/rdma/rdma_cma_ib.h (working copy) @@ -44,4 +44,19 @@ struct ib_cm_req_opt { uint8_t max_cm_retries; }; +/** + * rdma_get_dst_attr - Retrieve information about a UDP destination. + * @id: Connection identifier associated with the request. + * @addr: Address of remote destination to retrieve information about. + * @ah_attr: Address handle attributes. A caller uses these attributes to + * create an address handle when communicating with the destination. + * @qpn: The remote QP number associated with the UDP address. + * @qkey: The QKey of the remote QP. + * + * Users must have called rdma_connect() to resolve the destination information. + */ +int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr, + struct ibv_ah_attr *ah_attr, uint32_t *remote_qpn, + uint32_t *remote_qkey); + #endif /* RDMA_CMA_IB_H */ Index: include/rdma/rdma_cma_abi.h =================================================================== --- include/rdma/rdma_cma_abi.h (revision 7636) +++ include/rdma/rdma_cma_abi.h (working copy) @@ -40,7 +40,7 @@ */ #define RDMA_USER_CM_MIN_ABI_VERSION 1 -#define RDMA_USER_CM_MAX_ABI_VERSION 1 +#define RDMA_USER_CM_MAX_ABI_VERSION 2 #define RDMA_MAX_PRIVATE_DATA 256 @@ -60,6 +60,7 @@ enum { UCMA_CMD_GET_EVENT, UCMA_CMD_GET_OPTION, UCMA_CMD_SET_OPTION, + UCMA_CMD_GET_DST_ATTR }; struct ucma_abi_cmd_hdr { @@ -68,9 +69,16 @@ struct ucma_abi_cmd_hdr { __u16 out; }; +struct ucma_abi_create_id_v1 { + __u64 uid; + __u64 response; +}; + struct ucma_abi_create_id { __u64 uid; __u64 response; + __u16 ps; + __u8 reserved[6]; }; struct ucma_abi_create_id_resp { @@ -170,6 +178,18 @@ struct ucma_abi_init_qp_attr { __u32 qp_state; }; +struct ucma_abi_dst_attr_resp { + __u32 remote_qpn; + __u32 remote_qkey; + struct ibv_kern_ah_attr ah_attr; +}; + +struct ucma_abi_get_dst_attr { + __u64 response; + struct sockaddr_in6 addr; + __u32 id; +}; + struct ucma_abi_get_event { __u64 response; }; Index: include/rdma/rdma_cma.h =================================================================== --- include/rdma/rdma_cma.h (revision 7743) +++ include/rdma/rdma_cma.h (working copy) @@ -54,6 +54,11 @@ enum rdma_cm_event_type { RDMA_CM_EVENT_DEVICE_REMOVAL, }; +enum rdma_port_space { + RDMA_PS_TCP = 0x0106, + RDMA_PS_UDP = 0x0111, +}; + /* Protocol levels for get/set options. */ enum { RDMA_PROTO_IP = 0, @@ -90,6 +95,7 @@ struct rdma_cm_id { void *context; struct ibv_qp *qp; struct rdma_route route; + enum rdma_port_space ps; uint8_t port_num; }; @@ -121,9 +127,11 @@ void rdma_destroy_event_channel(struct r * @id: A reference where the allocated communication identifier will be * returned. * @context: User specified context associated with the rdma_cm_id. + * @ps: RDMA port space. */ int rdma_create_id(struct rdma_event_channel *channel, - struct rdma_cm_id **id, void *context); + struct rdma_cm_id **id, void *context, + enum rdma_port_space ps); /** * rdma_destroy_id - Release a communication identifier. @@ -194,6 +202,10 @@ struct rdma_conn_param { uint8_t flow_control; uint8_t retry_count; /* ignored when accepting */ uint8_t rnr_retry_count; + /* Fields below ignored if a QP is created on the rdma_cm_id. */ + uint8_t srq; + uint32_t qp_num; + enum ibv_qp_type qp_type; }; /** @@ -227,7 +239,8 @@ int rdma_reject(struct rdma_cm_id *id, c uint8_t private_data_len); /** - * rdma_disconnect - This function disconnects the associated QP. + * rdma_disconnect - This function disconnects the associated QP and + * transitions it into the error state. */ int rdma_disconnect(struct rdma_cm_id *id); @@ -278,4 +291,18 @@ int rdma_get_option(struct rdma_cm_id *i int rdma_set_option(struct rdma_cm_id *id, int level, int optname, void *optval, size_t optlen); +static inline uint16_t rdma_get_src_port(struct rdma_cm_id *id) +{ + return id->route.addr.src_addr.sin6_family == PF_INET6 ? + id->route.addr.src_addr.sin6_port : + ((struct sockaddr_in *) &id->route.addr.src_addr)->sin_port; +} + +static inline uint16_t rdma_get_dst_port(struct rdma_cm_id *id) +{ + return id->route.addr.dst_addr.sin6_family == PF_INET6 ? + id->route.addr.dst_addr.sin6_port : + ((struct sockaddr_in *) &id->route.addr.dst_addr)->sin_port; +} + #endif /* RDMA_CMA_H */ Index: src/cma.c =================================================================== --- src/cma.c (revision 7636) +++ src/cma.c (working copy) @@ -54,6 +54,7 @@ #include #include #include +#include #define PFX "librdmacm: " @@ -203,7 +204,7 @@ static int ucma_init(void) dev_list = ibv_get_device_list(NULL); if (!dev_list) { - printf("CMA: unable to get RDMA device liste\n"); + printf("CMA: unable to get RDMA device list\n"); ret = -ENODEV; goto err; } @@ -301,7 +302,8 @@ static void ucma_free_id(struct cma_id_p } static struct cma_id_private *ucma_alloc_id(struct rdma_event_channel *channel, - void *context) + void *context, + enum rdma_port_space ps) { struct cma_id_private *id_priv; @@ -311,6 +313,7 @@ static struct cma_id_private *ucma_alloc memset(id_priv, 0, sizeof *id_priv); id_priv->id.context = context; + id_priv->id.ps = ps; id_priv->id.channel = channel; pthread_mutex_init(&id_priv->mut, NULL); if (pthread_cond_init(&id_priv->cond, NULL)) @@ -322,8 +325,44 @@ err: ucma_free_id(id_priv); return NULL; } +static int ucma_create_id_v1(struct rdma_event_channel *channel, + struct rdma_cm_id **id, void *context, + enum rdma_port_space ps) +{ + struct ucma_abi_create_id_resp *resp; + struct ucma_abi_create_id_v1 *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size; + + if (ps != RDMA_PS_TCP) { + fprintf(stderr, "librdmacm: Kernel ABI does not support " + "requested port space.\n"); + return -EPROTONOSUPPORT; + } + + id_priv = ucma_alloc_id(channel, context, ps); + if (!id_priv) + return -ENOMEM; + + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_CREATE_ID, size); + cmd->uid = (uintptr_t) id_priv; + + ret = write(channel->fd, msg, size); + if (ret != size) + goto err; + + id_priv->handle = resp->id; + *id = &id_priv->id; + return 0; + +err: ucma_free_id(id_priv); + return ret; +} + int rdma_create_id(struct rdma_event_channel *channel, - struct rdma_cm_id **id, void *context) + struct rdma_cm_id **id, void *context, + enum rdma_port_space ps) { struct ucma_abi_create_id_resp *resp; struct ucma_abi_create_id *cmd; @@ -335,12 +374,16 @@ int rdma_create_id(struct rdma_event_cha if (ret) return ret; - id_priv = ucma_alloc_id(channel, context); + if (abi_ver == 1) + return ucma_create_id_v1(channel, id, context, ps); + + id_priv = ucma_alloc_id(channel, context, ps); if (!id_priv) return -ENOMEM; CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_CREATE_ID, size); cmd->uid = (uintptr_t) id_priv; + cmd->ps = ps; ret = write(channel->fd, msg, size); if (ret != size) @@ -637,6 +680,36 @@ static int ucma_init_ib_qp(struct cma_id IBV_QP_PKEY_INDEX | IBV_QP_PORT); } +static int ucma_init_ud_qp(struct cma_id_private *id_priv, struct ibv_qp *qp) +{ + struct ibv_qp_attr qp_attr; + struct ib_addr *ibaddr; + int ret; + + ibaddr = &id_priv->id.route.addr.addr.ibaddr; + ret = ucma_find_pkey(id_priv->cma_dev, id_priv->id.port_num, + ibaddr->pkey, &qp_attr.pkey_index); + if (ret) + return ret; + + qp_attr.port_num = id_priv->id.port_num; + qp_attr.qp_state = IBV_QPS_INIT; + qp_attr.qkey = ntohs(rdma_get_src_port(&id_priv->id)); + ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | + IBV_QP_PORT | IBV_QP_QKEY); + if (ret) + return ret; + + qp_attr.qp_state = IBV_QPS_RTR; + ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE); + if (ret) + return ret; + + qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.sq_psn = 0; + return ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_SQ_PSN); +} + int rdma_create_qp(struct rdma_cm_id *id, struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr) { @@ -652,7 +725,10 @@ int rdma_create_qp(struct rdma_cm_id *id if (!qp) return -ENOMEM; - ret = ucma_init_ib_qp(id_priv, qp); + if (id->ps == RDMA_PS_UDP) + ret = ucma_init_ud_qp(id_priv, qp); + else + ret = ucma_init_ib_qp(id_priv, qp); if (ret) goto err; @@ -670,11 +746,12 @@ void rdma_destroy_qp(struct rdma_cm_id * static void ucma_copy_conn_param_to_kern(struct ucma_abi_conn_param *dst, struct rdma_conn_param *src, - struct ibv_qp *qp) + uint32_t qp_num, + enum ibv_qp_type qp_type, uint8_t srq) { - dst->qp_num = qp->qp_num; - dst->qp_type = qp->qp_type; - dst->srq = (qp->srq != NULL); + dst->qp_num = qp_num; + dst->qp_type = qp_type; + dst->srq = srq; dst->responder_resources = src->responder_resources; dst->initiator_depth = src->initiator_depth; dst->flow_control = src->flow_control; @@ -700,7 +777,15 @@ int rdma_connect(struct rdma_cm_id *id, CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_CONNECT, size); id_priv = container_of(id, struct cma_id_private, id); cmd->id = id_priv->handle; - ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, id->qp); + if (id->qp) + ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, + id->qp->qp_num, id->qp->qp_type, + (id->qp->srq != NULL)); + else + ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, + conn_param->qp_num, + conn_param->qp_type, + conn_param->srq); ret = write(id->channel->fd, msg, size); if (ret != size) @@ -735,15 +820,25 @@ int rdma_accept(struct rdma_cm_id *id, s void *msg; int ret, size; - ret = ucma_modify_qp_rtr(id); - if (ret) - return ret; + if (id->ps != RDMA_PS_UDP) { + ret = ucma_modify_qp_rtr(id); + if (ret) + return ret; + } CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_ACCEPT, size); id_priv = container_of(id, struct cma_id_private, id); cmd->id = id_priv->handle; cmd->uid = (uintptr_t) id_priv; - ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, id->qp); + if (id->qp) + ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, + id->qp->qp_num, id->qp->qp_type, + (id->qp->srq != NULL)); + else + ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, + conn_param->qp_num, + conn_param->qp_type, + conn_param->srq); ret = write(id->channel->fd, msg, size); if (ret != size) { @@ -845,7 +940,8 @@ static int ucma_process_conn_req(struct int ret; listen_id_priv = container_of(event->id, struct cma_id_private, id); - id_priv = ucma_alloc_id(event->id->channel, event->id->context); + id_priv = ucma_alloc_id(event->id->channel, event->id->context, + event->id->ps); if (!id_priv) { ucma_destroy_kern_id(event->id->channel->fd, handle); ret = -ENOMEM; @@ -967,6 +1063,9 @@ retry: } break; case RDMA_CM_EVENT_ESTABLISHED: + if (id_priv->id.ps == RDMA_PS_UDP) + break; + evt->status = ucma_process_establish(&id_priv->id); if (evt->status) { evt->event = RDMA_CM_EVENT_CONNECT_ERROR; @@ -1041,3 +1140,32 @@ int rdma_set_option(struct rdma_cm_id *i return 0; } + +int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr, + struct ibv_ah_attr *ah_attr, uint32_t *remote_qpn, + uint32_t *remote_qkey) +{ + struct ucma_abi_dst_attr_resp *resp; + struct ucma_abi_get_dst_attr *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size, addrlen; + + addrlen = ucma_addrlen(addr); + if (!addrlen) + return -EINVAL; + + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_GET_DST_ATTR, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + memcpy(&cmd->addr, addr, addrlen); + + ret = write(id->channel->fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + ibv_copy_ah_attr_from_kern(ah_attr, &resp->ah_attr); + *remote_qpn = resp->remote_qpn; + *remote_qkey = resp->remote_qkey; + return 0; +} Index: src/librdmacm.map =================================================================== --- src/librdmacm.map (revision 7636) +++ src/librdmacm.map (working copy) @@ -18,5 +18,6 @@ RDMACM_1.0 { rdma_ack_cm_event; rdma_get_option; rdma_set_option; + rdma_get_dst_attr; local: *; }; Index: librdmacm.spec.in =================================================================== --- librdmacm.spec.in (revision 7636) +++ librdmacm.spec.in (working copy) @@ -66,3 +66,4 @@ rm -rf $RPM_BUILD_ROOT %defattr(-,root,root) %{_bindir}/rping %{_bindir}/ucmatose +%{_bindir}/udaddy Index: Makefile.am =================================================================== --- Makefile.am (revision 7743) +++ Makefile.am (working copy) @@ -18,11 +18,13 @@ endif src_librdmacm_la_SOURCES = src/cma.c src_librdmacm_la_LDFLAGS = -avoid-version $(rdmacm_version_script) -bin_PROGRAMS = examples/ucmatose examples/rping +bin_PROGRAMS = examples/ucmatose examples/rping examples/udaddy examples_ucmatose_SOURCES = examples/cmatose.c examples_ucmatose_LDADD = $(top_builddir)/src/librdmacm.la examples_rping_SOURCES = examples/rping.c examples_rping_LDADD = $(top_builddir)/src/librdmacm.la +examples_udaddy_SOURCES = examples/udaddy.c +examples_udaddy_LDADD = $(top_builddir)/src/librdmacm.la librdmacmincludedir = $(includedir)/rdma Index: examples/rping.c =================================================================== --- examples/rping.c (revision 7636) +++ examples/rping.c (working copy) @@ -1028,7 +1028,7 @@ int main(int argc, char *argv[]) goto out; } - ret = rdma_create_id(cb->cm_channel, &cb->cm_id, cb); + ret = rdma_create_id(cb->cm_channel, &cb->cm_id, cb, RDMA_PS_TCP); if (ret) { ret = errno; fprintf(stderr, "rdma_create_id error %d\n", ret); Index: examples/udaddy.c =================================================================== --- examples/udaddy.c (revision 0) +++ examples/udaddy.c (revision 0) @@ -0,0 +1,636 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +/* + * To execute: + * Server: rdma_cmatose + * Client: rdma_cmatose "dst_ip=ip" + */ + +struct cmatest_node { + int id; + struct rdma_cm_id *cma_id; + int connected; + struct ibv_pd *pd; + struct ibv_cq *cq; + struct ibv_mr *mr; + struct ibv_ah *ah; + uint32_t remote_qpn; + uint32_t remote_qkey; + void *mem; +}; + +struct cmatest { + struct rdma_event_channel *channel; + struct cmatest_node *nodes; + int conn_index; + int connects_left; + + struct sockaddr_in dst_in; + struct sockaddr *dst_addr; + struct sockaddr_in src_in; + struct sockaddr *src_addr; +}; + +static struct cmatest test; +static int connections = 1; +static int message_size = 100; +static int message_count = 10; +static int is_server; + +static int create_message(struct cmatest_node *node) +{ + if (!message_size) + message_count = 0; + + if (!message_count) + return 0; + + node->mem = malloc(message_size + sizeof(struct ibv_grh)); + if (!node->mem) { + printf("failed message allocation\n"); + return -1; + } + node->mr = ibv_reg_mr(node->pd, node->mem, + message_size + sizeof(struct ibv_grh), + IBV_ACCESS_LOCAL_WRITE); + if (!node->mr) { + printf("failed to reg MR\n"); + goto err; + } + return 0; +err: + free(node->mem); + return -1; +} + +static int init_node(struct cmatest_node *node) +{ + struct ibv_qp_init_attr init_qp_attr; + int cqe, ret; + + node->pd = ibv_alloc_pd(node->cma_id->verbs); + if (!node->pd) { + ret = -ENOMEM; + printf("cmatose: unable to allocate PD\n"); + goto out; + } + + cqe = message_count ? message_count * 2 : 2; + node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0); + if (!node->cq) { + ret = -ENOMEM; + printf("cmatose: unable to create CQ\n"); + goto out; + } + + memset(&init_qp_attr, 0, sizeof init_qp_attr); + init_qp_attr.cap.max_send_wr = message_count ? message_count : 1; + init_qp_attr.cap.max_recv_wr = message_count ? message_count : 1; + init_qp_attr.cap.max_send_sge = 1; + init_qp_attr.cap.max_recv_sge = 1; + init_qp_attr.qp_context = node; + init_qp_attr.sq_sig_all = 0; + init_qp_attr.qp_type = IBV_QPT_UD; + init_qp_attr.send_cq = node->cq; + init_qp_attr.recv_cq = node->cq; + ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr); + if (ret) { + printf("cmatose: unable to create QP: %d\n", ret); + goto out; + } + + ret = create_message(node); + if (ret) { + printf("cmatose: failed to create messages: %d\n", ret); + goto out; + } +out: + return ret; +} + +static int post_recvs(struct cmatest_node *node) +{ + struct ibv_recv_wr recv_wr, *recv_failure; + struct ibv_sge sge; + int i, ret = 0; + + if (!message_count) + return 0; + + recv_wr.next = NULL; + recv_wr.sg_list = &sge; + recv_wr.num_sge = 1; + recv_wr.wr_id = (uintptr_t) node; + + sge.length = message_size + sizeof(struct ibv_grh); + sge.lkey = node->mr->lkey; + sge.addr = (uintptr_t) node->mem; + + for (i = 0; i < message_count && !ret; i++ ) { + ret = ibv_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); + if (ret) { + printf("failed to post receives: %d\n", ret); + break; + } + } + return ret; +} + +static int post_sends(struct cmatest_node *node, int signal_flag) +{ + struct ibv_send_wr send_wr, *bad_send_wr; + struct ibv_sge sge; + int i, ret = 0; + + if (!node->connected || !message_count) + return 0; + + send_wr.next = NULL; + send_wr.sg_list = &sge; + send_wr.num_sge = 1; + send_wr.opcode = IBV_WR_SEND_WITH_IMM; + send_wr.send_flags = IBV_SEND_INLINE | signal_flag; + send_wr.wr_id = (unsigned long)node; + send_wr.imm_data = htonl(node->cma_id->qp->qp_num); + + send_wr.wr.ud.ah = node->ah; + send_wr.wr.ud.remote_qpn = node->remote_qpn; + send_wr.wr.ud.remote_qkey = node->remote_qkey; + + sge.length = message_size - sizeof(struct ibv_grh); + sge.lkey = node->mr->lkey; + sge.addr = (uintptr_t) node->mem; + + for (i = 0; i < message_count && !ret; i++) { + ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); + if (ret) + printf("failed to post sends: %d\n", ret); + } + return ret; +} + +static void connect_error(void) +{ + test.connects_left--; +} + +static int addr_handler(struct cmatest_node *node) +{ + int ret; + + ret = rdma_resolve_route(node->cma_id, 2000); + if (ret) { + printf("cmatose: resolve route failed: %d\n", ret); + connect_error(); + } + return ret; +} + +static int route_handler(struct cmatest_node *node) +{ + struct rdma_conn_param conn_param; + int ret; + + ret = init_node(node); + if (ret) + goto err; + + ret = post_recvs(node); + if (ret) + goto err; + + memset(&conn_param, 0, sizeof conn_param); + conn_param.qp_num = node->cma_id->qp->qp_num; + conn_param.qp_type = node->cma_id->qp->qp_type; + conn_param.retry_count = 5; + ret = rdma_connect(node->cma_id, &conn_param); + if (ret) { + printf("cmatose: failure connecting: %d\n", ret); + goto err; + } + return 0; +err: + connect_error(); + return ret; +} + +static int connect_handler(struct rdma_cm_id *cma_id) +{ + struct cmatest_node *node; + struct rdma_conn_param conn_param; + int ret; + + if (test.conn_index == connections) { + ret = -ENOMEM; + goto err1; + } + node = &test.nodes[test.conn_index++]; + + node->cma_id = cma_id; + cma_id->context = node; + + ret = init_node(node); + if (ret) + goto err2; + + ret = post_recvs(node); + if (ret) + goto err2; + + memset(&conn_param, 0, sizeof conn_param); + conn_param.qp_num = node->cma_id->qp->qp_num; + conn_param.qp_type = node->cma_id->qp->qp_type; + ret = rdma_accept(node->cma_id, &conn_param); + if (ret) { + printf("cmatose: failure accepting: %d\n", ret); + goto err2; + } + node->connected = 1; + test.connects_left--; + return 0; + +err2: + node->cma_id = NULL; + connect_error(); +err1: + printf("cmatose: failing connection request\n"); + rdma_reject(cma_id, NULL, 0); + return ret; +} + +static int resolved_handler(struct cmatest_node *node) +{ + struct ibv_ah_attr ah_attr; + int ret; + + ret = rdma_get_dst_attr(node->cma_id, test.dst_addr, &ah_attr, + &node->remote_qpn, &node->remote_qkey); + if (ret) { + printf("udaddy: failure getting destination attributes\n"); + goto err; + } + + node->ah = ibv_create_ah(node->pd, &ah_attr); + if (!node->ah) { + printf("udaddy: failure creating address handle\n"); + goto err; + } + + node->connected = 1; + test.connects_left--; + return 0; +err: + connect_error(); + return ret; +} + +static int cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event) +{ + int ret = 0; + + switch (event->event) { + case RDMA_CM_EVENT_ADDR_RESOLVED: + ret = addr_handler(cma_id->context); + break; + case RDMA_CM_EVENT_ROUTE_RESOLVED: + ret = route_handler(cma_id->context); + break; + case RDMA_CM_EVENT_CONNECT_REQUEST: + ret = connect_handler(cma_id); + break; + case RDMA_CM_EVENT_ESTABLISHED: + ret = resolved_handler(cma_id->context); + break; + case RDMA_CM_EVENT_ADDR_ERROR: + case RDMA_CM_EVENT_ROUTE_ERROR: + case RDMA_CM_EVENT_CONNECT_ERROR: + case RDMA_CM_EVENT_UNREACHABLE: + case RDMA_CM_EVENT_REJECTED: + printf("cmatose: event: %d, error: %d\n", event->event, + event->status); + connect_error(); + ret = event->status; + break; + case RDMA_CM_EVENT_DEVICE_REMOVAL: + /* Cleanup will occur after test completes. */ + break; + default: + break; + } + return ret; +} + +static void destroy_node(struct cmatest_node *node) +{ + if (!node->cma_id) + return; + + if (node->ah) + ibv_destroy_ah(node->ah); + + if (node->cma_id->qp) + rdma_destroy_qp(node->cma_id); + + if (node->cq) + ibv_destroy_cq(node->cq); + + if (node->mem) { + ibv_dereg_mr(node->mr); + free(node->mem); + } + + if (node->pd) + ibv_dealloc_pd(node->pd); + + /* Destroy the RDMA ID after all device resources */ + rdma_destroy_id(node->cma_id); +} + +static int alloc_nodes(void) +{ + int ret, i; + + test.nodes = malloc(sizeof *test.nodes * connections); + if (!test.nodes) { + printf("cmatose: unable to allocate memory for test nodes\n"); + return -ENOMEM; + } + memset(test.nodes, 0, sizeof *test.nodes * connections); + + for (i = 0; i < connections; i++) { + test.nodes[i].id = i; + if (!is_server) { + ret = rdma_create_id(test.channel, + &test.nodes[i].cma_id, + &test.nodes[i], RDMA_PS_UDP); + if (ret) + goto err; + } + } + return 0; +err: + while (--i >= 0) + rdma_destroy_id(test.nodes[i].cma_id); + free(test.nodes); + return ret; +} + +static void destroy_nodes(void) +{ + int i; + + for (i = 0; i < connections; i++) + destroy_node(&test.nodes[i]); + free(test.nodes); +} + +static void create_reply_ah(struct cmatest_node *node, struct ibv_wc *wc) +{ + node->ah = ibv_create_ah_from_wc(node->pd, wc, node->mem, + node->cma_id->port_num); + node->remote_qpn = ntohl(wc->imm_data); + node->remote_qkey = ntohs(rdma_get_dst_port(node->cma_id)); +} + +static int poll_cqs(void) +{ + struct ibv_wc wc[8]; + int done, i, ret; + + for (i = 0; i < connections; i++) { + if (!test.nodes[i].connected) + continue; + + for (done = 0; done < message_count; done += ret) { + ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); + if (ret < 0) { + printf("cmatose: failed polling CQ: %d\n", ret); + return ret; + } + + if (ret && !test.nodes[i].ah) + create_reply_ah(&test.nodes[i], wc); + } + } + return 0; +} + +static int connect_events(void) +{ + struct rdma_cm_event *event; + int ret = 0; + + while (test.connects_left && !ret) { + ret = rdma_get_cm_event(test.channel, &event); + if (!ret) { + ret = cma_handler(event->id, event); + rdma_ack_cm_event(event); + } + } + return ret; +} + +static int run_server(void) +{ + struct rdma_cm_id *listen_id; + int i, ret; + + printf("cmatose: starting server\n"); + ret = rdma_create_id(test.channel, &listen_id, &test, RDMA_PS_UDP); + if (ret) { + printf("cmatose: listen request failed\n"); + return ret; + } + + test.src_in.sin_family = PF_INET; + test.src_in.sin_port = 7174; + ret = rdma_bind_addr(listen_id, test.src_addr); + if (ret) { + printf("cmatose: bind address failed: %d\n", ret); + return ret; + } + + ret = rdma_listen(listen_id, 0); + if (ret) { + printf("cmatose: failure trying to listen: %d\n", ret); + goto out; + } + + connect_events(); + + if (message_count) { + printf("receiving data transfers\n"); + ret = poll_cqs(); + if (ret) + goto out; + + printf("sending replies\n"); + for (i = 0; i < connections; i++) { + ret = post_sends(&test.nodes[i], IBV_SEND_SIGNALED); + if (ret) + goto out; + } + + ret = poll_cqs(); + if (ret) + goto out; + printf("data transfers complete\n"); + } +out: + rdma_destroy_id(listen_id); + return ret; +} + +static int get_addr(char *dst, struct sockaddr_in *addr) +{ + struct addrinfo *res; + int ret; + + ret = getaddrinfo(dst, NULL, NULL, &res); + if (ret) { + printf("getaddrinfo failed - invalid hostname or IP address\n"); + return ret; + } + + if (res->ai_family != PF_INET) { + ret = -1; + goto out; + } + + *addr = *(struct sockaddr_in *) res->ai_addr; +out: + freeaddrinfo(res); + return ret; +} + +static int run_client(char *dst, char *src) +{ + int i, ret; + + printf("cmatose: starting client\n"); + if (src) { + ret = get_addr(src, &test.src_in); + if (ret) + return ret; + } + + ret = get_addr(dst, &test.dst_in); + if (ret) + return ret; + + test.dst_in.sin_port = 7174; + + printf("cmatose: connecting\n"); + for (i = 0; i < connections; i++) { + ret = rdma_resolve_addr(test.nodes[i].cma_id, + src ? test.src_addr : NULL, + test.dst_addr, 2000); + if (ret) { + printf("cmatose: failure getting addr: %d\n", ret); + connect_error(); + return ret; + } + } + + ret = connect_events(); + if (ret) + goto out; + + if (message_count) { + printf("initiating data transfers\n"); + for (i = 0; i < connections; i++) { + ret = post_sends(&test.nodes[i], 0); + if (ret) + goto out; + } + printf("receiving data transfers\n"); + ret = poll_cqs(); + if (ret) + goto out; + + printf("data transfers complete\n"); + } +out: + return ret; +} + +int main(int argc, char **argv) +{ + int ret; + + if (argc > 3) { + printf("usage: %s [server_addr [src_addr]]\n", argv[0]); + exit(1); + } + is_server = (argc == 1); + + test.dst_addr = (struct sockaddr *) &test.dst_in; + test.src_addr = (struct sockaddr *) &test.src_in; + test.connects_left = connections; + + test.channel = rdma_create_event_channel(); + if (!test.channel) { + printf("failed to create event channel\n"); + exit(1); + } + + if (alloc_nodes()) + exit(1); + + if (is_server) + ret = run_server(); + else + ret = run_client(argv[1], (argc == 3) ? argv[2] : NULL); + + printf("test complete\n"); + destroy_nodes(); + rdma_destroy_event_channel(test.channel); + + printf("return status %d\n", ret); + return ret; +} Index: examples/cmatose.c =================================================================== --- examples/cmatose.c (revision 7636) +++ examples/cmatose.c (working copy) @@ -380,7 +380,7 @@ static int alloc_nodes(void) if (!is_server) { ret = rdma_create_id(test.channel, &test.nodes[i].cma_id, - &test.nodes[i]); + &test.nodes[i], RDMA_PS_TCP); if (ret) goto err; } @@ -466,7 +466,7 @@ static int run_server(void) int i, ret; printf("cmatose: starting server\n"); - ret = rdma_create_id(test.channel, &listen_id, &test); + ret = rdma_create_id(test.channel, &listen_id, &test, RDMA_PS_TCP); if (ret) { printf("cmatose: listen request failed\n"); return ret; From yipeeyipeeyipeeyipee at yahoo.com Tue Jun 6 23:14:33 2006 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Wed, 7 Jun 2006 06:14:33 +0000 (UTC) Subject: [openib-general] Mellanox raw QP Message-ID: Hi, Can I create raw QP (MLX type) using the openIB API? Is this possible from user space ? I searched for the right API for this but couldn't find any such way. Would I have to do this from kernel? Thanks, x From dotanb at mellanox.co.il Tue Jun 6 23:37:37 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 7 Jun 2006 09:37:37 +0300 Subject: [openib-general] Mellanox raw QP Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30243FCFE@mtlexch01.mtl.com> > Can I create raw QP (MLX type) using the openIB API? > Is this possible from user space ? I searched for the right > API for this but > couldn't find any such way. > Would I have to do this from kernel? There isn't any API to create raw QP from user level, so I believe that the answer is yes ... Dotan From k_mahesh85 at yahoo.co.in Tue Jun 6 23:44:23 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Wed, 7 Jun 2006 07:44:23 +0100 (BST) Subject: [openib-general] repost-problem with memory registration-RDMA kernel utliity Message-ID: <20060607064423.80428.qmail@web8316.mail.in.yahoo.com> can anybody me suggest me the correct way to register a buffer for doing RDMA operations i have already posted my code in the previous thread but that is not working fine. it is a kernel utility and i have obtained the buffer by using kmalloc, now how can i register this inorder to perform RDMA operations over it. -Mahesh Send instant messages to your online friends http://in.messenger.yahoo.com Stay connected with your friends even when away from PC. Link: http://in.mobile.yahoo.com/new/messenger/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From k_mahesh85 at yahoo.co.in Wed Jun 7 00:06:09 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Wed, 7 Jun 2006 08:06:09 +0100 (BST) Subject: [openib-general] repost-problem with memory registration-RDMA kernel utliity Message-ID: <20060607070609.3767.qmail@web8315.mail.in.yahoo.com> can anybody me suggest me the correct way to register a buffer for doing RDMA operations i have already posted my code in the previous thread but that is not working fine. it is a kernel utility and i have obtained the buffer by using kmalloc, now how can i register this inorder to perform RDMA operations over it. -Mahesh Send instant messages to your online friends http://in.messenger.yahoo.com Stay connected with your friends even when away from PC. Link: http://in.mobile.yahoo.com/new/messenger/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Wed Jun 7 02:52:22 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 7 Jun 2006 12:52:22 +0300 (IDT) Subject: [openib-general] crash in ib_sa_mcmember_rec_callback while probing out ib_sa Message-ID: By mistake i was trying to bringup ib1 where port 1 and not 2 was the active port, and then got this crash on the rmmod script which is doing: ifconfig ib0 down ifconfig ib1 down modprobe -r ib_ipoib modprobe -r ib_mthca this is the dmesg crash - it happened over x86 with svn 7772 Or. ADDRCONF(NETDEV_UP): ib1: link is not ready Unable to handle kernel paging request at virtual address f8dd6758 printing eip: f8dd6758 *pde = 37c99067 *pte = 00000000 Oops: 0000 [#1] SMP Modules linked in: parport_pc lp parport autofs4 nfs lockd sunrpc button battery ac ipv6 ohci_hcd i2c_amd8111 i2c_core hw_random shpchp ib_mthca ib_sa ib_mad ib_core e100 mii tg3 floppy dm_snapshot dm_zero dm_mirror dm_mod sata_sil libata sd_mod scsi_mod CPU: 1 EIP: 0060:[] Not tainted VLI EFLAGS: 00210246 (2.6.16 #1) EIP is at 0xf8dd6758 eax: 00000000 ebx: ef2a2594 ecx: ef2a25a0 edx: f599beec esi: f38a5bec edi: f38a5bf4 ebp: fffffffc esp: f599be60 ds: 007b es: 007b ss: 0068 Process modprobe (pid: 20746, threadinfo=f599a000 task=f6411aa0) Stack: <0>f8dd1721 fffffffc 00000000 ecd95798 f66f9000 00000000 00000000 f599beb8 00000000 00000022 00000001 0000000f 00200286 f7878ec8 c03217dc c0150cfa f7fff200 6b00002c f4c9d668 f7fff200 ef2a2594 f38a5bf4 f599bef4 f599beec Call Trace: [] ib_sa_mcmember_rec_callback+0x43/0x4e [ib_sa] [] _spin_unlock_irqrestore+0x9/0xe [] poison_obj+0x21/0x41 [] send_handler+0x39/0x88 [ib_sa] [] cancel_mads+0x111/0x12f [ib_mad] [] unregister_mad_agent+0xe/0xae [ib_mad] [] ib_unregister_mad_agent+0x13/0x1f [ib_mad] [] ib_sa_remove_one+0x3c/0x6e [ib_sa] [] ib_unregister_client+0x34/0xb0 [ib_core] [] ib_sa_cleanup+0xa/0x17 [ib_sa] [] sys_delete_module+0x129/0x162 [] do_munmap+0xe7/0xf3 [] sys_munmap+0x4d/0x69 [] sysenter_past_esp+0x54/0x75 Code: Bad EIP value. BUG: modprobe/20746, lock held at task exit time! [f8db4280] {device_mutex} .. held by: modprobe:20746 [f6411aa0, 118] ... acquired at: ib_unregister_client+0x12/0xb0 [ib_core] From bpradip at in.ibm.com Wed Jun 7 05:54:39 2006 From: bpradip at in.ibm.com (Pradipta Kr. Banerjee) Date: Wed, 07 Jun 2006 18:24:39 +0530 Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: References: Message-ID: <4486CC8F.3050601@in.ibm.com> Sundeep Narravula wrote: >> By the way, I assume you configured, rebuilt and reinstalled libibverbs, >> librdmacm, and libamso? > > Yes. I have done these. > >> I do not see this on my systems using a 2.6.16.5 kernel on a SUSE 9.2 >> distro. What distro/kernel verions? > > The kernel used is 2.6.16 on a RH-AS4. > > --Sundeep. I don't see this problem at all. I am using kernel 2.6.16.16, SLES 9 glibc version 2.3.3-98, gcc version 3.3.3 and AMSO1100 RNIC. Will running it under gdb be of some help ? Thanks Pradipta Kumar. > >> Thanx, >> >> >> Steve. >> >> >> On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote: >>> Hi Steve, >>> We are trying the new iwarp branch on ammasso adapters. The installation >>> has gone fine. However, on running rping there is a error during >>> disconnect phase. >>> >>> $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999 >>> libibverbs: Warning: no userspace device-specific driver found for uverbs1 >>> driver search path: /usr/local/lib/infiniband >>> libibverbs: Warning: no userspace device-specific driver found for uverbs0 >>> driver search path: /usr/local/lib/infiniband >>> ping data: rdm >>> ping data: rdm >>> ping data: rdm >>> ping data: rdm >>> cq completion failed status 5 >>> DISCONNECT EVENT... >>> *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 *** >>> Aborted >>> >>> There are no apparent errors showing up in dmesg. Is this error >>> currently expected? >>> >>> Thanks, >>> --Sundeep. >>> From bpradip at in.ibm.com Wed Jun 7 05:56:21 2006 From: bpradip at in.ibm.com (Pradipta Kr. Banerjee) Date: Wed, 07 Jun 2006 18:26:21 +0530 Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: References: Message-ID: <4486CCF5.4050902@in.ibm.com> Sundeep Narravula wrote: >> By the way, I assume you configured, rebuilt and reinstalled libibverbs, >> librdmacm, and libamso? > > Yes. I have done these. > >> I do not see this on my systems using a 2.6.16.5 kernel on a SUSE 9.2 >> distro. What distro/kernel verions? > > The kernel used is 2.6.16 on a RH-AS4. > > --Sundeep. I don't see this problem at all. I am using kernel 2.6.16.16, SLES 9 glibc version 2.3.3-98, gcc version 3.3.3 and AMSO1100 RNIC. Will running it under gdb be of some help ? Thanks Pradipta Kumar. > >> Thanx, >> >> >> Steve. >> >> >> On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote: >>> Hi Steve, >>> We are trying the new iwarp branch on ammasso adapters. The installation >>> has gone fine. However, on running rping there is a error during >>> disconnect phase. >>> >>> $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999 >>> libibverbs: Warning: no userspace device-specific driver found for uverbs1 >>> driver search path: /usr/local/lib/infiniband >>> libibverbs: Warning: no userspace device-specific driver found for uverbs0 >>> driver search path: /usr/local/lib/infiniband >>> ping data: rdm >>> ping data: rdm >>> ping data: rdm >>> ping data: rdm >>> cq completion failed status 5 >>> DISCONNECT EVENT... >>> *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 *** >>> Aborted >>> >>> There are no apparent errors showing up in dmesg. Is this error >>> currently expected? >>> >>> Thanks, >>> --Sundeep. >>> From jackm at mellanox.co.il Wed Jun 7 06:39:03 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 7 Jun 2006 16:39:03 +0300 Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS In-Reply-To: References: Message-ID: <200606071639.03787.jackm@mellanox.co.il> On Wednesday 07 June 2006 00:51, James Lentini wrote: > On Mon, 5 Jun 2006, Arlin Davis wrote: > > Here is a patch to the openib-cma provider that uses the new > > set_option feature of the uCMA to adjust connect request timeout and > > retry values. After examining the patch (svn 7755), I noticed that it depends on changes to the kernel CMA module (SVN 7742) which were checked only last night (June 6) (and which we did not see until this morning). These CMA changes were not included in today's OFED RC6 release. Therefore, this new feature (set_option to adjust timeout and retry values) will not be supported in the current OFED final release (next week). Possibly, it can be included in the next OFED release. >> Also, included a fix to disallow any > > event after a disconnect event. This (bug fix) can still be included in next-week's release, if you think it is important (I have extracted it from the changes checked in at svn 7755) - Jack From ishai at mellanox.co.il Wed Jun 7 07:31:10 2006 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Wed, 7 Jun 2006 17:31:10 +0300 Subject: [openib-general] Re: SRP [PATCH 0/4] Kernel support for removal and restoration of target In-Reply-To: References: <20060605153213.GA7472@mellanox.co.il> Message-ID: <20060607143110.GA7442@mellanox.co.il> The idea is that the daemon will notice targets that leave the fabric (for a short time), and will activate remove_target. When the target will return to the fabric, the daemon will activate restore_target. This will make sure that the scsi_host won't go offline (From where there is no return) I'm waiting for suggestion about the mechanism that will be responsible to remove the scsi_host when the target does not return to the fabric after a while. (See my previous mail for details) On Tue, Jun 06, 2006 at 03:11:46PM -0700, Roland Dreier wrote: > I haven't read too deeply yet, but something that would help me > understand the overall plan here would be an explanation of how one > would use the restore_target function. Why would I want to disconnect > from a target but keep the kernel's SCSI device hanging around? > > - R. -- Ishai Rabinovitz From tziporet at mellanox.co.il Wed Jun 7 07:40:05 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 07 Jun 2006 17:40:05 +0300 Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS In-Reply-To: References: Message-ID: <4486E545.2030900@mellanox.co.il> Scott Weitzenkamp (sweitzen) wrote: > Tziporet is the gatekeeper (does that make me the keymaster? :-). > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > Since we are in RC6 today, toward a release next week we cannot take new CMA features that were implemented only this week. We plan another release pretty soon due to SDP problems, so we can include these changes in the next OFED release. Jack already replied regarding the fix Tziporet From tziporet at mellanox.co.il Wed Jun 7 07:58:32 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 7 Jun 2006 17:58:32 +0300 Subject: [openib-general] OFED-1.0-rc6 is available Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA715E@mtlexch01.mtl.com> Hi All, We have prepared OFED 1.0 RC6. Release location: https://openib.org/svn/gen2/branches/1.0/ofed/releases File: OFED-1.0-rc6.tgz Note: This release is the code freeze release for OFED 1.0. Only showstopper bugs will be fixed. BUILD_ID: OFED-1.0-rc6 openib-1.0 (REV=7772) # User space https://openib.org/svn/gen2/branches/1.0/src/userspace # Kernel space https://openib.org/svn/gen2/branches/1.0/ofed/tags/rc6/linux-kernel Git: ref: refs/heads/for-2.6.17 commit d9ec5ad24ce80b7ef69a0717363db661d13aada5 # MPI mpi_osu-0.9.7-mlx2.1.0.tgz openmpi-1.1b1-1.src.rpm mpitests-1.0-0.src.rpm OSes: * RH EL4 up2: 2.6.9-22.ELsmp * RH EL4 up3: 2.6.9-34.ELsmp * Fedora C4: 2.6.11-1.1369_FC4 * SLES10 RC2: 2.6.16.16-1.6-smp * SUSE 10 Pro: 2.6.13-15-smp * kernel.org: 2.6.16.x Systems: * x86_64 * x86 * ia64 * ppc64 Main changes from RC5: 1. SDP - libsdp implementation of RFC proposed by Eitan Zahavi; bug fixes in kernel module. See details below. 2. SRP - bug fixes 3. Open MPI - new package based on 1.1b1-1 4. OSU-MPI - See details below. 5. iSER: Enhanced to support SLES 10 RC1. 6. IPoIB default configuration changed: a. IPoIB configuration at install time is now optional. b. The default configuration of IPoIB interfaces (if performed at install time) is DHCP; it can be changed during interactive installation. c. For unattended installation one can give a new configuration file. See the example below. 7. Bug Fixes. Package limitations: 1. The ipath driver does not compile/load on most systems. To be fixed in final release. Meanwhile, one must work with custom build and not choose ipath driver, or change in the conf file: ib_ipath=n. I attached a reference ofed-no_ipath.conf file. Once Qlogic fixes the backport patches I will publish them on the release page so any one interested can use them with this release. 2. iSER is working on SuSE SLES 10 RC1 only IPoIB configuration file example: If you are going to install OFED on a 32 node cluster and want to use static IPoIB configuration based on Ethernet device configuration follow instructions below: Assume that the Ethernet IP addresses (eth0 interfaces) of the cluster are: 10.0.0.1 - 10.0.0.32 and you want to assign to ib0 IP addresses in the range: 192.168.0.1 - 192.168.0.32 and to ib1 IP addresses in the range: 172.16.0.1 - 172.16.0.32 Then create the file ofed_net.conf with the following lines: LAN_INTERFACE_ib0=eth0 IPADDR_ib0=192.168.'*'.'*' NETMASK_ib0=255.255.0.0 NETWORK_ib0=192.168.0.0 BROADCAST_ib0=192.168.255.255 ONBOOT_ib0=1 LAN_INTERFACE_ib1=eth0 IPADDR_ib1=172.16.'*'.'*' NETMASK_ib1=255.255.0.0 NETWORK_ib1=172.16.0.0 BROADCAST_ib1=172.16.255.255 ONBOOT_ib1=1 Note: '*' will be replaced by the corresponding octal from the eth0 IP address. Assuming that you already have OFED configuration file (ofed.conf) with selected packages (created by running OFED-1.0/install.sh) Run: ./install.sh -c ofed.conf -net ofed_net.conf OSU MPI: * Added mpi_alltoall fine tuning parameters * Added default configuration/documentation file $MPIHOME/etc/mvapich.conf * Added shell configuration files $MPIHOME/etc/mvapich.csh , $MPIHOME/etc/mvapich.csh * Default MTU was changed back to 2K for InfiniHost III Ex and InfiniHost III Lx HCAs. For InfiniHost card recommended value is: VIADEV_DEFAULT_MTU=MTU1024 SDP Details: libsdp enhancements according to the RFC: 1. New config syntax (please see libsdp.conf) 2. With no config or empty config use SIMPLE_LIBSDP mode 3. Support listening on both tcp and sdp 4. Support trying both connections (first SDP then TCP) 5. Support IPv4 embedded in IPv6 (also convert back address) 6. Comprehensive verbosity logging 7. BNF based config parser Current SDP limitations: * SDP currently does not support sending/receiving out of band data (MSG_OOB). * Generally, SDP supports only SOL_SOCKET socket options. * The following options can be set but actual support is missing: o SO_KEEPALIVE - no keepalives are sent o SO_OOBINLINE - out of band data is not supported o SDP currently supports setting the following SOL_TCP socket options: o TCP_NODELAY, TCP_CORK - but actual support for these options is still missing * SDP currently does not handle Zcopy mode messages correctly and does not set MaxAdverts properly in HH/HAH messages. OFED components tested by Mellanox: * Verbs over mthca * IPoIB * OpenSM * OSU-MPI * SRP * SDP * IB administration utils (ibutils) Please send us any issues you encounter and/or test results. Thanks Tziporet & Vlad Tziporet Koren Software Director Mellanox Technologies mailto: tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed-no_ipath.conf Type: application/octet-stream Size: 646 bytes Desc: ofed-no_ipath.conf URL: From jlentini at netapp.com Wed Jun 7 08:26:35 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 7 Jun 2006 11:26:35 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS In-Reply-To: <200606071639.03787.jackm@mellanox.co.il> References: <200606071639.03787.jackm@mellanox.co.il> Message-ID: On Wed, 7 Jun 2006, Jack Morgenstein wrote: > >> Also, included a fix to disallow any > > > event after a disconnect event. > > This (bug fix) can still be included in next-week's release, if you > think it is important (I have extracted it from the changes checked > in at svn 7755) If you are going to make another release anyway, then I would included it. From swise at opengridcomputing.com Wed Jun 7 08:56:44 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 10:56:44 -0500 Subject: [openib-general] Re: [PATCH 2/7] AMSO1100 Low Level Driver. In-Reply-To: <20060531115906.30f4bbda@localhost.localdomain> References: <20060531182733.3652.54755.stgit@stevo-desktop> <20060531182737.3652.24752.stgit@stevo-desktop> <20060531115906.30f4bbda@localhost.localdomain> Message-ID: <1149695804.27684.42.camel@stevo-desktop> On Wed, 2006-05-31 at 11:59 -0700, Stephen Hemminger wrote: > The following should be replaced with BUG_ON() or WARN_ON(). > and pr_debug() > > +#ifdef C2_DEBUG > +#define assert(expr) \ > + if(!(expr)) { \ > + printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n",\ > + #expr, __FILE__, __FUNCTION__, __LINE__); \ > + } > +#define dprintk(fmt, args...) do {printk(KERN_INFO PFX fmt, ##args);} while (0) > +#else > +#define assert(expr) do {} while (0) > +#define dprintk(fmt, args...) do {} while (0) > +#endif /* C2_DEBUG */ > > -------------------- > Also, you tend to use assert() as a bogus NULL pointer check. > If you get passed a NULL, it is a bug, and the deref will fail > and cause a pretty stack dump... > done. > > +static void c2_set_rxbufsize(struct c2_port *c2_port) > +{ > + struct net_device *netdev = c2_port->netdev; > + > + assert(netdev != NULL); > > Bogus, you will just fail on the deref below > done. > + > + if (netdev->mtu > RX_BUF_SIZE) > + c2_port->rx_buf_size = > + netdev->mtu + ETH_HLEN + sizeof(struct c2_rxp_hdr) + > + NET_IP_ALIGN; > + else > + c2_port->rx_buf_size = sizeof(struct c2_rxp_hdr) + RX_BUF_SIZE; > +} > > > +static void c2_rx_interrupt(struct net_device *netdev) > +{ > + struct c2_port *c2_port = netdev_priv(netdev); > + struct c2_dev *c2dev = c2_port->c2dev; > + struct c2_ring *rx_ring = &c2_port->rx_ring; > + struct c2_element *elem; > + struct c2_rx_desc *rx_desc; > + struct c2_rxp_hdr *rxp_hdr; > + struct sk_buff *skb; > + dma_addr_t mapaddr; > + u32 maplen, buflen; > + unsigned long flags; > + > + spin_lock_irqsave(&c2dev->lock, flags); > + > + /* Begin where we left off */ > + rx_ring->to_clean = rx_ring->start + c2dev->cur_rx; > + > + for (elem = rx_ring->to_clean; elem->next != rx_ring->to_clean; > + elem = elem->next) { > + rx_desc = elem->ht_desc; > + mapaddr = elem->mapaddr; > + maplen = elem->maplen; > + skb = elem->skb; > + rxp_hdr = (struct c2_rxp_hdr *) skb->data; > + > + if (rxp_hdr->flags != RXP_HRXD_DONE) > + break; > + buflen = rxp_hdr->len; > + > + /* Sanity check the RXP header */ > + if (rxp_hdr->status != RXP_HRXD_OK || > + buflen > (rx_desc->len - sizeof(*rxp_hdr))) { > + c2_rx_error(c2_port, elem); > + continue; > + } > + > + /* > + * Allocate and map a new skb for replenishing the host > + * RX desc > + */ > + if (c2_rx_alloc(c2_port, elem)) { > + c2_rx_error(c2_port, elem); > + continue; > + } > + > + /* Unmap the old skb */ > + pci_unmap_single(c2dev->pcidev, mapaddr, maplen, > + PCI_DMA_FROMDEVICE); > + > > prefetch(skb->data) here will help performance. > > good. ok. > + /* > + * Skip past the leading 8 bytes comprising of the > + * "struct c2_rxp_hdr", prepended by the adapter > + * to the usual Ethernet header ("struct ethhdr"), > + * to the start of the raw Ethernet packet. > + * > + * Fix up the various fields in the sk_buff before > + * passing it up to netif_rx(). The transfer size > + * (in bytes) specified by the adapter len field of > + * the "struct rxp_hdr_t" does NOT include the > + * "sizeof(struct c2_rxp_hdr)". > + */ > + skb->data += sizeof(*rxp_hdr); > + skb->tail = skb->data + buflen; > + skb->len = buflen; > + skb->dev = netdev; > + skb->protocol = eth_type_trans(skb, netdev); > + > + /* Drop arp requests to the pseudo nic ip addr */ > + if (unlikely(ntohs(skb->protocol) == ETH_P_ARP)) { > + u8 *tpa; > + > + /* pull out the tgt ip addr */ > + tpa = skb->data /* beginning of the arp packet */ > + + 8 /* arp addr fmts, lens, and opcode */ > + + 6 /* arp src hw addr */ > + + 4 /* arp src proto addr */ > + + 6; /* arp tgt hw addr */ > + if (is_rnic_addr(c2dev->pseudo_netdev, *((u32 *)tpa))) { > + dprintk("Dropping arp req for" > + " %03d.%03d.%03d.%03d\n", > + tpa[0], tpa[1], tpa[2], tpa[3]); > + kfree_skb(skb); > + continue; > + } > + } > > This is looks like a mess, please do it at a higher level or > code it with proper structure headers > This code can be removed entirely. It can be avoided having the c2 driver set in_dev->cnf.arp_ignore to 1 when loaded. > + > + netif_rx(skb); > + > + netdev->last_rx = jiffies; > + c2_port->netstats.rx_packets++; > + c2_port->netstats.rx_bytes += buflen; > + } > + > + /* Save where we left off */ > + rx_ring->to_clean = elem; > + c2dev->cur_rx = elem - rx_ring->start; > + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); > + > + spin_unlock_irqrestore(&c2dev->lock, flags); > +} > + > +/* > + * Handle netisr0 TX & RX interrupts. > + */ > +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs) > +{ > + unsigned int netisr0, dmaisr; > + int handled = 0; > + struct c2_dev *c2dev = (struct c2_dev *) dev_id; > + > + assert(c2dev != NULL); > + > + /* Process CCILNET interrupts */ > + netisr0 = readl(c2dev->regs + C2_NISR0); > + if (netisr0) { > + > + /* > + * There is an issue with the firmware that always > + * provides the status of RX for both TX & RX > + * interrupts. So process both queues here. > + */ > + c2_rx_interrupt(c2dev->netdev); > + c2_tx_interrupt(c2dev->netdev); > + > + /* Clear the interrupt */ > + writel(netisr0, c2dev->regs + C2_NISR0); > + handled++; > + } > + > + /* Process RNIC interrupts */ > + dmaisr = readl(c2dev->regs + C2_DISR); > + if (dmaisr) { > + writel(dmaisr, c2dev->regs + C2_DISR); > + c2_rnic_interrupt(c2dev); > + handled++; > + } > + > + if (handled) { > + return IRQ_HANDLED; > + } else { > + return IRQ_NONE; > + } > > return IRQ_RETVAL(handled); > +} > + > +static int c2_up(struct net_device *netdev) > +{ > + struct c2_port *c2_port = netdev_priv(netdev); > + struct c2_dev *c2dev = c2_port->c2dev; > + struct c2_element *elem; > + struct c2_rxp_hdr *rxp_hdr; > + size_t rx_size, tx_size; > + int ret, i; > + unsigned int netimr0; > + > + assert(c2dev != NULL); > > More bogus asserts > removed. > +static struct net_device_stats *c2_get_stats(struct net_device *netdev) > +{ > + struct c2_port *c2_port = netdev_priv(netdev); > + > + return &c2_port->netstats; > +} > + > +static int c2_set_mac_address(struct net_device *netdev, void *p) > +{ > + return -1; > +} > > If you don't handle changing mac_address, just leaveing > dev->set_mac_address will do the right thing. > Also, if you need to return an error, use -ESOMEERROR, not -1. > I'll remove c2_set_mac_address() entirely. > This seems like log spam, or developer debug thing. > You need to learn to watch netlink event's from user space. > Yes, the entire block below will be removed. It's not needed. > > + > +#ifdef NETEVENT_NOTIFIER > +static int netevent_notifier(struct notifier_block *self, unsigned long event, > + void *data) > +{ > + int i; > + u8 *ha; > + struct neighbour *neigh = data; > + struct netevent_redirect *redir = data; > + struct netevent_route_change *rev = data; > + > + switch (event) { > + case NETEVENT_ROUTE_UPDATE: > + printk(KERN_ERR "NETEVENT_ROUTE_UPDATE:\n"); > + printk(KERN_ERR "fib_flags : %d\n", > + rev->fib_info->fib_flags); > + printk(KERN_ERR "fib_protocol : %d\n", > + rev->fib_info->fib_protocol); > + printk(KERN_ERR "fib_prefsrc : %08x\n", > + rev->fib_info->fib_prefsrc); > + printk(KERN_ERR "fib_priority : %d\n", > + rev->fib_info->fib_priority); > + break; > + > + case NETEVENT_NEIGH_UPDATE: > + printk(KERN_ERR "NETEVENT_NEIGH_UPDATE:\n"); > + printk(KERN_ERR "nud_state : %d\n", neigh->nud_state); > + printk(KERN_ERR "refcnt : %d\n", neigh->refcnt); > + printk(KERN_ERR "used : %d\n", neigh->used); > + printk(KERN_ERR "confirmed : %d\n", neigh->confirmed); > + printk(KERN_ERR " ha: "); > + for (i = 0; i < neigh->dev->addr_len; i += 4) { > + ha = &neigh->ha[i]; > + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], > + ha[3]); > + } > + printk("\n"); > + > + printk(KERN_ERR "%8s: ", neigh->dev->name); > + for (i = 0; i < neigh->dev->addr_len; i += 4) { > + ha = &neigh->ha[i]; > + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], > + ha[3]); > + } > + printk("\n"); > + break; > + > + case NETEVENT_REDIRECT: > + printk(KERN_ERR "NETEVENT_REDIRECT:\n"); > + printk(KERN_ERR "old: "); > + for (i = 0; i < redir->old->neighbour->dev->addr_len; i += 4) { > + ha = &redir->old->neighbour->ha[i]; > + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], > + ha[3]); > + } > + printk("\n"); > + > + printk(KERN_ERR "new: "); > + for (i = 0; i < redir->new->neighbour->dev->addr_len; i += 4) { > + ha = &redir->new->neighbour->ha[i]; > + printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2], > + ha[3]); > + } > + printk("\n"); > + break; > + > + default: > + printk(KERN_ERR "NETEVENT_WTFO:\n"); > + } > + > + return NOTIFY_DONE; > +} > + > +static struct notifier_block nb = { > + .notifier_call = netevent_notifier, > +}; > +#endif > +/* Thanks, Steve. From rdreier at cisco.com Wed Jun 7 09:00:37 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 Jun 2006 09:00:37 -0700 Subject: [openib-general] crash in ib_sa_mcmember_rec_callback while probing out ib_sa In-Reply-To: (Or Gerlitz's message of "Wed, 7 Jun 2006 12:52:22 +0300 (IDT)") References: Message-ID: Looks like the same crash mst saw related to the multicast module being unloaded and then having sa call back into it. One small clue: > esi: f38a5bec edi: f38a5bf4 ebp: fffffffc esp: f599be60 ebp is -4, which is -EINTR. So this may be a callback from sa_query's send_handler() caused by a IB_WC_WR_FLUSH_ERR status. - R. From mshefty at ichips.intel.com Wed Jun 7 09:52:45 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 07 Jun 2006 09:52:45 -0700 Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS In-Reply-To: <200606071639.03787.jackm@mellanox.co.il> References: <200606071639.03787.jackm@mellanox.co.il> Message-ID: <4487045D.9000405@ichips.intel.com> Jack Morgenstein wrote: > After examining the patch (svn 7755), I noticed that it depends on changes to > the kernel CMA module (SVN 7742) which were checked only last night (June 6) > (and which we did not see until this morning). > These CMA changes were not included in today's OFED RC6 release. Therefore, > this new feature (set_option to adjust timeout and retry values) will not be > supported in the current OFED final release (next week). Possibly, it can be > included in the next OFED release. The changes were added as a solution to a scale up issue seen specifically by Intel MPI. - Sean From rkuchimanchi at silverstorm.com Wed Jun 7 10:28:05 2006 From: rkuchimanchi at silverstorm.com (Ramachandra K) Date: Wed, 07 Jun 2006 22:58:05 +0530 Subject: [openib-general] OFED-1.0-rc6 is available In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA715E@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA715E@mtlexch01.mtl.com> Message-ID: <44870CA5.3080406@silverstorm.com> Tziporet Koren wrote: > Hi All, > > We have prepared OFED 1.0 RC6. > From the openib source tar ball in OFED RC6, it looks like the SRP kernel changes (ulp/srp/ib_srp.c) in the trunk for supporting Rev 10 targets have been included in RC6, but the corresponding changes to the userspace srptool--ibsrpdm (userspace/srptools/src/srp-dm.c) for displaying the IO class of the target have not been made part of RC6. The changes to ibsrpdm were committed to the SVN repository trunk in revision number 7758. Will the latest version of ibsrpdm make it to the next OFED release ? Regards, Ram From mshefty at ichips.intel.com Wed Jun 7 10:41:59 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 07 Jun 2006 10:41:59 -0700 Subject: [openib-general] Re: crash in ib_sa_mcmember_rec_callback while probing out ib_sa In-Reply-To: References: Message-ID: <44870FE7.5090808@ichips.intel.com> I will look into this. - Sean From xma at us.ibm.com Wed Jun 7 10:48:11 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 7 Jun 2006 10:48:11 -0700 Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: Message-ID: Roland, We have seen several skb panic under heavy stress 48 hour test. I wonder whether there are duplicated or corrupted cookies received from device driver to reuse skb buff, since skb buff ring is indexed by wr_id. Is that possible? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Jun 7 11:25:08 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 07 Jun 2006 11:25:08 -0700 Subject: [openib-general] Re: Failed multicast join withnew multicast module In-Reply-To: <447DD2E4.3030709@ichips.intel.com> References: <1149024804.4510.1056.camel@hal.voltaire.com> <20060531090817.GQ21266@mellanox.co.il> <447DC8F8.60409@ichips.intel.com> <1149095100.4510.29902.camel@hal.voltaire.com> <447DD2E4.3030709@ichips.intel.com> Message-ID: <44871A04.9010705@ichips.intel.com> Sean Hefty wrote: > The multicast module should work in this specific case, since the only > client is ipoib, and ipoib first leaves the group before re-joining. I think that there's a race here. If ipoib leaves, then re-joins quickly enough, the join request will be processed before the leave. The result is that the join will be fulfilled locally, without an additional MAD sent. (Trying to process the leave immediately doesn't fix the problem in the generic case, where there may be multiple users of a group.) A temporary fix would be to always send a MAD, even if the join can be fulfilled locally. But I'm looking at having the multicast module re-join on an event. This raises the possibility that the new join request may fail, which would require the multicast module to report that a membership is no longer active. Another problem is if some nodes are joined as NonMembers or SendOnlyNonMembers, then the SM will not create the multicast group when they try to re-join. This leads to a race where NonMembers and SendOnlyNonMembers will fail to re-join until one of the FullMembers joins. - Sean From rdreier at cisco.com Wed Jun 7 11:30:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 Jun 2006 11:30:40 -0700 Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: (Shirley Ma's message of "Wed, 7 Jun 2006 10:48:11 -0700") References: Message-ID: Shirley> Roland, We have seen several skb panic under heavy stress Shirley> 48 hour test. I wonder whether there are duplicated or Shirley> corrupted cookies received from device driver to reuse Shirley> skb buff, since skb buff ring is indexed by wr_id. Is Shirley> that possible? It's possible, but I would say it's quite unlikely with mthca. With ehca I have no sense of how bug-free the driver is. Can you post a recipe to reproduce the crash? - R. From xma at us.ibm.com Wed Jun 7 11:48:34 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 7 Jun 2006 11:48:34 -0700 Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic In-Reply-To: Message-ID: Roland, >Can you post a recipe to reproduce the crash? It happened on 32 nodes cluster (each node has 8 dual core cpus) running IBM applications over IPoIB. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Jun 7 11:50:16 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Jun 2006 14:50:16 -0400 Subject: [openib-general] Re: Failed multicast join withnew multicast module In-Reply-To: <44871A04.9010705@ichips.intel.com> References: <1149024804.4510.1056.camel@hal.voltaire.com> <20060531090817.GQ21266@mellanox.co.il> <447DC8F8.60409@ichips.intel.com> <1149095100.4510.29902.camel@hal.voltaire.com> <447DD2E4.3030709@ichips.intel.com> <44871A04.9010705@ichips.intel.com> Message-ID: <1149706206.4510.292005.camel@hal.voltaire.com> On Wed, 2006-06-07 at 14:25, Sean Hefty wrote: > Sean Hefty wrote: > > The multicast module should work in this specific case, since the only > > client is ipoib, and ipoib first leaves the group before re-joining. > > I think that there's a race here. If ipoib leaves, then re-joins quickly > enough, the join request will be processed before the leave. The order of joins and leaves is important in terms of the SA. > The result is that > the join will be fulfilled locally, without an additional MAD sent. (Trying to > process the leave immediately doesn't fix the problem in the generic case, where > there may be multiple users of a group.) > > A temporary fix would be to always send a MAD, even if the join can be fulfilled > locally. But I'm looking at having the multicast module re-join on an event. > This raises the possibility that the new join request may fail, which would > require the multicast module to report that a membership is no longer active. A similar (as yet unresolved) problem exists with the SA if the topology changes and the previous group/members can no longer be satisfied. > Another problem is if some nodes are joined as NonMembers or SendOnlyNonMembers, > then the SM will not create the multicast group when they try to re-join. The same is true for FullMembers when there is insufficient components to create the group. In these cases, the group must either be precreated or the creator of the group must talk to the SA "first". > This > leads to a race where NonMembers and SendOnlyNonMembers will fail to re-join > until one of the FullMembers joins. Might also be true with joins (not creates) from FullMembers too. I would presume in such cases, the join would be retried. SendOnlyMembers (at least for IPoIB) do this if not joined every time a packet is sent. -- Hal > - Sean From narravul at cse.ohio-state.edu Wed Jun 7 11:49:51 2006 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Wed, 7 Jun 2006 14:49:51 -0400 (EDT) Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: <200606051538.35084.faulkner@opengridcomputing.com> Message-ID: > You will also get this warning on the latest CM if you have not updated the > library to use ibv_driver_init vs. openib_driver_init. This drop for libamso > happened last Friday, Jun 2. Check and see if you have that. This is the svn version I used for the test. (Looks like I have the changes from Jun 2.) $ svn info URL: https://openib.org/svn/gen2/branches/iwarp Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd Revision: 7668 Node Kind: directory Schedule: normal Last Changed Author: swise Last Changed Rev: 7638 Last Changed Date: 2006-06-02 17:13:02 -0400 (Fri, 02 Jun 2006) --Sundeep. > > > > > I'm guessing the glibc error is finding some rping bug. Maybe you have > > a later version of libc than my suse 9.2 distro? > > > > > > Stevo. > > -- > Boyd R. Faulkner > Open Grid Computing, Inc. > Phone: 512-343-9196 x109 > Fax: 512-343-5450 > From narravul at cse.ohio-state.edu Wed Jun 7 11:55:00 2006 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Wed, 7 Jun 2006 14:55:00 -0400 (EDT) Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: <4486CCF5.4050902@in.ibm.com> Message-ID: Hi, > I don't see this problem at all. I am using kernel 2.6.16.16, SLES 9 glibc > version 2.3.3-98, gcc version 3.3.3 and AMSO1100 RNIC. The versions I used are glibc 2.3.4, kernel 2.6.16 and gcc 3.4.3 and AMSO1100 RNIC. > Will running it under gdb be of some help ? I am able to reproduce this error with/without gdb. The glibc error disappears with higher number of iterations. (gdb) r -c -vV -C10 -S10 -a 150.111.111.100 -p 9999 Starting program: /usr/local/bin/rping -c -vV -C10 -S10 -a 150.111.111.100 -p 9999 Reading symbols from shared object read from target memory...done. Loaded system supplied DSO at 0xffffe000 [Thread debugging using libthread_db enabled] [New Thread -1208465728 (LWP 23960)] libibverbs: Warning: no userspace device-specific driver found for uverbs1 driver search path: /usr/local/lib/infiniband libibverbs: Warning: no userspace device-specific driver found for uverbs0 driver search path: /usr/local/lib/infiniband [New Thread -1208468560 (LWP 23963)] [New Thread -1216861264 (LWP 23964)] ping data: rdma-ping ping data: rdma-ping ping data: rdma-ping ping data: rdma-ping ping data: rdma-ping ping data: rdma-ping ping data: rdma-ping ping data: rdma-ping ping data: rdma-ping ping data: rdma-ping cq completion failed status 5 DISCONNECT EVENT... *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 *** Program received signal SIGABRT, Aborted. [Switching to Thread -1208465728 (LWP 23960)] 0xffffe410 in __kernel_vsyscall () (gdb) --Sundeep. > > Thanks > Pradipta Kumar. > > > >> Thanx, > >> > >> > >> Steve. > >> > >> > >> On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote: > >>> Hi Steve, > >>> We are trying the new iwarp branch on ammasso adapters. The installation > >>> has gone fine. However, on running rping there is a error during > >>> disconnect phase. > >>> > >>> $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999 > >>> libibverbs: Warning: no userspace device-specific driver found for uverbs1 > >>> driver search path: /usr/local/lib/infiniband > >>> libibverbs: Warning: no userspace device-specific driver found for uverbs0 > >>> driver search path: /usr/local/lib/infiniband > >>> ping data: rdm > >>> ping data: rdm > >>> ping data: rdm > >>> ping data: rdm > >>> cq completion failed status 5 > >>> DISCONNECT EVENT... > >>> *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 *** > >>> Aborted > >>> > >>> There are no apparent errors showing up in dmesg. Is this error > >>> currently expected? > >>> > >>> Thanks, > >>> --Sundeep. > >>> > From arlin.r.davis at intel.com Wed Jun 7 12:11:19 2006 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Wed, 7 Jun 2006 12:11:19 -0700 Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS Message-ID: Scott, Can you take a look and see if rdma_cm and rdma_ucm modules are being loaded? I noticed on my latest OFED RC5 install that I had to start them manually. -arlin >-----Original Message----- >From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] >Sent: Tuesday, June 06, 2006 5:08 PM >To: Arlin Davis; Scott Weitzenkamp (sweitzen) >Cc: Davis, Arlin R; Lentini, James; openib-general >Subject: RE: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS > > >> this looks like a configuration issue and not the timeout. The CR >> timeouts occured with >> the rdma device and not the rdssm. Is IPoIB running on the ib0 >> interfaces across the >> fabric? > >Yes, IPoIB is running. > >Scott From sweitzen at cisco.com Wed Jun 7 12:57:13 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 7 Jun 2006 12:57:13 -0700 Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS Message-ID: Yes, the modules were loaded. Each of the 32 hosts had 3 IB ports up. Does Intel MPI or uDAPL use multiple ports and/or multiple HCAs? I shut down all but one port on each host, and now Pallas is running better on the 32 nodes using Intel MPI 2.0.1. HP MPI 2.2 started working too with Pallas too over uDAPL, so maybe this is a uDAPL issue? I need to repeat the tests to make sure this isn't a fluke. Thanks for your help so far. Scott > -----Original Message----- > From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] > Sent: Wednesday, June 07, 2006 12:11 PM > To: Scott Weitzenkamp (sweitzen); Arlin Davis > Cc: Lentini, James; openib-general > Subject: RE: [openib-general] [PATCH] uDAPL openib-cma > provider - add support for IB_CM_REQ_OPTIONS > > Scott, > > Can you take a look and see if rdma_cm and rdma_ucm modules are being > loaded? > > I noticed on my latest OFED RC5 install that I had to start them > manually. > > -arlin > > >-----Original Message----- > >From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > >Sent: Tuesday, June 06, 2006 5:08 PM > >To: Arlin Davis; Scott Weitzenkamp (sweitzen) > >Cc: Davis, Arlin R; Lentini, James; openib-general > >Subject: RE: [openib-general] [PATCH] uDAPL openib-cma provider - add > support for IB_CM_REQ_OPTIONS > > > > > >> this looks like a configuration issue and not the timeout. The CR > >> timeouts occured with > >> the rdma device and not the rdssm. Is IPoIB running on the ib0 > >> interfaces across the > >> fabric? > > > >Yes, IPoIB is running. > > > >Scott > From rdreier at cisco.com Wed Jun 7 13:03:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 Jun 2006 13:03:06 -0700 Subject: [openib-general] OFED-1.0-rc6 is available In-Reply-To: <44870CA5.3080406@silverstorm.com> (Ramachandra K.'s message of "Wed, 07 Jun 2006 22:58:05 +0530") References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA715E@mtlexch01.mtl.com> <44870CA5.3080406@silverstorm.com> Message-ID: We also just found a bug in how ibsrpdm discovers Cisco/Topspin FC gateways. The patch is below, and is also checked in to the trunk as svn rev 7803. Please include this in OFED 1.0 final. Thanks, Roland --- srptools/ChangeLog (revision 7796) +++ srptools/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2006-06-07 Roland Dreier + * src/srp-dm.c (do_port): Use correct endianness when comparing + GUID against Topspin OUI. + + * src/srp-dm.c (set_class_port_info): Trivial whitespace fixes. + 2006-05-29 Ishai Rabinovitz * src/srp-dm.c (main): The agent ID array is declared with 0 --- srptools/src/srp-dm.c (revision 7796) +++ srptools/src/srp-dm.c (working copy) @@ -52,8 +52,6 @@ #include "ib_user_mad.h" #include "srp-dm.h" -static const uint8_t topspin_oui[3] = { 0x00, 0x05, 0xad }; - static char *umad_dev = "/dev/infiniband/umad0"; static char *port_sysfs_path; static int timeout_ms = 25000; @@ -249,7 +247,7 @@ static int set_class_port_info(int fd, u init_srp_dm_mad(&out_mad, agent[1], dlid, SRP_DM_ATTR_CLASS_PORT_INFO, 0); - out_dm_mad = (void *) out_mad.data; + out_dm_mad = (void *) out_mad.data; out_dm_mad->method = SRP_DM_METHOD_SET; cpi = (void *) out_dm_mad->data; @@ -266,9 +264,8 @@ static int set_class_port_info(int fd, u return -1; } - for (i = 0; i < 8; ++i) { + for (i = 0; i < 8; ++i) ((uint16_t *) cpi->trap_gid)[i] = htons(strtol(val + i * 5, NULL, 16)); - } if (send_and_get(fd, &out_mad, &in_mad, 0) < 0) return -1; @@ -371,7 +368,10 @@ static int do_port(int fd, uint32_t agen struct srp_dm_svc_entries svc_entries; int i, j, k; - if (!memcmp(&guid, topspin_oui, 3) && + static const uint64_t topspin_oui = 0x0005ad0000000000ull; + static const uint64_t oui_mask = 0xffffff0000000000ull; + + if ((guid & oui_mask) == topspin_oui && set_class_port_info(fd, agent, dlid)) fprintf(stderr, "Warning: set of ClassPortInfo failed\n"); From swise at opengridcomputing.com Wed Jun 7 13:06:00 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:06:00 -0500 Subject: [openib-general] [PATCH v2 0/2][RFC] iWARP Core Support Message-ID: <20060607200600.9003.56328.stgit@stevo-desktop> This patchset defines the modifications to the Linux infiniband subsystem to support iWARP devices. We're submitting it for review now with the goal for inclusion in the 2.6.19 kernel. This code has gone through several reviews in the openib-general list. Now we are submitting it for external review by the linux community. This StGIT patchset is cloned from Roland Dreier's infiniband.git for-2.6.18 branch. The patchset consists of 2 patches: 1 - New iWARP CM implementation. 2 - Core changes to support iWARP. I believe I've addressed all the round 1 review comments. Details of the changes are tracked in each patch comment. Signed-off-by: Tom Tucker Signed-off-by: Steve Wise From swise at opengridcomputing.com Wed Jun 7 13:06:05 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:06:05 -0500 Subject: [openib-general] [PATCH v2 1/2] iWARP Connection Manager. In-Reply-To: <20060607200600.9003.56328.stgit@stevo-desktop> References: <20060607200600.9003.56328.stgit@stevo-desktop> Message-ID: <20060607200605.9003.25830.stgit@stevo-desktop> This patch provides the new files implementing the iWARP Connection Manager. Review Changes: - sizeof -> sizeof() - removed printks - removed TT debug code - cleaned up lock/unlock around switch statements. - waitqueue -> completion for destroy path. --- drivers/infiniband/core/iwcm.c | 877 ++++++++++++++++++++++++++++++++++++++++ include/rdma/iw_cm.h | 254 ++++++++++++ include/rdma/iw_cm_private.h | 62 +++ 3 files changed, 1193 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c new file mode 100644 index 0000000..994bc79 --- /dev/null +++ b/drivers/infiniband/core/iwcm.c @@ -0,0 +1,877 @@ +/* + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Tom Tucker"); +MODULE_DESCRIPTION("iWARP CM"); +MODULE_LICENSE("Dual BSD/GPL"); + +static struct workqueue_struct *iwcm_wq; +struct iwcm_work { + struct work_struct work; + struct iwcm_id_private *cm_id; + struct list_head list; + struct iw_cm_event event; +}; + +/* + * Release a reference on cm_id. If the last reference is being removed + * and iw_destroy_cm_id is waiting, wake up the waiting thread. + */ +static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv) +{ + int ret = 0; + + BUG_ON(atomic_read(&cm_id_priv->refcount)==0); + if (atomic_dec_and_test(&cm_id_priv->refcount)) { + BUG_ON(!list_empty(&cm_id_priv->work_list)); + if (waitqueue_active(&cm_id_priv->destroy_comp.wait)) { + BUG_ON(cm_id_priv->state != IW_CM_STATE_DESTROYING); + BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, + &cm_id_priv->flags)); + ret = 1; + } + complete(&cm_id_priv->destroy_comp); + } + + return ret; +} + +static void add_ref(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + atomic_inc(&cm_id_priv->refcount); +} + +static void rem_ref(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + iwcm_deref_id(cm_id_priv); +} + +static void cm_event_handler(struct iw_cm_id *cm_id, struct iw_cm_event *event); + +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context) +{ + struct iwcm_id_private *cm_id_priv; + + cm_id_priv = kzalloc(sizeof(*cm_id_priv), GFP_KERNEL); + if (!cm_id_priv) + return ERR_PTR(-ENOMEM); + + cm_id_priv->state = IW_CM_STATE_IDLE; + cm_id_priv->id.device = device; + cm_id_priv->id.cm_handler = cm_handler; + cm_id_priv->id.context = context; + cm_id_priv->id.event_handler = cm_event_handler; + cm_id_priv->id.add_ref = add_ref; + cm_id_priv->id.rem_ref = rem_ref; + spin_lock_init(&cm_id_priv->lock); + atomic_set(&cm_id_priv->refcount, 1); + init_waitqueue_head(&cm_id_priv->connect_wait); + init_completion(&cm_id_priv->destroy_comp); + INIT_LIST_HEAD(&cm_id_priv->work_list); + + return &cm_id_priv->id; +} +EXPORT_SYMBOL(iw_create_cm_id); + + +static int iwcm_modify_qp_err(struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + if (!qp) + return -EINVAL; + + qp_attr.qp_state = IB_QPS_ERR; + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE); +} + +/* + * This is really the RDMAC CLOSING state. It is most similar to the + * IB SQD QP state. + */ +static int iwcm_modify_qp_sqd(struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + BUG_ON(qp == NULL); + qp_attr.qp_state = IB_QPS_SQD; + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE); +} + +/* + * CM_ID <-- CLOSING + * + * Block if a passive or active connection is currenlty being processed. Then + * process the event as follows: + * - If we are ESTABLISHED, move to CLOSING and modify the QP state + * based on the abrupt flag + * - If the connection is already in the CLOSING or IDLE state, the peer is + * disconnecting concurrently with us and we've already seen the + * DISCONNECT event -- ignore the request and return 0 + * - Disconnect on a listening endpoint returns -EINVAL + */ +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + struct ib_qp *qp = NULL; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + /* Wait if we're currently in a connect or accept downcall */ + wait_event(cm_id_priv->connect_wait, + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_ESTABLISHED: + cm_id_priv->state = IW_CM_STATE_CLOSING; + + /* QP could be for user-mode client */ + if (cm_id_priv->qp) + qp = cm_id_priv->qp; + else + ret = -EINVAL; + break; + case IW_CM_STATE_LISTEN: + ret = -EINVAL; + break; + case IW_CM_STATE_CLOSING: + /* remote peer closed first */ + case IW_CM_STATE_IDLE: + /* accept or connect returned !0 */ + break; + case IW_CM_STATE_CONN_RECV: + /* + * App called disconnect before/without calling accept after + * connect_request event delivered. + */ + break; + case IW_CM_STATE_CONN_SENT: + /* Can only get here if wait above fails */ + default: + BUG_ON(1); + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + if (qp) { + if (abrupt) + ret = iwcm_modify_qp_err(qp); + else + ret = iwcm_modify_qp_sqd(qp); + + /* + * If both sides are disconnecting the QP could + * already be in ERR or SQD states + */ + ret = 0; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_disconnect); + +/* + * CM_ID <-- DESTROYING + * + * Clean up all resources associated with the connection and release + * the initial reference taken by iw_create_cm_id. + */ +static void destroy_cm_id(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + /* Wait if we're currently in a connect or accept downcall. A + * listening endpoint should never block here. */ + wait_event(cm_id_priv->connect_wait, + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_LISTEN: + cm_id_priv->state = IW_CM_STATE_DESTROYING; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + /* destroy the listening endpoint */ + ret = cm_id->device->iwcm->destroy_listen(cm_id); + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_ESTABLISHED: + cm_id_priv->state = IW_CM_STATE_DESTROYING; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + /* Abrupt close of the connection */ + (void)iwcm_modify_qp_err(cm_id_priv->qp); + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_IDLE: + case IW_CM_STATE_CLOSING: + cm_id_priv->state = IW_CM_STATE_DESTROYING; + break; + case IW_CM_STATE_CONN_RECV: + /* + * App called destroy before/without calling accept after + * receiving connection request event notification. + */ + cm_id_priv->state = IW_CM_STATE_DESTROYING; + break; + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_DESTROYING: + default: + BUG_ON(1); + break; + } + if (cm_id_priv->qp) { + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); + cm_id_priv->qp = NULL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + (void)iwcm_deref_id(cm_id_priv); +} + +/* + * This function is only called by the application thread and cannot + * be called by the event thread. The function will wait for all + * references to be released on the cm_id and then kfree the cm_id + * object. + */ +void iw_destroy_cm_id(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)); + + destroy_cm_id(cm_id); + + wait_for_completion(&cm_id_priv->destroy_comp); + + kfree(cm_id_priv); +} +EXPORT_SYMBOL(iw_destroy_cm_id); + +/* + * CM_ID <-- LISTEN + * + * Start listening for connect requests. Generates one CONNECT_REQUEST + * event for each inbound connect request. + */ +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_IDLE: + cm_id_priv->state = IW_CM_STATE_LISTEN; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id->device->iwcm->create_listen(cm_id, backlog); + if (ret) + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + default: + ret = -EINVAL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + return ret; +} +EXPORT_SYMBOL(iw_cm_listen); + +/* + * CM_ID <-- IDLE + * + * Rejects an inbound connection request. No events are generated. + */ +int iw_cm_reject(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + return -EINVAL; + } + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->reject(cm_id, private_data, + private_data_len); + + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + + return ret; +} +EXPORT_SYMBOL(iw_cm_reject); + +/* + * CM_ID <-- ESTABLISHED + * + * Accepts an inbound connection request and generates an ESTABLISHED + * event. Callers of iw_cm_disconnect and iw_destroy_cm_id will block + * until the ESTABLISHED event is received from the provider. + */ +int iw_cm_accept(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *iw_param) +{ + struct iwcm_id_private *cm_id_priv; + struct ib_qp *qp; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + return -EINVAL; + } + /* Get the ib_qp given the QPN */ + qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); + if (!qp) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return -EINVAL; + } + cm_id->device->iwcm->add_ref(qp); + cm_id_priv->qp = qp; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->accept(cm_id, iw_param); + if (ret) { + /* An error on accept precludes provider events */ + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV); + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->qp) { + cm_id->device->iwcm->rem_ref(qp); + cm_id_priv->qp = NULL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + printk("Accept failed, ret=%d\n", ret); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_accept); + +/* + * Active Side: CM_ID <-- CONN_SENT + * + * If successful, results in the generation of a CONNECT_REPLY + * event. iw_cm_disconnect and iw_cm_destroy will block until the + * CONNECT_REPLY event is received from the provider. + */ +int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + struct iwcm_id_private *cm_id_priv; + int ret = 0; + unsigned long flags; + struct ib_qp *qp; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->state != IW_CM_STATE_IDLE) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + return -EINVAL; + } + + /* Get the ib_qp given the QPN */ + qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); + if (!qp) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return -EINVAL; + } + cm_id->device->iwcm->add_ref(qp); + cm_id_priv->qp = qp; + cm_id_priv->state = IW_CM_STATE_CONN_SENT; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->connect(cm_id, iw_param); + if (ret) { + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->qp) { + cm_id->device->iwcm->rem_ref(qp); + cm_id_priv->qp = NULL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT); + cm_id_priv->state = IW_CM_STATE_IDLE; + printk("Connect failed, ret=%d\n", ret); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_connect); + +/* + * Passive Side: new CM_ID <-- CONN_RECV + * + * Handles an inbound connect request. The function creates a new + * iw_cm_id to represent the new connection and inherits the client + * callback function and other attributes from the listening parent. + * + * The work item contains a pointer to the listen_cm_id and the event. The + * listen_cm_id contains the client cm_handler, context and + * device. These are copied when the device is cloned. The event + * contains the new four tuple. + * + * An error on the child should not affect the parent, so this + * function does not return a value. + */ +static void cm_conn_req_handler(struct iwcm_id_private *listen_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + struct iw_cm_id *cm_id; + struct iwcm_id_private *cm_id_priv; + int ret; + + /* The provider should never generate a connection request + * event with a bad status. + */ + BUG_ON(iw_event->status); + + /* We could be destroying the listening id. If so, ignore this + * upcall. */ + spin_lock_irqsave(&listen_id_priv->lock, flags); + if (listen_id_priv->state != IW_CM_STATE_LISTEN) { + spin_unlock_irqrestore(&listen_id_priv->lock, flags); + return; + } + spin_unlock_irqrestore(&listen_id_priv->lock, flags); + + cm_id = iw_create_cm_id(listen_id_priv->id.device, + listen_id_priv->id.cm_handler, + listen_id_priv->id.context); + /* If the cm_id could not be created, ignore the request */ + if (IS_ERR(cm_id)) + return; + + cm_id->provider_data = iw_event->provider_data; + cm_id->local_addr = iw_event->local_addr; + cm_id->remote_addr = iw_event->remote_addr; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + cm_id_priv->state = IW_CM_STATE_CONN_RECV; + + /* Call the client CM handler */ + ret = cm_id->cm_handler(cm_id, iw_event); + if (ret) { + set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); + destroy_cm_id(cm_id); + if (atomic_read(&cm_id_priv->refcount)==0) + kfree(cm_id); + } +} + +/* + * Passive Side: CM_ID <-- ESTABLISHED + * + * The provider generated an ESTABLISHED event which means that + * the MPA negotion has completed successfully and we are now in MPA + * FPDU mode. + * + * This event can only be received in the CONN_RECV state. If the + * remote peer closed, the ESTABLISHED event would be received followed + * by the CLOSE event. If the app closes, it will block until we wake + * it up after processing this event. + */ +static int cm_conn_est_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + + /* We clear the CONNECT_WAIT bit here to allow the callback + * function to call iw_cm_disconnect. Calling iw_destroy_cm_id + * from a callback handler is not allowed */ + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV); + cm_id_priv->state = IW_CM_STATE_ESTABLISHED; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); + wake_up_all(&cm_id_priv->connect_wait); + + return ret; +} + +/* + * Active Side: CM_ID <-- ESTABLISHED + * + * The app has called connect and is waiting for the established event to + * post it's requests to the server. This event will wake up anyone + * blocked in iw_cm_disconnect or iw_destroy_id. + */ +static int cm_conn_rep_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + /* Clear the connect wait bit so a callback function calling + * iw_cm_disconnect will not wait and deadlock this thread */ + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT); + if (iw_event->status == IW_CM_EVENT_STATUS_ACCEPTED) { + cm_id_priv->id.local_addr = iw_event->local_addr; + cm_id_priv->id.remote_addr = iw_event->remote_addr; + cm_id_priv->state = IW_CM_STATE_ESTABLISHED; + } else { + /* REJECTED or RESET */ + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); + cm_id_priv->qp = NULL; + cm_id_priv->state = IW_CM_STATE_IDLE; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); + + /* Wake up waiters on connect complete */ + wake_up_all(&cm_id_priv->connect_wait); + + return ret; +} + +/* + * CM_ID <-- CLOSING + * + * If in the ESTABLISHED state, move to CLOSING. + */ +static void cm_disconnect_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->state == IW_CM_STATE_ESTABLISHED) + cm_id_priv->state = IW_CM_STATE_CLOSING; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); +} + +/* + * CM_ID <-- IDLE + * + * If in the ESTBLISHED or CLOSING states, the QP will have have been + * moved by the provider to the ERR state. Disassociate the CM_ID from + * the QP, move to IDLE, and remove the 'connected' reference. + * + * If in some other state, the cm_id was destroyed asynchronously. + * This is the last reference that will result in waking up + * the app thread blocked in iw_destroy_cm_id. + */ +static int cm_close_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + int ret = 0; + spin_lock_irqsave(&cm_id_priv->lock, flags); + + if (cm_id_priv->qp) { + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); + cm_id_priv->qp = NULL; + } + switch (cm_id_priv->state) { + case IW_CM_STATE_ESTABLISHED: + case IW_CM_STATE_CLOSING: + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_DESTROYING: + break; + default: + BUG_ON(1); + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + return ret; +} + +static int process_event(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + int ret = 0; + + switch (iw_event->event) { + case IW_CM_EVENT_CONNECT_REQUEST: + cm_conn_req_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_CONNECT_REPLY: + ret = cm_conn_rep_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_ESTABLISHED: + ret = cm_conn_est_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_DISCONNECT: + cm_disconnect_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_CLOSE: + ret = cm_close_handler(cm_id_priv, iw_event); + break; + default: + BUG_ON(1); + } + + return ret; +} + +/* + * Process events on the work_list for the cm_id. If the callback + * function requests that the cm_id be deleted, a flag is set in the + * cm_id flags to indicate that when the last reference is + * removed, the cm_id is to be destroyed. This is necessary to + * distinguish between an object that will be destroyed by the app + * thread asleep on the destroy_comp list vs. an object destroyed + * here synchronously when the last reference is removed. + */ +static void cm_work_handler(void *arg) +{ + struct iwcm_work *work = (struct iwcm_work*)arg; + struct iwcm_id_private *cm_id_priv = work->cm_id; + unsigned long flags; + int empty; + int ret = 0; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + empty = list_empty(&cm_id_priv->work_list); + while (!empty) { + work = list_entry(cm_id_priv->work_list.next, + struct iwcm_work, list); + list_del_init(&work->list); + empty = list_empty(&cm_id_priv->work_list); + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = process_event(cm_id_priv, &work->event); + kfree(work); + if (ret) { + set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); + destroy_cm_id(&cm_id_priv->id); + } + BUG_ON(atomic_read(&cm_id_priv->refcount)==0); + if (iwcm_deref_id(cm_id_priv)) + return; + + if (atomic_read(&cm_id_priv->refcount)==0 && + test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)) { + kfree(cm_id_priv); + return; + } + spin_lock_irqsave(&cm_id_priv->lock, flags); + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); +} + +/* + * This function is called on interrupt context. Schedule events on + * the iwcm_wq thread to allow callback functions to downcall into + * the CM and/or block. Events are queued to a per-CM_ID + * work_list. If this is the first event on the work_list, the work + * element is also queued on the iwcm_wq thread. + * + * Each event holds a reference on the cm_id. Until the last posted + * event has been delivered and processed, the cm_id cannot be + * deleted. + */ +static void cm_event_handler(struct iw_cm_id *cm_id, + struct iw_cm_event *iw_event) +{ + struct iwcm_work *work; + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + + work = kmalloc(sizeof(*work), GFP_ATOMIC); + if (!work) + return; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + atomic_inc(&cm_id_priv->refcount); + + INIT_WORK(&work->work, cm_work_handler, work); + work->cm_id = cm_id_priv; + work->event = *iw_event; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (list_empty(&cm_id_priv->work_list)) { + list_add_tail(&work->list, &cm_id_priv->work_list); + queue_work(iwcm_wq, &work->work); + } else + list_add_tail(&work->list, &cm_id_priv->work_list); + spin_unlock_irqrestore(&cm_id_priv->lock, flags); +} + +static int iwcm_init_qp_init_attr(struct iwcm_id_private *cm_id_priv, + struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_IDLE: + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_CONN_RECV: + case IW_CM_STATE_ESTABLISHED: + *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS; + qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE| + IB_ACCESS_REMOTE_READ; + ret = 0; + break; + default: + ret = -EINVAL; + break; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return ret; +} + +static int iwcm_init_qp_rts_attr(struct iwcm_id_private *cm_id_priv, + struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_IDLE: + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_CONN_RECV: + case IW_CM_STATE_ESTABLISHED: + *qp_attr_mask = 0; + ret = 0; + break; + default: + ret = -EINVAL; + break; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return ret; +} + +int iw_cm_init_qp_attr(struct iw_cm_id *cm_id, + struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + struct iwcm_id_private *cm_id_priv; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + switch (qp_attr->qp_state) { + case IB_QPS_INIT: + case IB_QPS_RTR: + ret = iwcm_init_qp_init_attr(cm_id_priv, + qp_attr, qp_attr_mask); + break; + case IB_QPS_RTS: + ret = iwcm_init_qp_rts_attr(cm_id_priv, + qp_attr, qp_attr_mask); + break; + default: + ret = -EINVAL; + break; + } + return ret; +} +EXPORT_SYMBOL(iw_cm_init_qp_attr); + +static int __init iw_cm_init(void) +{ + iwcm_wq = create_singlethread_workqueue("iw_cm_wq"); + if (!iwcm_wq) + return -ENOMEM; + + return 0; +} + +static void __exit iw_cm_cleanup(void) +{ + destroy_workqueue(iwcm_wq); +} + +module_init(iw_cm_init); +module_exit(iw_cm_cleanup); diff --git a/include/rdma/iw_cm.h b/include/rdma/iw_cm.h new file mode 100644 index 0000000..0752a94 --- /dev/null +++ b/include/rdma/iw_cm.h @@ -0,0 +1,254 @@ +/* + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#if !defined(IW_CM_H) +#define IW_CM_H + +#include +#include + +struct iw_cm_id; + +enum iw_cm_event_type { + IW_CM_EVENT_CONNECT_REQUEST = 1, /* connect request received */ + IW_CM_EVENT_CONNECT_REPLY, /* reply from active connect request */ + IW_CM_EVENT_ESTABLISHED, /* passive side accept successful */ + IW_CM_EVENT_DISCONNECT, /* orderly shutdown */ + IW_CM_EVENT_CLOSE /* close complete */ +}; +enum iw_cm_event_status { + IW_CM_EVENT_STATUS_OK = 0, /* request successful */ + IW_CM_EVENT_STATUS_ACCEPTED = 0, /* connect request accepted */ + IW_CM_EVENT_STATUS_REJECTED, /* connect request rejected */ + IW_CM_EVENT_STATUS_TIMEOUT, /* the operation timed out */ + IW_CM_EVENT_STATUS_RESET, /* reset from remote peer */ + IW_CM_EVENT_STATUS_EINVAL, /* asynchronous failure for bad parm */ +}; +struct iw_cm_event { + enum iw_cm_event_type event; + enum iw_cm_event_status status; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + void *private_data; + u8 private_data_len; + void* provider_data; +}; + +/** + * iw_cm_handler - Function to be called by the IW CM when delivering events + * to the client. + * + * @cm_id: The IW CM identifier associated with the event. + * @event: Pointer to the event structure. + */ +typedef int (*iw_cm_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); + +/** + * iw_event_handler - Function called by the provider when delivering provider + * events to the IW CM. + * + * @cm_id: The IW CM identifier associated with the event. + * @event: Pointer to the event structure. + */ +typedef void (*iw_event_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); +struct iw_cm_id { + iw_cm_handler cm_handler; /* client callback function */ + void *context; /* client cb context */ + struct ib_device *device; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + void *provider_data; /* provider private data */ + iw_event_handler event_handler; /* cb for provider + events */ + /* Used by provider to add and remove refs on IW cm_id */ + void (*add_ref)(struct iw_cm_id *); + void (*rem_ref)(struct iw_cm_id *); +}; + +struct iw_cm_conn_param { + const void *private_data; + u16 private_data_len; + u32 ord; + u32 ird; + u32 qpn; +}; + +struct iw_cm_verbs { + void (*add_ref)(struct ib_qp *qp); + + void (*rem_ref)(struct ib_qp *qp); + + struct ib_qp * (*get_qp)(struct ib_device *device, + int qpn); + + int (*connect)(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *conn_param); + + int (*accept)(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *conn_param); + + int (*reject)(struct iw_cm_id *cm_id, + const void *pdata, u8 pdata_len); + + int (*create_listen)(struct iw_cm_id *cm_id, + int backlog); + + int (*destroy_listen)(struct iw_cm_id *cm_id); +}; + +/** + * iw_create_cm_id - Create an IW CM identifier. + * + * @device: The IB device on which to create the IW CM identier. + * @event_handler: User callback invoked to report events associated with the + * returned IW CM identifier. + * @context: User specified context associated with the id. + */ +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, void *context); + +/** + * iw_destroy_cm_id - Destroy an IW CM identifier. + * + * @cm_id: The previously created IW CM identifier to destroy. + * + * The client can assume that no events will be delivered for the CM ID after + * this function returns. + */ +void iw_destroy_cm_id(struct iw_cm_id *cm_id); + +/** + * iw_cm_bind_qp - Unbind the specified IW CM identifier and QP + * + * @cm_id: The IW CM idenfier to unbind from the QP. + * @qp: The QP + * + * This is called by the provider when destroying the QP to ensure + * that any references held by the IWCM are released. It may also + * be called by the IWCM when destroying a CM_ID to that any + * references held by the provider are released. + */ +void iw_cm_unbind_qp(struct iw_cm_id *cm_id, struct ib_qp *qp); + +/** + * iw_cm_get_qp - Return the ib_qp associated with a QPN + * + * @ib_device: The IB device + * @qpn: The queue pair number + */ +struct ib_qp *iw_cm_get_qp(struct ib_device *device, int qpn); + +/** + * iw_cm_listen - Listen for incoming connection requests on the + * specified IW CM id. + * + * @cm_id: The IW CM identifier. + * @backlog: The maximum number of outstanding un-accepted inbound listen + * requests to queue. + * + * The source address and port number are specified in the IW CM identifier + * structure. + */ +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog); + +/** + * iw_cm_accept - Called to accept an incoming connect request. + * + * @cm_id: The IW CM identifier associated with the connection request. + * @iw_param: Pointer to a structure containing connection establishment + * parameters. + * + * The specified cm_id will have been provided in the event data for a + * CONNECT_REQUEST event. Subsequent events related to this connection will be + * delivered to the specified IW CM identifier prior and may occur prior to + * the return of this function. If this function returns a non-zero value, the + * client can assume that no events will be delivered to the specified IW CM + * identifier. + */ +int iw_cm_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param); + +/** + * iw_cm_reject - Reject an incoming connection request. + * + * @cm_id: Connection identifier associated with the request. + * @private_daa: Pointer to data to deliver to the remote peer as part of the + * reject message. + * @private_data_len: The number of bytes in the private_data parameter. + * + * The client can assume that no events will be delivered to the specified IW + * CM identifier following the return of this function. The private_data + * buffer is available for reuse when this function returns. + */ +int iw_cm_reject(struct iw_cm_id *cm_id, const void *private_data, + u8 private_data_len); + +/** + * iw_cm_connect - Called to request a connection to a remote peer. + * + * @cm_id: The IW CM identifier for the connection. + * @iw_param: Pointer to a structure containing connection establishment + * parameters. + * + * Events may be delivered to the specified IW CM identifier prior to the + * return of this function. If this function returns a non-zero value, the + * client can assume that no events will be delivered to the specified IW CM + * identifier. + */ +int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param); + +/** + * iw_cm_disconnect - Close the specified connection. + * + * @cm_id: The IW CM identifier to close. + * @abrupt: If 0, the connection will be closed gracefully, otherwise, the + * connection will be reset. + * + * The IW CM identifier is still active until the IW_CM_EVENT_CLOSE event is + * delivered. + */ +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt); + +/** + * iw_cm_init_qp_attr - Called to initialize the attributes of the QP + * associated with a IW CM identifier. + * + * @cm_id: The IW CM identifier associated with the QP + * @qp_attr: Pointer to the QP attributes structure. + * @qp_attr_mask: Pointer to a bit vector specifying which QP attributes are + * valid. + */ +int iw_cm_init_qp_attr(struct iw_cm_id *cm_id, struct ib_qp_attr *qp_attr, + int *qp_attr_mask); + +#endif /* IW_CM_H */ diff --git a/include/rdma/iw_cm_private.h b/include/rdma/iw_cm_private.h new file mode 100644 index 0000000..aba8cb2 --- /dev/null +++ b/include/rdma/iw_cm_private.h @@ -0,0 +1,62 @@ +/* + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#if !defined(IW_CM_PRIVATE_H) +#define IW_CM_PRIVATE_H + +#include + +enum iw_cm_state { + IW_CM_STATE_IDLE, /* unbound, inactive */ + IW_CM_STATE_LISTEN, /* listen waiting for connect */ + IW_CM_STATE_CONN_RECV, /* inbound waiting for user accept */ + IW_CM_STATE_CONN_SENT, /* outbound waiting for peer accept */ + IW_CM_STATE_ESTABLISHED, /* established */ + IW_CM_STATE_CLOSING, /* disconnect */ + IW_CM_STATE_DESTROYING /* object being deleted */ +}; + +struct iwcm_id_private { + struct iw_cm_id id; + enum iw_cm_state state; + unsigned long flags; + struct ib_qp *qp; + struct completion destroy_comp; + wait_queue_head_t connect_wait; + struct list_head work_list; + spinlock_t lock; + atomic_t refcount; +}; +#define IWCM_F_CALLBACK_DESTROY 1 +#define IWCM_F_CONNECT_WAIT 2 + +#endif /* IW_CM_PRIVATE_H */ From swise at opengridcomputing.com Wed Jun 7 13:06:10 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:06:10 -0500 Subject: [openib-general] [PATCH v2 2/2] iWARP Core Changes. In-Reply-To: <20060607200600.9003.56328.stgit@stevo-desktop> References: <20060607200600.9003.56328.stgit@stevo-desktop> Message-ID: <20060607200610.9003.54068.stgit@stevo-desktop> This patch contains modifications to the existing rdma header files, core files, drivers, and ulp files to support iWARP. Review updates: - copy_addr() -> rdma_copy_addr() - dst_dev_addr param in rdma_copy_addr to const. - various spacing nits with recasting - include linux/inetdevice.h to get ip_dev_find() prototype. --- drivers/infiniband/core/Makefile | 4 drivers/infiniband/core/addr.c | 19 + drivers/infiniband/core/cache.c | 8 - drivers/infiniband/core/cm.c | 3 drivers/infiniband/core/cma.c | 353 +++++++++++++++++++++++--- drivers/infiniband/core/device.c | 6 drivers/infiniband/core/mad.c | 11 + drivers/infiniband/core/sa_query.c | 5 drivers/infiniband/core/smi.c | 18 + drivers/infiniband/core/sysfs.c | 18 + drivers/infiniband/core/ucm.c | 5 drivers/infiniband/core/user_mad.c | 9 - drivers/infiniband/hw/ipath/ipath_verbs.c | 2 drivers/infiniband/hw/mthca/mthca_provider.c | 2 drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 + drivers/infiniband/ulp/srp/ib_srp.c | 2 include/rdma/ib_addr.h | 15 + include/rdma/ib_verbs.h | 39 +++ 18 files changed, 435 insertions(+), 92 deletions(-) diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 68e73ec..163d991 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -1,7 +1,7 @@ infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS) := ib_addr.o rdma_cm.o obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o $(infiniband-y) + ib_cm.o iw_cm.o $(infiniband-y) obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o @@ -14,6 +14,8 @@ ib_sa-y := sa_query.o ib_cm-y := cm.o +iw_cm-y := iwcm.o + rdma_cm-y := cma.o ib_addr-y := addr.o diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index d294bbc..83f84ef 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -32,6 +32,7 @@ #include #include #include #include +#include #include #include #include @@ -60,12 +61,15 @@ static LIST_HEAD(req_list); static DECLARE_WORK(work, process_req, NULL); static struct workqueue_struct *addr_wq; -static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, - unsigned char *dst_dev_addr) +int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, + const unsigned char *dst_dev_addr) { switch (dev->type) { case ARPHRD_INFINIBAND: - dev_addr->dev_type = IB_NODE_CA; + dev_addr->dev_type = RDMA_NODE_IB_CA; + break; + case ARPHRD_ETHER: + dev_addr->dev_type = RDMA_NODE_RNIC; break; default: return -EADDRNOTAVAIL; @@ -77,6 +81,7 @@ static int copy_addr(struct rdma_dev_add memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); return 0; } +EXPORT_SYMBOL(rdma_copy_addr); int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) { @@ -88,7 +93,7 @@ int rdma_translate_ip(struct sockaddr *a if (!dev) return -EADDRNOTAVAIL; - ret = copy_addr(dev_addr, dev, NULL); + ret = rdma_copy_addr(dev_addr, dev, NULL); dev_put(dev); return ret; } @@ -160,7 +165,7 @@ static int addr_resolve_remote(struct so /* If the device does ARP internally, return 'done' */ if (rt->idev->dev->flags & IFF_NOARP) { - copy_addr(addr, rt->idev->dev, NULL); + rdma_copy_addr(addr, rt->idev->dev, NULL); goto put; } @@ -180,7 +185,7 @@ static int addr_resolve_remote(struct so src_in->sin_addr.s_addr = rt->rt_src; } - ret = copy_addr(addr, neigh->dev, neigh->ha); + ret = rdma_copy_addr(addr, neigh->dev, neigh->ha); release: neigh_release(neigh); put: @@ -244,7 +249,7 @@ static int addr_resolve_local(struct soc if (ZERONET(src_ip)) { src_in->sin_family = dst_in->sin_family; src_in->sin_addr.s_addr = dst_ip; - ret = copy_addr(addr, dev, dev->dev_addr); + ret = rdma_copy_addr(addr, dev, dev->dev_addr); } else if (LOOPBACK(src_ip)) { ret = rdma_translate_ip((struct sockaddr *)dst_in, addr); if (!ret) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index e05ca2c..061858c 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -32,13 +32,12 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $ + * $Id: cache.c 6885 2006-05-03 18:22:02Z sean.hefty $ */ #include #include #include -#include /* INIT_WORK, schedule_work(), flush_scheduled_work() */ #include @@ -62,12 +61,13 @@ struct ib_update_work { static inline int start_port(struct ib_device *device) { - return device->node_type == IB_NODE_SWITCH ? 0 : 1; + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; } static inline int end_port(struct ib_device *device) { - return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt; + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; } int ib_get_cached_gid(struct ib_device *device, diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 1c7463b..cf43ccb 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -3253,6 +3253,9 @@ static void cm_add_one(struct ib_device int ret; u8 i; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 94555d2..414600c 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -35,6 +35,7 @@ #include #include #include #include +#include #include @@ -43,6 +44,7 @@ #include #include #include #include +#include MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); @@ -124,6 +126,7 @@ struct rdma_id_private { int query_id; union { struct ib_cm_id *ib; + struct iw_cm_id *iw; } cm_id; u32 seq_num; @@ -259,13 +262,23 @@ static void cma_detach_from_dev(struct r id_priv->cma_dev = NULL; } -static int cma_acquire_ib_dev(struct rdma_id_private *id_priv) +static int cma_acquire_dev(struct rdma_id_private *id_priv) { + enum rdma_node_type dev_type = id_priv->id.route.addr.dev_addr.dev_type; struct cma_device *cma_dev; union ib_gid *gid; int ret = -ENODEV; - gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + switch (rdma_node_get_transport(dev_type)) { + case RDMA_TRANSPORT_IB: + gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + break; + case RDMA_TRANSPORT_IWARP: + gid = iw_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + break; + default: + return -ENODEV; + } mutex_lock(&lock); list_for_each_entry(cma_dev, &dev_list, list) { @@ -280,16 +293,6 @@ static int cma_acquire_ib_dev(struct rdm return ret; } -static int cma_acquire_dev(struct rdma_id_private *id_priv) -{ - switch (id_priv->id.route.addr.dev_addr.dev_type) { - case IB_NODE_CA: - return cma_acquire_ib_dev(id_priv); - default: - return -ENODEV; - } -} - static void cma_deref_id(struct rdma_id_private *id_priv) { if (atomic_dec_and_test(&id_priv->refcount)) @@ -347,6 +350,16 @@ static int cma_init_ib_qp(struct rdma_id IB_QP_PKEY_INDEX | IB_QP_PORT); } +static int cma_init_iw_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS); +} + int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, struct ib_qp_init_attr *qp_init_attr) { @@ -362,10 +375,13 @@ int rdma_create_qp(struct rdma_cm_id *id if (IS_ERR(qp)) return PTR_ERR(qp); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_init_ib_qp(id_priv, qp); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_init_iw_qp(id_priv, qp); + break; default: ret = -ENOSYS; break; @@ -451,13 +467,17 @@ int rdma_init_qp_attr(struct rdma_cm_id int ret; id_priv = container_of(id, struct rdma_id_private, id); - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr, qp_attr_mask); if (qp_attr->qp_state == IB_QPS_RTR) qp_attr->rq_psn = id_priv->seq_num; break; + case RDMA_TRANSPORT_IWARP: + ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr, + qp_attr_mask); + break; default: ret = -ENOSYS; break; @@ -590,8 +610,8 @@ static int cma_notify_user(struct rdma_i static void cma_cancel_route(struct rdma_id_private *id_priv) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->query) ib_sa_cancel_query(id_priv->query_id, id_priv->query); break; @@ -611,11 +631,15 @@ static void cma_destroy_listen(struct rd cma_exch(id_priv, CMA_DESTROYING); if (id_priv->cma_dev) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; + case RDMA_TRANSPORT_IWARP: + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) + iw_destroy_cm_id(id_priv->cm_id.iw); + break; default: break; } @@ -690,11 +714,15 @@ void rdma_destroy_id(struct rdma_cm_id * cma_cancel_operation(id_priv, state); if (id_priv->cma_dev) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; + case RDMA_TRANSPORT_IWARP: + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) + iw_destroy_cm_id(id_priv->cm_id.iw); + break; default: break; } @@ -868,7 +896,7 @@ static struct rdma_id_private *cma_new_i ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); - rt->addr.dev_addr.dev_type = IB_NODE_CA; + rt->addr.dev_addr.dev_type = RDMA_NODE_IB_CA; id_priv = container_of(id, struct rdma_id_private, id); id_priv->state = CMA_CONNECT; @@ -897,7 +925,7 @@ static int cma_req_handler(struct ib_cm_ } atomic_inc(&conn_id->dev_remove); - ret = cma_acquire_ib_dev(conn_id); + ret = cma_acquire_dev(conn_id); if (ret) { ret = -ENODEV; cma_release_remove(conn_id); @@ -981,6 +1009,123 @@ static void cma_set_compare_data(enum rd } } +static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event) +{ + struct rdma_id_private *id_priv = iw_id->context; + enum rdma_cm_event_type event = 0; + struct sockaddr_in *sin; + int ret = 0; + + atomic_inc(&id_priv->dev_remove); + + switch (iw_event->event) { + case IW_CM_EVENT_CLOSE: + event = RDMA_CM_EVENT_DISCONNECTED; + break; + case IW_CM_EVENT_CONNECT_REPLY: + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + *sin = iw_event->local_addr; + sin = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr; + *sin = iw_event->remote_addr; + if (iw_event->status) + event = RDMA_CM_EVENT_REJECTED; + else + event = RDMA_CM_EVENT_ESTABLISHED; + break; + case IW_CM_EVENT_ESTABLISHED: + event = RDMA_CM_EVENT_ESTABLISHED; + break; + default: + BUG_ON(1); + } + + ret = cma_notify_user(id_priv, event, iw_event->status, + iw_event->private_data, + iw_event->private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + id_priv->cm_id.iw = NULL; + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + rdma_destroy_id(&id_priv->id); + return ret; + } + + cma_release_remove(id_priv); + return ret; +} + +static int iw_conn_req_handler(struct iw_cm_id *cm_id, + struct iw_cm_event *iw_event) +{ + struct rdma_cm_id *new_cm_id; + struct rdma_id_private *listen_id, *conn_id; + struct sockaddr_in *sin; + struct net_device *dev; + int ret; + + listen_id = cm_id->context; + atomic_inc(&listen_id->dev_remove); + if (!cma_comp(listen_id, CMA_LISTEN)) { + ret = -ECONNABORTED; + goto out; + } + + /* Create a new RDMA id for the new IW CM ID */ + new_cm_id = rdma_create_id(listen_id->id.event_handler, + listen_id->id.context, + RDMA_PS_TCP); + if (!new_cm_id) { + ret = -ENOMEM; + goto out; + } + conn_id = container_of(new_cm_id, struct rdma_id_private, id); + atomic_inc(&conn_id->dev_remove); + conn_id->state = CMA_CONNECT; + + dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr); + if (!dev) { + ret = -EADDRNOTAVAIL; + rdma_destroy_id(new_cm_id); + goto out; + } + ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL); + if (ret) { + rdma_destroy_id(new_cm_id); + goto out; + } + + ret = cma_acquire_dev(conn_id); + if (ret) { + rdma_destroy_id(new_cm_id); + goto out; + } + + conn_id->cm_id.iw = cm_id; + cm_id->context = conn_id; + cm_id->cm_handler = cma_iw_handler; + + sin = (struct sockaddr_in *) &new_cm_id->route.addr.src_addr; + *sin = iw_event->local_addr; + sin = (struct sockaddr_in *) &new_cm_id->route.addr.dst_addr; + *sin = iw_event->remote_addr; + + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, + iw_event->private_data, + iw_event->private_data_len); + if (ret) { + /* User wants to destroy the CM ID */ + conn_id->cm_id.iw = NULL; + cma_exch(conn_id, CMA_DESTROYING); + cma_release_remove(conn_id); + rdma_destroy_id(&conn_id->id); + } + +out: + cma_release_remove(listen_id); + return ret; +} + static int cma_ib_listen(struct rdma_id_private *id_priv) { struct ib_cm_compare_data compare_data; @@ -1010,6 +1155,30 @@ static int cma_ib_listen(struct rdma_id_ return ret; } +static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog) +{ + int ret; + struct sockaddr_in *sin; + + id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, + iw_conn_req_handler, + id_priv); + if (IS_ERR(id_priv->cm_id.iw)) + return PTR_ERR(id_priv->cm_id.iw); + + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + id_priv->cm_id.iw->local_addr = *sin; + + ret = iw_cm_listen(id_priv->cm_id.iw, backlog); + + if (ret) { + iw_destroy_cm_id(id_priv->cm_id.iw); + id_priv->cm_id.iw = NULL; + } + + return ret; +} + static int cma_listen_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) { @@ -1085,12 +1254,17 @@ int rdma_listen(struct rdma_cm_id *id, i return -EINVAL; if (id->device) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_ib_listen(id_priv); if (ret) goto err; break; + case RDMA_TRANSPORT_IWARP: + ret = cma_iw_listen(id_priv, backlog); + if (ret) + goto err; + break; default: ret = -ENOSYS; goto err; @@ -1229,6 +1403,23 @@ err: } EXPORT_SYMBOL(rdma_set_ib_paths); +static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) +{ + struct cma_work *work; + + work = kzalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + work->id = id_priv; + INIT_WORK(&work->work, cma_work_handler, work); + work->old_state = CMA_ROUTE_QUERY; + work->new_state = CMA_ROUTE_RESOLVED; + work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; + queue_work(cma_wq, &work->work); + return 0; +} + int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) { struct rdma_id_private *id_priv; @@ -1239,10 +1430,13 @@ int rdma_resolve_route(struct rdma_cm_id return -EINVAL; atomic_inc(&id_priv->refcount); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_resolve_ib_route(id_priv, timeout_ms); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_resolve_iw_route(id_priv, timeout_ms); + break; default: ret = -ENOSYS; break; @@ -1354,8 +1548,8 @@ static int cma_resolve_loopback(struct r ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr)); if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) { - src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr; - dst_in = (struct sockaddr_in *)&id_priv->id.route.addr.dst_addr; + src_in = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + dst_in = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr; src_in->sin_family = dst_in->sin_family; src_in->sin_addr.s_addr = dst_in->sin_addr.s_addr; } @@ -1646,6 +1840,47 @@ out: return ret; } +static int cma_connect_iw(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct iw_cm_id *cm_id; + struct sockaddr_in* sin; + int ret; + struct iw_cm_conn_param iw_param; + + cm_id = iw_create_cm_id(id_priv->id.device, cma_iw_handler, id_priv); + if (IS_ERR(cm_id)) { + ret = PTR_ERR(cm_id); + goto out; + } + + id_priv->cm_id.iw = cm_id; + + sin = (struct sockaddr_in*) &id_priv->id.route.addr.src_addr; + cm_id->local_addr = *sin; + + sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr; + cm_id->remote_addr = *sin; + + ret = cma_modify_qp_rtr(&id_priv->id); + if (ret) { + iw_destroy_cm_id(cm_id); + return ret; + } + + iw_param.ord = conn_param->initiator_depth; + iw_param.ird = conn_param->responder_resources; + iw_param.private_data = conn_param->private_data; + iw_param.private_data_len = conn_param->private_data_len; + if (id_priv->id.qp) + iw_param.qpn = id_priv->qp_num; + else + iw_param.qpn = conn_param->qp_num; + ret = iw_cm_connect(cm_id, &iw_param); +out: + return ret; +} + int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct rdma_id_private *id_priv; @@ -1661,10 +1896,13 @@ int rdma_connect(struct rdma_cm_id *id, id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_connect_ib(id_priv, conn_param); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_connect_iw(id_priv, conn_param); + break; default: ret = -ENOSYS; break; @@ -1705,6 +1943,28 @@ static int cma_accept_ib(struct rdma_id_ return ib_send_cm_rep(id_priv->cm_id.ib, &rep); } +static int cma_accept_iw(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct iw_cm_conn_param iw_param; + int ret; + + ret = cma_modify_qp_rtr(&id_priv->id); + if (ret) + return ret; + + iw_param.ord = conn_param->initiator_depth; + iw_param.ird = conn_param->responder_resources; + iw_param.private_data = conn_param->private_data; + iw_param.private_data_len = conn_param->private_data_len; + if (id_priv->id.qp) { + iw_param.qpn = id_priv->qp_num; + } else + iw_param.qpn = conn_param->qp_num; + + return iw_cm_accept(id_priv->cm_id.iw, &iw_param); +} + int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct rdma_id_private *id_priv; @@ -1720,13 +1980,16 @@ int rdma_accept(struct rdma_cm_id *id, s id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: if (conn_param) ret = cma_accept_ib(id_priv, conn_param); else ret = cma_rep_recv(id_priv); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_accept_iw(id_priv, conn_param); + break; default: ret = -ENOSYS; break; @@ -1753,12 +2016,16 @@ int rdma_reject(struct rdma_cm_id *id, c if (!cma_comp(id_priv, CMA_CONNECT)) return -EINVAL; - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, private_data, private_data_len); break; + case RDMA_TRANSPORT_IWARP: + ret = iw_cm_reject(id_priv->cm_id.iw, + private_data, private_data_len); + break; default: ret = -ENOSYS; break; @@ -1777,16 +2044,18 @@ int rdma_disconnect(struct rdma_cm_id *i !cma_comp(id_priv, CMA_DISCONNECT)) return -EINVAL; - ret = cma_modify_qp_err(id); - if (ret) - goto out; - - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: + ret = cma_modify_qp_err(id); + if (ret) + goto out; /* Initiate or respond to a disconnect. */ if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); break; + case RDMA_TRANSPORT_IWARP: + ret = iw_cm_disconnect(id_priv->cm_id.iw, 0); + break; default: break; } diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index b2f3cb9..7318fba 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -30,7 +30,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: device.c 1349 2004-12-16 21:09:43Z roland $ + * $Id: device.c 5943 2006-03-22 00:58:04Z roland $ */ #include @@ -505,7 +505,7 @@ int ib_query_port(struct ib_device *devi u8 port_num, struct ib_port_attr *port_attr) { - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { if (port_num) return -EINVAL; } else if (port_num < 1 || port_num > device->phys_port_cnt) @@ -580,7 +580,7 @@ int ib_modify_port(struct ib_device *dev u8 port_num, int port_modify_mask, struct ib_port_modify *port_modify) { - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { if (port_num) return -EINVAL; } else if (port_num < 1 || port_num > device->phys_port_cnt) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index b38e02a..a928ecf 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. * @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ + * $Id: mad.c 7294 2006-05-17 18:12:30Z roland $ */ #include #include @@ -2877,7 +2877,10 @@ static void ib_mad_init_device(struct ib { int start, end, i; - if (device->node_type == IB_NODE_SWITCH) { + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) { start = 0; end = 0; } else { @@ -2924,7 +2927,7 @@ static void ib_mad_remove_device(struct { int i, num_ports, cur_port; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { num_ports = 1; cur_port = 0; } else { diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 501cc05..4230277 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -887,7 +887,10 @@ static void ib_sa_add_one(struct ib_devi struct ib_sa_device *sa_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) s = e = 0; else { s = 1; diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c index 35852e7..b81b2b9 100644 --- a/drivers/infiniband/core/smi.c +++ b/drivers/infiniband/core/smi.c @@ -34,7 +34,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: smi.c 1389 2004-12-27 22:56:47Z roland $ + * $Id: smi.c 5258 2006-02-01 20:32:40Z sean.hefty $ */ #include @@ -64,7 +64,7 @@ int smi_handle_dr_smp_send(struct ib_smp /* C14-9:2 */ if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; /* smp->return_path set when received */ @@ -77,7 +77,7 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_ptr == hop_cnt) { /* smp->return_path set when received */ smp->hop_ptr++; - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_dlid == IB_LID_PERMISSIVE); } @@ -95,7 +95,7 @@ int smi_handle_dr_smp_send(struct ib_smp /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; smp->hop_ptr--; @@ -107,7 +107,7 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_ptr == 1) { smp->hop_ptr--; /* C14-13:3 -- SMPs destined for SM shouldn't be here */ - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_slid == IB_LID_PERMISSIVE); } @@ -142,7 +142,7 @@ int smi_handle_dr_smp_recv(struct ib_smp /* C14-9:2 -- intermediate hop */ if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; smp->return_path[hop_ptr] = port_num; @@ -156,7 +156,7 @@ int smi_handle_dr_smp_recv(struct ib_smp smp->return_path[hop_ptr] = port_num; /* smp->hop_ptr updated when sending */ - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_dlid == IB_LID_PERMISSIVE); } @@ -175,7 +175,7 @@ int smi_handle_dr_smp_recv(struct ib_smp /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; /* smp->hop_ptr updated when sending */ @@ -190,7 +190,7 @@ int smi_handle_dr_smp_recv(struct ib_smp return 1; } /* smp->hop_ptr updated when sending */ - return (node_type == IB_NODE_SWITCH); + return (node_type == RDMA_NODE_IB_SWITCH); } /* C14-13:4 -- hop_ptr = 0 -> give to SM */ diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index 21f9282..cfd2c06 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: sysfs.c 1349 2004-12-16 21:09:43Z roland $ + * $Id: sysfs.c 6940 2006-05-04 17:04:55Z roland $ */ #include "core_priv.h" @@ -589,10 +589,16 @@ static ssize_t show_node_type(struct cla return -ENODEV; switch (dev->node_type) { - case IB_NODE_CA: return sprintf(buf, "%d: CA\n", dev->node_type); - case IB_NODE_SWITCH: return sprintf(buf, "%d: switch\n", dev->node_type); - case IB_NODE_ROUTER: return sprintf(buf, "%d: router\n", dev->node_type); - default: return sprintf(buf, "%d: \n", dev->node_type); + case RDMA_NODE_IB_CA: + return sprintf(buf, "%d: CA\n", dev->node_type); + case RDMA_NODE_RNIC: + return sprintf(buf, "%d: RNIC\n", dev->node_type); + case RDMA_NODE_IB_SWITCH: + return sprintf(buf, "%d: switch\n", dev->node_type); + case RDMA_NODE_IB_ROUTER: + return sprintf(buf, "%d: router\n", dev->node_type); + default: + return sprintf(buf, "%d: \n", dev->node_type); } } @@ -708,7 +714,7 @@ int ib_device_register_sysfs(struct ib_d if (ret) goto err_put; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { ret = add_port(device, 0); if (ret) goto err_put; diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index 67caf36..ad2e417 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -30,7 +30,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $ + * $Id: ucm.c 7119 2006-05-11 16:40:38Z sean.hefty $ */ #include @@ -1248,7 +1248,8 @@ static void ib_ucm_add_one(struct ib_dev { struct ib_ucm_device *ucm_dev; - if (!device->alloc_ucontext) + if (!device->alloc_ucontext || + rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) return; ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index afe70a5..0cbd692 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: user_mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ + * $Id: user_mad.c 6041 2006-03-27 21:06:00Z halr $ */ #include @@ -967,7 +967,10 @@ static void ib_umad_add_one(struct ib_de struct ib_umad_device *umad_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) s = e = 0; else { s = 1; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 28fdbda..e4b45d7 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -984,7 +984,7 @@ static void *ipath_register_ib_device(in (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); - dev->node_type = IB_NODE_CA; + dev->node_type = RDMA_NODE_IB_CA; dev->phys_port_cnt = 1; dev->dma_device = ipath_layer_get_device(dd); dev->class_dev.dev = dev->dma_device; diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index a2eae8a..5c31819 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1273,7 +1273,7 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); - dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.node_type = RDMA_NODE_IB_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; dev->ib_dev.class_dev.dev = &dev->pdev->dev; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 1c6ea1c..262427f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1084,13 +1084,16 @@ static void ipoib_add_one(struct ib_devi struct ipoib_dev_priv *priv; int s, e, p; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) return; INIT_LIST_HEAD(dev_list); - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { s = 0; e = 0; } else { @@ -1114,6 +1117,9 @@ static void ipoib_remove_one(struct ib_d struct ipoib_dev_priv *priv, *tmp; struct list_head *dev_list; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + dev_list = ib_get_client_data(device, &ipoib_client); list_for_each_entry_safe(priv, tmp, dev_list, list) { diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index f1401e1..bba2956 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1845,7 +1845,7 @@ static void srp_add_one(struct ib_device if (IS_ERR(srp_dev->fmr_pool)) srp_dev->fmr_pool = NULL; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { s = 0; e = 0; } else { diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h index fcb5ba8..d95d3eb 100644 --- a/include/rdma/ib_addr.h +++ b/include/rdma/ib_addr.h @@ -40,7 +40,7 @@ struct rdma_dev_addr { unsigned char src_dev_addr[MAX_ADDR_LEN]; unsigned char dst_dev_addr[MAX_ADDR_LEN]; unsigned char broadcast[MAX_ADDR_LEN]; - enum ib_node_type dev_type; + enum rdma_node_type dev_type; }; /** @@ -72,6 +72,9 @@ int rdma_resolve_ip(struct sockaddr *src void rdma_addr_cancel(struct rdma_dev_addr *addr); +int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, + const unsigned char *dst_dev_addr); + static inline int ip_addr_size(struct sockaddr *addr) { return addr->sa_family == AF_INET6 ? @@ -111,4 +114,14 @@ static inline void ib_addr_set_dgid(stru memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid); } +static inline union ib_gid* iw_addr_get_sgid(struct rdma_dev_addr* rda) +{ + return (union ib_gid *) rda->src_dev_addr; +} + +static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda) +{ + return (union ib_gid *) rda->dst_dev_addr; +} + #endif /* IB_ADDR_H */ diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index aeb4fcd..eac2d8f 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -35,7 +35,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $ + * $Id: ib_verbs.h 6885 2006-05-03 18:22:02Z sean.hefty $ */ #if !defined(IB_VERBS_H) @@ -56,12 +56,35 @@ union ib_gid { } global; }; -enum ib_node_type { - IB_NODE_CA = 1, - IB_NODE_SWITCH, - IB_NODE_ROUTER +enum rdma_node_type { + /* IB values map to NodeInfo:NodeType. */ + RDMA_NODE_IB_CA = 1, + RDMA_NODE_IB_SWITCH, + RDMA_NODE_IB_ROUTER, + RDMA_NODE_RNIC }; +enum rdma_transport_type { + RDMA_TRANSPORT_IB, + RDMA_TRANSPORT_IWARP +}; + +static inline enum rdma_transport_type +rdma_node_get_transport(enum rdma_node_type node_type) +{ + switch (node_type) { + case RDMA_NODE_IB_CA: + case RDMA_NODE_IB_SWITCH: + case RDMA_NODE_IB_ROUTER: + return RDMA_TRANSPORT_IB; + case RDMA_NODE_RNIC: + return RDMA_TRANSPORT_IWARP; + default: + BUG(); + return 0; + } +} + enum ib_device_cap_flags { IB_DEVICE_RESIZE_MAX_WR = 1, IB_DEVICE_BAD_PKEY_CNTR = (1<<1), @@ -78,6 +101,9 @@ enum ib_device_cap_flags { IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), IB_DEVICE_SRQ_RESIZE = (1<<13), IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_ZERO_STAG = (1<<15), + IB_DEVICE_SEND_W_INV = (1<<16), + IB_DEVICE_MEM_WINDOW = (1<<17) }; enum ib_atomic_cap { @@ -830,6 +856,7 @@ struct ib_cache { u8 *lmc_cache; }; +struct iw_cm_verbs; struct ib_device { struct device *dma_device; @@ -846,6 +873,8 @@ struct ib_device { u32 flags; + struct iw_cm_verbs *iwcm; + int (*query_device)(struct ib_device *device, struct ib_device_attr *device_attr); int (*query_port)(struct ib_device *device, From swise at opengridcomputing.com Wed Jun 7 13:06:46 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:06:46 -0500 Subject: [openib-general] [PATCH v2 0/7][RFC] Ammasso 1100 iWARP Driver Message-ID: <20060607200646.9259.24588.stgit@stevo-desktop> This patchset implements the iWARP provider driver for the Ammasso 1100 RNIC. It is dependent on the "iWARP Core Support" patch set. We're submitting it for review with the goal for inclusion in the 2.6.19 kernel. This code has gone through several reviews in the openib-general list. Now we are submitting it for external review by the linux community. This StGIT patchset is cloned from Roland Dreier's infiniband.git for-2.6.18 branch. The patchset consists of 7 patches: 1 - Low-level device interface and native stack support 2 - Work request definitions 3 - Provider interface 4 - Memory management 5 - User mode message queue implementation 6 - Verbs queue implementation 7 - Kconfig and Makefile I believe I've addressed all the round 1 review comments. Details of the changes are tracked in each patch comment. Signed-off-by: Tom Tucker Signed-off-by: Steve Wise From swise at opengridcomputing.com Wed Jun 7 13:06:55 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:06:55 -0500 Subject: [openib-general] [PATCH v2 4/7] AMSO1100 Memory Management. In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop> References: <20060607200646.9259.24588.stgit@stevo-desktop> Message-ID: <20060607200655.9259.90768.stgit@stevo-desktop> Review Changes: - sizeof -> sizeof() --- drivers/infiniband/hw/amso1100/c2_alloc.c | 256 ++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2_mm.c | 378 +++++++++++++++++++++++++++++ 2 files changed, 634 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_alloc.c b/drivers/infiniband/hw/amso1100/c2_alloc.c new file mode 100644 index 0000000..e496eb7 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_alloc.c @@ -0,0 +1,256 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "c2.h" + +/* Trivial bitmap-based allocator */ +u32 c2_alloc(struct c2_alloc *alloc) +{ + u32 obj; + + spin_lock(&alloc->lock); + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); + if (obj >= alloc->max) + obj = find_first_zero_bit(alloc->table, alloc->max); + if (obj >= 0) { + alloc->last = obj+1; + if (alloc->last > alloc->max) + alloc->last = 0; + } + spin_unlock(&alloc->lock); + + return obj; +} + +void c2_free(struct c2_alloc *alloc, u32 obj) +{ + spin_lock(&alloc->lock); + clear_bit(obj, alloc->table); + spin_unlock(&alloc->lock); +} + +int c2_alloc_init(struct c2_alloc *alloc, u32 num, u32 reserved) +{ + int i; + + alloc->last = 0; + alloc->max = num; + spin_lock_init(&alloc->lock); + alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof(long), GFP_KERNEL); + if (!alloc->table) + return -ENOMEM; + + bitmap_zero(alloc->table, num); + for (i = 0; i < reserved; ++i) + set_bit(i, alloc->table); + + return 0; +} + +void c2_alloc_cleanup(struct c2_alloc *alloc) +{ + kfree(alloc->table); +} + +/* + * Array of pointers with lazy allocation of leaf pages. Callers of + * _get, _set and _clear methods must use a lock or otherwise + * serialize access to the array. + */ + +void *c2_array_get(struct c2_array *array, int index) +{ + int p = (index * sizeof(void *)) >> PAGE_SHIFT; + + if (array->page_list[p].page) { + int i = index & (PAGE_SIZE / sizeof(void *) - 1); + return array->page_list[p].page[i]; + } else + return NULL; +} + +int c2_array_set(struct c2_array *array, int index, void *value) +{ + int p = (index * sizeof(void *)) >> PAGE_SHIFT; + + /* Allocate with GFP_ATOMIC because we'll be called with locks held. */ + if (!array->page_list[p].page) + array->page_list[p].page = + (void **) get_zeroed_page(GFP_ATOMIC); + + if (!array->page_list[p].page) + return -ENOMEM; + + array->page_list[p].page[index & (PAGE_SIZE / sizeof(void *) - 1)] = + value; + ++array->page_list[p].used; + + return 0; +} + +void c2_array_clear(struct c2_array *array, int index) +{ + int p = (index * sizeof(void *)) >> PAGE_SHIFT; + + if (--array->page_list[p].used == 0) { + free_page((unsigned long) array->page_list[p].page); + array->page_list[p].page = NULL; + } + + if (array->page_list[p].used < 0) + pr_debug("Array %p index %d page %d with ref count %d < 0\n", + array, index, p, array->page_list[p].used); +} + +int c2_array_init(struct c2_array *array, int nent) +{ + int npage = (nent * sizeof(void *) + PAGE_SIZE - 1) / PAGE_SIZE; + int i; + + array->page_list = + kmalloc(npage * sizeof(*array->page_list), GFP_KERNEL); + if (!array->page_list) + return -ENOMEM; + + for (i = 0; i < npage; ++i) { + array->page_list[i].page = NULL; + array->page_list[i].used = 0; + } + + return 0; +} + +void c2_array_cleanup(struct c2_array *array, int nent) +{ + int i; + + for (i = 0; i < (nent * sizeof(void *) + PAGE_SIZE - 1) / PAGE_SIZE; + ++i) + free_page((unsigned long) array->page_list[i].page); + + kfree(array->page_list); +} + +static int c2_alloc_mqsp_chunk(gfp_t gfp_mask, struct sp_chunk **head) +{ + int i; + struct sp_chunk *new_head; + + new_head = (struct sp_chunk *) __get_free_page(gfp_mask | GFP_DMA); + if (new_head == NULL) + return -ENOMEM; + + new_head->next = NULL; + new_head->head = 0; + new_head->gfp_mask = gfp_mask; + + /* build list where each index is the next free slot */ + for (i = 0; + i < (PAGE_SIZE - sizeof(struct sp_chunk) - + sizeof(u16)) / sizeof(u16) - 1; + i++) { + new_head->shared_ptr[i] = i + 1; + } + /* terminate list */ + new_head->shared_ptr[i] = 0xFFFF; + + *head = new_head; + return 0; +} + +int c2_init_mqsp_pool(gfp_t gfp_mask, struct sp_chunk **root) +{ + return c2_alloc_mqsp_chunk(gfp_mask, root); +} + +void c2_free_mqsp_pool(struct sp_chunk *root) +{ + struct sp_chunk *next; + + while (root) { + next = root->next; + __free_page((struct page *) root); + root = next; + } +} + +u16 *c2_alloc_mqsp(struct sp_chunk *head) +{ + u16 mqsp; + + while (head) { + mqsp = head->head; + if (mqsp != 0xFFFF) { + head->head = head->shared_ptr[mqsp]; + break; + } else if (head->next == NULL) { + if (c2_alloc_mqsp_chunk(head->gfp_mask, &head->next)==0) { + head = head->next; + mqsp = head->head; + head->head = head->shared_ptr[mqsp]; + break; + } else + return NULL; + } else + head = head->next; + } + if (head) + return &(head->shared_ptr[mqsp]); + return NULL; +} + +void c2_free_mqsp(u16 * mqsp) +{ + struct sp_chunk *head; + u16 idx; + + /* The chunk containing this ptr begins at the page boundary */ + head = (struct sp_chunk *) ((unsigned long) mqsp & PAGE_MASK); + + /* Link head to new mqsp */ + *mqsp = head->head; + + /* Compute the shared_ptr index */ + idx = ((unsigned long) mqsp & ~PAGE_MASK) >> 1; + idx -= (unsigned long) &(((struct sp_chunk *) 0)->shared_ptr[0]) >> 1; + + /* Point this index at the head */ + head->shared_ptr[idx] = head->head; + + /* Point head at this index */ + head->head = idx; +} diff --git a/drivers/infiniband/hw/amso1100/c2_mm.c b/drivers/infiniband/hw/amso1100/c2_mm.c new file mode 100644 index 0000000..13c8122 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_mm.c @@ -0,0 +1,378 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include "c2_vq.h" + +#define PBL_VIRT 1 +#define PBL_PHYS 2 + +/* + * Send all the PBL messages to convey the remainder of the PBL + * Wait for the adapter's reply on the last one. + * This is indicated by setting the MEM_PBL_COMPLETE in the flags. + * + * NOTE: vq_req is _not_ freed by this function. The VQ Host + * Reply buffer _is_ freed by this function. + */ +static int +send_pbl_messages(struct c2_dev *c2dev, u32 stag_index, + unsigned long va, u32 pbl_depth, + struct c2_vq_req *vq_req, int pbl_type) +{ + u32 pbe_count; /* amt that fits in a PBL msg */ + u32 count; /* amt in this PBL MSG. */ + struct c2wr_nsmr_pbl_req *wr; /* PBL WR ptr */ + struct c2wr_nsmr_pbl_rep *reply; /* reply ptr */ + int err, pbl_virt, pbl_index, i; + + switch (pbl_type) { + case PBL_VIRT: + pbl_virt = 1; + break; + case PBL_PHYS: + pbl_virt = 0; + break; + default: + return -EINVAL; + break; + } + + pbe_count = (c2dev->req_vq.msg_size - + sizeof(struct c2wr_nsmr_pbl_req)) / sizeof(u64); + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + return -ENOMEM; + } + c2_wr_set_id(wr, CCWR_NSMR_PBL); + + /* + * Only the last PBL message will generate a reply from the verbs, + * so we set the context to 0 indicating there is no kernel verbs + * handler blocked awaiting this reply. + */ + wr->hdr.context = 0; + wr->rnic_handle = c2dev->adapter_handle; + wr->stag_index = stag_index; /* already swapped */ + wr->flags = 0; + pbl_index = 0; + while (pbl_depth) { + count = min(pbe_count, pbl_depth); + wr->addrs_length = cpu_to_be32(count); + + /* + * If this is the last message, then reference the + * vq request struct cuz we're gonna wait for a reply. + * also make this PBL msg as the last one. + */ + if (count == pbl_depth) { + /* + * reference the request struct. dereferenced in the + * int handler. + */ + vq_req_get(c2dev, vq_req); + wr->flags = cpu_to_be32(MEM_PBL_COMPLETE); + + /* + * This is the last PBL message. + * Set the context to our VQ Request Object so we can + * wait for the reply. + */ + wr->hdr.context = (unsigned long) vq_req; + } + + /* + * If pbl_virt is set then va is a virtual address + * that describes a virtually contiguous memory + * allocation. The wr needs the start of each virtual page + * to be converted to the corresponding physical address + * of the page. If pbl_virt is not set then va is an array + * of physical addresses and there is no conversion to do. + * Just fill in the wr with what is in the array. + */ + for (i = 0; i < count; i++) { + if (pbl_virt) { + /* XXX */ + //wr->paddrs[i] = + // cpu_to_be64(user_virt_to_phys(va)); + va += PAGE_SIZE; + } else { + wr->paddrs[i] = + cpu_to_be64(((u64 *)va)[pbl_index + i]); + } + } + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + if (count <= pbe_count) { + vq_req_put(c2dev, vq_req); + } + goto bail0; + } + pbl_depth -= count; + pbl_index += count; + } + + /* + * Now wait for the reply... + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (struct c2wr_nsmr_pbl_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + + vq_repbuf_free(c2dev, reply); + bail0: + kfree(wr); + return err; +} + +#define C2_PBL_MAX_DEPTH 131072 +int +c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 *addr_list, + int page_size, int pbl_depth, u32 length, + u32 offset, u64 *va, enum c2_acf acf, + struct c2_mr *mr) +{ + struct c2_vq_req *vq_req; + struct c2wr_nsmr_register_req *wr; + struct c2wr_nsmr_register_rep *reply; + u16 flags; + int i, pbe_count, count; + int err; + + if (!va || !length || !addr_list || !pbl_depth) + return -EINTR; + + /* + * Verify PBL depth is within rnic max + */ + if (pbl_depth > C2_PBL_MAX_DEPTH) { + return -EINTR; + } + + /* + * allocate verbs request object + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + /* + * build the WR + */ + c2_wr_set_id(wr, CCWR_NSMR_REGISTER); + wr->hdr.context = (unsigned long) vq_req; + wr->rnic_handle = c2dev->adapter_handle; + + flags = (acf | MEM_VA_BASED | MEM_REMOTE); + + /* + * compute how many pbes can fit in the message + */ + pbe_count = (c2dev->req_vq.msg_size - + sizeof(struct c2wr_nsmr_register_req)) / sizeof(u64); + + if (pbl_depth <= pbe_count) { + flags |= MEM_PBL_COMPLETE; + } + wr->flags = cpu_to_be16(flags); + wr->stag_key = 0; //stag_key; + wr->va = cpu_to_be64(*va); + wr->pd_id = mr->pd->pd_id; + wr->pbe_size = cpu_to_be32(page_size); + wr->length = cpu_to_be32(length); + wr->pbl_depth = cpu_to_be32(pbl_depth); + wr->fbo = cpu_to_be32(offset); + count = min(pbl_depth, pbe_count); + wr->addrs_length = cpu_to_be32(count); + + /* + * fill out the PBL for this message + */ + for (i = 0; i < count; i++) { + wr->paddrs[i] = cpu_to_be64(addr_list[i]); + } + + /* + * regerence the request struct + */ + vq_req_get(c2dev, vq_req); + + /* + * send the WR to the adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + /* + * wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail1; + } + + /* + * process reply + */ + reply = + (struct c2wr_nsmr_register_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail1; + } + if ((err = c2_errno(reply))) { + goto bail2; + } + //*p_pb_entries = be32_to_cpu(reply->pbl_depth); + mr->ibmr.lkey = mr->ibmr.rkey = be32_to_cpu(reply->stag_index); + vq_repbuf_free(c2dev, reply); + + /* + * if there are still more PBEs we need to send them to + * the adapter and wait for a reply on the final one. + * reuse vq_req for this purpose. + */ + pbl_depth -= count; + if (pbl_depth) { + + vq_req->reply_msg = (unsigned long) NULL; + atomic_set(&vq_req->reply_ready, 0); + err = send_pbl_messages(c2dev, + cpu_to_be32(mr->ibmr.lkey), + (unsigned long) &addr_list[i], + pbl_depth, vq_req, PBL_PHYS); + if (err) { + goto bail1; + } + } + + vq_req_free(c2dev, vq_req); + kfree(wr); + + return err; + + bail2: + vq_repbuf_free(c2dev, reply); + bail1: + kfree(wr); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +int c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index) +{ + struct c2_vq_req *vq_req; /* verbs request object */ + struct c2wr_stag_dealloc_req wr; /* work request */ + struct c2wr_stag_dealloc_rep *reply; /* WR reply */ + int err; + + + /* + * allocate verbs request object + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + return -ENOMEM; + } + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_STAG_DEALLOC); + wr.hdr.context = (u64) (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.stag_index = cpu_to_be32(stag_index); + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (struct c2wr_stag_dealloc_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} From swise at opengridcomputing.com Wed Jun 7 13:06:49 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:06:49 -0500 Subject: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver. In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop> References: <20060607200646.9259.24588.stgit@stevo-desktop> Message-ID: <20060607200648.9259.69698.stgit@stevo-desktop> This is the core of the driver and includes the hardware probe, low-level device interfaces and native Ethernet support. Review Changes: - sizeof -> sizeof() - dprintk() -> pr_debug() - removed useless asserts - assert() -> BUG_ON() - C2_DEBUG -> DEBUG - removed debug netevent code - removed arp request squelch code from intr handler, replacing it with setting arp_ignore when the c2 netdev is brought up. - removed c2_set_mac_addr(). --- drivers/infiniband/hw/amso1100/c2.c | 1255 ++++++++++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2.h | 555 +++++++++++++ drivers/infiniband/hw/amso1100/c2_ae.c | 359 +++++++++ drivers/infiniband/hw/amso1100/c2_intr.c | 209 +++++ drivers/infiniband/hw/amso1100/c2_rnic.c | 631 +++++++++++++++ 5 files changed, 3009 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2.c b/drivers/infiniband/hw/amso1100/c2.c new file mode 100644 index 0000000..4fdbd80 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2.c @@ -0,0 +1,1255 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include "c2.h" +#include "c2_provider.h" + +MODULE_AUTHOR("Tom Tucker "); +MODULE_DESCRIPTION("Ammasso AMSO1100 Low-level iWARP Driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK + | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN; + +static int debug = -1; /* defaults above */ +module_param(debug, int, 0); +MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)"); + +static int c2_up(struct net_device *netdev); +static int c2_down(struct net_device *netdev); +static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev); +static void c2_tx_interrupt(struct net_device *netdev); +static void c2_rx_interrupt(struct net_device *netdev); +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs); +static void c2_tx_timeout(struct net_device *netdev); +static int c2_change_mtu(struct net_device *netdev, int new_mtu); +static void c2_reset(struct c2_port *c2_port); +static struct net_device_stats *c2_get_stats(struct net_device *netdev); + +static struct pci_device_id c2_pci_table[] = { + {0x18b8, 0xb001, PCI_ANY_ID, PCI_ANY_ID}, + {0} +}; + +MODULE_DEVICE_TABLE(pci, c2_pci_table); + +static void c2_print_macaddr(struct net_device *netdev) +{ + pr_debug("%s: MAC %02X:%02X:%02X:%02X:%02X:%02X, " + "IRQ %u\n", netdev->name, + netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2], + netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5], + netdev->irq); +} + +static void c2_set_rxbufsize(struct c2_port *c2_port) +{ + struct net_device *netdev = c2_port->netdev; + + if (netdev->mtu > RX_BUF_SIZE) + c2_port->rx_buf_size = + netdev->mtu + ETH_HLEN + sizeof(struct c2_rxp_hdr) + + NET_IP_ALIGN; + else + c2_port->rx_buf_size = sizeof(struct c2_rxp_hdr) + RX_BUF_SIZE; +} + +/* + * Allocate TX ring elements and chain them together. + * One-to-one association of adapter descriptors with ring elements. + */ +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr, + dma_addr_t base, void __iomem * mmio_txp_ring) +{ + struct c2_tx_desc *tx_desc; + struct c2_txp_desc __iomem *txp_desc; + struct c2_element *elem; + int i; + + tx_ring->start = kmalloc(sizeof(*elem) * tx_ring->count, GFP_KERNEL); + if (!tx_ring->start) + return -ENOMEM; + + elem = tx_ring->start; + tx_desc = vaddr; + txp_desc = mmio_txp_ring; + for (i = 0; i < tx_ring->count; i++, elem++, tx_desc++, txp_desc++) { + tx_desc->len = 0; + tx_desc->status = 0; + + /* Set TXP_HTXD_UNINIT */ + __raw_writeq(cpu_to_be64(0x1122334455667788ULL), + (void __iomem *) txp_desc + C2_TXP_ADDR); + __raw_writew(0, (void __iomem *) txp_desc + C2_TXP_LEN); + __raw_writew(cpu_to_be16(TXP_HTXD_UNINIT), + (void __iomem *) txp_desc + C2_TXP_FLAGS); + + elem->skb = NULL; + elem->ht_desc = tx_desc; + elem->hw_desc = txp_desc; + + if (i == tx_ring->count - 1) { + elem->next = tx_ring->start; + tx_desc->next_offset = base; + } else { + elem->next = elem + 1; + tx_desc->next_offset = + base + (i + 1) * sizeof(*tx_desc); + } + } + + tx_ring->to_use = tx_ring->to_clean = tx_ring->start; + + return 0; +} + +/* + * Allocate RX ring elements and chain them together. + * One-to-one association of adapter descriptors with ring elements. + */ +static int c2_rx_ring_alloc(struct c2_ring *rx_ring, void *vaddr, + dma_addr_t base, void __iomem * mmio_rxp_ring) +{ + struct c2_rx_desc *rx_desc; + struct c2_rxp_desc __iomem *rxp_desc; + struct c2_element *elem; + int i; + + rx_ring->start = kmalloc(sizeof(*elem) * rx_ring->count, GFP_KERNEL); + if (!rx_ring->start) + return -ENOMEM; + + elem = rx_ring->start; + rx_desc = vaddr; + rxp_desc = mmio_rxp_ring; + for (i = 0; i < rx_ring->count; i++, elem++, rx_desc++, rxp_desc++) { + rx_desc->len = 0; + rx_desc->status = 0; + + /* Set RXP_HRXD_UNINIT */ + __raw_writew(cpu_to_be16(RXP_HRXD_OK), + (void __iomem *) rxp_desc + C2_RXP_STATUS); + __raw_writew(0, (void __iomem *) rxp_desc + C2_RXP_COUNT); + __raw_writew(0, (void __iomem *) rxp_desc + C2_RXP_LEN); + __raw_writeq(cpu_to_be64(0x99aabbccddeeffULL), + (void __iomem *) rxp_desc + C2_RXP_ADDR); + __raw_writew(cpu_to_be16(RXP_HRXD_UNINIT), + (void __iomem *) rxp_desc + C2_RXP_FLAGS); + + elem->skb = NULL; + elem->ht_desc = rx_desc; + elem->hw_desc = rxp_desc; + + if (i == rx_ring->count - 1) { + elem->next = rx_ring->start; + rx_desc->next_offset = base; + } else { + elem->next = elem + 1; + rx_desc->next_offset = + base + (i + 1) * sizeof(*rx_desc); + } + } + + rx_ring->to_use = rx_ring->to_clean = rx_ring->start; + + return 0; +} + +/* Setup buffer for receiving */ +static inline int c2_rx_alloc(struct c2_port *c2_port, struct c2_element *elem) +{ + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_rx_desc *rx_desc = elem->ht_desc; + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen; + struct c2_rxp_hdr *rxp_hdr; + + skb = dev_alloc_skb(c2_port->rx_buf_size); + if (unlikely(!skb)) { + pr_debug("%s: out of memory for receive\n", + c2_port->netdev->name); + return -ENOMEM; + } + + /* Zero out the rxp hdr in the sk_buff */ + memset(skb->data, 0, sizeof(*rxp_hdr)); + + skb->dev = c2_port->netdev; + + maplen = c2_port->rx_buf_size; + mapaddr = + pci_map_single(c2dev->pcidev, skb->data, maplen, + PCI_DMA_FROMDEVICE); + + /* Set the sk_buff RXP_header to RXP_HRXD_READY */ + rxp_hdr = (struct c2_rxp_hdr *) skb->data; + rxp_hdr->flags = RXP_HRXD_READY; + + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS); + __raw_writew(cpu_to_be16((u16) maplen - sizeof(*rxp_hdr)), + elem->hw_desc + C2_RXP_LEN); + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_RXP_ADDR); + __raw_writew(cpu_to_be16(RXP_HRXD_READY), elem->hw_desc + C2_RXP_FLAGS); + + elem->skb = skb; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + rx_desc->len = maplen; + + return 0; +} + +/* + * Allocate buffers for the Rx ring + * For receive: rx_ring.to_clean is next received frame + */ +static int c2_rx_fill(struct c2_port *c2_port) +{ + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + int ret = 0; + + elem = rx_ring->start; + do { + if (c2_rx_alloc(c2_port, elem)) { + ret = 1; + break; + } + } while ((elem = elem->next) != rx_ring->start); + + rx_ring->to_clean = rx_ring->start; + return ret; +} + +/* Free all buffers in RX ring, assumes receiver stopped */ +static void c2_rx_clean(struct c2_port *c2_port) +{ + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + struct c2_rx_desc *rx_desc; + + elem = rx_ring->start; + do { + rx_desc = elem->ht_desc; + rx_desc->len = 0; + + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS); + __raw_writew(0, elem->hw_desc + C2_RXP_COUNT); + __raw_writew(0, elem->hw_desc + C2_RXP_LEN); + __raw_writeq(cpu_to_be64(0x99aabbccddeeffULL), + elem->hw_desc + C2_RXP_ADDR); + __raw_writew(cpu_to_be16(RXP_HRXD_UNINIT), + elem->hw_desc + C2_RXP_FLAGS); + + if (elem->skb) { + pci_unmap_single(c2dev->pcidev, elem->mapaddr, + elem->maplen, PCI_DMA_FROMDEVICE); + dev_kfree_skb(elem->skb); + elem->skb = NULL; + } + } while ((elem = elem->next) != rx_ring->start); +} + +static inline int c2_tx_free(struct c2_dev *c2dev, struct c2_element *elem) +{ + struct c2_tx_desc *tx_desc = elem->ht_desc; + + tx_desc->len = 0; + + pci_unmap_single(c2dev->pcidev, elem->mapaddr, elem->maplen, + PCI_DMA_TODEVICE); + + if (elem->skb) { + dev_kfree_skb_any(elem->skb); + elem->skb = NULL; + } + + return 0; +} + +/* Free all buffers in TX ring, assumes transmitter stopped */ +static void c2_tx_clean(struct c2_port *c2_port) +{ + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + struct c2_txp_desc txp_htxd; + int retry; + unsigned long flags; + + spin_lock_irqsave(&c2_port->tx_lock, flags); + + elem = tx_ring->start; + + do { + retry = 0; + do { + txp_htxd.flags = + readw(elem->hw_desc + C2_TXP_FLAGS); + + if (txp_htxd.flags == TXP_HTXD_READY) { + retry = 1; + __raw_writew(0, + elem->hw_desc + C2_TXP_LEN); + __raw_writeq(0, + elem->hw_desc + C2_TXP_ADDR); + __raw_writew(cpu_to_be16(TXP_HTXD_DONE), + elem->hw_desc + C2_TXP_FLAGS); + c2_port->netstats.tx_dropped++; + break; + } else { + __raw_writew(0, + elem->hw_desc + C2_TXP_LEN); + __raw_writeq(cpu_to_be64(0x1122334455667788ULL), + elem->hw_desc + C2_TXP_ADDR); + __raw_writew(cpu_to_be16(TXP_HTXD_UNINIT), + elem->hw_desc + C2_TXP_FLAGS); + } + + c2_tx_free(c2_port->c2dev, elem); + + } while ((elem = elem->next) != tx_ring->start); + } while (retry); + + c2_port->tx_avail = c2_port->tx_ring.count - 1; + c2_port->c2dev->cur_tx = tx_ring->to_use - tx_ring->start; + + if (c2_port->tx_avail > MAX_SKB_FRAGS + 1) + netif_wake_queue(c2_port->netdev); + + spin_unlock_irqrestore(&c2_port->tx_lock, flags); +} + +/* + * Process transmit descriptors marked 'DONE' by the firmware, + * freeing up their unneeded sk_buffs. + */ +static void c2_tx_interrupt(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + struct c2_txp_desc txp_htxd; + + spin_lock(&c2_port->tx_lock); + + for (elem = tx_ring->to_clean; elem != tx_ring->to_use; + elem = elem->next) { + txp_htxd.flags = + be16_to_cpu(readw(elem->hw_desc + C2_TXP_FLAGS)); + + if (txp_htxd.flags != TXP_HTXD_DONE) + break; + + if (netif_msg_tx_done(c2_port)) { + /* PCI reads are expensive in fast path */ + txp_htxd.len = + be16_to_cpu(readw(elem->hw_desc + C2_TXP_LEN)); + pr_debug("%s: tx done slot %3Zu status 0x%x len " + "%5u bytes\n", + netdev->name, elem - tx_ring->start, + txp_htxd.flags, txp_htxd.len); + } + + c2_tx_free(c2dev, elem); + ++(c2_port->tx_avail); + } + + tx_ring->to_clean = elem; + + if (netif_queue_stopped(netdev) + && c2_port->tx_avail > MAX_SKB_FRAGS + 1) + netif_wake_queue(netdev); + + spin_unlock(&c2_port->tx_lock); +} + +static void c2_rx_error(struct c2_port *c2_port, struct c2_element *elem) +{ + struct c2_rx_desc *rx_desc = elem->ht_desc; + struct c2_rxp_hdr *rxp_hdr = (struct c2_rxp_hdr *) elem->skb->data; + + if (rxp_hdr->status != RXP_HRXD_OK || + rxp_hdr->len > (rx_desc->len - sizeof(*rxp_hdr))) { + pr_debug("BAD RXP_HRXD\n"); + pr_debug(" rx_desc : %p\n", rx_desc); + pr_debug(" index : %Zu\n", + elem - c2_port->rx_ring.start); + pr_debug(" len : %u\n", rx_desc->len); + pr_debug(" rxp_hdr : %p [PA %p]\n", rxp_hdr, + (void *) __pa((unsigned long) rxp_hdr)); + pr_debug(" flags : 0x%x\n", rxp_hdr->flags); + pr_debug(" status: 0x%x\n", rxp_hdr->status); + pr_debug(" len : %u\n", rxp_hdr->len); + pr_debug(" rsvd : 0x%x\n", rxp_hdr->rsvd); + } + + /* Setup the skb for reuse since we're dropping this pkt */ + elem->skb->tail = elem->skb->data = elem->skb->head; + + /* Zero out the rxp hdr in the sk_buff */ + memset(elem->skb->data, 0, sizeof(*rxp_hdr)); + + /* Write the descriptor to the adapter's rx ring */ + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS); + __raw_writew(0, elem->hw_desc + C2_RXP_COUNT); + __raw_writew(cpu_to_be16((u16) elem->maplen - sizeof(*rxp_hdr)), + elem->hw_desc + C2_RXP_LEN); + __raw_writeq(cpu_to_be64(elem->mapaddr), elem->hw_desc + C2_RXP_ADDR); + __raw_writew(cpu_to_be16(RXP_HRXD_READY), elem->hw_desc + C2_RXP_FLAGS); + + pr_debug("packet dropped\n"); + c2_port->netstats.rx_dropped++; +} + +static void c2_rx_interrupt(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + struct c2_rx_desc *rx_desc; + struct c2_rxp_hdr *rxp_hdr; + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen, buflen; + unsigned long flags; + + spin_lock_irqsave(&c2dev->lock, flags); + + /* Begin where we left off */ + rx_ring->to_clean = rx_ring->start + c2dev->cur_rx; + + for (elem = rx_ring->to_clean; elem->next != rx_ring->to_clean; + elem = elem->next) { + rx_desc = elem->ht_desc; + mapaddr = elem->mapaddr; + maplen = elem->maplen; + skb = elem->skb; + rxp_hdr = (struct c2_rxp_hdr *) skb->data; + + if (rxp_hdr->flags != RXP_HRXD_DONE) + break; + buflen = rxp_hdr->len; + + /* Sanity check the RXP header */ + if (rxp_hdr->status != RXP_HRXD_OK || + buflen > (rx_desc->len - sizeof(*rxp_hdr))) { + c2_rx_error(c2_port, elem); + continue; + } + + /* + * Allocate and map a new skb for replenishing the host + * RX desc + */ + if (c2_rx_alloc(c2_port, elem)) { + c2_rx_error(c2_port, elem); + continue; + } + + /* Unmap the old skb */ + pci_unmap_single(c2dev->pcidev, mapaddr, maplen, + PCI_DMA_FROMDEVICE); + + prefetch(skb->data); + + /* + * Skip past the leading 8 bytes comprising of the + * "struct c2_rxp_hdr", prepended by the adapter + * to the usual Ethernet header ("struct ethhdr"), + * to the start of the raw Ethernet packet. + * + * Fix up the various fields in the sk_buff before + * passing it up to netif_rx(). The transfer size + * (in bytes) specified by the adapter len field of + * the "struct rxp_hdr_t" does NOT include the + * "sizeof(struct c2_rxp_hdr)". + */ + skb->data += sizeof(*rxp_hdr); + skb->tail = skb->data + buflen; + skb->len = buflen; + skb->dev = netdev; + skb->protocol = eth_type_trans(skb, netdev); + + netif_rx(skb); + + netdev->last_rx = jiffies; + c2_port->netstats.rx_packets++; + c2_port->netstats.rx_bytes += buflen; + } + + /* Save where we left off */ + rx_ring->to_clean = elem; + c2dev->cur_rx = elem - rx_ring->start; + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); + + spin_unlock_irqrestore(&c2dev->lock, flags); +} + +/* + * Handle netisr0 TX & RX interrupts. + */ +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs) +{ + unsigned int netisr0, dmaisr; + int handled = 0; + struct c2_dev *c2dev = (struct c2_dev *) dev_id; + + /* Process CCILNET interrupts */ + netisr0 = readl(c2dev->regs + C2_NISR0); + if (netisr0) { + + /* + * There is an issue with the firmware that always + * provides the status of RX for both TX & RX + * interrupts. So process both queues here. + */ + c2_rx_interrupt(c2dev->netdev); + c2_tx_interrupt(c2dev->netdev); + + /* Clear the interrupt */ + writel(netisr0, c2dev->regs + C2_NISR0); + handled++; + } + + /* Process RNIC interrupts */ + dmaisr = readl(c2dev->regs + C2_DISR); + if (dmaisr) { + writel(dmaisr, c2dev->regs + C2_DISR); + c2_rnic_interrupt(c2dev); + handled++; + } + + if (handled) { + return IRQ_HANDLED; + } else { + return IRQ_NONE; + } +} + +static int c2_up(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_element *elem; + struct c2_rxp_hdr *rxp_hdr; + struct in_device *in_dev; + size_t rx_size, tx_size; + int ret, i; + unsigned int netimr0; + + if (netif_msg_ifup(c2_port)) + pr_debug("%s: enabling interface\n", netdev->name); + + /* Set the Rx buffer size based on MTU */ + c2_set_rxbufsize(c2_port); + + /* Allocate DMA'able memory for Tx/Rx host descriptor rings */ + rx_size = c2_port->rx_ring.count * sizeof(struct c2_rx_desc); + tx_size = c2_port->tx_ring.count * sizeof(struct c2_tx_desc); + + c2_port->mem_size = tx_size + rx_size; + c2_port->mem = pci_alloc_consistent(c2dev->pcidev, c2_port->mem_size, + &c2_port->dma); + if (c2_port->mem == NULL) { + pr_debug("Unable to allocate memory for " + "host descriptor rings\n"); + return -ENOMEM; + } + + memset(c2_port->mem, 0, c2_port->mem_size); + + /* Create the Rx host descriptor ring */ + if ((ret = + c2_rx_ring_alloc(&c2_port->rx_ring, c2_port->mem, c2_port->dma, + c2dev->mmio_rxp_ring))) { + pr_debug("Unable to create RX ring\n"); + goto bail0; + } + + /* Allocate Rx buffers for the host descriptor ring */ + if (c2_rx_fill(c2_port)) { + pr_debug("Unable to fill RX ring\n"); + goto bail1; + } + + /* Create the Tx host descriptor ring */ + if ((ret = c2_tx_ring_alloc(&c2_port->tx_ring, c2_port->mem + rx_size, + c2_port->dma + rx_size, + c2dev->mmio_txp_ring))) { + pr_debug("Unable to create TX ring\n"); + goto bail1; + } + + /* Set the TX pointer to where we left off */ + c2_port->tx_avail = c2_port->tx_ring.count - 1; + c2_port->tx_ring.to_use = c2_port->tx_ring.to_clean = + c2_port->tx_ring.start + c2dev->cur_tx; + + /* missing: Initialize MAC */ + + BUG_ON(c2_port->tx_ring.to_use != c2_port->tx_ring.to_clean); + + /* Reset the adapter, ensures the driver is in sync with the RXP */ + c2_reset(c2_port); + + /* Reset the READY bit in the sk_buff RXP headers & adapter HRXDQ */ + for (i = 0, elem = c2_port->rx_ring.start; i < c2_port->rx_ring.count; + i++, elem++) { + rxp_hdr = (struct c2_rxp_hdr *) elem->skb->data; + rxp_hdr->flags = 0; + __raw_writew(cpu_to_be16(RXP_HRXD_READY), + elem->hw_desc + C2_RXP_FLAGS); + } + + /* Enable network packets */ + netif_start_queue(netdev); + + /* Enable IRQ */ + writel(0, c2dev->regs + C2_IDIS); + netimr0 = readl(c2dev->regs + C2_NIMR0); + netimr0 &= ~(C2_PCI_HTX_INT | C2_PCI_HRX_INT); + writel(netimr0, c2dev->regs + C2_NIMR0); + + /* Tell the stack to ignore arp requests for ipaddrs bound to + * other interfaces. This is needed to prevent the host stack + * from responding to arp requests to the ipaddr bound on the + * rdma interface. + */ + in_dev = in_dev_get(netdev); + in_dev->cnf.arp_ignore = 1; + in_dev_put(in_dev); + + return 0; + + bail1: + c2_rx_clean(c2_port); + kfree(c2_port->rx_ring.start); + + bail0: + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, + c2_port->dma); + + return ret; +} + +static int c2_down(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + + if (netif_msg_ifdown(c2_port)) + pr_debug("%s: disabling interface\n", + netdev->name); + + /* Wait for all the queued packets to get sent */ + c2_tx_interrupt(netdev); + + /* Disable network packets */ + netif_stop_queue(netdev); + + /* Disable IRQs by clearing the interrupt mask */ + writel(1, c2dev->regs + C2_IDIS); + writel(0, c2dev->regs + C2_NIMR0); + + /* missing: Stop transmitter */ + + /* missing: Stop receiver */ + + /* Reset the adapter, ensures the driver is in sync with the RXP */ + c2_reset(c2_port); + + /* missing: Turn off LEDs here */ + + /* Free all buffers in the host descriptor rings */ + c2_tx_clean(c2_port); + c2_rx_clean(c2_port); + + /* Free the host descriptor rings */ + kfree(c2_port->rx_ring.start); + kfree(c2_port->tx_ring.start); + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, + c2_port->dma); + + return 0; +} + +static void c2_reset(struct c2_port *c2_port) +{ + struct c2_dev *c2dev = c2_port->c2dev; + unsigned int cur_rx = c2dev->cur_rx; + + /* Tell the hardware to quiesce */ + C2_SET_CUR_RX(c2dev, cur_rx | C2_PCI_HRX_QUI); + + /* + * The hardware will reset the C2_PCI_HRX_QUI bit once + * the RXP is quiesced. Wait 2 seconds for this. + */ + ssleep(2); + + cur_rx = C2_GET_CUR_RX(c2dev); + + if (cur_rx & C2_PCI_HRX_QUI) + pr_debug("c2_reset: failed to quiesce the hardware!\n"); + + cur_rx &= ~C2_PCI_HRX_QUI; + + c2dev->cur_rx = cur_rx; + + pr_debug("Current RX: %u\n", c2dev->cur_rx); +} + +static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + dma_addr_t mapaddr; + u32 maplen; + unsigned long flags; + unsigned int i; + + spin_lock_irqsave(&c2_port->tx_lock, flags); + + if (unlikely(c2_port->tx_avail < (skb_shinfo(skb)->nr_frags + 1))) { + netif_stop_queue(netdev); + spin_unlock_irqrestore(&c2_port->tx_lock, flags); + + pr_debug("%s: Tx ring full when queue awake!\n", + netdev->name); + return NETDEV_TX_BUSY; + } + + maplen = skb_headlen(skb); + mapaddr = + pci_map_single(c2dev->pcidev, skb->data, maplen, PCI_DMA_TODEVICE); + + elem = tx_ring->to_use; + elem->skb = skb; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + + /* Tell HW to xmit */ + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR); + __raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN); + __raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS); + + c2_port->netstats.tx_packets++; + c2_port->netstats.tx_bytes += maplen; + + /* Loop thru additional data fragments and queue them */ + if (skb_shinfo(skb)->nr_frags) { + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + maplen = frag->size; + mapaddr = + pci_map_page(c2dev->pcidev, frag->page, + frag->page_offset, maplen, + PCI_DMA_TODEVICE); + + elem = elem->next; + elem->skb = NULL; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + + /* Tell HW to xmit */ + __raw_writeq(cpu_to_be64(mapaddr), + elem->hw_desc + C2_TXP_ADDR); + __raw_writew(cpu_to_be16(maplen), + elem->hw_desc + C2_TXP_LEN); + __raw_writew(cpu_to_be16(TXP_HTXD_READY), + elem->hw_desc + C2_TXP_FLAGS); + + c2_port->netstats.tx_packets++; + c2_port->netstats.tx_bytes += maplen; + } + } + + tx_ring->to_use = elem->next; + c2_port->tx_avail -= (skb_shinfo(skb)->nr_frags + 1); + + if (c2_port->tx_avail <= MAX_SKB_FRAGS + 1) { + netif_stop_queue(netdev); + if (netif_msg_tx_queued(c2_port)) + pr_debug("%s: transmit queue full\n", + netdev->name); + } + + spin_unlock_irqrestore(&c2_port->tx_lock, flags); + + netdev->trans_start = jiffies; + + return NETDEV_TX_OK; +} + +static struct net_device_stats *c2_get_stats(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + + return &c2_port->netstats; +} + +static void c2_tx_timeout(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + + if (netif_msg_timer(c2_port)) + pr_debug("%s: tx timeout\n", netdev->name); + + c2_tx_clean(c2_port); +} + +static int c2_change_mtu(struct net_device *netdev, int new_mtu) +{ + int ret = 0; + + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) + return -EINVAL; + + netdev->mtu = new_mtu; + + if (netif_running(netdev)) { + c2_down(netdev); + + c2_up(netdev); + } + + return ret; +} + +/* Initialize network device */ +static struct net_device *c2_devinit(struct c2_dev *c2dev, + void __iomem * mmio_addr) +{ + struct c2_port *c2_port = NULL; + struct net_device *netdev = alloc_etherdev(sizeof(*c2_port)); + + if (!netdev) { + pr_debug("c2_port etherdev alloc failed"); + return NULL; + } + + SET_MODULE_OWNER(netdev); + SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev); + + netdev->open = c2_up; + netdev->stop = c2_down; + netdev->hard_start_xmit = c2_xmit_frame; + netdev->get_stats = c2_get_stats; + netdev->tx_timeout = c2_tx_timeout; + netdev->change_mtu = c2_change_mtu; + netdev->watchdog_timeo = C2_TX_TIMEOUT; + netdev->irq = c2dev->pcidev->irq; + + c2_port = netdev_priv(netdev); + c2_port->netdev = netdev; + c2_port->c2dev = c2dev; + c2_port->msg_enable = netif_msg_init(debug, default_msg); + c2_port->tx_ring.count = C2_NUM_TX_DESC; + c2_port->rx_ring.count = C2_NUM_RX_DESC; + + spin_lock_init(&c2_port->tx_lock); + + /* Copy our 48-bit ethernet hardware address */ + memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_ENADDR, 6); + + /* Validate the MAC address */ + if (!is_valid_ether_addr(netdev->dev_addr)) { + pr_debug("Invalid MAC Address\n"); + c2_print_macaddr(netdev); + free_netdev(netdev); + return NULL; + } + + c2dev->netdev = netdev; + + return netdev; +} + +static int __devinit c2_probe(struct pci_dev *pcidev, + const struct pci_device_id *ent) +{ + int ret = 0, i; + unsigned long reg0_start, reg0_flags, reg0_len; + unsigned long reg2_start, reg2_flags, reg2_len; + unsigned long reg4_start, reg4_flags, reg4_len; + unsigned kva_map_size; + struct net_device *netdev = NULL; + struct c2_dev *c2dev = NULL; + void __iomem *mmio_regs = NULL; + + printk(KERN_INFO PFX "AMSO1100 Gigabit Ethernet driver v%s loaded\n", + DRV_VERSION); + + /* Enable PCI device */ + ret = pci_enable_device(pcidev); + if (ret) { + printk(KERN_ERR PFX "%s: Unable to enable PCI device\n", + pci_name(pcidev)); + goto bail0; + } + + reg0_start = pci_resource_start(pcidev, BAR_0); + reg0_len = pci_resource_len(pcidev, BAR_0); + reg0_flags = pci_resource_flags(pcidev, BAR_0); + + reg2_start = pci_resource_start(pcidev, BAR_2); + reg2_len = pci_resource_len(pcidev, BAR_2); + reg2_flags = pci_resource_flags(pcidev, BAR_2); + + reg4_start = pci_resource_start(pcidev, BAR_4); + reg4_len = pci_resource_len(pcidev, BAR_4); + reg4_flags = pci_resource_flags(pcidev, BAR_4); + + pr_debug("BAR0 size = 0x%lX bytes\n", reg0_len); + pr_debug("BAR2 size = 0x%lX bytes\n", reg2_len); + pr_debug("BAR4 size = 0x%lX bytes\n", reg4_len); + + /* Make sure PCI base addr are MMIO */ + if (!(reg0_flags & IORESOURCE_MEM) || + !(reg2_flags & IORESOURCE_MEM) || !(reg4_flags & IORESOURCE_MEM)) { + printk(KERN_ERR PFX "PCI regions not an MMIO resource\n"); + ret = -ENODEV; + goto bail1; + } + + /* Check for weird/broken PCI region reporting */ + if ((reg0_len < C2_REG0_SIZE) || + (reg2_len < C2_REG2_SIZE) || (reg4_len < C2_REG4_SIZE)) { + printk(KERN_ERR PFX "Invalid PCI region sizes\n"); + ret = -ENODEV; + goto bail1; + } + + /* Reserve PCI I/O and memory resources */ + ret = pci_request_regions(pcidev, DRV_NAME); + if (ret) { + printk(KERN_ERR PFX "%s: Unable to request regions\n", + pci_name(pcidev)); + goto bail1; + } + + if ((sizeof(dma_addr_t) > 4)) { + ret = pci_set_dma_mask(pcidev, DMA_64BIT_MASK); + if (ret < 0) { + printk(KERN_ERR PFX "64b DMA configuration failed\n"); + goto bail2; + } + } else { + ret = pci_set_dma_mask(pcidev, DMA_32BIT_MASK); + if (ret < 0) { + printk(KERN_ERR PFX "32b DMA configuration failed\n"); + goto bail2; + } + } + + /* Enables bus-mastering on the device */ + pci_set_master(pcidev); + + /* Remap the adapter PCI registers in BAR4 */ + mmio_regs = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, + sizeof(struct c2_adapter_pci_regs)); + if (mmio_regs == 0UL) { + printk(KERN_ERR PFX + "Unable to remap adapter PCI registers in BAR4\n"); + ret = -EIO; + goto bail2; + } + + /* Validate PCI regs magic */ + for (i = 0; i < sizeof(c2_magic); i++) { + if (c2_magic[i] != readb(mmio_regs + C2_REGS_MAGIC + i)) { + printk(KERN_ERR PFX "Downlevel Firmware boot loader " + "[%d/%Zd: got 0x%x, exp 0x%x]. Use the cc_flash " + "utility to update your boot loader\n", + i + 1, sizeof(c2_magic), + readb(mmio_regs + C2_REGS_MAGIC + i), + c2_magic[i]); + printk(KERN_ERR PFX "Adapter not claimed\n"); + iounmap(mmio_regs); + ret = -EIO; + goto bail2; + } + } + + /* Validate the adapter version */ + if (be32_to_cpu(readl(mmio_regs + C2_REGS_VERS)) != C2_VERSION) { + printk(KERN_ERR PFX "Version mismatch " + "[fw=%u, c2=%u], Adapter not claimed\n", + be32_to_cpu(readl(mmio_regs + C2_REGS_VERS)), + C2_VERSION); + ret = -EINVAL; + iounmap(mmio_regs); + goto bail2; + } + + /* Validate the adapter IVN */ + if (be32_to_cpu(readl(mmio_regs + C2_REGS_IVN)) != C2_IVN) { + printk(KERN_ERR PFX "Downlevel FIrmware level. You should be using " + "the OpenIB device support kit. " + "[fw=0x%x, c2=0x%x], Adapter not claimed\n", + be32_to_cpu(readl(mmio_regs + C2_REGS_IVN)), + C2_IVN); + ret = -EINVAL; + iounmap(mmio_regs); + goto bail2; + } + + /* Allocate hardware structure */ + c2dev = (struct c2_dev *) ib_alloc_device(sizeof(*c2dev)); + if (!c2dev) { + printk(KERN_ERR PFX "%s: Unable to alloc hardware struct\n", + pci_name(pcidev)); + ret = -ENOMEM; + iounmap(mmio_regs); + goto bail2; + } + + memset(c2dev, 0, sizeof(*c2dev)); + spin_lock_init(&c2dev->lock); + c2dev->pcidev = pcidev; + c2dev->cur_tx = 0; + + /* Get the last RX index */ + c2dev->cur_rx = + (be32_to_cpu(readl(mmio_regs + C2_REGS_HRX_CUR)) - + 0xffffc000) / sizeof(struct c2_rxp_desc); + + /* Request an interrupt line for the driver */ + ret = request_irq(pcidev->irq, c2_interrupt, SA_SHIRQ, DRV_NAME, c2dev); + if (ret) { + printk(KERN_ERR PFX "%s: requested IRQ %u is busy\n", + pci_name(pcidev), pcidev->irq); + iounmap(mmio_regs); + goto bail3; + } + + /* Set driver specific data */ + pci_set_drvdata(pcidev, c2dev); + + /* Initialize network device */ + if ((netdev = c2_devinit(c2dev, mmio_regs)) == NULL) { + iounmap(mmio_regs); + goto bail4; + } + + /* Save off the actual size prior to unmapping mmio_regs */ + kva_map_size = be32_to_cpu(readl(mmio_regs + C2_REGS_PCI_WINSIZE)); + + /* Unmap the adapter PCI registers in BAR4 */ + iounmap(mmio_regs); + + /* Register network device */ + ret = register_netdev(netdev); + if (ret) { + printk(KERN_ERR PFX "Unable to register netdev, ret = %d\n", + ret); + goto bail5; + } + + /* Disable network packets */ + netif_stop_queue(netdev); + + /* Remap the adapter HRXDQ PA space to kernel VA space */ + c2dev->mmio_rxp_ring = ioremap_nocache(reg4_start + C2_RXP_HRXDQ_OFFSET, + C2_RXP_HRXDQ_SIZE); + if (c2dev->mmio_rxp_ring == 0UL) { + printk(KERN_ERR PFX "Unable to remap MMIO HRXDQ region\n"); + ret = -EIO; + goto bail6; + } + + /* Remap the adapter HTXDQ PA space to kernel VA space */ + c2dev->mmio_txp_ring = ioremap_nocache(reg4_start + C2_TXP_HTXDQ_OFFSET, + C2_TXP_HTXDQ_SIZE); + if (c2dev->mmio_txp_ring == 0UL) { + printk(KERN_ERR PFX "Unable to remap MMIO HTXDQ region\n"); + ret = -EIO; + goto bail7; + } + + /* Save off the current RX index in the last 4 bytes of the TXP Ring */ + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); + + /* Remap the PCI registers in adapter BAR0 to kernel VA space */ + c2dev->regs = ioremap_nocache(reg0_start, reg0_len); + if (c2dev->regs == 0UL) { + printk(KERN_ERR PFX "Unable to remap BAR0\n"); + ret = -EIO; + goto bail8; + } + + /* Remap the PCI registers in adapter BAR4 to kernel VA space */ + c2dev->pa = reg4_start + C2_PCI_REGS_OFFSET; + c2dev->kva = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, + kva_map_size); + if (c2dev->kva == 0UL) { + printk(KERN_ERR PFX "Unable to remap BAR4\n"); + ret = -EIO; + goto bail9; + } + + /* Print out the MAC address */ + c2_print_macaddr(netdev); + + ret = c2_rnic_init(c2dev); + if (ret) { + printk(KERN_ERR PFX "c2_rnic_init failed: %d\n", ret); + goto bail10; + } + + c2_register_device(c2dev); + + return 0; + + bail10: + iounmap(c2dev->kva); + + bail9: + iounmap(c2dev->regs); + + bail8: + iounmap(c2dev->mmio_txp_ring); + + bail7: + iounmap(c2dev->mmio_rxp_ring); + + bail6: + unregister_netdev(netdev); + + bail5: + free_netdev(netdev); + + bail4: + free_irq(pcidev->irq, c2dev); + + bail3: + ib_dealloc_device(&c2dev->ibdev); + + bail2: + pci_release_regions(pcidev); + + bail1: + pci_disable_device(pcidev); + + bail0: + return ret; +} + +static void __devexit c2_remove(struct pci_dev *pcidev) +{ + struct c2_dev *c2dev = pci_get_drvdata(pcidev); + struct net_device *netdev = c2dev->netdev; + + /* Unregister with OpenIB */ + c2_unregister_device(c2dev); + + /* Clean up the RNIC resources */ + c2_rnic_term(c2dev); + + /* Remove network device from the kernel */ + unregister_netdev(netdev); + + /* Free network device */ + free_netdev(netdev); + + /* Free the interrupt line */ + free_irq(pcidev->irq, c2dev); + + /* missing: Turn LEDs off here */ + + /* Unmap adapter PA space */ + iounmap(c2dev->kva); + iounmap(c2dev->regs); + iounmap(c2dev->mmio_txp_ring); + iounmap(c2dev->mmio_rxp_ring); + + /* Free the hardware structure */ + ib_dealloc_device(&c2dev->ibdev); + + /* Release reserved PCI I/O and memory resources */ + pci_release_regions(pcidev); + + /* Disable PCI device */ + pci_disable_device(pcidev); + + /* Clear driver specific data */ + pci_set_drvdata(pcidev, NULL); +} + +static struct pci_driver c2_pci_driver = { + .name = DRV_NAME, + .id_table = c2_pci_table, + .probe = c2_probe, + .remove = __devexit_p(c2_remove), +}; + +static int __init c2_init_module(void) +{ + return pci_module_init(&c2_pci_driver); +} + +static void __exit c2_exit_module(void) +{ + pci_unregister_driver(&c2_pci_driver); +} + +module_init(c2_init_module); +module_exit(c2_exit_module); diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h new file mode 100644 index 0000000..3251e8f --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2.h @@ -0,0 +1,555 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef __C2_H +#define __C2_H + +#include +#include +#include +#include +#include +#include + +#include "c2_provider.h" +#include "c2_mq.h" +#include "c2_status.h" + +#define DRV_NAME "c2" +#define DRV_VERSION "1.1" +#define PFX DRV_NAME ": " + +#define BAR_0 0 +#define BAR_2 2 +#define BAR_4 4 + +#define RX_BUF_SIZE (1536 + 8) +#define ETH_JUMBO_MTU 9000 +#define C2_MAGIC "CEPHEUS" +#define C2_VERSION 4 +#define C2_IVN (18 & 0x7fffffff) + +#define C2_REG0_SIZE (16 * 1024) +#define C2_REG2_SIZE (2 * 1024 * 1024) +#define C2_REG4_SIZE (256 * 1024 * 1024) +#define C2_NUM_TX_DESC 341 +#define C2_NUM_RX_DESC 256 +#define C2_PCI_REGS_OFFSET (0x10000) +#define C2_RXP_HRXDQ_OFFSET (((C2_REG4_SIZE)/2)) +#define C2_RXP_HRXDQ_SIZE (4096) +#define C2_TXP_HTXDQ_OFFSET (((C2_REG4_SIZE)/2) + C2_RXP_HRXDQ_SIZE) +#define C2_TXP_HTXDQ_SIZE (4096) +#define C2_TX_TIMEOUT (6*HZ) + +/* CEPHEUS */ +static const u8 c2_magic[] = { + 0x43, 0x45, 0x50, 0x48, 0x45, 0x55, 0x53 +}; + +enum adapter_pci_regs { + C2_REGS_MAGIC = 0x0000, + C2_REGS_VERS = 0x0008, + C2_REGS_IVN = 0x000C, + C2_REGS_PCI_WINSIZE = 0x0010, + C2_REGS_Q0_QSIZE = 0x0014, + C2_REGS_Q0_MSGSIZE = 0x0018, + C2_REGS_Q0_POOLSTART = 0x001C, + C2_REGS_Q0_SHARED = 0x0020, + C2_REGS_Q1_QSIZE = 0x0024, + C2_REGS_Q1_MSGSIZE = 0x0028, + C2_REGS_Q1_SHARED = 0x0030, + C2_REGS_Q2_QSIZE = 0x0034, + C2_REGS_Q2_MSGSIZE = 0x0038, + C2_REGS_Q2_SHARED = 0x0040, + C2_REGS_ENADDR = 0x004C, + C2_REGS_RDMA_ENADDR = 0x0054, + C2_REGS_HRX_CUR = 0x006C, +}; + +struct c2_adapter_pci_regs { + char reg_magic[8]; + u32 version; + u32 ivn; + u32 pci_window_size; + u32 q0_q_size; + u32 q0_msg_size; + u32 q0_pool_start; + u32 q0_shared; + u32 q1_q_size; + u32 q1_msg_size; + u32 q1_pool_start; + u32 q1_shared; + u32 q2_q_size; + u32 q2_msg_size; + u32 q2_pool_start; + u32 q2_shared; + u32 log_start; + u32 log_size; + u8 host_enaddr[8]; + u8 rdma_enaddr[8]; + u32 crash_entry; + u32 crash_ready[2]; + u32 fw_txd_cur; + u32 fw_hrxd_cur; + u32 fw_rxd_cur; +}; + +enum pci_regs { + C2_HISR = 0x0000, + C2_DISR = 0x0004, + C2_HIMR = 0x0008, + C2_DIMR = 0x000C, + C2_NISR0 = 0x0010, + C2_NISR1 = 0x0014, + C2_NIMR0 = 0x0018, + C2_NIMR1 = 0x001C, + C2_IDIS = 0x0020, +}; + +enum { + C2_PCI_HRX_INT = 1 << 8, + C2_PCI_HTX_INT = 1 << 17, + C2_PCI_HRX_QUI = 1 << 31, +}; + +/* + * Cepheus registers in BAR0. + */ +struct c2_pci_regs { + u32 hostisr; + u32 dmaisr; + u32 hostimr; + u32 dmaimr; + u32 netisr0; + u32 netisr1; + u32 netimr0; + u32 netimr1; + u32 int_disable; +}; + +/* TXP flags */ +enum c2_txp_flags { + TXP_HTXD_DONE = 0, + TXP_HTXD_READY = 1 << 0, + TXP_HTXD_UNINIT = 1 << 1, +}; + +/* RXP flags */ +enum c2_rxp_flags { + RXP_HRXD_UNINIT = 0, + RXP_HRXD_READY = 1 << 0, + RXP_HRXD_DONE = 1 << 1, +}; + +/* RXP status */ +enum c2_rxp_status { + RXP_HRXD_ZERO = 0, + RXP_HRXD_OK = 1 << 0, + RXP_HRXD_BUF_OV = 1 << 1, +}; + +/* TXP descriptor fields */ +enum txp_desc { + C2_TXP_FLAGS = 0x0000, + C2_TXP_LEN = 0x0002, + C2_TXP_ADDR = 0x0004, +}; + +/* RXP descriptor fields */ +enum rxp_desc { + C2_RXP_FLAGS = 0x0000, + C2_RXP_STATUS = 0x0002, + C2_RXP_COUNT = 0x0004, + C2_RXP_LEN = 0x0006, + C2_RXP_ADDR = 0x0008, +}; + +struct c2_txp_desc { + u16 flags; + u16 len; + u64 addr; +} __attribute__ ((packed)); + +struct c2_rxp_desc { + u16 flags; + u16 status; + u16 count; + u16 len; + u64 addr; +} __attribute__ ((packed)); + +struct c2_rxp_hdr { + u16 flags; + u16 status; + u16 len; + u16 rsvd; +} __attribute__ ((packed)); + +struct c2_tx_desc { + u32 len; + u32 status; + dma_addr_t next_offset; +}; + +struct c2_rx_desc { + u32 len; + u32 status; + dma_addr_t next_offset; +}; + +struct c2_alloc { + u32 last; + u32 max; + spinlock_t lock; + unsigned long *table; +}; + +struct c2_array { + struct { + void **page; + int used; + } *page_list; +}; + +/* + * The MQ shared pointer pool is organized as a linked list of + * chunks. Each chunk contains a linked list of free shared pointers + * that can be allocated to a given user mode client. + * + */ +struct sp_chunk { + struct sp_chunk *next; + gfp_t gfp_mask; + u16 head; + u16 shared_ptr[0]; +}; + +struct c2_pd_table { + struct c2_alloc alloc; + struct c2_array pd; +}; + +struct c2_qp_table { + struct c2_alloc alloc; + spinlock_t lock; + struct c2_array qp; + struct c2_qp** map; +}; + +struct c2_element { + struct c2_element *next; + void *ht_desc; /* host descriptor */ + void __iomem *hw_desc; /* hardware descriptor */ + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen; +}; + +struct c2_ring { + struct c2_element *to_clean; + struct c2_element *to_use; + struct c2_element *start; + unsigned long count; +}; + +struct c2_dev { + struct ib_device ibdev; + void __iomem *regs; + void __iomem *mmio_txp_ring; /* remapped adapter memory for hw rings */ + void __iomem *mmio_rxp_ring; + spinlock_t lock; + struct pci_dev *pcidev; + struct net_device *netdev; + struct net_device *pseudo_netdev; + unsigned int cur_tx; + unsigned int cur_rx; + u32 adapter_handle; + int device_cap_flags; + void __iomem *kva; /* KVA device memory */ + unsigned long pa; /* PA device memory */ + void **qptr_array; + + kmem_cache_t *host_msg_cache; + + struct list_head cca_link; /* adapter list */ + struct list_head eh_wakeup_list; /* event wakeup list */ + wait_queue_head_t req_vq_wo; + + /* Cached RNIC properties */ + struct ib_device_attr props; + + struct c2_pd_table pd_table; + struct c2_qp_table qp_table; + int ports; /* num of GigE ports */ + int devnum; + spinlock_t vqlock; /* sync vbs req MQ */ + + /* Verbs Queues */ + struct c2_mq req_vq; /* Verbs Request MQ */ + struct c2_mq rep_vq; /* Verbs Reply MQ */ + struct c2_mq aeq; /* Async Events MQ */ + + /* Kernel client MQs */ + struct sp_chunk *kern_mqsp_pool; + + /* Device updates these values when posting messages to a host + * target queue */ + u16 req_vq_shared; + u16 rep_vq_shared; + u16 aeq_shared; + u16 irq_claimed; + + /* + * Shared host target pages for user-accessible MQs. + */ + int hthead; /* index of first free entry */ + void *htpages; /* kernel vaddr */ + int htlen; /* length of htpages memory */ + void *htuva; /* user mapped vaddr */ + spinlock_t htlock; /* serialize allocation */ + + u64 adapter_hint_uva; /* access to the activity FIFO */ + + // spinlock_t aeq_lock; + // spinlock_t rnic_lock; + + u16 hint_count; + u16 hints_read; + + int init; /* TRUE if it's ready */ + char ae_cache_name[16]; + char vq_cache_name[16]; +}; + +struct c2_port { + u32 msg_enable; + struct c2_dev *c2dev; + struct net_device *netdev; + + spinlock_t tx_lock; + u32 tx_avail; + struct c2_ring tx_ring; + struct c2_ring rx_ring; + + void *mem; /* PCI memory for host rings */ + dma_addr_t dma; + unsigned long mem_size; + + u32 rx_buf_size; + + struct net_device_stats netstats; +}; + +/* + * Activity FIFO registers in BAR0. + */ +#define PCI_BAR0_HOST_HINT 0x100 +#define PCI_BAR0_ADAPTER_HINT 0x2000 + +/* + * Ammasso PCI vendor id and Cepheus PCI device id. + */ +#define CQ_ARMED 0x01 +#define CQ_WAIT_FOR_DMA 0x80 + +/* + * The format of a hint is as follows: + * Lower 16 bits are the count of hints for the queue. + * Next 15 bits are the qp_index + * Upper most bit depends on who reads it: + * If read by producer, then it means Full (1) or Not-Full (0) + * If read by consumer, then it means Empty (1) or Not-Empty (0) + */ +#define C2_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count) +#define C2_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16) +#define C2_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF) + + +/* + * The following defines the offset in SDRAM for the c2_adapter_pci_regs_t + * struct. + */ +#define C2_ADAPTER_PCI_REGS_OFFSET 0x10000 + +#ifndef readq +static inline u64 readq(const void __iomem * addr) +{ + u64 ret = readl(addr + 4); + ret <<= 32; + ret |= readl(addr); + + return ret; +} +#endif + +#ifndef __raw_writeq +static inline void __raw_writeq(u64 val, void __iomem * addr) +{ + __raw_writel((u32) (val), addr); + __raw_writel((u32) (val >> 32), (addr + 4)); +} +#endif + +#define C2_SET_CUR_RX(c2dev, cur_rx) \ + __raw_writel(cpu_to_be32(cur_rx), c2dev->mmio_txp_ring + 4092) + +#define C2_GET_CUR_RX(c2dev) \ + be32_to_cpu(readl(c2dev->mmio_txp_ring + 4092)) + +static inline struct c2_dev *to_c2dev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct c2_dev, ibdev); +} + +static inline int c2_errno(void *reply) +{ + switch (c2_wr_get_result(reply)) { + case C2_OK: + return 0; + case CCERR_NO_BUFS: + case CCERR_INSUFFICIENT_RESOURCES: + case CCERR_ZERO_RDMA_READ_RESOURCES: + return -ENOMEM; + case CCERR_MR_IN_USE: + case CCERR_QP_IN_USE: + return -EBUSY; + case CCERR_ADDR_IN_USE: + return -EADDRINUSE; + case CCERR_ADDR_NOT_AVAIL: + return -EADDRNOTAVAIL; + case CCERR_CONN_RESET: + return -ECONNRESET; + case CCERR_NOT_IMPLEMENTED: + case CCERR_INVALID_WQE: + return -ENOSYS; + case CCERR_QP_NOT_PRIVILEGED: + return -EPERM; + case CCERR_STACK_ERROR: + return -EPROTO; + case CCERR_ACCESS_VIOLATION: + case CCERR_BASE_AND_BOUNDS_VIOLATION: + return -EFAULT; + case CCERR_STAG_STATE_NOT_INVALID: + case CCERR_INVALID_ADDRESS: + case CCERR_INVALID_CQ: + case CCERR_INVALID_EP: + case CCERR_INVALID_MODIFIER: + case CCERR_INVALID_MTU: + case CCERR_INVALID_PD_ID: + case CCERR_INVALID_QP: + case CCERR_INVALID_RNIC: + case CCERR_INVALID_STAG: + return -EINVAL; + default: + return -EAGAIN; + } +} + +/* Device */ +extern int c2_register_device(struct c2_dev *c2dev); +extern void c2_unregister_device(struct c2_dev *c2dev); +extern int c2_rnic_init(struct c2_dev *c2dev); +extern void c2_rnic_term(struct c2_dev *c2dev); +extern void c2_rnic_interrupt(struct c2_dev *c2dev); +extern int c2_rnic_query(struct c2_dev *c2dev, struct ib_device_attr *props); +extern int c2_del_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask); +extern int c2_add_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask); + +/* QPs */ +extern int c2_alloc_qp(struct c2_dev *c2dev, struct c2_pd *pd, + struct ib_qp_init_attr *qp_attrs, struct c2_qp *qp); +extern void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp); +extern struct ib_qp *c2_get_qp(struct ib_device *device, int qpn); +extern int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp, + struct ib_qp_attr *attr, int attr_mask); +extern int c2_qp_set_read_limits(struct c2_dev *c2dev, struct c2_qp *qp, + int ord, int ird); +extern int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, + struct ib_send_wr **bad_wr); +extern int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, + struct ib_recv_wr **bad_wr); +extern int __devinit c2_init_qp_table(struct c2_dev *c2dev); +extern void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev); +extern void c2_set_qp_state(struct c2_qp *, int); + +/* PDs */ +extern int c2_pd_alloc(struct c2_dev *c2dev, int privileged, struct c2_pd *pd); +extern void c2_pd_free(struct c2_dev *c2dev, struct c2_pd *pd); +extern int __devinit c2_init_pd_table(struct c2_dev *c2dev); +extern void __devexit c2_cleanup_pd_table(struct c2_dev *c2dev); + +/* CQs */ +extern int c2_init_cq(struct c2_dev *c2dev, int entries, + struct c2_ucontext *ctx, struct c2_cq *cq); +extern void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq); +extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); +extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index); +extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); + +/* CM */ +extern int c2_llp_connect(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *iw_param); +extern int c2_llp_accept(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *iw_param); +extern int c2_llp_reject(struct iw_cm_id *cm_id, const void *pdata, + u8 pdata_len); +extern int c2_llp_service_create(struct iw_cm_id *cm_id, int backlog); +extern int c2_llp_service_destroy(struct iw_cm_id *cm_id); + +/* MM */ +extern int c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 *addr_list, + int page_size, int pbl_depth, u32 length, + u32 off, u64 *va, enum c2_acf acf, + struct c2_mr *mr); +extern int c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index); + +/* AE */ +extern void c2_ae_event(struct c2_dev *c2dev, u32 mq_index); + +/* Allocators */ +extern u32 c2_alloc(struct c2_alloc *alloc); +extern void c2_free(struct c2_alloc *alloc, u32 obj); +extern int c2_alloc_init(struct c2_alloc *alloc, u32 num, u32 reserved); +extern void c2_alloc_cleanup(struct c2_alloc *alloc); +extern int c2_init_mqsp_pool(gfp_t gfp_mask, struct sp_chunk **root); +extern void c2_free_mqsp_pool(struct sp_chunk *root); +extern u16 *c2_alloc_mqsp(struct sp_chunk *head); +extern void c2_free_mqsp(u16 * mqsp); +extern void c2_array_cleanup(struct c2_array *array, int nent); +extern int c2_array_init(struct c2_array *array, int nent); +extern void c2_array_clear(struct c2_array *array, int index); +extern int c2_array_set(struct c2_array *array, int index, void *value); +extern void *c2_array_get(struct c2_array *array, int index); + +#endif diff --git a/drivers/infiniband/hw/amso1100/c2_ae.c b/drivers/infiniband/hw/amso1100/c2_ae.c new file mode 100644 index 0000000..c979ef6 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_ae.c @@ -0,0 +1,359 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include +#include "c2_status.h" +#include "c2_ae.h" + +static int c2_convert_cm_status(u32 c2_status) +{ + switch (c2_status) { + case C2_CONN_STATUS_SUCCESS: + return 0; + case C2_CONN_STATUS_REJECTED: + return -ENETRESET; + case C2_CONN_STATUS_REFUSED: + return -ECONNREFUSED; + case C2_CONN_STATUS_TIMEDOUT: + return -ETIMEDOUT; + case C2_CONN_STATUS_NETUNREACH: + return -ENETUNREACH; + case C2_CONN_STATUS_HOSTUNREACH: + return -EHOSTUNREACH; + case C2_CONN_STATUS_INVALID_RNIC: + return -EINVAL; + case C2_CONN_STATUS_INVALID_QP: + return -EINVAL; + case C2_CONN_STATUS_INVALID_QP_STATE: + return -EINVAL; + case C2_CONN_STATUS_ADDR_NOT_AVAIL: + return -EADDRNOTAVAIL; + default: + printk(KERN_ERR PFX + "%s - Unable to convert CM status: %d\n", + __FUNCTION__, c2_status); + return -EIO; + } +} + +#ifdef DEBUG +static const char* to_event_str(int event) +{ + static const char* event_str[] = { + "CCAE_REMOTE_SHUTDOWN", + "CCAE_ACTIVE_CONNECT_RESULTS", + "CCAE_CONNECTION_REQUEST", + "CCAE_LLP_CLOSE_COMPLETE", + "CCAE_TERMINATE_MESSAGE_RECEIVED", + "CCAE_LLP_CONNECTION_RESET", + "CCAE_LLP_CONNECTION_LOST", + "CCAE_LLP_SEGMENT_SIZE_INVALID", + "CCAE_LLP_INVALID_CRC", + "CCAE_LLP_BAD_FPDU", + "CCAE_INVALID_DDP_VERSION", + "CCAE_INVALID_RDMA_VERSION", + "CCAE_UNEXPECTED_OPCODE", + "CCAE_INVALID_DDP_QUEUE_NUMBER", + "CCAE_RDMA_READ_NOT_ENABLED", + "CCAE_RDMA_WRITE_NOT_ENABLED", + "CCAE_RDMA_READ_TOO_SMALL", + "CCAE_NO_L_BIT", + "CCAE_TAGGED_INVALID_STAG", + "CCAE_TAGGED_BASE_BOUNDS_VIOLATION", + "CCAE_TAGGED_ACCESS_RIGHTS_VIOLATION", + "CCAE_TAGGED_INVALID_PD", + "CCAE_WRAP_ERROR", + "CCAE_BAD_CLOSE", + "CCAE_BAD_LLP_CLOSE", + "CCAE_INVALID_MSN_RANGE", + "CCAE_INVALID_MSN_GAP", + "CCAE_IRRQ_OVERFLOW", + "CCAE_IRRQ_MSN_GAP", + "CCAE_IRRQ_MSN_RANGE", + "CCAE_IRRQ_INVALID_STAG", + "CCAE_IRRQ_BASE_BOUNDS_VIOLATION", + "CCAE_IRRQ_ACCESS_RIGHTS_VIOLATION", + "CCAE_IRRQ_INVALID_PD", + "CCAE_IRRQ_WRAP_ERROR", + "CCAE_CQ_SQ_COMPLETION_OVERFLOW", + "CCAE_CQ_RQ_COMPLETION_ERROR", + "CCAE_QP_SRQ_WQE_ERROR", + "CCAE_QP_LOCAL_CATASTROPHIC_ERROR", + "CCAE_CQ_OVERFLOW", + "CCAE_CQ_OPERATION_ERROR", + "CCAE_SRQ_LIMIT_REACHED", + "CCAE_QP_RQ_LIMIT_REACHED", + "CCAE_SRQ_CATASTROPHIC_ERROR", + "CCAE_RNIC_CATASTROPHIC_ERROR" + }; + + if (event < CCAE_REMOTE_SHUTDOWN || + event > CCAE_RNIC_CATASTROPHIC_ERROR) + return ""; + + event -= CCAE_REMOTE_SHUTDOWN; + return event_str[event]; +} + +const char *to_qp_state_str(int state) +{ + switch (state) { + case C2_QP_STATE_IDLE: + return "C2_QP_STATE_IDLE"; + case C2_QP_STATE_CONNECTING: + return "C2_QP_STATE_CONNECTING"; + case C2_QP_STATE_RTS: + return "C2_QP_STATE_RTS"; + case C2_QP_STATE_CLOSING: + return "C2_QP_STATE_CLOSING"; + case C2_QP_STATE_TERMINATE: + return "C2_QP_STATE_TERMINATE"; + case C2_QP_STATE_ERROR: + return "C2_QP_STATE_ERROR"; + default: + return ""; + }; +} +#endif + +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) +{ + struct c2_mq *mq = c2dev->qptr_array[mq_index]; + union c2wr *wr; + void *resource_user_context; + struct iw_cm_event cm_event; + struct ib_event ib_event; + enum c2_resource_indicator resource_indicator; + enum c2_event_id event_id; + unsigned long flags; + u8 *pdata = NULL; + int status; + + /* + * retreive the message + */ + wr = c2_mq_consume(mq); + if (!wr) + return; + + memset(&ib_event, 0, sizeof(ib_event)); + memset(&cm_event, 0, sizeof(cm_event)); + + event_id = c2_wr_get_id(wr); + resource_indicator = be32_to_cpu(wr->ae.ae_generic.resource_type); + resource_user_context = + (void *) (unsigned long) wr->ae.ae_generic.user_context; + + status = cm_event.status = c2_convert_cm_status(c2_wr_get_result(wr)); + + pr_debug("event received c2_dev=%p, event_id=%d, " + "resource_indicator=%d, user_context=%p, status = %d\n", + c2dev, event_id, resource_indicator, resource_user_context, + status); + + switch (resource_indicator) { + case C2_RES_IND_QP:{ + + struct c2_qp *qp = (struct c2_qp *)resource_user_context; + struct iw_cm_id *cm_id = qp->cm_id; + struct c2wr_ae_active_connect_results *res; + + if (!cm_id) { + pr_debug("event received, but cm_id is , qp=%p!\n", + qp); + goto ignore_it; + } + pr_debug("%s: event = %s, user_context=%llx, " + "resource_type=%x, " + "resource=%x, qp_state=%s\n", + __FUNCTION__, + to_event_str(event_id), + be64_to_cpu(wr->ae.ae_generic.user_context), + be32_to_cpu(wr->ae.ae_generic.resource_type), + be32_to_cpu(wr->ae.ae_generic.resource), + to_qp_state_str(be32_to_cpu(wr->ae.ae_generic.qp_state))); + + c2_set_qp_state(qp, be32_to_cpu(wr->ae.ae_generic.qp_state)); + + switch (event_id) { + case CCAE_ACTIVE_CONNECT_RESULTS: + res = &wr->ae.ae_active_connect_results; + cm_event.event = IW_CM_EVENT_CONNECT_REPLY; + cm_event.local_addr.sin_addr.s_addr = res->laddr; + cm_event.remote_addr.sin_addr.s_addr = res->raddr; + cm_event.local_addr.sin_port = res->lport; + cm_event.remote_addr.sin_port = res->rport; + if (status == 0) { + cm_event.private_data_len = + be32_to_cpu(res->private_data_length); + } else { + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id) { + qp->cm_id->rem_ref(qp->cm_id); + qp->cm_id = NULL; + } + spin_unlock_irqrestore(&qp->lock, flags); + cm_event.private_data_len = 0; + cm_event.private_data = NULL; + } + if (cm_event.private_data_len) { + /* copy private data */ + pdata = + kmalloc(cm_event.private_data_len, + GFP_ATOMIC); + if (!pdata) { + /* Ignore the request, maybe the + * remote peer will retry */ + pr_debug ("Ignored connect request -- " + "no memory for pdata" + "private_data_len=%d\n", + cm_event.private_data_len); + goto ignore_it; + } + + memcpy(pdata, res->private_data, + cm_event.private_data_len); + + cm_event.private_data = pdata; + } + if (cm_id->event_handler) + cm_id->event_handler(cm_id, &cm_event); + break; + case CCAE_TERMINATE_MESSAGE_RECEIVED: + case CCAE_CQ_SQ_COMPLETION_OVERFLOW: + ib_event.device = &c2dev->ibdev; + ib_event.element.qp = &qp->ibqp; + ib_event.event = IB_EVENT_QP_REQ_ERR; + + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&ib_event, + qp->ibqp. + qp_context); + break; + case CCAE_BAD_CLOSE: + case CCAE_LLP_CLOSE_COMPLETE: + case CCAE_LLP_CONNECTION_RESET: + case CCAE_LLP_CONNECTION_LOST: + BUG_ON(cm_id->event_handler==(void*)0x6b6b6b6b); + + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id) { + qp->cm_id->rem_ref(qp->cm_id); + qp->cm_id = NULL; + } + spin_unlock_irqrestore(&qp->lock, flags); + cm_event.event = IW_CM_EVENT_CLOSE; + cm_event.status = 0; + if (cm_id->event_handler) + cm_id->event_handler(cm_id, &cm_event); + break; + default: + BUG_ON(1); + pr_debug("%s:%d Unexpected event_id=%d on QP=%p, " + "CM_ID=%p\n", + __FUNCTION__, __LINE__, + event_id, qp, cm_id); + break; + } + break; + } + + case C2_RES_IND_EP:{ + + struct c2wr_ae_connection_request *req = + &wr->ae.ae_connection_request; + struct iw_cm_id *cm_id = + (struct iw_cm_id *)resource_user_context; + + pr_debug("C2_RES_IND_EP event_id=%d\n", event_id); + if (event_id != CCAE_CONNECTION_REQUEST) { + pr_debug("%s: Invalid event_id: %d\n", + __FUNCTION__, event_id); + break; + } + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; + cm_event.provider_data = (void*)(unsigned long)req->cr_handle; + cm_event.local_addr.sin_addr.s_addr = req->laddr; + cm_event.remote_addr.sin_addr.s_addr = req->raddr; + cm_event.local_addr.sin_port = req->lport; + cm_event.remote_addr.sin_port = req->rport; + cm_event.private_data_len = + be32_to_cpu(req->private_data_length); + + if (cm_event.private_data_len) { + pdata = + kmalloc(cm_event.private_data_len, + GFP_ATOMIC); + if (!pdata) { + /* Ignore the request, maybe the remote peer + * will retry */ + pr_debug ("Ignored connect request -- " + "no memory for pdata" + "private_data_len=%d\n", + cm_event.private_data_len); + goto ignore_it; + } + memcpy(pdata, + req->private_data, + cm_event.private_data_len); + + cm_event.private_data = pdata; + } + if (cm_id->event_handler) + cm_id->event_handler(cm_id, &cm_event); + break; + } + + case C2_RES_IND_CQ:{ + struct c2_cq *cq = + (struct c2_cq *) resource_user_context; + + pr_debug("IB_EVENT_CQ_ERR\n"); + ib_event.device = &c2dev->ibdev; + ib_event.element.cq = &cq->ibcq; + ib_event.event = IB_EVENT_CQ_ERR; + + if (cq->ibcq.event_handler) + cq->ibcq.event_handler(&ib_event, + cq->ibcq.cq_context); + } + + default: + printk("Bad resource indicator = %d\n", + resource_indicator); + break; + } + + ignore_it: + c2_mq_free(mq); +} diff --git a/drivers/infiniband/hw/amso1100/c2_intr.c b/drivers/infiniband/hw/amso1100/c2_intr.c new file mode 100644 index 0000000..75bb18c --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_intr.c @@ -0,0 +1,209 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include +#include "c2_vq.h" + +static void handle_mq(struct c2_dev *c2dev, u32 index); +static void handle_vq(struct c2_dev *c2dev, u32 mq_index); + +/* + * Handle RNIC interrupts + */ +void c2_rnic_interrupt(struct c2_dev *c2dev) +{ + unsigned int mq_index; + + while (c2dev->hints_read != be16_to_cpu(c2dev->hint_count)) { + mq_index = readl(c2dev->regs + PCI_BAR0_HOST_HINT); + if (mq_index & 0x80000000) { + break; + } + + c2dev->hints_read++; + handle_mq(c2dev, mq_index); + } + +} + +/* + * Top level MQ handler + */ +static void handle_mq(struct c2_dev *c2dev, u32 mq_index) +{ + if (c2dev->qptr_array[mq_index] == NULL) { + pr_debug(KERN_INFO "handle_mq: stray activity for mq_index=%d\n", + mq_index); + return; + } + + switch (mq_index) { + case (0): + /* + * An index of 0 in the activity queue + * indicates the req vq now has messages + * available... + * + * Wake up any waiters waiting on req VQ + * message availability. + */ + wake_up(&c2dev->req_vq_wo); + break; + case (1): + handle_vq(c2dev, mq_index); + break; + case (2): + /* We have to purge the VQ in case there are pending + * accept reply requests that would result in the + * generation of an ESTABLISHED event. If we don't + * generate these first, a CLOSE event could end up + * being delivered before the ESTABLISHED event. + */ + handle_vq(c2dev, 1); + + c2_ae_event(c2dev, mq_index); + break; + default: + /* There is no event synchronization between CQ events + * and AE or CM events. In fact, CQE could be + * delivered for all of the I/O up to and including the + * FLUSH for a peer disconenct prior to the ESTABLISHED + * event being delivered to the app. The reason for this + * is that CM events are delivered on a thread, while AE + * and CM events are delivered on interrupt context. + */ + c2_cq_event(c2dev, mq_index); + break; + } + + return; +} + +/* + * Handles verbs WR replies. + */ +static void handle_vq(struct c2_dev *c2dev, u32 mq_index) +{ + void *adapter_msg, *reply_msg; + struct c2wr_hdr *host_msg; + struct c2wr_hdr tmp; + struct c2_mq *reply_vq; + struct c2_vq_req *req; + struct iw_cm_event cm_event; + int err; + + reply_vq = (struct c2_mq *) c2dev->qptr_array[mq_index]; + + /* + * get next msg from mq_index into adapter_msg. + * don't free it yet. + */ + adapter_msg = c2_mq_consume(reply_vq); + if (adapter_msg == NULL) { + return; + } + + host_msg = vq_repbuf_alloc(c2dev); + + /* + * If we can't get a host buffer, then we'll still + * wakeup the waiter, we just won't give him the msg. + * It is assumed the waiter will deal with this... + */ + if (!host_msg) { + pr_debug("handle_vq: no repbufs!\n"); + + /* + * just copy the WR header into a local variable. + * this allows us to still demux on the context + */ + host_msg = &tmp; + memcpy(host_msg, adapter_msg, sizeof(tmp)); + reply_msg = NULL; + } else { + memcpy(host_msg, adapter_msg, reply_vq->msg_size); + reply_msg = host_msg; + } + + /* + * consume the msg from the MQ + */ + c2_mq_free(reply_vq); + + /* + * wakeup the waiter. + */ + req = (struct c2_vq_req *) (unsigned long) host_msg->context; + if (req == NULL) { + /* + * We should never get here, as the adapter should + * never send us a reply that we're not expecting. + */ + vq_repbuf_free(c2dev, host_msg); + pr_debug("handle_vq: UNEXPECTEDLY got NULL req\n"); + return; + } + + err = c2_errno(reply_msg); + if (!err) switch (req->event) { + case IW_CM_EVENT_ESTABLISHED: + c2_set_qp_state(req->qp, + C2_QP_STATE_RTS); + case IW_CM_EVENT_CLOSE: + + /* + * Move the QP to RTS if this is + * the established event + */ + cm_event.event = req->event; + cm_event.status = 0; + cm_event.local_addr = req->cm_id->local_addr; + cm_event.remote_addr = req->cm_id->remote_addr; + cm_event.private_data = NULL; + cm_event.private_data_len = 0; + req->cm_id->event_handler(req->cm_id, &cm_event); + break; + default: + break; + } + + req->reply_msg = (u64) (unsigned long) (reply_msg); + atomic_set(&req->reply_ready, 1); + wake_up(&req->wait_object); + + /* + * If the request was cancelled, then this put will + * free the vq_req memory...and reply_msg!!! + */ + vq_req_put(c2dev, req); +} diff --git a/drivers/infiniband/hw/amso1100/c2_rnic.c b/drivers/infiniband/hw/amso1100/c2_rnic.c new file mode 100644 index 0000000..49645a9 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_rnic.c @@ -0,0 +1,631 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include +#include +#include +#include "c2.h" +#include "c2_vq.h" + +/* Device capabilities */ +#define C2_MIN_PAGESIZE 1024 + +#define C2_MAX_MRS 32768 +#define C2_MAX_QPS 16000 +#define C2_MAX_WQE_SZ 256 +#define C2_MAX_QP_WR ((128*1024)/C2_MAX_WQE_SZ) +#define C2_MAX_SGES 4 +#define C2_MAX_SGE_RD 1 +#define C2_MAX_CQS 32768 +#define C2_MAX_CQES 4096 +#define C2_MAX_PDS 16384 + +/* + * Send the adapter INIT message to the amso1100 + */ +static int c2_adapter_init(struct c2_dev *c2dev) +{ + struct c2wr_init_req wr; + int err; + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_INIT); + wr.hdr.context = 0; + wr.hint_count = cpu_to_be64(__pa(&c2dev->hint_count)); + wr.q0_host_shared = cpu_to_be64(__pa(c2dev->req_vq.shared)); + wr.q1_host_shared = cpu_to_be64(__pa(c2dev->rep_vq.shared)); + wr.q1_host_msg_pool = cpu_to_be64(__pa(c2dev->rep_vq.msg_pool.host)); + wr.q2_host_shared = cpu_to_be64(__pa(c2dev->aeq.shared)); + wr.q2_host_msg_pool = cpu_to_be64(__pa(c2dev->aeq.msg_pool.host)); + + /* Post the init message */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + + return err; +} + +/* + * Send the adapter TERM message to the amso1100 + */ +static void c2_adapter_term(struct c2_dev *c2dev) +{ + struct c2wr_init_req wr; + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_TERM); + wr.hdr.context = 0; + + /* Post the init message */ + vq_send_wr(c2dev, (union c2wr *) & wr); + c2dev->init = 0; + + return; +} + +/* + * Query the adapter + */ +int c2_rnic_query(struct c2_dev *c2dev, + struct ib_device_attr *props) +{ + struct c2_vq_req *vq_req; + struct c2wr_rnic_query_req wr; + struct c2wr_rnic_query_rep *reply; + int err; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + c2_wr_set_id(&wr, CCWR_RNIC_QUERY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) &wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = + (struct c2wr_rnic_query_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) + err = -ENOMEM; + + err = c2_errno(reply); + if (err) + goto bail2; + + props->fw_ver = + ((u64)be32_to_cpu(reply->fw_ver_major) << 32) | + ((be32_to_cpu(reply->fw_ver_minor) && 0xFFFF) << 16) | + (be32_to_cpu(reply->fw_ver_patch) && 0xFFFF); + memcpy(&props->sys_image_guid, c2dev->netdev->dev_addr, 6); + props->max_mr_size = 0xFFFFFFFF; + props->page_size_cap = ~(C2_MIN_PAGESIZE-1); + props->vendor_id = be32_to_cpu(reply->vendor_id); + props->vendor_part_id = be32_to_cpu(reply->part_number); + props->hw_ver = be32_to_cpu(reply->hw_version); + props->max_qp = be32_to_cpu(reply->max_qps); + props->max_qp_wr = be32_to_cpu(reply->max_qp_depth); + props->device_cap_flags = c2dev->device_cap_flags; + props->max_sge = C2_MAX_SGES; + props->max_sge_rd = C2_MAX_SGE_RD; + props->max_cq = be32_to_cpu(reply->max_cqs); + props->max_cqe = be32_to_cpu(reply->max_cq_depth); + props->max_mr = be32_to_cpu(reply->max_mrs); + props->max_pd = be32_to_cpu(reply->max_pds); + props->max_qp_rd_atom = be32_to_cpu(reply->max_qp_ird); + props->max_ee_rd_atom = 0; + props->max_res_rd_atom = be32_to_cpu(reply->max_global_ird); + props->max_qp_init_rd_atom = be32_to_cpu(reply->max_qp_ord); + props->max_ee_init_rd_atom = 0; + props->atomic_cap = IB_ATOMIC_NONE; + props->max_ee = 0; + props->max_rdd = 0; + props->max_mw = be32_to_cpu(reply->max_mws); + props->max_raw_ipv6_qp = 0; + props->max_raw_ethy_qp = 0; + props->max_mcast_grp = 0; + props->max_mcast_qp_attach = 0; + props->max_total_mcast_qp_attach = 0; + props->max_ah = 0; + props->max_fmr = 0; + props->max_map_per_fmr = 0; + props->max_srq = 0; + props->max_srq_wr = 0; + props->max_srq_sge = 0; + props->max_pkeys = 0; + props->local_ca_ack_delay = 0; + + bail2: + vq_repbuf_free(c2dev, reply); + + bail1: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Add an IP address to the RNIC interface + */ +int c2_add_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask) +{ + struct c2_vq_req *vq_req; + struct c2wr_rnic_setconfig_req *wr; + struct c2wr_rnic_setconfig_rep *reply; + struct c2_netaddr netaddr; + int err, len; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + len = sizeof(struct c2_netaddr); + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + c2_wr_set_id(wr, CCWR_RNIC_SETCONFIG); + wr->hdr.context = (unsigned long) vq_req; + wr->rnic_handle = c2dev->adapter_handle; + wr->option = cpu_to_be32(C2_CFG_ADD_ADDR); + + netaddr.ip_addr = inaddr; + netaddr.netmask = inmask; + netaddr.mtu = 0; + + memcpy(wr->data, &netaddr, len); + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = + (struct c2wr_rnic_setconfig_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail1; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + + bail1: + kfree(wr); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Delete an IP address from the RNIC interface + */ +int c2_del_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask) +{ + struct c2_vq_req *vq_req; + struct c2wr_rnic_setconfig_req *wr; + struct c2wr_rnic_setconfig_rep *reply; + struct c2_netaddr netaddr; + int err, len; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + len = sizeof(struct c2_netaddr); + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + c2_wr_set_id(wr, CCWR_RNIC_SETCONFIG); + wr->hdr.context = (unsigned long) vq_req; + wr->rnic_handle = c2dev->adapter_handle; + wr->option = cpu_to_be32(C2_CFG_DEL_ADDR); + + netaddr.ip_addr = inaddr; + netaddr.netmask = inmask; + netaddr.mtu = 0; + + memcpy(wr->data, &netaddr, len); + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = + (struct c2wr_rnic_setconfig_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail1; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + + bail1: + kfree(wr); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Open a single RNIC instance to use with all + * low level openib calls + */ +static int c2_rnic_open(struct c2_dev *c2dev) +{ + struct c2_vq_req *vq_req; + union c2wr wr; + struct c2wr_rnic_open_rep *reply; + int err; + + vq_req = vq_req_alloc(c2dev); + if (vq_req == NULL) { + return -ENOMEM; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_RNIC_OPEN); + wr.rnic_open.req.hdr.context = (unsigned long) (vq_req); + wr.rnic_open.req.flags = cpu_to_be16(RNIC_PRIV_MODE); + wr.rnic_open.req.port_num = cpu_to_be16(0); + wr.rnic_open.req.user_context = (unsigned long) c2dev; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, &wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + reply = (struct c2wr_rnic_open_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + if ((err = c2_errno(reply)) != 0) { + goto bail1; + } + + c2dev->adapter_handle = reply->rnic_handle; + + bail1: + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Close the RNIC instance + */ +static int c2_rnic_close(struct c2_dev *c2dev) +{ + struct c2_vq_req *vq_req; + union c2wr wr; + struct c2wr_rnic_close_rep *reply; + int err; + + vq_req = vq_req_alloc(c2dev); + if (vq_req == NULL) { + return -ENOMEM; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_RNIC_CLOSE); + wr.rnic_close.req.hdr.context = (unsigned long) vq_req; + wr.rnic_close.req.rnic_handle = c2dev->adapter_handle; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, &wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + reply = (struct c2wr_rnic_close_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + if ((err = c2_errno(reply)) != 0) { + goto bail1; + } + + c2dev->adapter_handle = 0; + + bail1: + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Called by c2_probe to initialize the RNIC. This principally + * involves initalizing the various limits and resouce pools that + * comprise the RNIC instance. + */ +int c2_rnic_init(struct c2_dev *c2dev) +{ + int err; + u32 qsize, msgsize; + void *q1_pages; + void *q2_pages; + void __iomem *mmio_regs; + + /* Device capabilities */ + c2dev->device_cap_flags = + (IB_DEVICE_RESIZE_MAX_WR | + IB_DEVICE_CURR_QP_STATE_MOD | + IB_DEVICE_SYS_IMAGE_GUID | + IB_DEVICE_ZERO_STAG | + IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW); + + /* Allocate the qptr_array */ + c2dev->qptr_array = vmalloc(C2_MAX_CQS * sizeof(void *)); + if (!c2dev->qptr_array) { + return -ENOMEM; + } + + /* Inialize the qptr_array */ + memset(c2dev->qptr_array, 0, C2_MAX_CQS * sizeof(void *)); + c2dev->qptr_array[0] = (void *) &c2dev->req_vq; + c2dev->qptr_array[1] = (void *) &c2dev->rep_vq; + c2dev->qptr_array[2] = (void *) &c2dev->aeq; + + /* Initialize data structures */ + init_waitqueue_head(&c2dev->req_vq_wo); + spin_lock_init(&c2dev->vqlock); + spin_lock_init(&c2dev->lock); + + /* Allocate MQ shared pointer pool for kernel clients. User + * mode client pools are hung off the user context + */ + err = c2_init_mqsp_pool(GFP_KERNEL, &c2dev->kern_mqsp_pool); + if (err) { + goto bail0; + } + + /* Allocate shared pointers for Q0, Q1, and Q2 from + * the shared pointer pool. + */ + c2dev->req_vq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); + c2dev->rep_vq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); + c2dev->aeq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); + if (!c2dev->req_vq.shared || + !c2dev->rep_vq.shared || !c2dev->aeq.shared) { + err = -ENOMEM; + goto bail1; + } + + mmio_regs = c2dev->kva; + /* Initialize the Verbs Request Queue */ + c2_mq_req_init(&c2dev->req_vq, 0, + be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_QSIZE)), + be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_MSGSIZE)), + mmio_regs + + be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_POOLSTART)), + mmio_regs + + be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_SHARED)), + C2_MQ_ADAPTER_TARGET); + + /* Initialize the Verbs Reply Queue */ + qsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_QSIZE)); + msgsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_MSGSIZE)); + q1_pages = kmalloc(qsize * msgsize, GFP_KERNEL); + if (!q1_pages) { + err = -ENOMEM; + goto bail1; + } + c2_mq_rep_init(&c2dev->rep_vq, + 1, + qsize, + msgsize, + q1_pages, + mmio_regs + + be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_SHARED)), + C2_MQ_HOST_TARGET); + + /* Initialize the Asynchronus Event Queue */ + qsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_QSIZE)); + msgsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_MSGSIZE)); + q2_pages = kmalloc(qsize * msgsize, GFP_KERNEL); + if (!q2_pages) { + err = -ENOMEM; + goto bail2; + } + c2_mq_rep_init(&c2dev->aeq, + 2, + qsize, + msgsize, + q2_pages, + mmio_regs + + be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_SHARED)), + C2_MQ_HOST_TARGET); + + /* Initialize the verbs request allocator */ + err = vq_init(c2dev); + if (err) + goto bail3; + + /* Enable interrupts on the adapter */ + writel(0, c2dev->regs + C2_IDIS); + + /* create the WR init message */ + err = c2_adapter_init(c2dev); + if (err) + goto bail4; + c2dev->init++; + + /* open an adapter instance */ + err = c2_rnic_open(c2dev); + if (err) + goto bail4; + + /* Initialize cached the adapter limits */ + if (c2_rnic_query(c2dev, &c2dev->props)) + goto bail4; + + /* Initialize the PD pool */ + err = c2_init_pd_table(c2dev); + if (err) + goto bail5; + + /* Initialize the QP pool */ + err = c2_init_qp_table(c2dev); + if (err) + goto bail6; + return 0; + + bail6: + c2_cleanup_pd_table(c2dev); + bail5: + c2_rnic_close(c2dev); + bail4: + vq_term(c2dev); + bail3: + kfree(q2_pages); + bail2: + kfree(q1_pages); + bail1: + c2_free_mqsp_pool(c2dev->kern_mqsp_pool); + bail0: + vfree(c2dev->qptr_array); + + return err; +} + +/* + * Called by c2_remove to cleanup the RNIC resources. + */ +void c2_rnic_term(struct c2_dev *c2dev) +{ + + /* Close the open adapter instance */ + c2_rnic_close(c2dev); + + /* Send the TERM message to the adapter */ + c2_adapter_term(c2dev); + + /* Disable interrupts on the adapter */ + writel(1, c2dev->regs + C2_IDIS); + + /* Free the QP pool */ + c2_cleanup_qp_table(c2dev); + + /* Free the PD pool */ + c2_cleanup_pd_table(c2dev); + + /* Free the verbs request allocator */ + vq_term(c2dev); + + /* Free the asynchronus event queue */ + kfree(c2dev->aeq.msg_pool.host); + + /* Free the verbs reply queue */ + kfree(c2dev->rep_vq.msg_pool.host); + + /* Free the MQ shared pointer pool */ + c2_free_mqsp_pool(c2dev->kern_mqsp_pool); + + /* Free the qptr_array */ + vfree(c2dev->qptr_array); + + return; +} From swise at opengridcomputing.com Wed Jun 7 13:06:53 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:06:53 -0500 Subject: [openib-general] [PATCH v2 3/7] AMSO1100 OpenFabrics Provider. In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop> References: <20060607200646.9259.24588.stgit@stevo-desktop> Message-ID: <20060607200653.9259.31696.stgit@stevo-desktop> Review Changes: sizeof -> sizeof() dprintk() -> pr_debug() assert() -> BUG_ON() C2_DEBUG -> DEBUG --- drivers/infiniband/hw/amso1100/c2_cm.c | 452 ++++++++++++ drivers/infiniband/hw/amso1100/c2_cq.c | 423 +++++++++++ drivers/infiniband/hw/amso1100/c2_pd.c | 71 ++ drivers/infiniband/hw/amso1100/c2_provider.c | 867 +++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2_provider.h | 182 +++++ drivers/infiniband/hw/amso1100/c2_qp.c | 975 ++++++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2_user.h | 82 ++ 7 files changed, 3052 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_cm.c b/drivers/infiniband/hw/amso1100/c2_cm.c new file mode 100644 index 0000000..018d11f --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_cm.c @@ -0,0 +1,452 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include "c2.h" +#include "c2_wr.h" +#include "c2_vq.h" +#include + +int c2_llp_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + struct c2_dev *c2dev = to_c2dev(cm_id->device); + struct ib_qp *ibqp; + struct c2_qp *qp; + struct c2wr_qp_connect_req *wr; /* variable size needs a malloc. */ + struct c2_vq_req *vq_req; + int err; + + ibqp = c2_get_qp(cm_id->device, iw_param->qpn); + if (!ibqp) + return -EINVAL; + qp = to_c2qp(ibqp); + + /* Associate QP <--> CM_ID */ + cm_id->provider_data = qp; + cm_id->add_ref(cm_id); + qp->cm_id = cm_id; + + /* + * only support the max private_data length + */ + if (iw_param->private_data_len > C2_MAX_PRIVATE_DATA_SIZE) { + err = -EINVAL; + goto bail0; + } + /* + * Set the rdma read limits + */ + err = c2_qp_set_read_limits(c2dev, qp, iw_param->ord, iw_param->ird); + if (err) + goto bail0; + + /* + * Create and send a WR_QP_CONNECT... + */ + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + err = -ENOMEM; + goto bail1; + } + + c2_wr_set_id(wr, CCWR_QP_CONNECT); + wr->hdr.context = 0; + wr->rnic_handle = c2dev->adapter_handle; + wr->qp_handle = qp->adapter_handle; + + wr->remote_addr = cm_id->remote_addr.sin_addr.s_addr; + wr->remote_port = cm_id->remote_addr.sin_port; + + /* + * Move any private data from the callers's buf into + * the WR. + */ + if (iw_param->private_data) { + wr->private_data_length = + cpu_to_be32(iw_param->private_data_len); + memcpy(&wr->private_data[0], iw_param->private_data, + iw_param->private_data_len); + } else + wr->private_data_length = 0; + + /* + * Send WR to adapter. NOTE: There is no synch reply from + * the adapter. + */ + err = vq_send_wr(c2dev, (union c2wr *) wr); + vq_req_free(c2dev, vq_req); + + bail1: + kfree(wr); + bail0: + if (err) { + /* + * If we fail, release reference on QP and + * disassociate QP from CM_ID + */ + cm_id->provider_data = NULL; + qp->cm_id = NULL; + cm_id->rem_ref(cm_id); + } + return err; +} + +int c2_llp_service_create(struct iw_cm_id *cm_id, int backlog) +{ + struct c2_dev *c2dev; + struct c2wr_ep_listen_create_req wr; + struct c2wr_ep_listen_create_rep *reply; + struct c2_vq_req *vq_req; + int err; + + c2dev = to_c2dev(cm_id->device); + if (c2dev == NULL) + return -EINVAL; + + /* + * Allocate verbs request. + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_EP_LISTEN_CREATE); + wr.hdr.context = (u64) (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.local_addr = cm_id->local_addr.sin_addr.s_addr; + wr.local_port = cm_id->local_addr.sin_port; + wr.backlog = cpu_to_be32(backlog); + wr.user_context = (u64) (unsigned long) cm_id; + + /* + * Reference the request struct. Dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + /* + * Process reply + */ + reply = + (struct c2wr_ep_listen_create_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail1; + } + + if ((err = c2_errno(reply)) != 0) + goto bail1; + + /* + * Keep the adapter handle. Used in subsequent destroy + */ + cm_id->provider_data = (void*)(unsigned long) reply->ep_handle; + + /* + * free vq stuff + */ + vq_repbuf_free(c2dev, reply); + vq_req_free(c2dev, vq_req); + + return 0; + + bail1: + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + + +int c2_llp_service_destroy(struct iw_cm_id *cm_id) +{ + + struct c2_dev *c2dev; + struct c2wr_ep_listen_destroy_req wr; + struct c2wr_ep_listen_destroy_rep *reply; + struct c2_vq_req *vq_req; + int err; + + c2dev = to_c2dev(cm_id->device); + if (c2dev == NULL) + return -EINVAL; + + /* + * Allocate verbs request. + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_EP_LISTEN_DESTROY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.ep_handle = (u32)(unsigned long)cm_id->provider_data; + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + /* + * Process reply + */ + reply=(struct c2wr_ep_listen_destroy_rep *)(unsigned long)vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + if ((err = c2_errno(reply)) != 0) + goto bail1; + + bail1: + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +int c2_llp_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + struct c2_dev *c2dev = to_c2dev(cm_id->device); + struct c2_qp *qp; + struct ib_qp *ibqp; + struct c2wr_cr_accept_req *wr; /* variable length WR */ + struct c2_vq_req *vq_req; + struct c2wr_cr_accept_rep *reply; /* VQ Reply msg ptr. */ + int err; + + ibqp = c2_get_qp(cm_id->device, iw_param->qpn); + if (!ibqp) + return -EINVAL; + qp = to_c2qp(ibqp); + + /* Set the RDMA read limits */ + err = c2_qp_set_read_limits(c2dev, qp, iw_param->ord, iw_param->ird); + if (err) + goto bail0; + + /* Allocate verbs request. */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + err = -ENOMEM; + goto bail1; + } + vq_req->qp = qp; + vq_req->cm_id = cm_id; + vq_req->event = IW_CM_EVENT_ESTABLISHED; + + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail2; + } + + /* Build the WR */ + c2_wr_set_id(wr, CCWR_CR_ACCEPT); + wr->hdr.context = (unsigned long) vq_req; + wr->rnic_handle = c2dev->adapter_handle; + wr->ep_handle = (u32) (unsigned long) cm_id->provider_data; + wr->qp_handle = qp->adapter_handle; + + /* Replace the cr_handle with the QP after accept */ + cm_id->provider_data = qp; + cm_id->add_ref(cm_id); + qp->cm_id = cm_id; + + cm_id->provider_data = qp; + + /* Validate private_data length */ + if (iw_param->private_data_len > C2_MAX_PRIVATE_DATA_SIZE) { + err = -EINVAL; + goto bail2; + } + + if (iw_param->private_data) { + wr->private_data_length = cpu_to_be32(iw_param->private_data_len); + memcpy(&wr->private_data[0], + iw_param->private_data, iw_param->private_data_len); + } else + wr->private_data_length = 0; + + /* Reference the request struct. Dereferenced in the int handler. */ + vq_req_get(c2dev, vq_req); + + /* Send WR to adapter */ + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail2; + } + + /* Wait for reply from adapter */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail2; + + /* Check that reply is present */ + reply = (struct c2wr_cr_accept_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail2; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + + if (!err) + c2_set_qp_state(qp, C2_QP_STATE_RTS); + bail2: + kfree(wr); + bail1: + vq_req_free(c2dev, vq_req); + bail0: + if (err) { + /* + * If we fail, release reference on QP and + * disassociate QP from CM_ID + */ + cm_id->provider_data = NULL; + qp->cm_id = NULL; + cm_id->rem_ref(cm_id); + } + return err; +} + +int c2_llp_reject(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) +{ + struct c2_dev *c2dev; + struct c2wr_cr_reject_req wr; + struct c2_vq_req *vq_req; + struct c2wr_cr_reject_rep *reply; + int err; + + c2dev = to_c2dev(cm_id->device); + + /* + * Allocate verbs request. + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_CR_REJECT); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.ep_handle = (u32) (unsigned long) cm_id->provider_data; + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + /* + * Process reply + */ + reply = (struct c2wr_cr_reject_rep *) (unsigned long) + vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + err = c2_errno(reply); + /* + * free vq stuff + */ + vq_repbuf_free(c2dev, reply); + + bail0: + vq_req_free(c2dev, vq_req); + return err; +} diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c new file mode 100644 index 0000000..71128ff --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -0,0 +1,423 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Cisco Systems, Inc. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include "c2.h" +#include "c2_vq.h" +#include "c2_status.h" + +#define C2_CQ_MSG_SIZE ((sizeof(struct c2wr_ce) + 32-1) & ~(32-1)) + +struct c2_cq *c2_cq_get(struct c2_dev *c2dev, int cqn) +{ + struct c2_cq *cq; + unsigned long flags; + + spin_lock_irqsave(&c2dev->lock, flags); + cq = c2dev->qptr_array[cqn]; + if (!cq) { + spin_unlock_irqrestore(&c2dev->lock, flags); + return NULL; + } + atomic_inc(&cq->refcount); + spin_unlock_irqrestore(&c2dev->lock, flags); + return cq; +} + +void c2_cq_put(struct c2_cq *cq) +{ + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +void c2_cq_event(struct c2_dev *c2dev, u32 mq_index) +{ + struct c2_cq *cq; + + cq = c2_cq_get(c2dev, mq_index); + if (!cq) { + printk("discarding events on destroyed CQN=%d\n", mq_index); + return; + } + + (*cq->ibcq.comp_handler) (&cq->ibcq, cq->ibcq.cq_context); + c2_cq_put(cq); +} + +void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index) +{ + struct c2_cq *cq; + struct c2_mq *q; + + cq = c2_cq_get(c2dev, mq_index); + if (!cq) + return; + + spin_lock_irq(&cq->lock); + q = &cq->mq; + if (q && !c2_mq_empty(q)) { + u16 priv = q->priv; + struct c2wr_ce *msg; + + while (priv != be16_to_cpu(*q->shared)) { + msg = (struct c2wr_ce *) + (q->msg_pool.host + priv * q->msg_size); + if (msg->qp_user_context == (u64) (unsigned long) qp) { + msg->qp_user_context = (u64) 0; + } + priv = (priv + 1) % q->q_size; + } + } + spin_unlock_irq(&cq->lock); + c2_cq_put(cq); +} + +static inline enum ib_wc_status c2_cqe_status_to_openib(u8 status) +{ + switch (status) { + case C2_OK: + return IB_WC_SUCCESS; + case CCERR_FLUSHED: + return IB_WC_WR_FLUSH_ERR; + case CCERR_BASE_AND_BOUNDS_VIOLATION: + return IB_WC_LOC_PROT_ERR; + case CCERR_ACCESS_VIOLATION: + return IB_WC_LOC_ACCESS_ERR; + case CCERR_TOTAL_LENGTH_TOO_BIG: + return IB_WC_LOC_LEN_ERR; + case CCERR_INVALID_WINDOW: + return IB_WC_MW_BIND_ERR; + default: + return IB_WC_GENERAL_ERR; + } +} + + +static inline int c2_poll_one(struct c2_dev *c2dev, + struct c2_cq *cq, struct ib_wc *entry) +{ + struct c2wr_ce *ce; + struct c2_qp *qp; + int is_recv = 0; + + ce = (struct c2wr_ce *) c2_mq_consume(&cq->mq); + if (!ce) { + return -EAGAIN; + } + + /* + * if the qp returned is null then this qp has already + * been freed and we are unable process the completion. + * try pulling the next message + */ + while ((qp = + (struct c2_qp *) (unsigned long) ce->qp_user_context) == NULL) { + c2_mq_free(&cq->mq); + ce = (struct c2wr_ce *) c2_mq_consume(&cq->mq); + if (!ce) + return -EAGAIN; + } + + entry->status = c2_cqe_status_to_openib(c2_wr_get_result(ce)); + entry->wr_id = ce->hdr.context; + entry->qp_num = ce->handle; + entry->wc_flags = 0; + entry->slid = 0; + entry->sl = 0; + entry->src_qp = 0; + entry->dlid_path_bits = 0; + entry->pkey_index = 0; + + switch (c2_wr_get_id(ce)) { + case C2_WR_TYPE_SEND: + entry->opcode = IB_WC_SEND; + break; + case C2_WR_TYPE_RDMA_WRITE: + entry->opcode = IB_WC_RDMA_WRITE; + break; + case C2_WR_TYPE_RDMA_READ: + entry->opcode = IB_WC_RDMA_READ; + break; + case C2_WR_TYPE_BIND_MW: + entry->opcode = IB_WC_BIND_MW; + break; + case C2_WR_TYPE_RECV: + entry->byte_len = be32_to_cpu(ce->bytes_rcvd); + entry->opcode = IB_WC_RECV; + is_recv = 1; + break; + default: + break; + } + + /* consume the WQEs */ + if (is_recv) + c2_mq_lconsume(&qp->rq_mq, 1); + else + c2_mq_lconsume(&qp->sq_mq, + be32_to_cpu(c2_wr_get_wqe_count(ce)) + 1); + + /* free the message */ + c2_mq_free(&cq->mq); + + return 0; +} + +int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) +{ + struct c2_dev *c2dev = to_c2dev(ibcq->device); + struct c2_cq *cq = to_c2cq(ibcq); + unsigned long flags; + int npolled, err; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled) { + + err = c2_poll_one(c2dev, cq, entry + npolled); + if (err) + break; + } + + spin_unlock_irqrestore(&cq->lock, flags); + + return npolled; +} + +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +{ + struct c2_mq_shared __iomem *shared; + struct c2_cq *cq; + + cq = to_c2cq(ibcq); + shared = cq->mq.peer; + + if (notify == IB_CQ_NEXT_COMP) + writeb(C2_CQ_NOTIFICATION_TYPE_NEXT, &shared->notification_type); + else if (notify == IB_CQ_SOLICITED) + writeb(C2_CQ_NOTIFICATION_TYPE_NEXT_SE, &shared->notification_type); + else + return -EINVAL; + + writeb(CQ_WAIT_FOR_DMA | CQ_ARMED, &shared->armed); + + /* + * Now read back shared->armed to make the PCI + * write synchronous. This is necessary for + * correct cq notification semantics. + */ + readb(&shared->armed); + + return 0; +} + +static void c2_free_cq_buf(struct c2_mq *mq) +{ + free_pages((unsigned long) mq->msg_pool.host, + get_order(mq->q_size * mq->msg_size)); +} + +static int c2_alloc_cq_buf(struct c2_mq *mq, int q_size, int msg_size) +{ + unsigned long pool_start; + + pool_start = __get_free_pages(GFP_KERNEL, + get_order(q_size * msg_size)); + if (!pool_start) + return -ENOMEM; + + c2_mq_rep_init(mq, + 0, /* index (currently unknown) */ + q_size, + msg_size, + (u8 *) pool_start, + NULL, /* peer (currently unknown) */ + C2_MQ_HOST_TARGET); + + return 0; +} + +int c2_init_cq(struct c2_dev *c2dev, int entries, + struct c2_ucontext *ctx, struct c2_cq *cq) +{ + struct c2wr_cq_create_req wr; + struct c2wr_cq_create_rep *reply; + unsigned long peer_pa; + struct c2_vq_req *vq_req; + int err; + + might_sleep(); + + cq->ibcq.cqe = entries - 1; + cq->is_kernel = !ctx; + + /* Allocate a shared pointer */ + cq->mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); + if (!cq->mq.shared) + return -ENOMEM; + + /* Allocate pages for the message pool */ + err = c2_alloc_cq_buf(&cq->mq, entries + 1, C2_CQ_MSG_SIZE); + if (err) + goto bail0; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + err = -ENOMEM; + goto bail1; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_CQ_CREATE); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.msg_size = cpu_to_be32(cq->mq.msg_size); + wr.depth = cpu_to_be32(cq->mq.q_size); + wr.shared_ht = cpu_to_be64(__pa(cq->mq.shared)); + wr.msg_pool = cpu_to_be64(__pa(cq->mq.msg_pool.host)); + wr.user_context = (u64) (unsigned long) (cq); + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail2; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail2; + + reply = (struct c2wr_cq_create_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail2; + } + + if ((err = c2_errno(reply)) != 0) + goto bail3; + + cq->adapter_handle = reply->cq_handle; + cq->mq.index = be32_to_cpu(reply->mq_index); + + peer_pa = c2dev->pa + be32_to_cpu(reply->adapter_shared); + cq->mq.peer = ioremap_nocache(peer_pa, PAGE_SIZE); + if (!cq->mq.peer) { + err = -ENOMEM; + goto bail3; + } + + vq_repbuf_free(c2dev, reply); + vq_req_free(c2dev, vq_req); + + spin_lock_init(&cq->lock); + atomic_set(&cq->refcount, 1); + init_waitqueue_head(&cq->wait); + + /* + * Use the MQ index allocated by the adapter to + * store the CQ in the qptr_array + */ + cq->cqn = cq->mq.index; + c2dev->qptr_array[cq->cqn] = cq; + + return 0; + + bail3: + vq_repbuf_free(c2dev, reply); + bail2: + vq_req_free(c2dev, vq_req); + bail1: + c2_free_cq_buf(&cq->mq); + bail0: + c2_free_mqsp(cq->mq.shared); + + return err; +} + +void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq) +{ + int err; + struct c2_vq_req *vq_req; + struct c2wr_cq_destroy_req wr; + struct c2wr_cq_destroy_rep *reply; + + might_sleep(); + + /* Clear CQ from the qptr array */ + spin_lock_irq(&c2dev->lock); + c2dev->qptr_array[cq->mq.index] = NULL; + atomic_dec(&cq->refcount); + spin_unlock_irq(&c2dev->lock); + + wait_event(cq->wait, !atomic_read(&cq->refcount)); + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + goto bail0; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_CQ_DESTROY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.cq_handle = cq->adapter_handle; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = (struct c2wr_cq_destroy_rep *) (unsigned long) (vq_req->reply_msg); + + vq_repbuf_free(c2dev, reply); + bail1: + vq_req_free(c2dev, vq_req); + bail0: + if (cq->is_kernel) { + c2_free_cq_buf(&cq->mq); + } + + return; +} diff --git a/drivers/infiniband/hw/amso1100/c2_pd.c b/drivers/infiniband/hw/amso1100/c2_pd.c new file mode 100644 index 0000000..27459b8 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_pd.c @@ -0,0 +1,71 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include "c2.h" +#include "c2_provider.h" + +int c2_pd_alloc(struct c2_dev *dev, int privileged, struct c2_pd *pd) +{ + int err = 0; + + might_sleep(); + + atomic_set(&pd->sqp_count, 0); + pd->pd_id = c2_alloc(&dev->pd_table.alloc); + if (pd->pd_id == -1) + return -ENOMEM; + + return err; +} + +void c2_pd_free(struct c2_dev *dev, struct c2_pd *pd) +{ + might_sleep(); + c2_free(&dev->pd_table.alloc, pd->pd_id); +} + +int __devinit c2_init_pd_table(struct c2_dev *dev) +{ + return c2_alloc_init(&dev->pd_table.alloc, dev->props.max_pd, 0); +} + +void __devexit c2_cleanup_pd_table(struct c2_dev *dev) +{ + /* XXX check if any PDs are still allocated? */ + c2_alloc_cleanup(&dev->pd_table.alloc); +} diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c new file mode 100644 index 0000000..eaf786e --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -0,0 +1,867 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include +#include "c2.h" +#include "c2_provider.h" +#include "c2_user.h" + +static int c2_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct c2_dev *c2dev = to_c2dev(ibdev); + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + *props = c2dev->props; + return 0; +} + +static int c2_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + props->max_mtu = IB_MTU_4096; + props->lid = 0; + props->lmc = 0; + props->sm_lid = 0; + props->sm_sl = 0; + props->state = IB_PORT_ACTIVE; + props->phys_state = 0; + props->port_cap_flags = + IB_PORT_CM_SUP | + IB_PORT_REINIT_SUP | + IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP; + props->gid_tbl_len = 1; + props->pkey_tbl_len = 1; + props->qkey_viol_cntr = 0; + props->active_width = 1; + props->active_speed = 1; + + return 0; +} + +static int c2_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return 0; +} + +static int c2_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 * pkey) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + *pkey = 0; + return 0; +} + +static int c2_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct c2_dev *c2dev = to_c2dev(ibdev); + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + memset(&(gid->raw[0]), 0, sizeof(gid->raw)); + memcpy(&(gid->raw[0]), c2dev->pseudo_netdev->dev_addr, 6); + + return 0; +} + +/* Allocate the user context data structure. This keeps track + * of all objects associated with a particular user-mode client. + */ +static struct ib_ucontext *c2_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct c2_ucontext *context; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + context = kmalloc(sizeof(*context), GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + + return &context->ibucontext; +} + +static int c2_dealloc_ucontext(struct ib_ucontext *context) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + kfree(context); + return 0; +} + +static int c2_mmap_uar(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static struct ib_pd *c2_alloc_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct c2_pd *pd; + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + pd = kmalloc(sizeof(*pd), GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + err = c2_pd_alloc(to_c2dev(ibdev), !context, pd); + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + if (context) { + if (ib_copy_to_udata(udata, &pd->pd_id, sizeof(__u32))) { + c2_pd_free(to_c2dev(ibdev), pd); + kfree(pd); + return ERR_PTR(-EFAULT); + } + } + + return &pd->ibpd; +} + +static int c2_dealloc_pd(struct ib_pd *pd) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + c2_pd_free(to_c2dev(pd->device), to_c2pd(pd)); + kfree(pd); + + return 0; +} + +static struct ib_ah *c2_ah_create(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return ERR_PTR(-ENOSYS); +} + +static int c2_ah_destroy(struct ib_ah *ah) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static void c2_add_ref(struct ib_qp *ibqp) +{ + struct c2_qp *qp; + BUG_ON(!ibqp); + qp = to_c2qp(ibqp); + atomic_inc(&qp->refcount); +} + +static void c2_rem_ref(struct ib_qp *ibqp) +{ + struct c2_qp *qp; + BUG_ON(!ibqp); + qp = to_c2qp(ibqp); + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +struct ib_qp *c2_get_qp(struct ib_device *device, int qpn) +{ + struct c2_dev* c2dev = to_c2dev(device); + struct c2_qp *qp; + + qp = c2dev->qp_table.map[qpn]; + pr_debug("%s Returning QP=%p for QPN=%d, device=%p, refcount=%d\n", + __FUNCTION__, qp, qpn, device, + (qp?atomic_read(&qp->refcount):0)); + + return (qp?&qp->ibqp:NULL); +} + +static struct ib_qp *c2_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata) +{ + struct c2_qp *qp; + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + switch (init_attr->qp_type) { + case IB_QPT_RC: + qp = kzalloc(sizeof(*qp), GFP_KERNEL); + if (!qp) { + pr_debug("%s: Unable to allocate QP\n", __FUNCTION__); + return ERR_PTR(-ENOMEM); + } + spin_lock_init(&qp->lock); + if (pd->uobject) { + /* XXX userspace specific */ + } + + err = c2_alloc_qp(to_c2dev(pd->device), + to_c2pd(pd), init_attr, qp); + + if (err && pd->uobject) { + /* XXX userspace specific */ + } + + break; + default: + pr_debug("%s: Invalid QP type: %d\n", __FUNCTION__, + init_attr->qp_type); + return ERR_PTR(-EINVAL); + break; + } + + if (err) { + kfree(qp); + return ERR_PTR(err); + } + + return &qp->ibqp; +} + +static int c2_destroy_qp(struct ib_qp *ib_qp) +{ + struct c2_qp *qp = to_c2qp(ib_qp); + + pr_debug("%s:%u qp=%p,qp->state=%d\n", + __FUNCTION__, __LINE__,ib_qp,qp->state); + c2_free_qp(to_c2dev(ib_qp->device), qp); + kfree(qp); + return 0; +} + +static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct c2_cq *cq; + int err; + + cq = kmalloc(sizeof(*cq), GFP_KERNEL); + if (!cq) { + pr_debug("%s: Unable to allocate CQ\n", __FUNCTION__); + return ERR_PTR(-ENOMEM); + } + + err = c2_init_cq(to_c2dev(ibdev), entries, NULL, cq); + if (err) { + pr_debug("%s: error initializing CQ\n", __FUNCTION__); + kfree(cq); + return ERR_PTR(err); + } + + return &cq->ibcq; +} + +static int c2_destroy_cq(struct ib_cq *ib_cq) +{ + struct c2_cq *cq = to_c2cq(ib_cq); + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + c2_free_cq(to_c2dev(ib_cq->device), cq); + kfree(cq); + + return 0; +} + +static inline u32 c2_convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_WRITE ? C2_ACF_REMOTE_WRITE : 0) | + (acc & IB_ACCESS_REMOTE_READ ? C2_ACF_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? C2_ACF_LOCAL_WRITE : 0) | + C2_ACF_LOCAL_READ | C2_ACF_WINDOW_BIND; +} + +static struct ib_mr *c2_reg_phys_mr(struct ib_pd *ib_pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, int acc, u64 * iova_start) +{ + struct c2_mr *mr; + u64 *page_list; + u32 total_len; + int err, i, j, k, page_shift, pbl_depth; + + pbl_depth = 0; + total_len = 0; + + page_shift = PAGE_SHIFT; + /* + * If there is only 1 buffer we assume this could + * be a map of all phy mem...use a 32k page_shift. + */ + if (num_phys_buf == 1) + page_shift += 3; /* XXX */ + + for (i = 0; i < num_phys_buf; i++) { + + if (buffer_list[i].addr & ~PAGE_MASK) { + pr_debug("Unaligned Memory Buffer: 0x%x\n", + (unsigned int) buffer_list[i].addr); + return ERR_PTR(-EINVAL); + } + + if (!buffer_list[i].size) { + pr_debug("Invalid Buffer Size\n"); + return ERR_PTR(-EINVAL); + } + + total_len += buffer_list[i].size; + pbl_depth += ALIGN(buffer_list[i].size, + (1 << page_shift)) >> page_shift; + } + + page_list = vmalloc(sizeof(u64) * pbl_depth); + if (!page_list) { + pr_debug("couldn't vmalloc page_list of size %zd\n", + (sizeof(u64) * pbl_depth)); + return ERR_PTR(-ENOMEM); + } + + for (i = 0, j = 0; i < num_phys_buf; i++) { + + int naddrs; + + naddrs = ALIGN(buffer_list[i].size, + (1 << page_shift)) >> page_shift; + for (k = 0; k < naddrs; k++) + page_list[j++] = (buffer_list[i].addr + + (k << page_shift)); + } + + mr = kmalloc(sizeof(*mr), GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + mr->pd = to_c2pd(ib_pd); + pr_debug("%s - page shift %d, pbl_depth %d, total_len %u, " + "*iova_start %llx, first pa %llx, last pa %llx\n", + __FUNCTION__, page_shift, pbl_depth, total_len, + *iova_start, page_list[0], page_list[pbl_depth-1]); + err = c2_nsmr_register_phys_kern(to_c2dev(ib_pd->device), page_list, + (1 << page_shift), pbl_depth, + total_len, 0, iova_start, + c2_convert_access(acc), mr); + vfree(page_list); + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + return &mr->ibmr; +} + +static struct ib_mr *c2_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct ib_phys_buf bl; + u64 kva = 0; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + /* AMSO1100 limit */ + bl.size = 0xffffffff; + bl.addr = 0; + return c2_reg_phys_mr(pd, &bl, 1, acc, &kva); +} + +static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, + int acc, struct ib_udata *udata) +{ + u64 *pages; + u64 kva = 0; + int shift, n, len; + int i, j, k; + int err = 0; + struct ib_umem_chunk *chunk; + struct c2_pd *c2pd = to_c2pd(pd); + struct c2_mr *c2mr; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + shift = ffs(region->page_size) - 1; + + c2mr = kmalloc(sizeof(*c2mr), GFP_KERNEL); + if (!c2mr) + return ERR_PTR(-ENOMEM); + c2mr->pd = c2pd; + + n = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) + n += chunk->nents; + + pages = kmalloc(n * sizeof(u64), GFP_KERNEL); + if (!pages) { + err = -ENOMEM; + goto err; + } + + i = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) { + for (j = 0; j < chunk->nmap; ++j) { + len = sg_dma_len(&chunk->page_list[j]) >> shift; + for (k = 0; k < len; ++k) { + pages[i++] = + sg_dma_address(&chunk->page_list[j]) + + (region->page_size * k); + } + } + } + + kva = (u64)region->virt_base; + err = c2_nsmr_register_phys_kern(to_c2dev(pd->device), + pages, + region->page_size, + i, + region->length, + region->offset, + &kva, + c2_convert_access(acc), + c2mr); + kfree(pages); + if (err) { + kfree(c2mr); + return ERR_PTR(err); + } + return &c2mr->ibmr; + +err: + kfree(c2mr); + return ERR_PTR(err); +} + +static int c2_dereg_mr(struct ib_mr *ib_mr) +{ + struct c2_mr *mr = to_c2mr(ib_mr); + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + err = c2_stag_dealloc(to_c2dev(ib_mr->device), ib_mr->lkey); + if (err) + pr_debug("c2_stag_dealloc failed: %d\n", err); + else + kfree(mr); + + return err; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return sprintf(buf, "%x\n", dev->props.hw_ver); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return sprintf(buf, "%x.%x.%x\n", + (int) (dev->props.fw_ver >> 32), + (int) (dev->props.fw_ver >> 16) & 0xffff, + (int) (dev->props.fw_ver & 0xffff)); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return sprintf(buf, "AMSO1100\n"); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return sprintf(buf, "%.*s\n", 32, "AMSO1100 Board ID"); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); + +static struct class_device_attribute *c2_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type, + &class_device_attr_board_id +}; + +static int c2_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask) +{ + int err; + + err = + c2_qp_modify(to_c2dev(ibqp->device), to_c2qp(ibqp), attr, + attr_mask); + + return err; +} + +static int c2_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int c2_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int c2_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int c2_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + /* Request a connection */ + return c2_llp_connect(cm_id, iw_param); +} + +static int c2_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + /* Accept the new connection */ + return c2_llp_accept(cm_id, iw_param); +} + +static int c2_reject(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) +{ + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + err = c2_llp_reject(cm_id, pdata, pdata_len); + return err; +} + +static int c2_service_create(struct iw_cm_id *cm_id, int backlog) +{ + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + err = c2_llp_service_create(cm_id, backlog); + pr_debug("%s:%u err=%d\n", + __FUNCTION__, __LINE__, + err); + return err; +} + +static int c2_service_destroy(struct iw_cm_id *cm_id) +{ + int err; + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + err = c2_llp_service_destroy(cm_id); + + return err; +} + +static int c2_pseudo_up(struct net_device *netdev) +{ + struct in_device *ind; + struct c2_dev *c2dev = netdev->priv; + + ind = in_dev_get(netdev); + if (!ind) + return 0; + + pr_debug("adding...\n"); + for_ifa(ind) { +#ifdef DEBUG + u8 *ip = (u8 *) & ifa->ifa_address; + + pr_debug("%s: %d.%d.%d.%d\n", + ifa->ifa_label, ip[0], ip[1], ip[2], ip[3]); +#endif + c2_add_addr(c2dev, ifa->ifa_address, ifa->ifa_mask); + } + endfor_ifa(ind); + in_dev_put(ind); + + return 0; +} + +static int c2_pseudo_down(struct net_device *netdev) +{ + struct in_device *ind; + struct c2_dev *c2dev = netdev->priv; + + ind = in_dev_get(netdev); + if (!ind) + return 0; + + pr_debug("deleting...\n"); + for_ifa(ind) { +#ifdef DEBUG + u8 *ip = (u8 *) & ifa->ifa_address; + + pr_debug("%s: %d.%d.%d.%d\n", + ifa->ifa_label, ip[0], ip[1], ip[2], ip[3]); +#endif + c2_del_addr(c2dev, ifa->ifa_address, ifa->ifa_mask); + } + endfor_ifa(ind); + in_dev_put(ind); + + return 0; +} + +static int c2_pseudo_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +{ + kfree_skb(skb); + return NETDEV_TX_OK; +} + +static int c2_pseudo_change_mtu(struct net_device *netdev, int new_mtu) +{ + int ret = 0; + + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) + return -EINVAL; + + netdev->mtu = new_mtu; + + /* XXX tell rnic about new rmda interface mtu */ + return ret; +} + +static void setup(struct net_device *netdev) +{ + SET_MODULE_OWNER(netdev); + netdev->open = c2_pseudo_up; + netdev->stop = c2_pseudo_down; + netdev->hard_start_xmit = c2_pseudo_xmit_frame; + netdev->get_stats = NULL; + netdev->tx_timeout = NULL; + netdev->set_mac_address = NULL; + netdev->change_mtu = c2_pseudo_change_mtu; + netdev->watchdog_timeo = 0; + netdev->type = ARPHRD_ETHER; + netdev->mtu = 1500; + netdev->hard_header_len = ETH_HLEN; + netdev->addr_len = ETH_ALEN; + netdev->tx_queue_len = 0; + netdev->flags |= IFF_NOARP; + return; +} + +static struct net_device *c2_pseudo_netdev_init(struct c2_dev *c2dev) +{ + char name[IFNAMSIZ]; + struct net_device *netdev; + + /* change ethxxx to iwxxx */ + strcpy(name, "iw"); + strcat(name, &c2dev->netdev->name[3]); + netdev = alloc_netdev(sizeof(*netdev), name, setup); + if (!netdev) { + printk(KERN_ERR PFX "%s - etherdev alloc failed", + __FUNCTION__); + return NULL; + } + + netdev->priv = c2dev; + + SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev); + + memcpy_fromio(netdev->dev_addr, c2dev->kva + C2_REGS_RDMA_ENADDR, 6); + + /* Print out the MAC address */ + pr_debug("%s: MAC %02X:%02X:%02X:%02X:%02X:%02X\n", + netdev->name, + netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2], + netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5]); + + /* Disable network packets */ + netif_stop_queue(netdev); + return netdev; +} + +int c2_register_device(struct c2_dev *dev) +{ + int ret; + int i; + + /* Register pseudo network device */ + dev->pseudo_netdev = c2_pseudo_netdev_init(dev); + if (dev->pseudo_netdev) { + ret = register_netdev(dev->pseudo_netdev); + if (ret) { + printk(KERN_ERR PFX + "Unable to register netdev, ret = %d\n", ret); + free_netdev(dev->pseudo_netdev); + return ret; + } + } + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX); + dev->ibdev.owner = THIS_MODULE; + dev->ibdev.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV); + + dev->ibdev.node_type = RDMA_NODE_RNIC; + memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); + memcpy(&dev->ibdev.node_guid, dev->pseudo_netdev->dev_addr, 6); + dev->ibdev.phys_port_cnt = 1; + dev->ibdev.dma_device = &dev->pcidev->dev; + dev->ibdev.class_dev.dev = &dev->pcidev->dev; + dev->ibdev.query_device = c2_query_device; + dev->ibdev.query_port = c2_query_port; + dev->ibdev.modify_port = c2_modify_port; + dev->ibdev.query_pkey = c2_query_pkey; + dev->ibdev.query_gid = c2_query_gid; + dev->ibdev.alloc_ucontext = c2_alloc_ucontext; + dev->ibdev.dealloc_ucontext = c2_dealloc_ucontext; + dev->ibdev.mmap = c2_mmap_uar; + dev->ibdev.alloc_pd = c2_alloc_pd; + dev->ibdev.dealloc_pd = c2_dealloc_pd; + dev->ibdev.create_ah = c2_ah_create; + dev->ibdev.destroy_ah = c2_ah_destroy; + dev->ibdev.create_qp = c2_create_qp; + dev->ibdev.modify_qp = c2_modify_qp; + dev->ibdev.destroy_qp = c2_destroy_qp; + dev->ibdev.create_cq = c2_create_cq; + dev->ibdev.destroy_cq = c2_destroy_cq; + dev->ibdev.poll_cq = c2_poll_cq; + dev->ibdev.get_dma_mr = c2_get_dma_mr; + dev->ibdev.reg_phys_mr = c2_reg_phys_mr; + dev->ibdev.reg_user_mr = c2_reg_user_mr; + dev->ibdev.dereg_mr = c2_dereg_mr; + + dev->ibdev.alloc_fmr = NULL; + dev->ibdev.unmap_fmr = NULL; + dev->ibdev.dealloc_fmr = NULL; + dev->ibdev.map_phys_fmr = NULL; + + dev->ibdev.attach_mcast = c2_multicast_attach; + dev->ibdev.detach_mcast = c2_multicast_detach; + dev->ibdev.process_mad = c2_process_mad; + + dev->ibdev.req_notify_cq = c2_arm_cq; + dev->ibdev.post_send = c2_post_send; + dev->ibdev.post_recv = c2_post_receive; + + dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), GFP_KERNEL); + dev->ibdev.iwcm->add_ref = c2_add_ref; + dev->ibdev.iwcm->rem_ref = c2_rem_ref; + dev->ibdev.iwcm->get_qp = c2_get_qp; + dev->ibdev.iwcm->connect = c2_connect; + dev->ibdev.iwcm->accept = c2_accept; + dev->ibdev.iwcm->reject = c2_reject; + dev->ibdev.iwcm->create_listen = c2_service_create; + dev->ibdev.iwcm->destroy_listen = c2_service_destroy; + + ret = ib_register_device(&dev->ibdev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(c2_class_attributes); ++i) { + ret = class_device_create_file(&dev->ibdev.class_dev, + c2_class_attributes[i]); + if (ret) { + unregister_netdev(dev->pseudo_netdev); + free_netdev(dev->pseudo_netdev); + ib_unregister_device(&dev->ibdev); + return ret; + } + } + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return 0; +} + +void c2_unregister_device(struct c2_dev *dev) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + unregister_netdev(dev->pseudo_netdev); + free_netdev(dev->pseudo_netdev); + ib_unregister_device(&dev->ibdev); +} diff --git a/drivers/infiniband/hw/amso1100/c2_provider.h b/drivers/infiniband/hw/amso1100/c2_provider.h new file mode 100644 index 0000000..05c4ab6 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_provider.h @@ -0,0 +1,182 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef C2_PROVIDER_H +#define C2_PROVIDER_H +#include + +#include +#include + +#include "c2_mq.h" +#include + +#define C2_MPT_FLAG_ATOMIC (1 << 14) +#define C2_MPT_FLAG_REMOTE_WRITE (1 << 13) +#define C2_MPT_FLAG_REMOTE_READ (1 << 12) +#define C2_MPT_FLAG_LOCAL_WRITE (1 << 11) +#define C2_MPT_FLAG_LOCAL_READ (1 << 10) + +struct c2_buf_list { + void *buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + + +/* The user context keeps track of objects allocated for a + * particular user-mode client. */ +struct c2_ucontext { + struct ib_ucontext ibucontext; +}; + +struct c2_mtt; + +/* All objects associated with a PD are kept in the + * associated user context if present. + */ +struct c2_pd { + struct ib_pd ibpd; + u32 pd_id; + atomic_t sqp_count; +}; + +struct c2_mr { + struct ib_mr ibmr; + struct c2_pd *pd; +}; + +struct c2_av; + +enum c2_ah_type { + C2_AH_ON_HCA, + C2_AH_PCI_POOL, + C2_AH_KMALLOC +}; + +struct c2_ah { + struct ib_ah ibah; +}; + +struct c2_cq { + struct ib_cq ibcq; + spinlock_t lock; + atomic_t refcount; + int cqn; + int is_kernel; + wait_queue_head_t wait; + + u32 adapter_handle; + struct c2_mq mq; +}; + +struct c2_wq { + spinlock_t lock; +}; +struct iw_cm_id; +struct c2_qp { + struct ib_qp ibqp; + struct iw_cm_id *cm_id; + spinlock_t lock; + atomic_t refcount; + wait_queue_head_t wait; + int qpn; + + u32 adapter_handle; + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u8 state; + + struct c2_mq sq_mq; + struct c2_mq rq_mq; +}; + +struct c2_cr_query_attrs { + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; +}; + +static inline struct c2_pd *to_c2pd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct c2_pd, ibpd); +} + +static inline struct c2_ucontext *to_c2ucontext(struct ib_ucontext *ibucontext) +{ + return container_of(ibucontext, struct c2_ucontext, ibucontext); +} + +static inline struct c2_mr *to_c2mr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct c2_mr, ibmr); +} + + +static inline struct c2_ah *to_c2ah(struct ib_ah *ibah) +{ + return container_of(ibah, struct c2_ah, ibah); +} + +static inline struct c2_cq *to_c2cq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct c2_cq, ibcq); +} + +static inline struct c2_qp *to_c2qp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct c2_qp, ibqp); +} + +static inline int is_rnic_addr(struct net_device *netdev, u32 addr) +{ + struct in_device *ind; + int ret = 0; + + ind = in_dev_get(netdev); + if (!ind) + return 0; + + for_ifa(ind) { + if (ifa->ifa_address == addr) { + ret = 1; + break; + } + } + endfor_ifa(ind); + in_dev_put(ind); + return ret; +} +#endif /* C2_PROVIDER_H */ diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c b/drivers/infiniband/hw/amso1100/c2_qp.c new file mode 100644 index 0000000..6071cf0 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_qp.c @@ -0,0 +1,975 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include "c2.h" +#include "c2_vq.h" +#include "c2_status.h" + +#define C2_MAX_ORD_PER_QP 128 +#define C2_MAX_IRD_PER_QP 128 + +#define C2_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count) +#define C2_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16) +#define C2_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF) + +#define NO_SUPPORT -1 +static const u8 c2_opcode[] = { + [IB_WR_SEND] = C2_WR_TYPE_SEND, + [IB_WR_SEND_WITH_IMM] = NO_SUPPORT, + [IB_WR_RDMA_WRITE] = C2_WR_TYPE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = NO_SUPPORT, + [IB_WR_RDMA_READ] = C2_WR_TYPE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = NO_SUPPORT, + [IB_WR_ATOMIC_FETCH_AND_ADD] = NO_SUPPORT, +}; + +static int to_c2_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: + return C2_QP_STATE_IDLE; + case IB_QPS_RTS: + return C2_QP_STATE_RTS; + case IB_QPS_SQD: + return C2_QP_STATE_CLOSING; + case IB_QPS_SQE: + return C2_QP_STATE_CLOSING; + case IB_QPS_ERR: + return C2_QP_STATE_ERROR; + default: + return -1; + } +} + +int to_ib_state(enum c2_qp_state c2_state) +{ + switch (c2_state) { + case C2_QP_STATE_IDLE: + return IB_QPS_RESET; + case C2_QP_STATE_CONNECTING: + return IB_QPS_RTR; + case C2_QP_STATE_RTS: + return IB_QPS_RTS; + case C2_QP_STATE_CLOSING: + return IB_QPS_SQD; + case C2_QP_STATE_ERROR: + return IB_QPS_ERR; + case C2_QP_STATE_TERMINATE: + return IB_QPS_SQE; + default: + return -1; + } +} + +const char *to_ib_state_str(int ib_state) +{ + static const char *state_str[] = { + "IB_QPS_RESET", + "IB_QPS_INIT", + "IB_QPS_RTR", + "IB_QPS_RTS", + "IB_QPS_SQD", + "IB_QPS_SQE", + "IB_QPS_ERR" + }; + if (ib_state < IB_QPS_RESET || + ib_state > IB_QPS_ERR) + return ""; + + ib_state -= IB_QPS_RESET; + return state_str[ib_state]; +} + +void c2_set_qp_state(struct c2_qp *qp, int c2_state) +{ + int new_state = to_ib_state(c2_state); + + pr_debug("%s: qp[%p] state modify %s --> %s\n", + __FUNCTION__, + qp, + to_ib_state_str(qp->state), + to_ib_state_str(new_state)); + qp->state = new_state; +} + +#define C2_QP_NO_ATTR_CHANGE 0xFFFFFFFF + +int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp, + struct ib_qp_attr *attr, int attr_mask) +{ + struct c2wr_qp_modify_req wr; + struct c2wr_qp_modify_rep *reply; + struct c2_vq_req *vq_req; + unsigned long flags; + u8 next_state; + int err; + + pr_debug("%s:%d qp=%p, %s --> %s\n", + __FUNCTION__, __LINE__, + qp, + to_ib_state_str(qp->state), + to_ib_state_str(attr->qp_state)); + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + c2_wr_set_id(&wr, CCWR_QP_MODIFY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.qp_handle = qp->adapter_handle; + wr.ord = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.ird = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.sq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.rq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + + if (attr_mask & IB_QP_STATE) { + /* Ensure the state is valid */ + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) + return -EINVAL; + + wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state)); + + if (attr->qp_state == IB_QPS_ERR) { + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id && qp->state == IB_QPS_RTS) { + pr_debug("Generating CLOSE event for QP-->ERR, " + "qp=%p, cm_id=%p\n",qp,qp->cm_id); + /* Generate an CLOSE event */ + vq_req->cm_id = qp->cm_id; + vq_req->event = IW_CM_EVENT_CLOSE; + } + spin_unlock_irqrestore(&qp->lock, flags); + } + next_state = attr->qp_state; + + } else if (attr_mask & IB_QP_CUR_STATE) { + + if (attr->cur_qp_state != IB_QPS_RTR && + attr->cur_qp_state != IB_QPS_RTS && + attr->cur_qp_state != IB_QPS_SQD && + attr->cur_qp_state != IB_QPS_SQE) + return -EINVAL; + else + wr.next_qp_state = + cpu_to_be32(to_c2_state(attr->cur_qp_state)); + + next_state = attr->cur_qp_state; + + } else { + err = 0; + goto bail0; + } + + /* reference the request struct */ + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + reply = (struct c2wr_qp_modify_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + if (!err) + qp->state = next_state; +#ifdef DEBUG + else + pr_debug("%s: c2_errno=%d\n", __FUNCTION__, err); +#endif + /* + * If we're going to error and generating the event here, then + * we need to remove the reference because there will be no + * close event generated by the adapter + */ + spin_lock_irqsave(&qp->lock, flags); + if (vq_req->event==IW_CM_EVENT_CLOSE && qp->cm_id) { + qp->cm_id->rem_ref(qp->cm_id); + qp->cm_id = NULL; + } + spin_unlock_irqrestore(&qp->lock, flags); + + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + + pr_debug("%s:%d qp=%p, cur_state=%s\n", + __FUNCTION__, __LINE__, + qp, + to_ib_state_str(qp->state)); + return err; +} + +int c2_qp_set_read_limits(struct c2_dev *c2dev, struct c2_qp *qp, + int ord, int ird) +{ + struct c2wr_qp_modify_req wr; + struct c2wr_qp_modify_rep *reply; + struct c2_vq_req *vq_req; + int err; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + c2_wr_set_id(&wr, CCWR_QP_MODIFY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.qp_handle = qp->adapter_handle; + wr.ord = cpu_to_be32(ord); + wr.ird = cpu_to_be32(ird); + wr.sq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.rq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.next_qp_state = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + + /* reference the request struct */ + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + reply = (struct c2wr_qp_modify_rep *) (unsigned long) + vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +static int destroy_qp(struct c2_dev *c2dev, struct c2_qp *qp) +{ + struct c2_vq_req *vq_req; + struct c2wr_qp_destroy_req wr; + struct c2wr_qp_destroy_rep *reply; + unsigned long flags; + int err; + + /* + * Allocate a verb request message + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + return -ENOMEM; + } + + /* + * Initialize the WR + */ + c2_wr_set_id(&wr, CCWR_QP_DESTROY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.qp_handle = qp->adapter_handle; + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id && qp->state == IB_QPS_RTS) { + pr_debug("destroy_qp: generating CLOSE event for QP-->ERR, " + "qp=%p, cm_id=%p\n",qp,qp->cm_id); + /* Generate an CLOSE event */ + vq_req->qp = qp; + vq_req->cm_id = qp->cm_id; + vq_req->event = IW_CM_EVENT_CLOSE; + } + spin_unlock_irqrestore(&qp->lock, flags); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (struct c2wr_qp_destroy_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id) { + qp->cm_id->rem_ref(qp->cm_id); + qp->cm_id = NULL; + } + spin_unlock_irqrestore(&qp->lock, flags); + + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +int c2_alloc_qp(struct c2_dev *c2dev, + struct c2_pd *pd, + struct ib_qp_init_attr *qp_attrs, struct c2_qp *qp) +{ + struct c2wr_qp_create_req wr; + struct c2wr_qp_create_rep *reply; + struct c2_vq_req *vq_req; + struct c2_cq *send_cq = to_c2cq(qp_attrs->send_cq); + struct c2_cq *recv_cq = to_c2cq(qp_attrs->recv_cq); + unsigned long peer_pa; + u32 q_size, msg_size, mmap_size; + void __iomem *mmap; + int err; + + qp->qpn = c2_alloc(&c2dev->qp_table.alloc); + if (qp->qpn == -1) + return -ENOMEM; + + qp->ibqp.qp_num = qp->qpn; + qp->ibqp.qp_type = IB_QPT_RC; + + /* Allocate the SQ and RQ shared pointers */ + qp->sq_mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); + if (!qp->sq_mq.shared) { + err = -ENOMEM; + goto bail0; + } + + qp->rq_mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool); + if (!qp->rq_mq.shared) { + err = -ENOMEM; + goto bail1; + } + + /* Allocate the verbs request */ + vq_req = vq_req_alloc(c2dev); + if (vq_req == NULL) { + err = -ENOMEM; + goto bail2; + } + + /* Initialize the work request */ + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_QP_CREATE); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.sq_cq_handle = send_cq->adapter_handle; + wr.rq_cq_handle = recv_cq->adapter_handle; + wr.sq_depth = cpu_to_be32(qp_attrs->cap.max_send_wr + 1); + wr.rq_depth = cpu_to_be32(qp_attrs->cap.max_recv_wr + 1); + wr.srq_handle = 0; + wr.flags = cpu_to_be32(QP_RDMA_READ | QP_RDMA_WRITE | QP_MW_BIND | + QP_ZERO_STAG | QP_RDMA_READ_RESPONSE); + wr.send_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge); + wr.recv_sgl_depth = cpu_to_be32(qp_attrs->cap.max_recv_sge); + wr.rdma_write_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge); + // XXX no write depth? + wr.shared_sq_ht = cpu_to_be64(__pa(qp->sq_mq.shared)); + wr.shared_rq_ht = cpu_to_be64(__pa(qp->rq_mq.shared)); + wr.ord = cpu_to_be32(C2_MAX_ORD_PER_QP); + wr.ird = cpu_to_be32(C2_MAX_IRD_PER_QP); + wr.pd_id = pd->pd_id; + wr.user_context = (unsigned long) qp; + + vq_req_get(c2dev, vq_req); + + /* Send the WR to the adapter */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail3; + } + + /* Wait for the verb reply */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail3; + } + + /* Process the reply */ + reply = (struct c2wr_qp_create_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail3; + } + + if ((err = c2_wr_get_result(reply)) != 0) { + goto bail4; + } + + /* Fill in the kernel QP struct */ + atomic_set(&qp->refcount, 1); + qp->adapter_handle = reply->qp_handle; + qp->state = IB_QPS_RESET; + qp->send_sgl_depth = qp_attrs->cap.max_send_sge; + qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge; + qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge; + + /* Initialize the SQ MQ */ + q_size = be32_to_cpu(reply->sq_depth); + msg_size = be32_to_cpu(reply->sq_msg_size); + peer_pa = c2dev->pa + be32_to_cpu(reply->sq_mq_start); + mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size); + mmap = ioremap_nocache(peer_pa, mmap_size); + if (!mmap) { + err = -ENOMEM; + goto bail5; + } + + c2_mq_req_init(&qp->sq_mq, + be32_to_cpu(reply->sq_mq_index), + q_size, + msg_size, + mmap + sizeof(struct c2_mq_shared), /* pool start */ + mmap, /* peer */ + C2_MQ_ADAPTER_TARGET); + + /* Initialize the RQ mq */ + q_size = be32_to_cpu(reply->rq_depth); + msg_size = be32_to_cpu(reply->rq_msg_size); + peer_pa = c2dev->pa + be32_to_cpu(reply->rq_mq_start); + mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size); + mmap = ioremap_nocache(peer_pa, mmap_size); + if (!mmap) { + err = -ENOMEM; + goto bail6; + } + + c2_mq_req_init(&qp->rq_mq, + be32_to_cpu(reply->rq_mq_index), + q_size, + msg_size, + mmap + sizeof(struct c2_mq_shared), /* pool start */ + mmap, /* peer */ + C2_MQ_ADAPTER_TARGET); + + vq_repbuf_free(c2dev, reply); + vq_req_free(c2dev, vq_req); + + spin_lock_irq(&c2dev->qp_table.lock); + c2_array_set(&c2dev->qp_table.qp, qp->qpn & (c2dev->props.max_qp - 1), qp); + c2dev->qp_table.map[qp->qpn] = qp; + spin_unlock_irq(&c2dev->qp_table.lock); + + return 0; + + bail6: + iounmap(qp->sq_mq.peer); + bail5: + destroy_qp(c2dev, qp); + bail4: + vq_repbuf_free(c2dev, reply); + bail3: + vq_req_free(c2dev, vq_req); + bail2: + c2_free_mqsp(qp->rq_mq.shared); + bail1: + c2_free_mqsp(qp->sq_mq.shared); + bail0: + c2_free(&c2dev->qp_table.alloc, qp->qpn); + return err; +} + +void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp) +{ + struct c2_cq *send_cq; + struct c2_cq *recv_cq; + + send_cq = to_c2cq(qp->ibqp.send_cq); + recv_cq = to_c2cq(qp->ibqp.recv_cq); + + /* + * Lock CQs here, so that CQ polling code can do QP lookup + * without taking a lock. + */ + spin_lock_irq(&send_cq->lock); + if (send_cq != recv_cq) + spin_lock(&recv_cq->lock); + + spin_lock(&c2dev->qp_table.lock); + c2_array_clear(&c2dev->qp_table.qp, qp->qpn & (c2dev->props.max_qp - 1)); + c2dev->qp_table.map[qp->qpn] = NULL; + spin_unlock(&c2dev->qp_table.lock); + + if (send_cq != recv_cq) + spin_unlock(&recv_cq->lock); + spin_unlock_irq(&send_cq->lock); + + /* + * Destory qp in the rnic... + */ + destroy_qp(c2dev, qp); + + /* + * Mark any unreaped CQEs as null and void. + */ + c2_cq_clean(c2dev, qp, send_cq->cqn); + if (send_cq != recv_cq) + c2_cq_clean(c2dev, qp, recv_cq->cqn); + /* + * Unmap the MQs and return the shared pointers + * to the message pool. + */ + iounmap(qp->sq_mq.peer); + iounmap(qp->rq_mq.peer); + c2_free_mqsp(qp->sq_mq.shared); + c2_free_mqsp(qp->rq_mq.shared); + + atomic_dec(&qp->refcount); + wait_event(qp->wait, !atomic_read(&qp->refcount)); + c2_free(&c2dev->qp_table.alloc, qp->qpn); +} + +/* + * Function: move_sgl + * + * Description: + * Move an SGL from the user's work request struct into a CCIL Work Request + * message, swapping to WR byte order and ensure the total length doesn't + * overflow. + * + * IN: + * dst - ptr to CCIL Work Request message SGL memory. + * src - ptr to the consumers SGL memory. + * + * OUT: none + * + * Return: + * CCIL status codes. + */ +static int +move_sgl(struct c2_data_addr * dst, struct ib_sge *src, int count, u32 * p_len, + u8 * actual_count) +{ + u32 tot = 0; /* running total */ + u8 acount = 0; /* running total non-0 len sge's */ + + while (count > 0) { + /* + * If the addition of this SGE causes the + * total SGL length to exceed 2^32-1, then + * fail-n-bail. + * + * If the current total plus the next element length + * wraps, then it will go negative and be less than the + * current total... + */ + if ((tot + src->length) < tot) { + return -EINVAL; + } + /* + * Bug: 1456 (as well as 1498 & 1643) + * Skip over any sge's supplied with len=0 + */ + if (src->length) { + tot += src->length; + dst->stag = cpu_to_be32(src->lkey); + dst->to = cpu_to_be64(src->addr); + dst->length = cpu_to_be32(src->length); + dst++; + acount++; + } + src++; + count--; + } + + if (acount == 0) { + /* + * Bug: 1476 (as well as 1498, 1456 and 1643) + * Setup the SGL in the WR to make it easier for the RNIC. + * This way, the FW doesn't have to deal with special cases. + * Setting length=0 should be sufficient. + */ + dst->stag = 0; + dst->to = 0; + dst->length = 0; + } + + *p_len = tot; + *actual_count = acount; + return 0; +} + +/* + * Function: c2_activity (private function) + * + * Description: + * Post an mq index to the host->adapter activity fifo. + * + * IN: + * c2dev - ptr to c2dev structure + * mq_index - mq index to post + * shared - value most recently written to shared + * + * OUT: + * + * Return: + * none + */ +static inline void c2_activity(struct c2_dev *c2dev, u32 mq_index, u16 shared) +{ + /* + * First read the register to see if the FIFO is full, and if so, + * spin until it's not. This isn't perfect -- there is no + * synchronization among the clients of the register, but in + * practice it prevents multiple CPU from hammering the bus + * with PCI RETRY. Note that when this does happen, the card + * cannot get on the bus and the card and system hang in a + * deadlock -- thus the need for this code. [TOT] + */ + while (readl(c2dev->regs + PCI_BAR0_ADAPTER_HINT) & 0x80000000) { + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_timeout(0); + } + + __raw_writel(C2_HINT_MAKE(mq_index, shared), + c2dev->regs + PCI_BAR0_ADAPTER_HINT); +} + +/* + * Function: qp_wr_post + * + * Description: + * This in-line function allocates a MQ msg, then moves the host-copy of + * the completed WR into msg. Then it posts the message. + * + * IN: + * q - ptr to user MQ. + * wr - ptr to host-copy of the WR. + * qp - ptr to user qp + * size - Number of bytes to post. Assumed to be divisible by 4. + * + * OUT: none + * + * Return: + * CCIL status codes. + */ +static int qp_wr_post(struct c2_mq *q, union c2wr * wr, struct c2_qp *qp, u32 size) +{ + union c2wr *msg; + + msg = c2_mq_alloc(q); + if (msg == NULL) { + return -EINVAL; + } +#ifdef CCMSGMAGIC + ((c2wr_hdr_t *) wr)->magic = cpu_to_be32(CCWR_MAGIC); +#endif + + /* + * Since all header fields in the WR are the same as the + * CQE, set the following so the adapter need not. + */ + c2_wr_set_result(wr, CCERR_PENDING); + + /* + * Copy the wr down to the adapter + */ + memcpy((void *) msg, (void *) wr, size); + + c2_mq_produce(q); + return 0; +} + + +int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, + struct ib_send_wr **bad_wr) +{ + struct c2_dev *c2dev = to_c2dev(ibqp->device); + struct c2_qp *qp = to_c2qp(ibqp); + union c2wr wr; + int err = 0; + + u32 flags; + u32 tot_len; + u8 actual_sge_count; + u32 msg_size; + + if (qp->state > IB_QPS_RTS) + return -EINVAL; + + while (ib_wr) { + + flags = 0; + wr.sqwr.sq_hdr.user_hdr.hdr.context = ib_wr->wr_id; + if (ib_wr->send_flags & IB_SEND_SIGNALED) { + flags |= SQ_SIGNALED; + } + + switch (ib_wr->opcode) { + case IB_WR_SEND: + if (ib_wr->send_flags & IB_SEND_SOLICITED) { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE); + msg_size = sizeof(struct c2wr_send_req); + } else { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND); + msg_size = sizeof(struct c2wr_send_req); + } + + wr.sqwr.send.remote_stag = 0; + msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; + if (ib_wr->num_sge > qp->send_sgl_depth) { + err = -EINVAL; + break; + } + if (ib_wr->send_flags & IB_SEND_FENCE) { + flags |= SQ_READ_FENCE; + } + err = move_sgl((struct c2_data_addr *) & (wr.sqwr.send.data), + ib_wr->sg_list, + ib_wr->num_sge, + &tot_len, &actual_sge_count); + wr.sqwr.send.sge_len = cpu_to_be32(tot_len); + c2_wr_set_sge_count(&wr, actual_sge_count); + break; + case IB_WR_RDMA_WRITE: + c2_wr_set_id(&wr, C2_WR_TYPE_RDMA_WRITE); + msg_size = sizeof(struct c2wr_rdma_write_req) + + (sizeof(struct c2_data_addr) * ib_wr->num_sge); + if (ib_wr->num_sge > qp->rdma_write_sgl_depth) { + err = -EINVAL; + break; + } + if (ib_wr->send_flags & IB_SEND_FENCE) { + flags |= SQ_READ_FENCE; + } + wr.sqwr.rdma_write.remote_stag = + cpu_to_be32(ib_wr->wr.rdma.rkey); + wr.sqwr.rdma_write.remote_to = + cpu_to_be64(ib_wr->wr.rdma.remote_addr); + err = move_sgl((struct c2_data_addr *) + & (wr.sqwr.rdma_write.data), + ib_wr->sg_list, + ib_wr->num_sge, + &tot_len, &actual_sge_count); + wr.sqwr.rdma_write.sge_len = cpu_to_be32(tot_len); + c2_wr_set_sge_count(&wr, actual_sge_count); + break; + case IB_WR_RDMA_READ: + c2_wr_set_id(&wr, C2_WR_TYPE_RDMA_READ); + msg_size = sizeof(struct c2wr_rdma_read_req); + + /* IWarp only suppots 1 sge for RDMA reads */ + if (ib_wr->num_sge > 1) { + err = -EINVAL; + break; + } + + /* + * Move the local and remote stag/to/len into the WR. + */ + wr.sqwr.rdma_read.local_stag = + cpu_to_be32(ib_wr->sg_list->lkey); + wr.sqwr.rdma_read.local_to = + cpu_to_be64(ib_wr->sg_list->addr); + wr.sqwr.rdma_read.remote_stag = + cpu_to_be32(ib_wr->wr.rdma.rkey); + wr.sqwr.rdma_read.remote_to = + cpu_to_be64(ib_wr->wr.rdma.remote_addr); + wr.sqwr.rdma_read.length = + cpu_to_be32(ib_wr->sg_list->length); + break; + default: + /* error */ + msg_size = 0; + err = -EINVAL; + break; + } + + /* + * If we had an error on the last wr build, then + * break out. Possible errors include bogus WR + * type, and a bogus SGL length... + */ + if (err) { + break; + } + + /* + * Store flags + */ + c2_wr_set_flags(&wr, flags); + + /* + * Post the puppy! + */ + err = qp_wr_post(&qp->sq_mq, &wr, qp, msg_size); + if (err) { + break; + } + + /* + * Enqueue mq index to activity FIFO. + */ + c2_activity(c2dev, qp->sq_mq.index, qp->sq_mq.hint_count); + + ib_wr = ib_wr->next; + } + + if (err) + *bad_wr = ib_wr; + return err; +} + +int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, + struct ib_recv_wr **bad_wr) +{ + struct c2_dev *c2dev = to_c2dev(ibqp->device); + struct c2_qp *qp = to_c2qp(ibqp); + union c2wr wr; + int err = 0; + + if (qp->state > IB_QPS_RTS) + return -EINVAL; + + /* + * Try and post each work request + */ + while (ib_wr) { + u32 tot_len; + u8 actual_sge_count; + + if (ib_wr->num_sge > qp->recv_sgl_depth) { + err = -EINVAL; + break; + } + + /* + * Create local host-copy of the WR + */ + wr.rqwr.rq_hdr.user_hdr.hdr.context = ib_wr->wr_id; + c2_wr_set_id(&wr, CCWR_RECV); + c2_wr_set_flags(&wr, 0); + + /* sge_count is limited to eight bits. */ + BUG_ON(ib_wr->num_sge >= 256); + err = move_sgl((struct c2_data_addr *) & (wr.rqwr.data), + ib_wr->sg_list, + ib_wr->num_sge, &tot_len, &actual_sge_count); + c2_wr_set_sge_count(&wr, actual_sge_count); + + /* + * If we had an error on the last wr build, then + * break out. Possible errors include bogus WR + * type, and a bogus SGL length... + */ + if (err) { + break; + } + + err = qp_wr_post(&qp->rq_mq, &wr, qp, qp->rq_mq.msg_size); + if (err) { + break; + } + + /* + * Enqueue mq index to activity FIFO + */ + c2_activity(c2dev, qp->rq_mq.index, qp->rq_mq.hint_count); + + ib_wr = ib_wr->next; + } + + if (err) + *bad_wr = ib_wr; + return err; +} + +int __devinit c2_init_qp_table(struct c2_dev *c2dev) +{ + int err; + + spin_lock_init(&c2dev->qp_table.lock); + + err = c2_alloc_init(&c2dev->qp_table.alloc, + c2dev->props.max_qp, 1); + if (err) + return err; + + err = c2_array_init(&c2dev->qp_table.qp, c2dev->props.max_qp); + if (err) { + c2_alloc_cleanup(&c2dev->qp_table.alloc); + return err; + } + + c2dev->qp_table.map = vmalloc(sizeof(struct c2_qp *) * c2dev->props.max_qp); + if (!c2dev->qp_table.map) { + pr_debug("Could not allocate QPN <-> QP map\n"); + c2_alloc_cleanup(&c2dev->qp_table.alloc); + c2_array_cleanup(&c2dev->qp_table.qp, c2dev->props.max_qp); + return -ENOMEM; + } + + return 0; +} + +void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev) +{ + c2_alloc_cleanup(&c2dev->qp_table.alloc); + c2_array_cleanup(&c2dev->qp_table.qp, c2dev->props.max_qp); +} diff --git a/drivers/infiniband/hw/amso1100/c2_user.h b/drivers/infiniband/hw/amso1100/c2_user.h new file mode 100644 index 0000000..7e9e7ad --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_user.h @@ -0,0 +1,82 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef C2_USER_H +#define C2_USER_H + +#include + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct c2_alloc_ucontext_resp { + __u32 qp_tab_size; + __u32 uarc_size; +}; + +struct c2_alloc_pd_resp { + __u32 pdn; + __u32 reserved; +}; + +struct c2_create_cq { + __u32 lkey; + __u32 pdn; + __u64 arm_db_page; + __u64 set_db_page; + __u32 arm_db_index; + __u32 set_db_index; +}; + +struct c2_create_cq_resp { + __u32 cqn; + __u32 reserved; +}; + +struct c2_create_qp { + __u32 lkey; + __u32 reserved; + __u64 sq_db_page; + __u64 rq_db_page; + __u32 sq_db_index; + __u32 rq_db_index; +}; + +#endif /* C2_USER_H */ From swise at opengridcomputing.com Wed Jun 7 13:06:51 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:06:51 -0500 Subject: [openib-general] [PATCH v2 2/7] AMSO1100 WR / Event Definitions. In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop> References: <20060607200646.9259.24588.stgit@stevo-desktop> Message-ID: <20060607200651.9259.73654.stgit@stevo-desktop> Review Changes: C2_DEBUG -> DEBUG --- drivers/infiniband/hw/amso1100/c2_ae.h | 108 ++ drivers/infiniband/hw/amso1100/c2_status.h | 158 +++ drivers/infiniband/hw/amso1100/c2_wr.h | 1523 ++++++++++++++++++++++++++++ 3 files changed, 1789 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_ae.h b/drivers/infiniband/hw/amso1100/c2_ae.h new file mode 100644 index 0000000..3a065c3 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_ae.h @@ -0,0 +1,108 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _C2_AE_H_ +#define _C2_AE_H_ + +/* + * WARNING: If you change this file, also bump C2_IVN_BASE + * in common/include/clustercore/c2_ivn.h. + */ + +/* + * Asynchronous Event Identifiers + * + * These start at 0x80 only so it's obvious from inspection that + * they are not work-request statuses. This isn't critical. + * + * NOTE: these event id's must fit in eight bits. + */ +enum c2_event_id { + CCAE_REMOTE_SHUTDOWN = 0x80, + CCAE_ACTIVE_CONNECT_RESULTS, + CCAE_CONNECTION_REQUEST, + CCAE_LLP_CLOSE_COMPLETE, + CCAE_TERMINATE_MESSAGE_RECEIVED, + CCAE_LLP_CONNECTION_RESET, + CCAE_LLP_CONNECTION_LOST, + CCAE_LLP_SEGMENT_SIZE_INVALID, + CCAE_LLP_INVALID_CRC, + CCAE_LLP_BAD_FPDU, + CCAE_INVALID_DDP_VERSION, + CCAE_INVALID_RDMA_VERSION, + CCAE_UNEXPECTED_OPCODE, + CCAE_INVALID_DDP_QUEUE_NUMBER, + CCAE_RDMA_READ_NOT_ENABLED, + CCAE_RDMA_WRITE_NOT_ENABLED, + CCAE_RDMA_READ_TOO_SMALL, + CCAE_NO_L_BIT, + CCAE_TAGGED_INVALID_STAG, + CCAE_TAGGED_BASE_BOUNDS_VIOLATION, + CCAE_TAGGED_ACCESS_RIGHTS_VIOLATION, + CCAE_TAGGED_INVALID_PD, + CCAE_WRAP_ERROR, + CCAE_BAD_CLOSE, + CCAE_BAD_LLP_CLOSE, + CCAE_INVALID_MSN_RANGE, + CCAE_INVALID_MSN_GAP, + CCAE_IRRQ_OVERFLOW, + CCAE_IRRQ_MSN_GAP, + CCAE_IRRQ_MSN_RANGE, + CCAE_IRRQ_INVALID_STAG, + CCAE_IRRQ_BASE_BOUNDS_VIOLATION, + CCAE_IRRQ_ACCESS_RIGHTS_VIOLATION, + CCAE_IRRQ_INVALID_PD, + CCAE_IRRQ_WRAP_ERROR, + CCAE_CQ_SQ_COMPLETION_OVERFLOW, + CCAE_CQ_RQ_COMPLETION_ERROR, + CCAE_QP_SRQ_WQE_ERROR, + CCAE_QP_LOCAL_CATASTROPHIC_ERROR, + CCAE_CQ_OVERFLOW, + CCAE_CQ_OPERATION_ERROR, + CCAE_SRQ_LIMIT_REACHED, + CCAE_QP_RQ_LIMIT_REACHED, + CCAE_SRQ_CATASTROPHIC_ERROR, + CCAE_RNIC_CATASTROPHIC_ERROR +/* WARNING If you add more id's, make sure their values fit in eight bits. */ +}; + +/* + * Resource Indicators and Identifiers + */ +enum c2_resource_indicator { + C2_RES_IND_QP = 1, + C2_RES_IND_EP, + C2_RES_IND_CQ, + C2_RES_IND_SRQ, +}; + +#endif /* _C2_AE_H_ */ diff --git a/drivers/infiniband/hw/amso1100/c2_status.h b/drivers/infiniband/hw/amso1100/c2_status.h new file mode 100644 index 0000000..6ee4aa9 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_status.h @@ -0,0 +1,158 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _C2_STATUS_H_ +#define _C2_STATUS_H_ + +/* + * Verbs Status Codes + */ +enum c2_status { + C2_OK = 0, /* This must be zero */ + CCERR_INSUFFICIENT_RESOURCES = 1, + CCERR_INVALID_MODIFIER = 2, + CCERR_INVALID_MODE = 3, + CCERR_IN_USE = 4, + CCERR_INVALID_RNIC = 5, + CCERR_INTERRUPTED_OPERATION = 6, + CCERR_INVALID_EH = 7, + CCERR_INVALID_CQ = 8, + CCERR_CQ_EMPTY = 9, + CCERR_NOT_IMPLEMENTED = 10, + CCERR_CQ_DEPTH_TOO_SMALL = 11, + CCERR_PD_IN_USE = 12, + CCERR_INVALID_PD = 13, + CCERR_INVALID_SRQ = 14, + CCERR_INVALID_ADDRESS = 15, + CCERR_INVALID_NETMASK = 16, + CCERR_INVALID_QP = 17, + CCERR_INVALID_QP_STATE = 18, + CCERR_TOO_MANY_WRS_POSTED = 19, + CCERR_INVALID_WR_TYPE = 20, + CCERR_INVALID_SGL_LENGTH = 21, + CCERR_INVALID_SQ_DEPTH = 22, + CCERR_INVALID_RQ_DEPTH = 23, + CCERR_INVALID_ORD = 24, + CCERR_INVALID_IRD = 25, + CCERR_QP_ATTR_CANNOT_CHANGE = 26, + CCERR_INVALID_STAG = 27, + CCERR_QP_IN_USE = 28, + CCERR_OUTSTANDING_WRS = 29, + CCERR_STAG_IN_USE = 30, + CCERR_INVALID_STAG_INDEX = 31, + CCERR_INVALID_SGL_FORMAT = 32, + CCERR_ADAPTER_TIMEOUT = 33, + CCERR_INVALID_CQ_DEPTH = 34, + CCERR_INVALID_PRIVATE_DATA_LENGTH = 35, + CCERR_INVALID_EP = 36, + CCERR_MR_IN_USE = CCERR_STAG_IN_USE, + CCERR_FLUSHED = 38, + CCERR_INVALID_WQE = 39, + CCERR_LOCAL_QP_CATASTROPHIC_ERROR = 40, + CCERR_REMOTE_TERMINATION_ERROR = 41, + CCERR_BASE_AND_BOUNDS_VIOLATION = 42, + CCERR_ACCESS_VIOLATION = 43, + CCERR_INVALID_PD_ID = 44, + CCERR_WRAP_ERROR = 45, + CCERR_INV_STAG_ACCESS_ERROR = 46, + CCERR_ZERO_RDMA_READ_RESOURCES = 47, + CCERR_QP_NOT_PRIVILEGED = 48, + CCERR_STAG_STATE_NOT_INVALID = 49, + CCERR_INVALID_PAGE_SIZE = 50, + CCERR_INVALID_BUFFER_SIZE = 51, + CCERR_INVALID_PBE = 52, + CCERR_INVALID_FBO = 53, + CCERR_INVALID_LENGTH = 54, + CCERR_INVALID_ACCESS_RIGHTS = 55, + CCERR_PBL_TOO_BIG = 56, + CCERR_INVALID_VA = 57, + CCERR_INVALID_REGION = 58, + CCERR_INVALID_WINDOW = 59, + CCERR_TOTAL_LENGTH_TOO_BIG = 60, + CCERR_INVALID_QP_ID = 61, + CCERR_ADDR_IN_USE = 62, + CCERR_ADDR_NOT_AVAIL = 63, + CCERR_NET_DOWN = 64, + CCERR_NET_UNREACHABLE = 65, + CCERR_CONN_ABORTED = 66, + CCERR_CONN_RESET = 67, + CCERR_NO_BUFS = 68, + CCERR_CONN_TIMEDOUT = 69, + CCERR_CONN_REFUSED = 70, + CCERR_HOST_UNREACHABLE = 71, + CCERR_INVALID_SEND_SGL_DEPTH = 72, + CCERR_INVALID_RECV_SGL_DEPTH = 73, + CCERR_INVALID_RDMA_WRITE_SGL_DEPTH = 74, + CCERR_INSUFFICIENT_PRIVILEGES = 75, + CCERR_STACK_ERROR = 76, + CCERR_INVALID_VERSION = 77, + CCERR_INVALID_MTU = 78, + CCERR_INVALID_IMAGE = 79, + CCERR_PENDING = 98, /* not an error; user internally by adapter */ + CCERR_DEFER = 99, /* not an error; used internally by adapter */ + CCERR_FAILED_WRITE = 100, + CCERR_FAILED_ERASE = 101, + CCERR_FAILED_VERIFICATION = 102, + CCERR_NOT_FOUND = 103, + +}; + +/* + * CCAE_ACTIVE_CONNECT_RESULTS status result codes. + */ +enum c2_connect_status { + C2_CONN_STATUS_SUCCESS = C2_OK, + C2_CONN_STATUS_NO_MEM = CCERR_INSUFFICIENT_RESOURCES, + C2_CONN_STATUS_TIMEDOUT = CCERR_CONN_TIMEDOUT, + C2_CONN_STATUS_REFUSED = CCERR_CONN_REFUSED, + C2_CONN_STATUS_NETUNREACH = CCERR_NET_UNREACHABLE, + C2_CONN_STATUS_HOSTUNREACH = CCERR_HOST_UNREACHABLE, + C2_CONN_STATUS_INVALID_RNIC = CCERR_INVALID_RNIC, + C2_CONN_STATUS_INVALID_QP = CCERR_INVALID_QP, + C2_CONN_STATUS_INVALID_QP_STATE = CCERR_INVALID_QP_STATE, + C2_CONN_STATUS_REJECTED = CCERR_CONN_RESET, + C2_CONN_STATUS_ADDR_NOT_AVAIL = CCERR_ADDR_NOT_AVAIL, +}; + +/* + * Flash programming status codes. + */ +enum c2_flash_status { + C2_FLASH_STATUS_SUCCESS = 0x0000, + C2_FLASH_STATUS_VERIFY_ERR = 0x0002, + C2_FLASH_STATUS_IMAGE_ERR = 0x0004, + C2_FLASH_STATUS_ECLBS = 0x0400, + C2_FLASH_STATUS_PSLBS = 0x0800, + C2_FLASH_STATUS_VPENS = 0x1000, +}; + +#endif /* _C2_STATUS_H_ */ diff --git a/drivers/infiniband/hw/amso1100/c2_wr.h b/drivers/infiniband/hw/amso1100/c2_wr.h new file mode 100644 index 0000000..9d6468d --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_wr.h @@ -0,0 +1,1523 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _C2_WR_H_ +#define _C2_WR_H_ + +#ifdef CCDEBUG +#define CCWR_MAGIC 0xb07700b0 +#endif + +#define C2_QP_NO_ATTR_CHANGE 0xFFFFFFFF + +/* Maximum allowed size in bytes of private_data exchange + * on connect. + */ +#define C2_MAX_PRIVATE_DATA_SIZE 200 + +/* + * These types are shared among the adapter, host, and CCIL consumer. + */ +enum c2_cq_notification_type { + C2_CQ_NOTIFICATION_TYPE_NONE = 1, + C2_CQ_NOTIFICATION_TYPE_NEXT, + C2_CQ_NOTIFICATION_TYPE_NEXT_SE +}; + +enum c2_setconfig_cmd { + C2_CFG_ADD_ADDR = 1, + C2_CFG_DEL_ADDR = 2, + C2_CFG_ADD_ROUTE = 3, + C2_CFG_DEL_ROUTE = 4 +}; + +enum c2_getconfig_cmd { + C2_GETCONFIG_ROUTES = 1, + C2_GETCONFIG_ADDRS +}; + +/* + * CCIL Work Request Identifiers + */ +enum c2wr_ids { + CCWR_RNIC_OPEN = 1, + CCWR_RNIC_QUERY, + CCWR_RNIC_SETCONFIG, + CCWR_RNIC_GETCONFIG, + CCWR_RNIC_CLOSE, + CCWR_CQ_CREATE, + CCWR_CQ_QUERY, + CCWR_CQ_MODIFY, + CCWR_CQ_DESTROY, + CCWR_QP_CONNECT, + CCWR_PD_ALLOC, + CCWR_PD_DEALLOC, + CCWR_SRQ_CREATE, + CCWR_SRQ_QUERY, + CCWR_SRQ_MODIFY, + CCWR_SRQ_DESTROY, + CCWR_QP_CREATE, + CCWR_QP_QUERY, + CCWR_QP_MODIFY, + CCWR_QP_DESTROY, + CCWR_NSMR_STAG_ALLOC, + CCWR_NSMR_REGISTER, + CCWR_NSMR_PBL, + CCWR_STAG_DEALLOC, + CCWR_NSMR_REREGISTER, + CCWR_SMR_REGISTER, + CCWR_MR_QUERY, + CCWR_MW_ALLOC, + CCWR_MW_QUERY, + CCWR_EP_CREATE, + CCWR_EP_GETOPT, + CCWR_EP_SETOPT, + CCWR_EP_DESTROY, + CCWR_EP_BIND, + CCWR_EP_CONNECT, + CCWR_EP_LISTEN, + CCWR_EP_SHUTDOWN, + CCWR_EP_LISTEN_CREATE, + CCWR_EP_LISTEN_DESTROY, + CCWR_EP_QUERY, + CCWR_CR_ACCEPT, + CCWR_CR_REJECT, + CCWR_CONSOLE, + CCWR_TERM, + CCWR_FLASH_INIT, + CCWR_FLASH, + CCWR_BUF_ALLOC, + CCWR_BUF_FREE, + CCWR_FLASH_WRITE, + CCWR_INIT, /* WARNING: Don't move this ever again! */ + + + + /* Add new IDs here */ + + + + /* + * WARNING: CCWR_LAST must always be the last verbs id defined! + * All the preceding IDs are fixed, and must not change. + * You can add new IDs, but must not remove or reorder + * any IDs. If you do, YOU will ruin any hope of + * compatability between versions. + */ + CCWR_LAST, + + /* + * Start over at 1 so that arrays indexed by user wr id's + * begin at 1. This is OK since the verbs and user wr id's + * are always used on disjoint sets of queues. + */ + /* + * The order of the CCWR_SEND_XX verbs must + * match the order of the RDMA_OPs + */ + CCWR_SEND = 1, + CCWR_SEND_INV, + CCWR_SEND_SE, + CCWR_SEND_SE_INV, + CCWR_RDMA_WRITE, + CCWR_RDMA_READ, + CCWR_RDMA_READ_INV, + CCWR_MW_BIND, + CCWR_NSMR_FASTREG, + CCWR_STAG_INVALIDATE, + CCWR_RECV, + CCWR_NOP, + CCWR_UNIMPL, +/* WARNING: This must always be the last user wr id defined! */ +}; +#define RDMA_SEND_OPCODE_FROM_WR_ID(x) (x+2) + +/* + * SQ/RQ Work Request Types + */ +enum c2_wr_type { + C2_WR_TYPE_SEND = CCWR_SEND, + C2_WR_TYPE_SEND_SE = CCWR_SEND_SE, + C2_WR_TYPE_SEND_INV = CCWR_SEND_INV, + C2_WR_TYPE_SEND_SE_INV = CCWR_SEND_SE_INV, + C2_WR_TYPE_RDMA_WRITE = CCWR_RDMA_WRITE, + C2_WR_TYPE_RDMA_READ = CCWR_RDMA_READ, + C2_WR_TYPE_RDMA_READ_INV_STAG = CCWR_RDMA_READ_INV, + C2_WR_TYPE_BIND_MW = CCWR_MW_BIND, + C2_WR_TYPE_FASTREG_NSMR = CCWR_NSMR_FASTREG, + C2_WR_TYPE_INV_STAG = CCWR_STAG_INVALIDATE, + C2_WR_TYPE_RECV = CCWR_RECV, + C2_WR_TYPE_NOP = CCWR_NOP, +}; + +struct c2_netaddr { + u32 ip_addr; + u32 netmask; + u32 mtu; +}; + +struct c2_route { + u32 ip_addr; /* 0 indicates the default route */ + u32 netmask; /* netmask associated with dst */ + u32 flags; + union { + u32 ipaddr; /* address of the nexthop interface */ + u8 enaddr[6]; + } nexthop; +}; + +/* + * A Scatter Gather Entry. + */ +struct c2_data_addr { + u32 stag; + u32 length; + u64 to; +}; + +/* + * MR and MW flags used by the consumer, RI, and RNIC. + */ +enum c2_mm_flags { + MEM_REMOTE = 0x0001, /* allow mw binds with remote access. */ + MEM_VA_BASED = 0x0002, /* Not Zero-based */ + MEM_PBL_COMPLETE = 0x0004, /* PBL array is complete in this msg */ + MEM_LOCAL_READ = 0x0008, /* allow local reads */ + MEM_LOCAL_WRITE = 0x0010, /* allow local writes */ + MEM_REMOTE_READ = 0x0020, /* allow remote reads */ + MEM_REMOTE_WRITE = 0x0040, /* allow remote writes */ + MEM_WINDOW_BIND = 0x0080, /* binds allowed */ + MEM_SHARED = 0x0100, /* set if MR is shared */ + MEM_STAG_VALID = 0x0200 /* set if STAG is in valid state */ +}; + +/* + * CCIL API ACF flags defined in terms of the low level mem flags. + * This minimizes translation needed in the user API + */ +enum c2_acf { + C2_ACF_LOCAL_READ = MEM_LOCAL_READ, + C2_ACF_LOCAL_WRITE = MEM_LOCAL_WRITE, + C2_ACF_REMOTE_READ = MEM_REMOTE_READ, + C2_ACF_REMOTE_WRITE = MEM_REMOTE_WRITE, + C2_ACF_WINDOW_BIND = MEM_WINDOW_BIND +}; + +/* + * Image types of objects written to flash + */ +#define C2_FLASH_IMG_BITFILE 1 +#define C2_FLASH_IMG_OPTION_ROM 2 +#define C2_FLASH_IMG_VPD 3 + +/* + * to fix bug 1815 we define the max size allowable of the + * terminate message (per the IETF spec).Refer to the IETF + * protocal specification, section 12.1.6, page 64) + * The message is prefixed by 20 types of DDP info. + * + * Then the message has 6 bytes for the terminate control + * and DDP segment length info plus a DDP header (either + * 14 or 18 byts) plus 28 bytes for the RDMA header. + * Thus the max size in: + * 20 + (6 + 18 + 28) = 72 + */ +#define C2_MAX_TERMINATE_MESSAGE_SIZE (72) + +/* + * Build String Length. It must be the same as C2_BUILD_STR_LEN in ccil_api.h + */ +#define WR_BUILD_STR_LEN 64 + +/* + * WARNING: All of these structs need to align any 64bit types on + * 64 bit boundaries! 64bit types include u64 and u64. + */ + +/* + * Clustercore Work Request Header. Be sensitive to field layout + * and alignment. + */ +struct c2wr_hdr { + /* wqe_count is part of the cqe. It is put here so the + * adapter can write to it while the wr is pending without + * clobbering part of the wr. This word need not be dma'd + * from the host to adapter by libccil, but we copy it anyway + * to make the memcpy to the adapter better aligned. + */ + u32 wqe_count; + + /* Put these fields next so that later 32- and 64-bit + * quantities are naturally aligned. + */ + u8 id; + u8 result; /* adapter -> host */ + u8 sge_count; /* host -> adapter */ + u8 flags; /* host -> adapter */ + + u64 context; +#ifdef CCMSGMAGIC + u32 magic; + u32 pad; +#endif +} __attribute__((packed)); + +/* + *------------------------ RNIC ------------------------ + */ + +/* + * WR_RNIC_OPEN + */ + +/* + * Flags for the RNIC WRs + */ +enum c2_rnic_flags { + RNIC_IRD_STATIC = 0x0001, + RNIC_ORD_STATIC = 0x0002, + RNIC_QP_STATIC = 0x0004, + RNIC_SRQ_SUPPORTED = 0x0008, + RNIC_PBL_BLOCK_MODE = 0x0010, + RNIC_SRQ_MODEL_ARRIVAL = 0x0020, + RNIC_CQ_OVF_DETECTED = 0x0040, + RNIC_PRIV_MODE = 0x0080 +}; + +struct c2wr_rnic_open_req { + struct c2wr_hdr hdr; + u64 user_context; + u16 flags; /* See enum c2_rnic_flags */ + u16 port_num; +} __attribute__((packed)); + +struct c2wr_rnic_open_rep { + struct c2wr_hdr hdr; + u32 rnic_handle; +} __attribute__((packed)); + +union c2wr_rnic_open { + struct c2wr_rnic_open_req req; + struct c2wr_rnic_open_rep rep; +} __attribute__((packed)); + +struct c2wr_rnic_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; +} __attribute__((packed)); + +/* + * WR_RNIC_QUERY + */ +struct c2wr_rnic_query_rep { + struct c2wr_hdr hdr; + u64 user_context; + u32 vendor_id; + u32 part_number; + u32 hw_version; + u32 fw_ver_major; + u32 fw_ver_minor; + u32 fw_ver_patch; + char fw_ver_build_str[WR_BUILD_STR_LEN]; + u32 max_qps; + u32 max_qp_depth; + u32 max_srq_depth; + u32 max_send_sgl_depth; + u32 max_rdma_sgl_depth; + u32 max_cqs; + u32 max_cq_depth; + u32 max_cq_event_handlers; + u32 max_mrs; + u32 max_pbl_depth; + u32 max_pds; + u32 max_global_ird; + u32 max_global_ord; + u32 max_qp_ird; + u32 max_qp_ord; + u32 flags; + u32 max_mws; + u32 pbe_range_low; + u32 pbe_range_high; + u32 max_srqs; + u32 page_size; +} __attribute__((packed)); + +union c2wr_rnic_query { + struct c2wr_rnic_query_req req; + struct c2wr_rnic_query_rep rep; +} __attribute__((packed)); + +/* + * WR_RNIC_GETCONFIG + */ + +struct c2wr_rnic_getconfig_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 option; /* see c2_getconfig_cmd_t */ + u64 reply_buf; + u32 reply_buf_len; +} __attribute__((packed)) ; + +struct c2wr_rnic_getconfig_rep { + struct c2wr_hdr hdr; + u32 option; /* see c2_getconfig_cmd_t */ + u32 count_len; /* length of the number of addresses configured */ +} __attribute__((packed)) ; + +union c2wr_rnic_getconfig { + struct c2wr_rnic_getconfig_req req; + struct c2wr_rnic_getconfig_rep rep; +} __attribute__((packed)) ; + +/* + * WR_RNIC_SETCONFIG + */ +struct c2wr_rnic_setconfig_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 option; /* See c2_setconfig_cmd_t */ + /* variable data and pad. See c2_netaddr and c2_route */ + u8 data[0]; +} __attribute__((packed)) ; + +struct c2wr_rnic_setconfig_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_rnic_setconfig { + struct c2wr_rnic_setconfig_req req; + struct c2wr_rnic_setconfig_rep rep; +} __attribute__((packed)) ; + +/* + * WR_RNIC_CLOSE + */ +struct c2wr_rnic_close_req { + struct c2wr_hdr hdr; + u32 rnic_handle; +} __attribute__((packed)) ; + +struct c2wr_rnic_close_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_rnic_close { + struct c2wr_rnic_close_req req; + struct c2wr_rnic_close_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ CQ ------------------------ + */ +struct c2wr_cq_create_req { + struct c2wr_hdr hdr; + u64 shared_ht; + u64 user_context; + u64 msg_pool; + u32 rnic_handle; + u32 msg_size; + u32 depth; +} __attribute__((packed)) ; + +struct c2wr_cq_create_rep { + struct c2wr_hdr hdr; + u32 mq_index; + u32 adapter_shared; + u32 cq_handle; +} __attribute__((packed)) ; + +union c2wr_cq_create { + struct c2wr_cq_create_req req; + struct c2wr_cq_create_rep rep; +} __attribute__((packed)) ; + +struct c2wr_cq_modify_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 cq_handle; + u32 new_depth; + u64 new_msg_pool; +} __attribute__((packed)) ; + +struct c2wr_cq_modify_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_cq_modify { + struct c2wr_cq_modify_req req; + struct c2wr_cq_modify_rep rep; +} __attribute__((packed)) ; + +struct c2wr_cq_destroy_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 cq_handle; +} __attribute__((packed)) ; + +struct c2wr_cq_destroy_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_cq_destroy { + struct c2wr_cq_destroy_req req; + struct c2wr_cq_destroy_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ PD ------------------------ + */ +struct c2wr_pd_alloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_pd_alloc_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_pd_alloc { + struct c2wr_pd_alloc_req req; + struct c2wr_pd_alloc_rep rep; +} __attribute__((packed)) ; + +struct c2wr_pd_dealloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_pd_dealloc_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_pd_dealloc { + struct c2wr_pd_dealloc_req req; + struct c2wr_pd_dealloc_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ SRQ ------------------------ + */ +struct c2wr_srq_create_req { + struct c2wr_hdr hdr; + u64 shared_ht; + u64 user_context; + u32 rnic_handle; + u32 srq_depth; + u32 srq_limit; + u32 sgl_depth; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_srq_create_rep { + struct c2wr_hdr hdr; + u32 srq_depth; + u32 sgl_depth; + u32 msg_size; + u32 mq_index; + u32 mq_start; + u32 srq_handle; +} __attribute__((packed)) ; + +union c2wr_srq_create { + struct c2wr_srq_create_req req; + struct c2wr_srq_create_rep rep; +} __attribute__((packed)) ; + +struct c2wr_srq_destroy_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 srq_handle; +} __attribute__((packed)) ; + +struct c2wr_srq_destroy_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_srq_destroy { + struct c2wr_srq_destroy_req req; + struct c2wr_srq_destroy_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ QP ------------------------ + */ +enum c2wr_qp_flags { + QP_RDMA_READ = 0x00000001, /* RDMA read enabled? */ + QP_RDMA_WRITE = 0x00000002, /* RDMA write enabled? */ + QP_MW_BIND = 0x00000004, /* MWs enabled */ + QP_ZERO_STAG = 0x00000008, /* enabled? */ + QP_REMOTE_TERMINATION = 0x00000010, /* remote end terminated */ + QP_RDMA_READ_RESPONSE = 0x00000020 /* Remote RDMA read */ + /* enabled? */ +}; + +struct c2wr_qp_create_req { + struct c2wr_hdr hdr; + u64 shared_sq_ht; + u64 shared_rq_ht; + u64 user_context; + u32 rnic_handle; + u32 sq_cq_handle; + u32 rq_cq_handle; + u32 sq_depth; + u32 rq_depth; + u32 srq_handle; + u32 srq_limit; + u32 flags; /* see enum c2wr_qp_flags */ + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u32 ord; + u32 ird; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_qp_create_rep { + struct c2wr_hdr hdr; + u32 sq_depth; + u32 rq_depth; + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u32 ord; + u32 ird; + u32 sq_msg_size; + u32 sq_mq_index; + u32 sq_mq_start; + u32 rq_msg_size; + u32 rq_mq_index; + u32 rq_mq_start; + u32 qp_handle; +} __attribute__((packed)) ; + +union c2wr_qp_create { + struct c2wr_qp_create_req req; + struct c2wr_qp_create_rep rep; +} __attribute__((packed)) ; + +struct c2wr_qp_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 qp_handle; +} __attribute__((packed)) ; + +struct c2wr_qp_query_rep { + struct c2wr_hdr hdr; + u64 user_context; + u32 rnic_handle; + u32 sq_depth; + u32 rq_depth; + u32 send_sgl_depth; + u32 rdma_write_sgl_depth; + u32 recv_sgl_depth; + u32 ord; + u32 ird; + u16 qp_state; + u16 flags; /* see c2wr_qp_flags_t */ + u32 qp_id; + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; + u32 terminate_msg_length; /* 0 if not present */ + u8 data[0]; + /* Terminate Message in-line here. */ +} __attribute__((packed)) ; + +union c2wr_qp_query { + struct c2wr_qp_query_req req; + struct c2wr_qp_query_rep rep; +} __attribute__((packed)) ; + +struct c2wr_qp_modify_req { + struct c2wr_hdr hdr; + u64 stream_msg; + u32 stream_msg_length; + u32 rnic_handle; + u32 qp_handle; + u32 next_qp_state; + u32 ord; + u32 ird; + u32 sq_depth; + u32 rq_depth; + u32 llp_ep_handle; +} __attribute__((packed)) ; + +struct c2wr_qp_modify_rep { + struct c2wr_hdr hdr; + u32 ord; + u32 ird; + u32 sq_depth; + u32 rq_depth; + u32 sq_msg_size; + u32 sq_mq_index; + u32 sq_mq_start; + u32 rq_msg_size; + u32 rq_mq_index; + u32 rq_mq_start; +} __attribute__((packed)) ; + +union c2wr_qp_modify { + struct c2wr_qp_modify_req req; + struct c2wr_qp_modify_rep rep; +} __attribute__((packed)) ; + +struct c2wr_qp_destroy_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 qp_handle; +} __attribute__((packed)) ; + +struct c2wr_qp_destroy_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_qp_destroy { + struct c2wr_qp_destroy_req req; + struct c2wr_qp_destroy_rep rep; +} __attribute__((packed)) ; + +/* + * The CCWR_QP_CONNECT msg is posted on the verbs request queue. It can + * only be posted when a QP is in IDLE state. After the connect request is + * submitted to the LLP, the adapter moves the QP to CONNECT_PENDING state. + * No synchronous reply from adapter to this WR. The results of + * connection are passed back in an async event CCAE_ACTIVE_CONNECT_RESULTS + * See c2wr_ae_active_connect_results_t + */ +struct c2wr_qp_connect_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 qp_handle; + u32 remote_addr; + u16 remote_port; + u16 pad; + u32 private_data_length; + u8 private_data[0]; /* Private data in-line. */ +} __attribute__((packed)) ; + +struct c2wr_qp_connect { + struct c2wr_qp_connect_req req; + /* no synchronous reply. */ +} __attribute__((packed)) ; + + +/* + *------------------------ MM ------------------------ + */ + +struct c2wr_nsmr_stag_alloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 pbl_depth; + u32 pd_id; + u32 flags; +} __attribute__((packed)) ; + +struct c2wr_nsmr_stag_alloc_rep { + struct c2wr_hdr hdr; + u32 pbl_depth; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_nsmr_stag_alloc { + struct c2wr_nsmr_stag_alloc_req req; + struct c2wr_nsmr_stag_alloc_rep rep; +} __attribute__((packed)) ; + +struct c2wr_nsmr_register_req { + struct c2wr_hdr hdr; + u64 va; + u32 rnic_handle; + u16 flags; + u8 stag_key; + u8 pad; + u32 pd_id; + u32 pbl_depth; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} __attribute__((packed)) ; + +struct c2wr_nsmr_register_rep { + struct c2wr_hdr hdr; + u32 pbl_depth; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_nsmr_register { + struct c2wr_nsmr_register_req req; + struct c2wr_nsmr_register_rep rep; +} __attribute__((packed)) ; + +struct c2wr_nsmr_pbl_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 flags; + u32 stag_index; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} __attribute__((packed)) ; + +struct c2wr_nsmr_pbl_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_nsmr_pbl { + struct c2wr_nsmr_pbl_req req; + struct c2wr_nsmr_pbl_rep rep; +} __attribute__((packed)) ; + +struct c2wr_mr_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 stag_index; +} __attribute__((packed)) ; + +struct c2wr_mr_query_rep { + struct c2wr_hdr hdr; + u8 stag_key; + u8 pad[3]; + u32 pd_id; + u32 flags; + u32 pbl_depth; +} __attribute__((packed)) ; + +union c2wr_mr_query { + struct c2wr_mr_query_req req; + struct c2wr_mr_query_rep rep; +} __attribute__((packed)) ; + +struct c2wr_mw_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 stag_index; +} __attribute__((packed)) ; + +struct c2wr_mw_query_rep { + struct c2wr_hdr hdr; + u8 stag_key; + u8 pad[3]; + u32 pd_id; + u32 flags; +} __attribute__((packed)) ; + +union c2wr_mw_query { + struct c2wr_mw_query_req req; + struct c2wr_mw_query_rep rep; +} __attribute__((packed)) ; + + +struct c2wr_stag_dealloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 stag_index; +} __attribute__((packed)) ; + +struct c2wr_stag_dealloc_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_stag_dealloc { + struct c2wr_stag_dealloc_req req; + struct c2wr_stag_dealloc_rep rep; +} __attribute__((packed)) ; + +struct c2wr_nsmr_reregister_req { + struct c2wr_hdr hdr; + u64 va; + u32 rnic_handle; + u16 flags; + u8 stag_key; + u8 pad; + u32 stag_index; + u32 pd_id; + u32 pbl_depth; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + u32 pad1; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} __attribute__((packed)) ; + +struct c2wr_nsmr_reregister_rep { + struct c2wr_hdr hdr; + u32 pbl_depth; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_nsmr_reregister { + struct c2wr_nsmr_reregister_req req; + struct c2wr_nsmr_reregister_rep rep; +} __attribute__((packed)) ; + +struct c2wr_smr_register_req { + struct c2wr_hdr hdr; + u64 va; + u32 rnic_handle; + u16 flags; + u8 stag_key; + u8 pad; + u32 stag_index; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_smr_register_rep { + struct c2wr_hdr hdr; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_smr_register { + struct c2wr_smr_register_req req; + struct c2wr_smr_register_rep rep; +} __attribute__((packed)) ; + +struct c2wr_mw_alloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_mw_alloc_rep { + struct c2wr_hdr hdr; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_mw_alloc { + struct c2wr_mw_alloc_req req; + struct c2wr_mw_alloc_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ WRs ----------------------- + */ + +struct c2wr_user_hdr { + struct c2wr_hdr hdr; /* Has status and WR Type */ +} __attribute__((packed)) ; + +enum c2_qp_state { + C2_QP_STATE_IDLE = 0x01, + C2_QP_STATE_CONNECTING = 0x02, + C2_QP_STATE_RTS = 0x04, + C2_QP_STATE_CLOSING = 0x08, + C2_QP_STATE_TERMINATE = 0x10, + C2_QP_STATE_ERROR = 0x20, +}; + +/* Completion queue entry. */ +struct c2wr_ce { + struct c2wr_hdr hdr; /* Has status and WR Type */ + u64 qp_user_context; /* c2_user_qp_t * */ + u32 qp_state; /* Current QP State */ + u32 handle; /* QPID or EP Handle */ + u32 bytes_rcvd; /* valid for RECV WCs */ + u32 stag; +} __attribute__((packed)) ; + + +/* + * Flags used for all post-sq WRs. These must fit in the flags + * field of the struct c2wr_hdr (eight bits). + */ +enum { + SQ_SIGNALED = 0x01, + SQ_READ_FENCE = 0x02, + SQ_FENCE = 0x04, +}; + +/* + * Common fields for all post-sq WRs. Namely the standard header and a + * secondary header with fields common to all post-sq WRs. + */ +struct c2_sq_hdr { + struct c2wr_user_hdr user_hdr; +} __attribute__((packed)); + +/* + * Same as above but for post-rq WRs. + */ +struct c2_rq_hdr { + struct c2wr_user_hdr user_hdr; +} __attribute__((packed)); + +/* + * use the same struct for all sends. + */ +struct c2wr_send_req { + struct c2_sq_hdr sq_hdr; + u32 sge_len; + u32 remote_stag; + u8 data[0]; /* SGE array */ +} __attribute__((packed)); +/* XXX c2wr_send_req_t, c2wr_send_se_req_t, c2wr_send_inv_req_t, + c2wr_send_se_inv_req_t;*/ + +union c2wr_send { + struct c2wr_send_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_rdma_write_req { + struct c2_sq_hdr sq_hdr; + u64 remote_to; + u32 remote_stag; + u32 sge_len; + u8 data[0]; /* SGE array */ +} __attribute__((packed)); + +union c2wr_rdma_write { + struct c2wr_rdma_write_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_rdma_read_req { + struct c2_sq_hdr sq_hdr; + u64 local_to; + u64 remote_to; + u32 local_stag; + u32 remote_stag; + u32 length; +} __attribute__((packed)); + +union c2wr_rdma_read { + struct c2wr_rdma_read_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_mw_bind_req { + struct c2_sq_hdr sq_hdr; + u64 va; + u8 stag_key; + u8 pad[3]; + u32 mw_stag_index; + u32 mr_stag_index; + u32 length; + u32 flags; +} __attribute__((packed)); + +union c2wr_mw_bind { + struct c2wr_mw_bind_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_nsmr_fastreg_req { + struct c2_sq_hdr sq_hdr; + u64 va; + u8 stag_key; + u8 pad[3]; + u32 stag_index; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} __attribute__((packed)); + +union c2wr_nsmr_fastreg { + struct c2wr_nsmr_fastreg_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_stag_invalidate_req { + struct c2_sq_hdr sq_hdr; + u8 stag_key; + u8 pad[3]; + u32 stag_index; +} __attribute__((packed)); + +union c2wr_stag_invalidate { + struct c2wr_stag_invalidate_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +union c2wr_sqwr { + struct c2_sq_hdr sq_hdr; + struct c2wr_send_req send; + struct c2wr_send_req send_se; + struct c2wr_send_req send_inv; + struct c2wr_send_req send_se_inv; + struct c2wr_rdma_write_req rdma_write; + struct c2wr_rdma_read_req rdma_read; + struct c2wr_mw_bind_req mw_bind; + struct c2wr_nsmr_fastreg_req nsmr_fastreg; + struct c2wr_stag_invalidate_req stag_inv; +} __attribute__((packed)); + + +/* + * RQ WRs + */ +struct c2wr_rqwr { + struct c2_rq_hdr rq_hdr; + u8 data[0]; /* array of SGEs */ +} __attribute__((packed)); +/* XXX c2wr_rqwr_t, c2wr_recv_req_t; */ + +union c2wr_recv { + struct c2wr_rqwr req; + struct c2wr_ce rep; +} __attribute__((packed)); + +/* + * All AEs start with this header. Most AEs only need to convey the + * information in the header. Some, like LLP connection events, need + * more info. The union typdef c2wr_ae_t has all the possible AEs. + * + * hdr.context is the user_context from the rnic_open WR. NULL If this + * is not affiliated with an rnic + * + * hdr.id is the AE identifier (eg; CCAE_REMOTE_SHUTDOWN, + * CCAE_LLP_CLOSE_COMPLETE) + * + * resource_type is one of: C2_RES_IND_QP, C2_RES_IND_CQ, C2_RES_IND_SRQ + * + * user_context is the context passed down when the host created the resource. + */ +struct c2wr_ae_hdr { + struct c2wr_hdr hdr; + u64 user_context; /* user context for this res. */ + u32 resource_type; /* see enum c2_resource_indicator */ + u32 resource; /* handle for resource */ + u32 qp_state; /* current QP State */ +} __attribute__((packed)); + +/* + * After submitting the CCAE_ACTIVE_CONNECT_RESULTS message on the AEQ, + * the adapter moves the QP into RTS state + */ +struct c2wr_ae_active_connect_results { + struct c2wr_ae_hdr ae_hdr; + u32 laddr; + u32 raddr; + u16 lport; + u16 rport; + u32 private_data_length; + u8 private_data[0]; /* data is in-line in the msg. */ +} __attribute__((packed)); + +/* + * When connections are established by the stack (and the private data + * MPA frame is received), the adapter will generate an event to the host. + * The details of the connection, any private data, and the new connection + * request handle is passed up via the CCAE_CONNECTION_REQUEST msg on the + * AE queue: + */ +struct c2wr_ae_connection_request { + struct c2wr_ae_hdr ae_hdr; + u32 cr_handle; /* connreq handle (sock ptr) */ + u32 laddr; + u32 raddr; + u16 lport; + u16 rport; + u32 private_data_length; + u8 private_data[0]; /* data is in-line in the msg. */ +} __attribute__((packed)); + +union c2wr_ae { + struct c2wr_ae_hdr ae_generic; + struct c2wr_ae_active_connect_results ae_active_connect_results; + struct c2wr_ae_connection_request ae_connection_request; +} __attribute__((packed)); + +struct c2wr_init_req { + struct c2wr_hdr hdr; + u64 hint_count; + u64 q0_host_shared; + u64 q1_host_shared; + u64 q1_host_msg_pool; + u64 q2_host_shared; + u64 q2_host_msg_pool; +} __attribute__((packed)); + +struct c2wr_init_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_init { + struct c2wr_init_req req; + struct c2wr_init_rep rep; +} __attribute__((packed)); + +/* + * For upgrading flash. + */ + +struct c2wr_flash_init_req { + struct c2wr_hdr hdr; + u32 rnic_handle; +} __attribute__((packed)); + +struct c2wr_flash_init_rep { + struct c2wr_hdr hdr; + u32 adapter_flash_buf_offset; + u32 adapter_flash_len; +} __attribute__((packed)); + +union c2wr_flash_init { + struct c2wr_flash_init_req req; + struct c2wr_flash_init_rep rep; +} __attribute__((packed)); + +struct c2wr_flash_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 len; +} __attribute__((packed)); + +struct c2wr_flash_rep { + struct c2wr_hdr hdr; + u32 status; +} __attribute__((packed)); + +union c2wr_flash { + struct c2wr_flash_req req; + struct c2wr_flash_rep rep; +} __attribute__((packed)); + +struct c2wr_buf_alloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 size; +} __attribute__((packed)); + +struct c2wr_buf_alloc_rep { + struct c2wr_hdr hdr; + u32 offset; /* 0 if mem not available */ + u32 size; /* 0 if mem not available */ +} __attribute__((packed)); + +union c2wr_buf_alloc { + struct c2wr_buf_alloc_req req; + struct c2wr_buf_alloc_rep rep; +} __attribute__((packed)); + +struct c2wr_buf_free_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 offset; /* Must match value from alloc */ + u32 size; /* Must match value from alloc */ +} __attribute__((packed)); + +struct c2wr_buf_free_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_buf_free { + struct c2wr_buf_free_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_flash_write_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 offset; + u32 size; + u32 type; + u32 flags; +} __attribute__((packed)); + +struct c2wr_flash_write_rep { + struct c2wr_hdr hdr; + u32 status; +} __attribute__((packed)); + +union c2wr_flash_write { + struct c2wr_flash_write_req req; + struct c2wr_flash_write_rep rep; +} __attribute__((packed)); + +/* + * Messages for LLP connection setup. + */ + +/* + * Listen Request. This allocates a listening endpoint to allow passive + * connection setup. Newly established LLP connections are passed up + * via an AE. See c2wr_ae_connection_request_t + */ +struct c2wr_ep_listen_create_req { + struct c2wr_hdr hdr; + u64 user_context; /* returned in AEs. */ + u32 rnic_handle; + u32 local_addr; /* local addr, or 0 */ + u16 local_port; /* 0 means "pick one" */ + u16 pad; + u32 backlog; /* tradional tcp listen bl */ +} __attribute__((packed)); + +struct c2wr_ep_listen_create_rep { + struct c2wr_hdr hdr; + u32 ep_handle; /* handle to new listening ep */ + u16 local_port; /* resulting port... */ + u16 pad; +} __attribute__((packed)); + +union c2wr_ep_listen_create { + struct c2wr_ep_listen_create_req req; + struct c2wr_ep_listen_create_rep rep; +} __attribute__((packed)); + +struct c2wr_ep_listen_destroy_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 ep_handle; +} __attribute__((packed)); + +struct c2wr_ep_listen_destroy_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_ep_listen_destroy { + struct c2wr_ep_listen_destroy_req req; + struct c2wr_ep_listen_destroy_rep rep; +} __attribute__((packed)); + +struct c2wr_ep_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 ep_handle; +} __attribute__((packed)); + +struct c2wr_ep_query_rep { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; +} __attribute__((packed)); + +union c2wr_ep_query { + struct c2wr_ep_query_req req; + struct c2wr_ep_query_rep rep; +} __attribute__((packed)); + + +/* + * The host passes this down to indicate acceptance of a pending iWARP + * connection. The cr_handle was obtained from the CONNECTION_REQUEST + * AE passed up by the adapter. See c2wr_ae_connection_request_t. + */ +struct c2wr_cr_accept_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 qp_handle; /* QP to bind to this LLP conn */ + u32 ep_handle; /* LLP handle to accept */ + u32 private_data_length; + u8 private_data[0]; /* data in-line in msg. */ +} __attribute__((packed)); + +/* + * adapter sends reply when private data is successfully submitted to + * the LLP. + */ +struct c2wr_cr_accept_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_cr_accept { + struct c2wr_cr_accept_req req; + struct c2wr_cr_accept_rep rep; +} __attribute__((packed)); + +/* + * The host sends this down if a given iWARP connection request was + * rejected by the consumer. The cr_handle was obtained from a + * previous c2wr_ae_connection_request_t AE sent by the adapter. + */ +struct c2wr_cr_reject_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 ep_handle; /* LLP handle to reject */ +} __attribute__((packed)); + +/* + * Dunno if this is needed, but we'll add it for now. The adapter will + * send the reject_reply after the LLP endpoint has been destroyed. + */ +struct c2wr_cr_reject_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_cr_reject { + struct c2wr_cr_reject_req req; + struct c2wr_cr_reject_rep rep; +} __attribute__((packed)); + +/* + * console command. Used to implement a debug console over the verbs + * request and reply queues. + */ + +/* + * Console request message. It contains: + * - message hdr with id = CCWR_CONSOLE + * - the physaddr/len of host memory to be used for the reply. + * - the command string. eg: "netstat -s" or "zoneinfo" + */ +struct c2wr_console_req { + struct c2wr_hdr hdr; /* id = CCWR_CONSOLE */ + u64 reply_buf; /* pinned host buf for reply */ + u32 reply_buf_len; /* length of reply buffer */ + u8 command[0]; /* NUL terminated ascii string */ + /* containing the command req */ +} __attribute__((packed)); + +/* + * flags used in the console reply. + */ +enum c2_console_flags { + CONS_REPLY_TRUNCATED = 0x00000001 /* reply was truncated */ +} __attribute__((packed)); + +/* + * Console reply message. + * hdr.result contains the c2_status_t error if the reply was _not_ generated, + * or C2_OK if the reply was generated. + */ +struct c2wr_console_rep { + struct c2wr_hdr hdr; /* id = CCWR_CONSOLE */ + u32 flags; +} __attribute__((packed)); + +union c2wr_console { + struct c2wr_console_req req; + struct c2wr_console_rep rep; +} __attribute__((packed)); + + +/* + * Giant union with all WRs. Makes life easier... + */ +union c2wr { + struct c2wr_hdr hdr; + struct c2wr_user_hdr user_hdr; + union c2wr_rnic_open rnic_open; + union c2wr_rnic_query rnic_query; + union c2wr_rnic_getconfig rnic_getconfig; + union c2wr_rnic_setconfig rnic_setconfig; + union c2wr_rnic_close rnic_close; + union c2wr_cq_create cq_create; + union c2wr_cq_modify cq_modify; + union c2wr_cq_destroy cq_destroy; + union c2wr_pd_alloc pd_alloc; + union c2wr_pd_dealloc pd_dealloc; + union c2wr_srq_create srq_create; + union c2wr_srq_destroy srq_destroy; + union c2wr_qp_create qp_create; + union c2wr_qp_query qp_query; + union c2wr_qp_modify qp_modify; + union c2wr_qp_destroy qp_destroy; + struct c2wr_qp_connect qp_connect; + union c2wr_nsmr_stag_alloc nsmr_stag_alloc; + union c2wr_nsmr_register nsmr_register; + union c2wr_nsmr_pbl nsmr_pbl; + union c2wr_mr_query mr_query; + union c2wr_mw_query mw_query; + union c2wr_stag_dealloc stag_dealloc; + union c2wr_sqwr sqwr; + struct c2wr_rqwr rqwr; + struct c2wr_ce ce; + union c2wr_ae ae; + union c2wr_init init; + union c2wr_ep_listen_create ep_listen_create; + union c2wr_ep_listen_destroy ep_listen_destroy; + union c2wr_cr_accept cr_accept; + union c2wr_cr_reject cr_reject; + union c2wr_console console; + union c2wr_flash_init flash_init; + union c2wr_flash flash; + union c2wr_buf_alloc buf_alloc; + union c2wr_buf_free buf_free; + union c2wr_flash_write flash_write; +} __attribute__((packed)); + + +/* + * Accessors for the wr fields that are packed together tightly to + * reduce the wr message size. The wr arguments are void* so that + * either a struct c2wr*, a struct c2wr_hdr*, or a pointer to any of the types + * in the struct c2wr union can be passed in. + */ +static __inline__ u8 c2_wr_get_id(void *wr) +{ + return ((struct c2wr_hdr *) wr)->id; +} +static __inline__ void c2_wr_set_id(void *wr, u8 id) +{ + ((struct c2wr_hdr *) wr)->id = id; +} +static __inline__ u8 c2_wr_get_result(void *wr) +{ + return ((struct c2wr_hdr *) wr)->result; +} +static __inline__ void c2_wr_set_result(void *wr, u8 result) +{ + ((struct c2wr_hdr *) wr)->result = result; +} +static __inline__ u8 c2_wr_get_flags(void *wr) +{ + return ((struct c2wr_hdr *) wr)->flags; +} +static __inline__ void c2_wr_set_flags(void *wr, u8 flags) +{ + ((struct c2wr_hdr *) wr)->flags = flags; +} +static __inline__ u8 c2_wr_get_sge_count(void *wr) +{ + return ((struct c2wr_hdr *) wr)->sge_count; +} +static __inline__ void c2_wr_set_sge_count(void *wr, u8 sge_count) +{ + ((struct c2wr_hdr *) wr)->sge_count = sge_count; +} +static __inline__ u32 c2_wr_get_wqe_count(void *wr) +{ + return ((struct c2wr_hdr *) wr)->wqe_count; +} +static __inline__ void c2_wr_set_wqe_count(void *wr, u32 wqe_count) +{ + ((struct c2wr_hdr *) wr)->wqe_count = wqe_count; +} + +#endif /* _C2_WR_H_ */ From swise at opengridcomputing.com Wed Jun 7 13:07:02 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:07:02 -0500 Subject: [openib-general] [PATCH v2 7/7] AMSO1100 Makefiles and Kconfig changes. In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop> References: <20060607200646.9259.24588.stgit@stevo-desktop> Message-ID: <20060607200702.9259.62339.stgit@stevo-desktop> Review Changes: - C2DEBUG -> DEBUG --- drivers/infiniband/Kconfig | 1 + drivers/infiniband/Makefile | 1 + drivers/infiniband/hw/amso1100/Kbuild | 10 ++++++++++ drivers/infiniband/hw/amso1100/Kconfig | 15 +++++++++++++++ drivers/infiniband/hw/amso1100/README | 11 +++++++++++ 5 files changed, 38 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index ba2d650..04e6d4f 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -36,6 +36,7 @@ config INFINIBAND_ADDR_TRANS source "drivers/infiniband/hw/mthca/Kconfig" source "drivers/infiniband/hw/ipath/Kconfig" +source "drivers/infiniband/hw/amso1100/Kconfig" source "drivers/infiniband/ulp/ipoib/Kconfig" diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index eea2732..e2b93f9 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -1,5 +1,6 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ obj-$(CONFIG_IPATH_CORE) += hw/ipath/ +obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ diff --git a/drivers/infiniband/hw/amso1100/Kbuild b/drivers/infiniband/hw/amso1100/Kbuild new file mode 100644 index 0000000..e1f10ab --- /dev/null +++ b/drivers/infiniband/hw/amso1100/Kbuild @@ -0,0 +1,10 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +ifdef CONFIG_INFINIBAND_AMSO1100_DEBUG +EXTRA_CFLAGS += -DDEBUG +endif + +obj-$(CONFIG_INFINIBAND_AMSO1100) += iw_c2.o + +iw_c2-y := c2.o c2_provider.o c2_rnic.o c2_alloc.o c2_mq.o c2_ae.o c2_vq.o \ + c2_intr.o c2_cq.o c2_qp.o c2_cm.o c2_mm.o c2_pd.o diff --git a/drivers/infiniband/hw/amso1100/Kconfig b/drivers/infiniband/hw/amso1100/Kconfig new file mode 100644 index 0000000..809cb14 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/Kconfig @@ -0,0 +1,15 @@ +config INFINIBAND_AMSO1100 + tristate "Ammasso 1100 HCA support" + depends on PCI && INET && INFINIBAND + ---help--- + This is a low-level driver for the Ammasso 1100 host + channel adapter (HCA). + +config INFINIBAND_AMSO1100_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_AMSO1100 + default n + ---help--- + This option causes the amso1100 driver to produce a bunch of + debug messages. Select this if you are developing the driver + or trying to diagnose a problem. diff --git a/drivers/infiniband/hw/amso1100/README b/drivers/infiniband/hw/amso1100/README new file mode 100644 index 0000000..1331353 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/README @@ -0,0 +1,11 @@ +This is the OpenFabrics provider driver for the +AMSO1100 1Gb RNIC adapter. + +This adapter is available in limited quantities +for development purposes from Open Grid Computing. + +This driver requires the IWCM and CMA mods necessary +to support iWARP. + +Contact tom at opengridcomputing.com for more information. + From swise at opengridcomputing.com Wed Jun 7 13:06:57 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:06:57 -0500 Subject: [openib-general] [PATCH v2 5/7] AMSO1100 Message Queues. In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop> References: <20060607200646.9259.24588.stgit@stevo-desktop> Message-ID: <20060607200657.9259.48820.stgit@stevo-desktop> Review Changes: - remove useless asserts - assert() -> BUG_ON() - C2_DEBUG -> DEBUG --- drivers/infiniband/hw/amso1100/c2_mq.c | 175 ++++++++++++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2_mq.h | 103 +++++++++++++++++++ 2 files changed, 278 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_mq.c b/drivers/infiniband/hw/amso1100/c2_mq.c new file mode 100644 index 0000000..0b0ab02 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_mq.c @@ -0,0 +1,175 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include "c2_mq.h" + +void *c2_mq_alloc(struct c2_mq *q) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_ADAPTER_TARGET); + + if (c2_mq_full(q)) { + return NULL; + } else { +#ifdef DEBUG + struct c2wr_hdr *m = + (struct c2wr_hdr *) (q->msg_pool.host + q->priv * q->msg_size); +#ifdef CCMSGMAGIC + BUG_ON(m->magic != be32_to_cpu(~CCWR_MAGIC)); + m->magic = cpu_to_be32(CCWR_MAGIC); +#endif + return m; +#else + return q->msg_pool.host + q->priv * q->msg_size; +#endif + } +} + +void c2_mq_produce(struct c2_mq *q) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_ADAPTER_TARGET); + + if (!c2_mq_full(q)) { + q->priv = (q->priv + 1) % q->q_size; + q->hint_count++; + /* Update peer's offset. */ + __raw_writew(cpu_to_be16(q->priv), &q->peer->shared); + } +} + +void *c2_mq_consume(struct c2_mq *q) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_HOST_TARGET); + + if (c2_mq_empty(q)) { + return NULL; + } else { +#ifdef DEBUG + struct c2wr_hdr *m = (struct c2wr_hdr *) + (q->msg_pool.host + q->priv * q->msg_size); +#ifdef CCMSGMAGIC + BUG_ON(m->magic != be32_to_cpu(CCWR_MAGIC)); +#endif + return m; +#else + return q->msg_pool.host + q->priv * q->msg_size; +#endif + } +} + +void c2_mq_free(struct c2_mq *q) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_HOST_TARGET); + + if (!c2_mq_empty(q)) { + +#ifdef CCMSGMAGIC + { + struct c2wr_hdr __iomem *m = (struct c2wr_hdr __iomem *) + (q->msg_pool.adapter + q->priv * q->msg_size); + __raw_writel(cpu_to_be32(~CCWR_MAGIC), &m->magic); + } +#endif + q->priv = (q->priv + 1) % q->q_size; + /* Update peer's offset. */ + __raw_writew(cpu_to_be16(q->priv), &q->peer->shared); + } +} + + +void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_ADAPTER_TARGET); + + while (wqe_count--) { + BUG_ON(c2_mq_empty(q)); + *q->shared = cpu_to_be16((be16_to_cpu(*q->shared)+1) % q->q_size); + } +} + + +u32 c2_mq_count(struct c2_mq *q) +{ + s32 count; + + if (q->type == C2_MQ_HOST_TARGET) { + count = be16_to_cpu(*q->shared) - q->priv; + } else { + count = q->priv - be16_to_cpu(*q->shared); + } + + if (count < 0) { + count += q->q_size; + } + + return (u32) count; +} + +void c2_mq_req_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size, + u8 __iomem *pool_start, u16 __iomem *peer, u32 type) +{ + BUG_ON(!q->shared); + + /* This code assumes the byte swapping has already been done! */ + q->index = index; + q->q_size = q_size; + q->msg_size = msg_size; + q->msg_pool.adapter = pool_start; + q->peer = (struct c2_mq_shared __iomem *) peer; + q->magic = C2_MQ_MAGIC; + q->type = type; + q->priv = 0; + q->hint_count = 0; + return; +} +void c2_mq_rep_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size, + u8 *pool_start, u16 __iomem *peer, u32 type) +{ + BUG_ON(!q->shared); + + /* This code assumes the byte swapping has already been done! */ + q->index = index; + q->q_size = q_size; + q->msg_size = msg_size; + q->msg_pool.host = pool_start; + q->peer = (struct c2_mq_shared __iomem *) peer; + q->magic = C2_MQ_MAGIC; + q->type = type; + q->priv = 0; + q->hint_count = 0; + return; +} diff --git a/drivers/infiniband/hw/amso1100/c2_mq.h b/drivers/infiniband/hw/amso1100/c2_mq.h new file mode 100644 index 0000000..de00184 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_mq.h @@ -0,0 +1,103 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef _C2_MQ_H_ +#define _C2_MQ_H_ +#include +#include "c2_wr.h" + +enum c2_shared_regs { + + C2_SHARED_ARMED = 0x10, + C2_SHARED_NOTIFY = 0x18, + C2_SHARED_SHARED = 0x40, +}; + +struct c2_mq_shared { + u16 unused1; + u8 armed; + u8 notification_type; + u32 unused2; + u16 shared; + /* Pad to 64 bytes. */ + u8 pad[64 - sizeof(u16) - 2 * sizeof(u8) - sizeof(u32) - sizeof(u16)]; +}; + +enum c2_mq_type { + C2_MQ_HOST_TARGET = 1, + C2_MQ_ADAPTER_TARGET = 2, +}; + +/* + * c2_mq_t is for kernel-mode MQs like the VQs Cand the AEQ. + * c2_user_mq_t (which is the same format) is for user-mode MQs... + */ +#define C2_MQ_MAGIC 0x4d512020 /* 'MQ ' */ +struct c2_mq { + u32 magic; + union { + u8 *host; + u8 __iomem *adapter; + } msg_pool; + u16 hint_count; + u16 priv; + struct c2_mq_shared __iomem *peer; + u16 *shared; + u32 q_size; + u32 msg_size; + u32 index; + enum c2_mq_type type; +}; + +static __inline__ int c2_mq_empty(struct c2_mq *q) +{ + return q->priv == be16_to_cpu(*q->shared); +} + +static __inline__ int c2_mq_full(struct c2_mq *q) +{ + return q->priv == (be16_to_cpu(*q->shared) + q->q_size - 1) % q->q_size; +} + +extern void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count); +extern void *c2_mq_alloc(struct c2_mq *q); +extern void c2_mq_produce(struct c2_mq *q); +extern void *c2_mq_consume(struct c2_mq *q); +extern void c2_mq_free(struct c2_mq *q); +extern u32 c2_mq_count(struct c2_mq *q); +extern void c2_mq_req_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size, + u8 __iomem *pool_start, u16 __iomem *peer, u32 type); +extern void c2_mq_rep_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size, + u8 *pool_start, u16 __iomem *peer, u32 type); + +#endif /* _C2_MQ_H_ */ From swise at opengridcomputing.com Wed Jun 7 13:07:00 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:07:00 -0500 Subject: [openib-general] [PATCH v2 6/7] AMSO1100: Privileged Verbs Queues. In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop> References: <20060607200646.9259.24588.stgit@stevo-desktop> Message-ID: <20060607200659.9259.85242.stgit@stevo-desktop> Review Changes: dprintk() -> pr_debug() --- drivers/infiniband/hw/amso1100/c2_vq.c | 260 ++++++++++++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2_vq.h | 63 ++++++++ 2 files changed, 323 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_vq.c b/drivers/infiniband/hw/amso1100/c2_vq.c new file mode 100644 index 0000000..445b1ed --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_vq.c @@ -0,0 +1,260 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include "c2_vq.h" +#include "c2_provider.h" + +/* + * Verbs Request Objects: + * + * VQ Request Objects are allocated by the kernel verbs handlers. + * They contain a wait object, a refcnt, an atomic bool indicating that the + * adapter has replied, and a copy of the verb reply work request. + * A pointer to the VQ Request Object is passed down in the context + * field of the work request message, and reflected back by the adapter + * in the verbs reply message. The function handle_vq() in the interrupt + * path will use this pointer to: + * 1) append a copy of the verbs reply message + * 2) mark that the reply is ready + * 3) wake up the kernel verbs handler blocked awaiting the reply. + * + * + * The kernel verbs handlers do a "get" to put a 2nd reference on the + * VQ Request object. If the kernel verbs handler exits before the adapter + * can respond, this extra reference will keep the VQ Request object around + * until the adapter's reply can be processed. The reason we need this is + * because a pointer to this object is stuffed into the context field of + * the verbs work request message, and reflected back in the reply message. + * It is used in the interrupt handler (handle_vq()) to wake up the appropriate + * kernel verb handler that is blocked awaiting the verb reply. + * So handle_vq() will do a "put" on the object when it's done accessing it. + * NOTE: If we guarantee that the kernel verb handler will never bail before + * getting the reply, then we don't need these refcnts. + * + * + * VQ Request objects are freed by the kernel verbs handlers only + * after the verb has been processed, or when the adapter fails and + * does not reply. + * + * + * Verbs Reply Buffers: + * + * VQ Reply bufs are local host memory copies of a + * outstanding Verb Request reply + * message. The are always allocated by the kernel verbs handlers, and _may_ be + * freed by either the kernel verbs handler -or- the interrupt handler. The + * kernel verbs handler _must_ free the repbuf, then free the vq request object + * in that order. + */ + +int vq_init(struct c2_dev *c2dev) +{ + sprintf(c2dev->vq_cache_name, "c2-vq:dev%c", + (char) ('0' + c2dev->devnum)); + c2dev->host_msg_cache = + kmem_cache_create(c2dev->vq_cache_name, c2dev->rep_vq.msg_size, 0, + SLAB_HWCACHE_ALIGN, NULL, NULL); + if (c2dev->host_msg_cache == NULL) { + return -ENOMEM; + } + return 0; +} + +void vq_term(struct c2_dev *c2dev) +{ + kmem_cache_destroy(c2dev->host_msg_cache); +} + +/* vq_req_alloc - allocate a VQ Request Object and initialize it. + * The refcnt is set to 1. + */ +struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev) +{ + struct c2_vq_req *r; + + r = kmalloc(sizeof(struct c2_vq_req), GFP_KERNEL); + if (r) { + init_waitqueue_head(&r->wait_object); + r->reply_msg = (u64) NULL; + r->event = 0; + r->cm_id = NULL; + r->qp = NULL; + atomic_set(&r->refcnt, 1); + atomic_set(&r->reply_ready, 0); + } + return r; +} + + +/* vq_req_free - free the VQ Request Object. It is assumed the verbs handler + * has already free the VQ Reply Buffer if it existed. + */ +void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *r) +{ + r->reply_msg = (u64) NULL; + if (atomic_dec_and_test(&r->refcnt)) { + kfree(r); + } +} + +/* vq_req_get - reference a VQ Request Object. Done + * only in the kernel verbs handlers. + */ +void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *r) +{ + atomic_inc(&r->refcnt); +} + + +/* vq_req_put - dereference and potentially free a VQ Request Object. + * + * This is only called by handle_vq() on the + * interrupt when it is done processing + * a verb reply message. If the associated + * kernel verbs handler has already bailed, + * then this put will actually free the VQ + * Request object _and_ the VQ Reply Buffer + * if it exists. + */ +void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *r) +{ + if (atomic_dec_and_test(&r->refcnt)) { + if (r->reply_msg != (u64) NULL) + vq_repbuf_free(c2dev, + (void *) (unsigned long) r->reply_msg); + kfree(r); + } +} + + +/* + * vq_repbuf_alloc - allocate a VQ Reply Buffer. + */ +void *vq_repbuf_alloc(struct c2_dev *c2dev) +{ + return kmem_cache_alloc(c2dev->host_msg_cache, SLAB_ATOMIC); +} + +/* + * vq_send_wr - post a verbs request message to the Verbs Request Queue. + * If a message is not available in the MQ, then block until one is available. + * NOTE: handle_mq() on the interrupt context will wake up threads blocked here. + * When the adapter drains the Verbs Request Queue, + * it inserts MQ index 0 in to the + * adapter->host activity fifo and interrupts the host. + */ +int vq_send_wr(struct c2_dev *c2dev, union c2wr *wr) +{ + void *msg; + wait_queue_t __wait; + + /* + * grab adapter vq lock + */ + spin_lock(&c2dev->vqlock); + + /* + * allocate msg + */ + msg = c2_mq_alloc(&c2dev->req_vq); + + /* + * If we cannot get a msg, then we'll wait + * When a messages are available, the int handler will wake_up() + * any waiters. + */ + while (msg == NULL) { + pr_debug("%s:%d no available msg in VQ, waiting...\n", + __FUNCTION__, __LINE__); + init_waitqueue_entry(&__wait, current); + add_wait_queue(&c2dev->req_vq_wo, &__wait); + spin_unlock(&c2dev->vqlock); + for (;;) { + set_current_state(TASK_INTERRUPTIBLE); + if (!c2_mq_full(&c2dev->req_vq)) { + break; + } + if (!signal_pending(current)) { + schedule_timeout(1 * HZ); /* 1 second... */ + continue; + } + set_current_state(TASK_RUNNING); + remove_wait_queue(&c2dev->req_vq_wo, &__wait); + return -EINTR; + } + set_current_state(TASK_RUNNING); + remove_wait_queue(&c2dev->req_vq_wo, &__wait); + spin_lock(&c2dev->vqlock); + msg = c2_mq_alloc(&c2dev->req_vq); + } + + /* + * copy wr into adapter msg + */ + memcpy(msg, wr, c2dev->req_vq.msg_size); + + /* + * post msg + */ + c2_mq_produce(&c2dev->req_vq); + + /* + * release adapter vq lock + */ + spin_unlock(&c2dev->vqlock); + return 0; +} + + +/* + * vq_wait_for_reply - block until the adapter posts a Verb Reply Message. + */ +int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req) +{ + if (!wait_event_timeout(req->wait_object, + atomic_read(&req->reply_ready), + 60*HZ)) + return -ETIMEDOUT; + + return 0; +} + +/* + * vq_repbuf_free - Free a Verbs Reply Buffer. + */ +void vq_repbuf_free(struct c2_dev *c2dev, void *reply) +{ + kmem_cache_free(c2dev->host_msg_cache, reply); +} diff --git a/drivers/infiniband/hw/amso1100/c2_vq.h b/drivers/infiniband/hw/amso1100/c2_vq.h new file mode 100644 index 0000000..3380562 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_vq.h @@ -0,0 +1,63 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _C2_VQ_H_ +#define _C2_VQ_H_ +#include +#include "c2.h" +#include "c2_wr.h" +#include "c2_provider.h" + +struct c2_vq_req { + u64 reply_msg; /* ptr to reply msg */ + wait_queue_head_t wait_object; /* wait object for vq reqs */ + atomic_t reply_ready; /* set when reply is ready */ + atomic_t refcnt; /* used to cancel WRs... */ + int event; + struct iw_cm_id *cm_id; + struct c2_qp *qp; +}; + +extern int vq_init(struct c2_dev *c2dev); +extern void vq_term(struct c2_dev *c2dev); + +extern struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev); +extern void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *req); +extern void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *req); +extern void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *req); +extern int vq_send_wr(struct c2_dev *c2dev, union c2wr * wr); + +extern void *vq_repbuf_alloc(struct c2_dev *c2dev); +extern void vq_repbuf_free(struct c2_dev *c2dev, void *reply); + +extern int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req); +#endif /* _C2_VQ_H_ */ From swise at opengridcomputing.com Wed Jun 7 13:39:22 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:39:22 -0500 Subject: [openib-general] [PATCH v2 2/7] AMSO1100 WR / Event Definitions. In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop> References: <20060607200646.9259.24588.stgit@stevo-desktop> Message-ID: <1149712762.27684.82.camel@stevo-desktop> Resending 2/7 gziped. linux-kernel and netdev mailing lists didn't formward the plain text patch... If anywone knows how to address this issue, please email me directly cuz I don't know why 2/7 didn't get forwarded. Sorry. Steve. -------------- next part -------------- A non-text attachment was scrubbed... Name: amso1100_wr.gz Type: application/x-gzip Size: 10387 bytes Desc: not available URL: From rdreier at cisco.com Wed Jun 7 13:43:08 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 Jun 2006 13:43:08 -0700 Subject: [openib-general] Re: [PATCH v2 2/7] AMSO1100 WR / Event Definitions. In-Reply-To: <1149712762.27684.82.camel@stevo-desktop> (Steve Wise's message of "Wed, 07 Jun 2006 15:39:22 -0500") References: <20060607200646.9259.24588.stgit@stevo-desktop> <1149712762.27684.82.camel@stevo-desktop> Message-ID: I just realized it could be the spam filters. You have some comments with three 'X's in a row which might be getting it blocked. Is that possible? - R. From swise at opengridcomputing.com Wed Jun 7 13:59:32 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:59:32 -0500 Subject: [openib-general] Re: [PATCH v2 2/7] AMSO1100 WR / Event Definitions. In-Reply-To: References: <20060607200646.9259.24588.stgit@stevo-desktop> <1149712762.27684.82.camel@stevo-desktop> Message-ID: <1149713972.27684.97.camel@stevo-desktop> On Wed, 2006-06-07 at 13:43 -0700, Roland Dreier wrote: > I just realized it could be the spam filters. You have some comments > with three 'X's in a row which might be getting it blocked. Is that > possible? There are other files that have comments with 'XXX' like c2_provider.c and c2_qp.c which is in patch 3/7 and it made it though. These 'XXX' comments need to be cleaned up anyway, so I'll remove them (or address the issue if there is one) and we'll see next time I post a new version. Steve. From swise at opengridcomputing.com Wed Jun 7 13:59:32 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 Jun 2006 15:59:32 -0500 Subject: [openib-general] Re: [PATCH v2 2/7] AMSO1100 WR / Event Definitions. In-Reply-To: References: <20060607200646.9259.24588.stgit@stevo-desktop> <1149712762.27684.82.camel@stevo-desktop> Message-ID: <1149713972.27684.97.camel@stevo-desktop> On Wed, 2006-06-07 at 13:43 -0700, Roland Dreier wrote: > I just realized it could be the spam filters. You have some comments > with three 'X's in a row which might be getting it blocked. Is that > possible? There are other files that have comments with 'XXX' like c2_provider.c and c2_qp.c which is in patch 3/7 and it made it though. These 'XXX' comments need to be cleaned up anyway, so I'll remove them (or address the issue if there is one) and we'll see next time I post a new version. Steve. From tom at opengridcomputing.com Wed Jun 7 15:13:27 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 07 Jun 2006 17:13:27 -0500 Subject: [openib-general] Re: [PATCH v2 2/2] iWARP Core Changes. In-Reply-To: <20060607200610.9003.54068.stgit@stevo-desktop> References: <20060607200600.9003.56328.stgit@stevo-desktop> <20060607200610.9003.54068.stgit@stevo-desktop> Message-ID: <1149718407.9716.15.camel@trinity.ogc.int> A reference is being taken on an iWARP device that is never getting released. This prevents a participating iWARP netdev device from being unloaded after a connection has been established on the passive side. Search for ip_dev_find below... On Wed, 2006-06-07 at 15:06 -0500, Steve Wise wrote: > This patch contains modifications to the existing rdma header files, > core files, drivers, and ulp files to support iWARP. > > Review updates: > > - copy_addr() -> rdma_copy_addr() > > - dst_dev_addr param in rdma_copy_addr to const. > > - various spacing nits with recasting > > - include linux/inetdevice.h to get ip_dev_find() prototype. > --- > > drivers/infiniband/core/Makefile | 4 > drivers/infiniband/core/addr.c | 19 + > drivers/infiniband/core/cache.c | 8 - > drivers/infiniband/core/cm.c | 3 > drivers/infiniband/core/cma.c | 353 +++++++++++++++++++++++--- > drivers/infiniband/core/device.c | 6 > drivers/infiniband/core/mad.c | 11 + > drivers/infiniband/core/sa_query.c | 5 > drivers/infiniband/core/smi.c | 18 + > drivers/infiniband/core/sysfs.c | 18 + > drivers/infiniband/core/ucm.c | 5 > drivers/infiniband/core/user_mad.c | 9 - > drivers/infiniband/hw/ipath/ipath_verbs.c | 2 > drivers/infiniband/hw/mthca/mthca_provider.c | 2 > drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 + > drivers/infiniband/ulp/srp/ib_srp.c | 2 > include/rdma/ib_addr.h | 15 + > include/rdma/ib_verbs.h | 39 +++ > 18 files changed, 435 insertions(+), 92 deletions(-) > > diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile > index 68e73ec..163d991 100644 > --- a/drivers/infiniband/core/Makefile > +++ b/drivers/infiniband/core/Makefile > @@ -1,7 +1,7 @@ > infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS) := ib_addr.o rdma_cm.o > > obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \ > - ib_cm.o $(infiniband-y) > + ib_cm.o iw_cm.o $(infiniband-y) > obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o > obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o > > @@ -14,6 +14,8 @@ ib_sa-y := sa_query.o > > ib_cm-y := cm.o > > +iw_cm-y := iwcm.o > + > rdma_cm-y := cma.o > > ib_addr-y := addr.o > diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c > index d294bbc..83f84ef 100644 > --- a/drivers/infiniband/core/addr.c > +++ b/drivers/infiniband/core/addr.c > @@ -32,6 +32,7 @@ #include > #include > #include > #include > +#include > #include > #include > #include > @@ -60,12 +61,15 @@ static LIST_HEAD(req_list); > static DECLARE_WORK(work, process_req, NULL); > static struct workqueue_struct *addr_wq; > > -static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, > - unsigned char *dst_dev_addr) > +int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, > + const unsigned char *dst_dev_addr) > { > switch (dev->type) { > case ARPHRD_INFINIBAND: > - dev_addr->dev_type = IB_NODE_CA; > + dev_addr->dev_type = RDMA_NODE_IB_CA; > + break; > + case ARPHRD_ETHER: > + dev_addr->dev_type = RDMA_NODE_RNIC; > break; > default: > return -EADDRNOTAVAIL; > @@ -77,6 +81,7 @@ static int copy_addr(struct rdma_dev_add > memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); > return 0; > } > +EXPORT_SYMBOL(rdma_copy_addr); > > int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) > { > @@ -88,7 +93,7 @@ int rdma_translate_ip(struct sockaddr *a > if (!dev) > return -EADDRNOTAVAIL; > > - ret = copy_addr(dev_addr, dev, NULL); > + ret = rdma_copy_addr(dev_addr, dev, NULL); > dev_put(dev); > return ret; > } > @@ -160,7 +165,7 @@ static int addr_resolve_remote(struct so > > /* If the device does ARP internally, return 'done' */ > if (rt->idev->dev->flags & IFF_NOARP) { > - copy_addr(addr, rt->idev->dev, NULL); > + rdma_copy_addr(addr, rt->idev->dev, NULL); > goto put; > } > > @@ -180,7 +185,7 @@ static int addr_resolve_remote(struct so > src_in->sin_addr.s_addr = rt->rt_src; > } > > - ret = copy_addr(addr, neigh->dev, neigh->ha); > + ret = rdma_copy_addr(addr, neigh->dev, neigh->ha); > release: > neigh_release(neigh); > put: > @@ -244,7 +249,7 @@ static int addr_resolve_local(struct soc > if (ZERONET(src_ip)) { > src_in->sin_family = dst_in->sin_family; > src_in->sin_addr.s_addr = dst_ip; > - ret = copy_addr(addr, dev, dev->dev_addr); > + ret = rdma_copy_addr(addr, dev, dev->dev_addr); > } else if (LOOPBACK(src_ip)) { > ret = rdma_translate_ip((struct sockaddr *)dst_in, addr); > if (!ret) > diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c > index e05ca2c..061858c 100644 > --- a/drivers/infiniband/core/cache.c > +++ b/drivers/infiniband/core/cache.c > @@ -32,13 +32,12 @@ > * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > * SOFTWARE. > * > - * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $ > + * $Id: cache.c 6885 2006-05-03 18:22:02Z sean.hefty $ > */ > > #include > #include > #include > -#include /* INIT_WORK, schedule_work(), flush_scheduled_work() */ > > #include > > @@ -62,12 +61,13 @@ struct ib_update_work { > > static inline int start_port(struct ib_device *device) > { > - return device->node_type == IB_NODE_SWITCH ? 0 : 1; > + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; > } > > static inline int end_port(struct ib_device *device) > { > - return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt; > + return (device->node_type == RDMA_NODE_IB_SWITCH) ? > + 0 : device->phys_port_cnt; > } > > int ib_get_cached_gid(struct ib_device *device, > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index 1c7463b..cf43ccb 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -3253,6 +3253,9 @@ static void cm_add_one(struct ib_device > int ret; > u8 i; > > + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) > + return; > + > cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * > device->phys_port_cnt, GFP_KERNEL); > if (!cm_dev) > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c > index 94555d2..414600c 100644 > --- a/drivers/infiniband/core/cma.c > +++ b/drivers/infiniband/core/cma.c > @@ -35,6 +35,7 @@ #include > #include > #include > #include > +#include > > #include > > @@ -43,6 +44,7 @@ #include > #include > #include > #include > +#include > > MODULE_AUTHOR("Sean Hefty"); > MODULE_DESCRIPTION("Generic RDMA CM Agent"); > @@ -124,6 +126,7 @@ struct rdma_id_private { > int query_id; > union { > struct ib_cm_id *ib; > + struct iw_cm_id *iw; > } cm_id; > > u32 seq_num; > @@ -259,13 +262,23 @@ static void cma_detach_from_dev(struct r > id_priv->cma_dev = NULL; > } > > -static int cma_acquire_ib_dev(struct rdma_id_private *id_priv) > +static int cma_acquire_dev(struct rdma_id_private *id_priv) > { > + enum rdma_node_type dev_type = id_priv->id.route.addr.dev_addr.dev_type; > struct cma_device *cma_dev; > union ib_gid *gid; > int ret = -ENODEV; > > - gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); > + switch (rdma_node_get_transport(dev_type)) { > + case RDMA_TRANSPORT_IB: > + gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); > + break; > + case RDMA_TRANSPORT_IWARP: > + gid = iw_addr_get_sgid(&id_priv->id.route.addr.dev_addr); > + break; > + default: > + return -ENODEV; > + } > > mutex_lock(&lock); > list_for_each_entry(cma_dev, &dev_list, list) { > @@ -280,16 +293,6 @@ static int cma_acquire_ib_dev(struct rdm > return ret; > } > > -static int cma_acquire_dev(struct rdma_id_private *id_priv) > -{ > - switch (id_priv->id.route.addr.dev_addr.dev_type) { > - case IB_NODE_CA: > - return cma_acquire_ib_dev(id_priv); > - default: > - return -ENODEV; > - } > -} > - > static void cma_deref_id(struct rdma_id_private *id_priv) > { > if (atomic_dec_and_test(&id_priv->refcount)) > @@ -347,6 +350,16 @@ static int cma_init_ib_qp(struct rdma_id > IB_QP_PKEY_INDEX | IB_QP_PORT); > } > > +static int cma_init_iw_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) > +{ > + struct ib_qp_attr qp_attr; > + > + qp_attr.qp_state = IB_QPS_INIT; > + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; > + > + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS); > +} > + > int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, > struct ib_qp_init_attr *qp_init_attr) > { > @@ -362,10 +375,13 @@ int rdma_create_qp(struct rdma_cm_id *id > if (IS_ERR(qp)) > return PTR_ERR(qp); > > - switch (id->device->node_type) { > - case IB_NODE_CA: > + switch (rdma_node_get_transport(id->device->node_type)) { > + case RDMA_TRANSPORT_IB: > ret = cma_init_ib_qp(id_priv, qp); > break; > + case RDMA_TRANSPORT_IWARP: > + ret = cma_init_iw_qp(id_priv, qp); > + break; > default: > ret = -ENOSYS; > break; > @@ -451,13 +467,17 @@ int rdma_init_qp_attr(struct rdma_cm_id > int ret; > > id_priv = container_of(id, struct rdma_id_private, id); > - switch (id_priv->id.device->node_type) { > - case IB_NODE_CA: > + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { > + case RDMA_TRANSPORT_IB: > ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr, > qp_attr_mask); > if (qp_attr->qp_state == IB_QPS_RTR) > qp_attr->rq_psn = id_priv->seq_num; > break; > + case RDMA_TRANSPORT_IWARP: > + ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr, > + qp_attr_mask); > + break; > default: > ret = -ENOSYS; > break; > @@ -590,8 +610,8 @@ static int cma_notify_user(struct rdma_i > > static void cma_cancel_route(struct rdma_id_private *id_priv) > { > - switch (id_priv->id.device->node_type) { > - case IB_NODE_CA: > + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { > + case RDMA_TRANSPORT_IB: > if (id_priv->query) > ib_sa_cancel_query(id_priv->query_id, id_priv->query); > break; > @@ -611,11 +631,15 @@ static void cma_destroy_listen(struct rd > cma_exch(id_priv, CMA_DESTROYING); > > if (id_priv->cma_dev) { > - switch (id_priv->id.device->node_type) { > - case IB_NODE_CA: > + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { > + case RDMA_TRANSPORT_IB: > if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) > ib_destroy_cm_id(id_priv->cm_id.ib); > break; > + case RDMA_TRANSPORT_IWARP: > + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) > + iw_destroy_cm_id(id_priv->cm_id.iw); > + break; > default: > break; > } > @@ -690,11 +714,15 @@ void rdma_destroy_id(struct rdma_cm_id * > cma_cancel_operation(id_priv, state); > > if (id_priv->cma_dev) { > - switch (id->device->node_type) { > - case IB_NODE_CA: > + switch (rdma_node_get_transport(id->device->node_type)) { > + case RDMA_TRANSPORT_IB: > if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) > ib_destroy_cm_id(id_priv->cm_id.ib); > break; > + case RDMA_TRANSPORT_IWARP: > + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) > + iw_destroy_cm_id(id_priv->cm_id.iw); > + break; > default: > break; > } > @@ -868,7 +896,7 @@ static struct rdma_id_private *cma_new_i > ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); > ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); > ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); > - rt->addr.dev_addr.dev_type = IB_NODE_CA; > + rt->addr.dev_addr.dev_type = RDMA_NODE_IB_CA; > > id_priv = container_of(id, struct rdma_id_private, id); > id_priv->state = CMA_CONNECT; > @@ -897,7 +925,7 @@ static int cma_req_handler(struct ib_cm_ > } > > atomic_inc(&conn_id->dev_remove); > - ret = cma_acquire_ib_dev(conn_id); > + ret = cma_acquire_dev(conn_id); > if (ret) { > ret = -ENODEV; > cma_release_remove(conn_id); > @@ -981,6 +1009,123 @@ static void cma_set_compare_data(enum rd > } > } > > +static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event) > +{ > + struct rdma_id_private *id_priv = iw_id->context; > + enum rdma_cm_event_type event = 0; > + struct sockaddr_in *sin; > + int ret = 0; > + > + atomic_inc(&id_priv->dev_remove); > + > + switch (iw_event->event) { > + case IW_CM_EVENT_CLOSE: > + event = RDMA_CM_EVENT_DISCONNECTED; > + break; > + case IW_CM_EVENT_CONNECT_REPLY: > + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; > + *sin = iw_event->local_addr; > + sin = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr; > + *sin = iw_event->remote_addr; > + if (iw_event->status) > + event = RDMA_CM_EVENT_REJECTED; > + else > + event = RDMA_CM_EVENT_ESTABLISHED; > + break; > + case IW_CM_EVENT_ESTABLISHED: > + event = RDMA_CM_EVENT_ESTABLISHED; > + break; > + default: > + BUG_ON(1); > + } > + > + ret = cma_notify_user(id_priv, event, iw_event->status, > + iw_event->private_data, > + iw_event->private_data_len); > + if (ret) { > + /* Destroy the CM ID by returning a non-zero value. */ > + id_priv->cm_id.iw = NULL; > + cma_exch(id_priv, CMA_DESTROYING); > + cma_release_remove(id_priv); > + rdma_destroy_id(&id_priv->id); > + return ret; > + } > + > + cma_release_remove(id_priv); > + return ret; > +} > + > +static int iw_conn_req_handler(struct iw_cm_id *cm_id, > + struct iw_cm_event *iw_event) > +{ > + struct rdma_cm_id *new_cm_id; > + struct rdma_id_private *listen_id, *conn_id; > + struct sockaddr_in *sin; > + struct net_device *dev; > + int ret; > + > + listen_id = cm_id->context; > + atomic_inc(&listen_id->dev_remove); > + if (!cma_comp(listen_id, CMA_LISTEN)) { > + ret = -ECONNABORTED; > + goto out; > + } > + > + /* Create a new RDMA id for the new IW CM ID */ > + new_cm_id = rdma_create_id(listen_id->id.event_handler, > + listen_id->id.context, > + RDMA_PS_TCP); > + if (!new_cm_id) { > + ret = -ENOMEM; > + goto out; > + } > + conn_id = container_of(new_cm_id, struct rdma_id_private, id); > + atomic_inc(&conn_id->dev_remove); > + conn_id->state = CMA_CONNECT; > + we take a reference on the iWARP device here that we never release > + dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr); > + if (!dev) { > + ret = -EADDRNOTAVAIL; > + rdma_destroy_id(new_cm_id); > + goto out; > + } > + ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL); > + if (ret) { > + rdma_destroy_id(new_cm_id); > + goto out; > + } > + > + ret = cma_acquire_dev(conn_id); > + if (ret) { > + rdma_destroy_id(new_cm_id); > + goto out; > + } > + > + conn_id->cm_id.iw = cm_id; > + cm_id->context = conn_id; > + cm_id->cm_handler = cma_iw_handler; > + > + sin = (struct sockaddr_in *) &new_cm_id->route.addr.src_addr; > + *sin = iw_event->local_addr; > + sin = (struct sockaddr_in *) &new_cm_id->route.addr.dst_addr; > + *sin = iw_event->remote_addr; > + > + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, > + iw_event->private_data, > + iw_event->private_data_len); > + if (ret) { > + /* User wants to destroy the CM ID */ > + conn_id->cm_id.iw = NULL; > + cma_exch(conn_id, CMA_DESTROYING); > + cma_release_remove(conn_id); > + rdma_destroy_id(&conn_id->id); > + } > + > +out: We need to put a dev_put here or the reference on the device will never get released and you won't be able to remove it after you've had at least one connection. This is my bug.... dev_put(dev); > + cma_release_remove(listen_id); > + return ret; > +} > + > static int cma_ib_listen(struct rdma_id_private *id_priv) > { > struct ib_cm_compare_data compare_data; > @@ -1010,6 +1155,30 @@ static int cma_ib_listen(struct rdma_id_ > return ret; > } > > +static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog) > +{ > + int ret; > + struct sockaddr_in *sin; > + > + id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, > + iw_conn_req_handler, > + id_priv); > + if (IS_ERR(id_priv->cm_id.iw)) > + return PTR_ERR(id_priv->cm_id.iw); > + > + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; > + id_priv->cm_id.iw->local_addr = *sin; > + > + ret = iw_cm_listen(id_priv->cm_id.iw, backlog); > + > + if (ret) { > + iw_destroy_cm_id(id_priv->cm_id.iw); > + id_priv->cm_id.iw = NULL; > + } > + > + return ret; > +} > + > static int cma_listen_handler(struct rdma_cm_id *id, > struct rdma_cm_event *event) > { > @@ -1085,12 +1254,17 @@ int rdma_listen(struct rdma_cm_id *id, i > return -EINVAL; > > if (id->device) { > - switch (id->device->node_type) { > - case IB_NODE_CA: > + switch (rdma_node_get_transport(id->device->node_type)) { > + case RDMA_TRANSPORT_IB: > ret = cma_ib_listen(id_priv); > if (ret) > goto err; > break; > + case RDMA_TRANSPORT_IWARP: > + ret = cma_iw_listen(id_priv, backlog); > + if (ret) > + goto err; > + break; > default: > ret = -ENOSYS; > goto err; > @@ -1229,6 +1403,23 @@ err: > } > EXPORT_SYMBOL(rdma_set_ib_paths); > > +static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) > +{ > + struct cma_work *work; > + > + work = kzalloc(sizeof *work, GFP_KERNEL); > + if (!work) > + return -ENOMEM; > + > + work->id = id_priv; > + INIT_WORK(&work->work, cma_work_handler, work); > + work->old_state = CMA_ROUTE_QUERY; > + work->new_state = CMA_ROUTE_RESOLVED; > + work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; > + queue_work(cma_wq, &work->work); > + return 0; > +} > + > int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) > { > struct rdma_id_private *id_priv; > @@ -1239,10 +1430,13 @@ int rdma_resolve_route(struct rdma_cm_id > return -EINVAL; > > atomic_inc(&id_priv->refcount); > - switch (id->device->node_type) { > - case IB_NODE_CA: > + switch (rdma_node_get_transport(id->device->node_type)) { > + case RDMA_TRANSPORT_IB: > ret = cma_resolve_ib_route(id_priv, timeout_ms); > break; > + case RDMA_TRANSPORT_IWARP: > + ret = cma_resolve_iw_route(id_priv, timeout_ms); > + break; > default: > ret = -ENOSYS; > break; > @@ -1354,8 +1548,8 @@ static int cma_resolve_loopback(struct r > ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr)); > > if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) { > - src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr; > - dst_in = (struct sockaddr_in *)&id_priv->id.route.addr.dst_addr; > + src_in = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; > + dst_in = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr; > src_in->sin_family = dst_in->sin_family; > src_in->sin_addr.s_addr = dst_in->sin_addr.s_addr; > } > @@ -1646,6 +1840,47 @@ out: > return ret; > } > > +static int cma_connect_iw(struct rdma_id_private *id_priv, > + struct rdma_conn_param *conn_param) > +{ > + struct iw_cm_id *cm_id; > + struct sockaddr_in* sin; > + int ret; > + struct iw_cm_conn_param iw_param; > + > + cm_id = iw_create_cm_id(id_priv->id.device, cma_iw_handler, id_priv); > + if (IS_ERR(cm_id)) { > + ret = PTR_ERR(cm_id); > + goto out; > + } > + > + id_priv->cm_id.iw = cm_id; > + > + sin = (struct sockaddr_in*) &id_priv->id.route.addr.src_addr; > + cm_id->local_addr = *sin; > + > + sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr; > + cm_id->remote_addr = *sin; > + > + ret = cma_modify_qp_rtr(&id_priv->id); > + if (ret) { > + iw_destroy_cm_id(cm_id); > + return ret; > + } > + > + iw_param.ord = conn_param->initiator_depth; > + iw_param.ird = conn_param->responder_resources; > + iw_param.private_data = conn_param->private_data; > + iw_param.private_data_len = conn_param->private_data_len; > + if (id_priv->id.qp) > + iw_param.qpn = id_priv->qp_num; > + else > + iw_param.qpn = conn_param->qp_num; > + ret = iw_cm_connect(cm_id, &iw_param); > +out: > + return ret; > +} > + > int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) > { > struct rdma_id_private *id_priv; > @@ -1661,10 +1896,13 @@ int rdma_connect(struct rdma_cm_id *id, > id_priv->srq = conn_param->srq; > } > > - switch (id->device->node_type) { > - case IB_NODE_CA: > + switch (rdma_node_get_transport(id->device->node_type)) { > + case RDMA_TRANSPORT_IB: > ret = cma_connect_ib(id_priv, conn_param); > break; > + case RDMA_TRANSPORT_IWARP: > + ret = cma_connect_iw(id_priv, conn_param); > + break; > default: > ret = -ENOSYS; > break; > @@ -1705,6 +1943,28 @@ static int cma_accept_ib(struct rdma_id_ > return ib_send_cm_rep(id_priv->cm_id.ib, &rep); > } > > +static int cma_accept_iw(struct rdma_id_private *id_priv, > + struct rdma_conn_param *conn_param) > +{ > + struct iw_cm_conn_param iw_param; > + int ret; > + > + ret = cma_modify_qp_rtr(&id_priv->id); > + if (ret) > + return ret; > + > + iw_param.ord = conn_param->initiator_depth; > + iw_param.ird = conn_param->responder_resources; > + iw_param.private_data = conn_param->private_data; > + iw_param.private_data_len = conn_param->private_data_len; > + if (id_priv->id.qp) { > + iw_param.qpn = id_priv->qp_num; > + } else > + iw_param.qpn = conn_param->qp_num; > + > + return iw_cm_accept(id_priv->cm_id.iw, &iw_param); > +} > + > int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) > { > struct rdma_id_private *id_priv; > @@ -1720,13 +1980,16 @@ int rdma_accept(struct rdma_cm_id *id, s > id_priv->srq = conn_param->srq; > } > > - switch (id->device->node_type) { > - case IB_NODE_CA: > + switch (rdma_node_get_transport(id->device->node_type)) { > + case RDMA_TRANSPORT_IB: > if (conn_param) > ret = cma_accept_ib(id_priv, conn_param); > else > ret = cma_rep_recv(id_priv); > break; > + case RDMA_TRANSPORT_IWARP: > + ret = cma_accept_iw(id_priv, conn_param); > + break; > default: > ret = -ENOSYS; > break; > @@ -1753,12 +2016,16 @@ int rdma_reject(struct rdma_cm_id *id, c > if (!cma_comp(id_priv, CMA_CONNECT)) > return -EINVAL; > > - switch (id->device->node_type) { > - case IB_NODE_CA: > + switch (rdma_node_get_transport(id->device->node_type)) { > + case RDMA_TRANSPORT_IB: > ret = ib_send_cm_rej(id_priv->cm_id.ib, > IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, > private_data, private_data_len); > break; > + case RDMA_TRANSPORT_IWARP: > + ret = iw_cm_reject(id_priv->cm_id.iw, > + private_data, private_data_len); > + break; > default: > ret = -ENOSYS; > break; > @@ -1777,16 +2044,18 @@ int rdma_disconnect(struct rdma_cm_id *i > !cma_comp(id_priv, CMA_DISCONNECT)) > return -EINVAL; > > - ret = cma_modify_qp_err(id); > - if (ret) > - goto out; > - > - switch (id->device->node_type) { > - case IB_NODE_CA: > + switch (rdma_node_get_transport(id->device->node_type)) { > + case RDMA_TRANSPORT_IB: > + ret = cma_modify_qp_err(id); > + if (ret) > + goto out; > /* Initiate or respond to a disconnect. */ > if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) > ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); > break; > + case RDMA_TRANSPORT_IWARP: > + ret = iw_cm_disconnect(id_priv->cm_id.iw, 0); > + break; > default: > break; > } > diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c > index b2f3cb9..7318fba 100644 > --- a/drivers/infiniband/core/device.c > +++ b/drivers/infiniband/core/device.c > @@ -30,7 +30,7 @@ > * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > * SOFTWARE. > * > - * $Id: device.c 1349 2004-12-16 21:09:43Z roland $ > + * $Id: device.c 5943 2006-03-22 00:58:04Z roland $ > */ > > #include > @@ -505,7 +505,7 @@ int ib_query_port(struct ib_device *devi > u8 port_num, > struct ib_port_attr *port_attr) > { > - if (device->node_type == IB_NODE_SWITCH) { > + if (device->node_type == RDMA_NODE_IB_SWITCH) { > if (port_num) > return -EINVAL; > } else if (port_num < 1 || port_num > device->phys_port_cnt) > @@ -580,7 +580,7 @@ int ib_modify_port(struct ib_device *dev > u8 port_num, int port_modify_mask, > struct ib_port_modify *port_modify) > { > - if (device->node_type == IB_NODE_SWITCH) { > + if (device->node_type == RDMA_NODE_IB_SWITCH) { > if (port_num) > return -EINVAL; > } else if (port_num < 1 || port_num > device->phys_port_cnt) > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index b38e02a..a928ecf 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -1,5 +1,5 @@ > /* > - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. > * Copyright (c) 2005 Intel Corporation. All rights reserved. > * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. > * > @@ -31,7 +31,7 @@ > * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > * SOFTWARE. > * > - * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ > + * $Id: mad.c 7294 2006-05-17 18:12:30Z roland $ > */ > #include > #include > @@ -2877,7 +2877,10 @@ static void ib_mad_init_device(struct ib > { > int start, end, i; > > - if (device->node_type == IB_NODE_SWITCH) { > + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) > + return; > + > + if (device->node_type == RDMA_NODE_IB_SWITCH) { > start = 0; > end = 0; > } else { > @@ -2924,7 +2927,7 @@ static void ib_mad_remove_device(struct > { > int i, num_ports, cur_port; > > - if (device->node_type == IB_NODE_SWITCH) { > + if (device->node_type == RDMA_NODE_IB_SWITCH) { > num_ports = 1; > cur_port = 0; > } else { > diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c > index 501cc05..4230277 100644 > --- a/drivers/infiniband/core/sa_query.c > +++ b/drivers/infiniband/core/sa_query.c > @@ -887,7 +887,10 @@ static void ib_sa_add_one(struct ib_devi > struct ib_sa_device *sa_dev; > int s, e, i; > > - if (device->node_type == IB_NODE_SWITCH) > + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) > + return; > + > + if (device->node_type == RDMA_NODE_IB_SWITCH) > s = e = 0; > else { > s = 1; > diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c > index 35852e7..b81b2b9 100644 > --- a/drivers/infiniband/core/smi.c > +++ b/drivers/infiniband/core/smi.c > @@ -34,7 +34,7 @@ > * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > * SOFTWARE. > * > - * $Id: smi.c 1389 2004-12-27 22:56:47Z roland $ > + * $Id: smi.c 5258 2006-02-01 20:32:40Z sean.hefty $ > */ > > #include > @@ -64,7 +64,7 @@ int smi_handle_dr_smp_send(struct ib_smp > > /* C14-9:2 */ > if (hop_ptr && hop_ptr < hop_cnt) { > - if (node_type != IB_NODE_SWITCH) > + if (node_type != RDMA_NODE_IB_SWITCH) > return 0; > > /* smp->return_path set when received */ > @@ -77,7 +77,7 @@ int smi_handle_dr_smp_send(struct ib_smp > if (hop_ptr == hop_cnt) { > /* smp->return_path set when received */ > smp->hop_ptr++; > - return (node_type == IB_NODE_SWITCH || > + return (node_type == RDMA_NODE_IB_SWITCH || > smp->dr_dlid == IB_LID_PERMISSIVE); > } > > @@ -95,7 +95,7 @@ int smi_handle_dr_smp_send(struct ib_smp > > /* C14-13:2 */ > if (2 <= hop_ptr && hop_ptr <= hop_cnt) { > - if (node_type != IB_NODE_SWITCH) > + if (node_type != RDMA_NODE_IB_SWITCH) > return 0; > > smp->hop_ptr--; > @@ -107,7 +107,7 @@ int smi_handle_dr_smp_send(struct ib_smp > if (hop_ptr == 1) { > smp->hop_ptr--; > /* C14-13:3 -- SMPs destined for SM shouldn't be here */ > - return (node_type == IB_NODE_SWITCH || > + return (node_type == RDMA_NODE_IB_SWITCH || > smp->dr_slid == IB_LID_PERMISSIVE); > } > > @@ -142,7 +142,7 @@ int smi_handle_dr_smp_recv(struct ib_smp > > /* C14-9:2 -- intermediate hop */ > if (hop_ptr && hop_ptr < hop_cnt) { > - if (node_type != IB_NODE_SWITCH) > + if (node_type != RDMA_NODE_IB_SWITCH) > return 0; > > smp->return_path[hop_ptr] = port_num; > @@ -156,7 +156,7 @@ int smi_handle_dr_smp_recv(struct ib_smp > smp->return_path[hop_ptr] = port_num; > /* smp->hop_ptr updated when sending */ > > - return (node_type == IB_NODE_SWITCH || > + return (node_type == RDMA_NODE_IB_SWITCH || > smp->dr_dlid == IB_LID_PERMISSIVE); > } > > @@ -175,7 +175,7 @@ int smi_handle_dr_smp_recv(struct ib_smp > > /* C14-13:2 */ > if (2 <= hop_ptr && hop_ptr <= hop_cnt) { > - if (node_type != IB_NODE_SWITCH) > + if (node_type != RDMA_NODE_IB_SWITCH) > return 0; > > /* smp->hop_ptr updated when sending */ > @@ -190,7 +190,7 @@ int smi_handle_dr_smp_recv(struct ib_smp > return 1; > } > /* smp->hop_ptr updated when sending */ > - return (node_type == IB_NODE_SWITCH); > + return (node_type == RDMA_NODE_IB_SWITCH); > } > > /* C14-13:4 -- hop_ptr = 0 -> give to SM */ > diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c > index 21f9282..cfd2c06 100644 > --- a/drivers/infiniband/core/sysfs.c > +++ b/drivers/infiniband/core/sysfs.c > @@ -31,7 +31,7 @@ > * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > * SOFTWARE. > * > - * $Id: sysfs.c 1349 2004-12-16 21:09:43Z roland $ > + * $Id: sysfs.c 6940 2006-05-04 17:04:55Z roland $ > */ > > #include "core_priv.h" > @@ -589,10 +589,16 @@ static ssize_t show_node_type(struct cla > return -ENODEV; > > switch (dev->node_type) { > - case IB_NODE_CA: return sprintf(buf, "%d: CA\n", dev->node_type); > - case IB_NODE_SWITCH: return sprintf(buf, "%d: switch\n", dev->node_type); > - case IB_NODE_ROUTER: return sprintf(buf, "%d: router\n", dev->node_type); > - default: return sprintf(buf, "%d: \n", dev->node_type); > + case RDMA_NODE_IB_CA: > + return sprintf(buf, "%d: CA\n", dev->node_type); > + case RDMA_NODE_RNIC: > + return sprintf(buf, "%d: RNIC\n", dev->node_type); > + case RDMA_NODE_IB_SWITCH: > + return sprintf(buf, "%d: switch\n", dev->node_type); > + case RDMA_NODE_IB_ROUTER: > + return sprintf(buf, "%d: router\n", dev->node_type); > + default: > + return sprintf(buf, "%d: \n", dev->node_type); > } > } > > @@ -708,7 +714,7 @@ int ib_device_register_sysfs(struct ib_d > if (ret) > goto err_put; > > - if (device->node_type == IB_NODE_SWITCH) { > + if (device->node_type == RDMA_NODE_IB_SWITCH) { > ret = add_port(device, 0); > if (ret) > goto err_put; > diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c > index 67caf36..ad2e417 100644 > --- a/drivers/infiniband/core/ucm.c > +++ b/drivers/infiniband/core/ucm.c > @@ -30,7 +30,7 @@ > * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > * SOFTWARE. > * > - * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $ > + * $Id: ucm.c 7119 2006-05-11 16:40:38Z sean.hefty $ > */ > > #include > @@ -1248,7 +1248,8 @@ static void ib_ucm_add_one(struct ib_dev > { > struct ib_ucm_device *ucm_dev; > > - if (!device->alloc_ucontext) > + if (!device->alloc_ucontext || > + rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) > return; > > ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); > diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c > index afe70a5..0cbd692 100644 > --- a/drivers/infiniband/core/user_mad.c > +++ b/drivers/infiniband/core/user_mad.c > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004 Topspin Communications. All rights reserved. > - * Copyright (c) 2005 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2005-2006 Voltaire, Inc. All rights reserved. > * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. > * > * This software is available to you under a choice of one of two > @@ -31,7 +31,7 @@ > * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > * SOFTWARE. > * > - * $Id: user_mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ > + * $Id: user_mad.c 6041 2006-03-27 21:06:00Z halr $ > */ > > #include > @@ -967,7 +967,10 @@ static void ib_umad_add_one(struct ib_de > struct ib_umad_device *umad_dev; > int s, e, i; > > - if (device->node_type == IB_NODE_SWITCH) > + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) > + return; > + > + if (device->node_type == RDMA_NODE_IB_SWITCH) > s = e = 0; > else { > s = 1; > diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c > index 28fdbda..e4b45d7 100644 > --- a/drivers/infiniband/hw/ipath/ipath_verbs.c > +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c > @@ -984,7 +984,7 @@ static void *ipath_register_ib_device(in > (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | > (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | > (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); > - dev->node_type = IB_NODE_CA; > + dev->node_type = RDMA_NODE_IB_CA; > dev->phys_port_cnt = 1; > dev->dma_device = ipath_layer_get_device(dd); > dev->class_dev.dev = dev->dma_device; > diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c > index a2eae8a..5c31819 100644 > --- a/drivers/infiniband/hw/mthca/mthca_provider.c > +++ b/drivers/infiniband/hw/mthca/mthca_provider.c > @@ -1273,7 +1273,7 @@ int mthca_register_device(struct mthca_d > (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | > (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | > (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); > - dev->ib_dev.node_type = IB_NODE_CA; > + dev->ib_dev.node_type = RDMA_NODE_IB_CA; > dev->ib_dev.phys_port_cnt = dev->limits.num_ports; > dev->ib_dev.dma_device = &dev->pdev->dev; > dev->ib_dev.class_dev.dev = &dev->pdev->dev; > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 1c6ea1c..262427f 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -1084,13 +1084,16 @@ static void ipoib_add_one(struct ib_devi > struct ipoib_dev_priv *priv; > int s, e, p; > > + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) > + return; > + > dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); > if (!dev_list) > return; > > INIT_LIST_HEAD(dev_list); > > - if (device->node_type == IB_NODE_SWITCH) { > + if (device->node_type == RDMA_NODE_IB_SWITCH) { > s = 0; > e = 0; > } else { > @@ -1114,6 +1117,9 @@ static void ipoib_remove_one(struct ib_d > struct ipoib_dev_priv *priv, *tmp; > struct list_head *dev_list; > > + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) > + return; > + > dev_list = ib_get_client_data(device, &ipoib_client); > > list_for_each_entry_safe(priv, tmp, dev_list, list) { > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c > index f1401e1..bba2956 100644 > --- a/drivers/infiniband/ulp/srp/ib_srp.c > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > @@ -1845,7 +1845,7 @@ static void srp_add_one(struct ib_device > if (IS_ERR(srp_dev->fmr_pool)) > srp_dev->fmr_pool = NULL; > > - if (device->node_type == IB_NODE_SWITCH) { > + if (device->node_type == RDMA_NODE_IB_SWITCH) { > s = 0; > e = 0; > } else { > diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h > index fcb5ba8..d95d3eb 100644 > --- a/include/rdma/ib_addr.h > +++ b/include/rdma/ib_addr.h > @@ -40,7 +40,7 @@ struct rdma_dev_addr { > unsigned char src_dev_addr[MAX_ADDR_LEN]; > unsigned char dst_dev_addr[MAX_ADDR_LEN]; > unsigned char broadcast[MAX_ADDR_LEN]; > - enum ib_node_type dev_type; > + enum rdma_node_type dev_type; > }; > > /** > @@ -72,6 +72,9 @@ int rdma_resolve_ip(struct sockaddr *src > > void rdma_addr_cancel(struct rdma_dev_addr *addr); > > +int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, > + const unsigned char *dst_dev_addr); > + > static inline int ip_addr_size(struct sockaddr *addr) > { > return addr->sa_family == AF_INET6 ? > @@ -111,4 +114,14 @@ static inline void ib_addr_set_dgid(stru > memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid); > } > > +static inline union ib_gid* iw_addr_get_sgid(struct rdma_dev_addr* rda) > +{ > + return (union ib_gid *) rda->src_dev_addr; > +} > + > +static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda) > +{ > + return (union ib_gid *) rda->dst_dev_addr; > +} > + > #endif /* IB_ADDR_H */ > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index aeb4fcd..eac2d8f 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -35,7 +35,7 @@ > * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > * SOFTWARE. > * > - * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $ > + * $Id: ib_verbs.h 6885 2006-05-03 18:22:02Z sean.hefty $ > */ > > #if !defined(IB_VERBS_H) > @@ -56,12 +56,35 @@ union ib_gid { > } global; > }; > > -enum ib_node_type { > - IB_NODE_CA = 1, > - IB_NODE_SWITCH, > - IB_NODE_ROUTER > +enum rdma_node_type { > + /* IB values map to NodeInfo:NodeType. */ > + RDMA_NODE_IB_CA = 1, > + RDMA_NODE_IB_SWITCH, > + RDMA_NODE_IB_ROUTER, > + RDMA_NODE_RNIC > }; > > +enum rdma_transport_type { > + RDMA_TRANSPORT_IB, > + RDMA_TRANSPORT_IWARP > +}; > + > +static inline enum rdma_transport_type > +rdma_node_get_transport(enum rdma_node_type node_type) > +{ > + switch (node_type) { > + case RDMA_NODE_IB_CA: > + case RDMA_NODE_IB_SWITCH: > + case RDMA_NODE_IB_ROUTER: > + return RDMA_TRANSPORT_IB; > + case RDMA_NODE_RNIC: > + return RDMA_TRANSPORT_IWARP; > + default: > + BUG(); > + return 0; > + } > +} > + > enum ib_device_cap_flags { > IB_DEVICE_RESIZE_MAX_WR = 1, > IB_DEVICE_BAD_PKEY_CNTR = (1<<1), > @@ -78,6 +101,9 @@ enum ib_device_cap_flags { > IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), > IB_DEVICE_SRQ_RESIZE = (1<<13), > IB_DEVICE_N_NOTIFY_CQ = (1<<14), > + IB_DEVICE_ZERO_STAG = (1<<15), > + IB_DEVICE_SEND_W_INV = (1<<16), > + IB_DEVICE_MEM_WINDOW = (1<<17) > }; > > enum ib_atomic_cap { > @@ -830,6 +856,7 @@ struct ib_cache { > u8 *lmc_cache; > }; > > +struct iw_cm_verbs; > struct ib_device { > struct device *dma_device; > > @@ -846,6 +873,8 @@ struct ib_device { > > u32 flags; > > + struct iw_cm_verbs *iwcm; > + > int (*query_device)(struct ib_device *device, > struct ib_device_attr *device_attr); > int (*query_port)(struct ib_device *device, > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html From mshefty at ichips.intel.com Wed Jun 7 15:21:27 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 07 Jun 2006 15:21:27 -0700 Subject: [openib-general] crash in ib_sa_mcmember_rec_callback while probing out ib_sa In-Reply-To: References: Message-ID: <44875167.2090405@ichips.intel.com> Roland Dreier wrote: > Looks like the same crash mst saw related to the multicast module > being unloaded and then having sa call back into it. One small clue: > > > esi: f38a5bec edi: f38a5bf4 ebp: fffffffc esp: f599be60 > > ebp is -4, which is -EINTR. So this may be a callback from sa_query's > send_handler() caused by a IB_WC_WR_FLUSH_ERR status. This makes sense given the call trace. When ib_sa is unloading, it unregisters its mad_agent, which results in canceling all outstanding MADs. What doesn't make sense to me is how ib_multicast could have unloaded while there are any outstanding SA queries. All queries hold a reference on a MC group until they complete. And all groups reference a port. The module shouldn't unload until all references are released on all ports. I removed some code that is intended to speed up cleanup, but is unnecessary. We can see if that helps, but I'm skeptical. - Sean From ardavis at ichips.intel.com Wed Jun 7 15:24:46 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 07 Jun 2006 15:24:46 -0700 Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS In-Reply-To: References: Message-ID: <4487522E.7060808@ichips.intel.com> Scott Weitzenkamp (sweitzen) wrote: >Yes, the modules were loaded. > >Each of the 32 hosts had 3 IB ports up. Does Intel MPI or uDAPL use >multiple ports and/or multiple HCAs? > >I shut down all but one port on each host, and now Pallas is running >better on the 32 nodes using Intel MPI 2.0.1. HP MPI 2.2 started >working too with Pallas too over uDAPL, so maybe this is a uDAPL issue? > > Can you tell me what adapters are installed (ibstat), how they are configured (ifconfig), and what your dat.conf looks like? It sounds like a device mapping issue during the dat_ia_open() processing. Multiple ports and HCAs should work fine but there is some care required in configuration of the dat.conf so you consitantly pick up the correct device across the cluster. Intel MPI will simply open a device based on the provider/device name (example: setenv I_MPI_DAPL_PROVIDER=OpenIB-cma) defined in the dat.conf and query dapl for the address to be used for connections. This line in the dat.conf will determine which library to load and which IB device to open and bind too. If you have the same exact configuration on each node and know that the ib0,ib1,ib2, etc will always come up in the same order then you can simply use the same netdev names across the cluster and use the same exact copy of dat.conf on each node. Here are the dat.conf options for OpenIB-cma configurations. # For cma version you specify as: # network address, network hostname, or netdev name and 0 for port # # Simple (OpenIB-cma) default with netdev name provided first on list # to enable use of same dat.conf version on all nodes # OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" OpenIB-cma-ip u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "192.168.0.22 0" "" OpenIB-cma-name u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "svr1-ib0 0" "" OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" Which type are you using? address, hostname, or netdev names? Also, Intel MPI is sometimes too smart for its own good when opening rdma devices via uDAPL. If the open fails with the first rdma device specified in the dat.conf it will continue onto the next line until one is successfull. If all rdma devices fail it will then go onto the static device automatcally. This sometimes does more harm then good since one node could be failing over to the second device in your configuration and the other nodes are all on the first device. If they are all on the same subnet then it would work fine but if they are on different subnets then we would not be able to connect. If you send me your configuration, we can set it up here and hopefully duplicate your error case. -arlin From sweitzen at cisco.com Wed Jun 7 15:44:42 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 7 Jun 2006 15:44:42 -0700 Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS Message-ID: I have not touched /etc/dat.conf, so I am using whatever comes with OFED 1.0 rc5. For whatever reason, things have improved some. I am now running Intel MPI right after bringing up hosts (previously I was trying MVAPICH, then Open MPI, then HP MPI, then Intel MPI). I've run twice, and see these failures: Run #1 (after rebooting all hosts): rank 13 in job 1 192.168.1.1_34674 caused collective abort of all ranks^M exit status of rank 13: killed by signal 11 ^M ^[_releng at svbu-qaclus-1:/data/home/scott/builds/TopspinOS-2.7.0/build013 /protes\ t/Lk3/060706_123945/intel.intel^[\[releng at svbu-qaclus-1 intel.intel]$ ### TEST-W: Could not run /data/home/scott/builds/TopspinOS-2.7.0/build013/prot\ est/Lk3/060706_123945/intel.intel/1149709233/IMB_2.3/src/IMB-MPI1 Allreduce : 0\ Run #2 (after rebooting all hosts): rank 6 in job 1 192.168.1.1_33649 caused collective abort of all ranks^M exit status of rank 6: killed by signal 11 ^M ^[_releng at svbu-qaclus-1:/data/home/scott/builds/TopspinOS-2.7.0/build013 /protes\ t/Lk3/060706_145739/intel.intel^[\[releng at svbu-qaclus-1 intel.intel]$ ### TEST-W: Could not run /data/home/scott/builds/TopspinOS-2.7.0/build013/prot\ est/Lk3/060706_145739/intel.intel/1149717497/IMB_2.3/src/IMB-MPI1 Exchange : 0 rank 21 in job 1 192.168.1.1_34734 caused collective abort of all ranks^M exit status of rank 21: killed by signal 11 ^M ^[_releng at svbu-qaclus-1:/data/home/scott/builds/TopspinOS-2.7.0/build013 /protes\ t/Lk3/060706_145739/intel.intel^[\[releng at svbu-qaclus-1 intel.intel]$ ### TEST-W: Could not run /data/home/scott/builds/TopspinOS-2.7.0/build013/prot\ est/Lk3/060706_145739/intel.intel/1149717497/IMB_2.3/src/IMB-MPI1 Allgatherrv -\ multi 1: 0 Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: Arlin Davis [mailto:ardavis at ichips.intel.com] > Sent: Wednesday, June 07, 2006 3:25 PM > To: Scott Weitzenkamp (sweitzen) > Cc: Davis, Arlin R; Lentini, James; openib-general > Subject: Re: [openib-general] [PATCH] uDAPL openib-cma > provider - add support for IB_CM_REQ_OPTIONS > > Scott Weitzenkamp (sweitzen) wrote: > > >Yes, the modules were loaded. > > > >Each of the 32 hosts had 3 IB ports up. Does Intel MPI or uDAPL use > >multiple ports and/or multiple HCAs? > > > >I shut down all but one port on each host, and now Pallas is running > >better on the 32 nodes using Intel MPI 2.0.1. HP MPI 2.2 started > >working too with Pallas too over uDAPL, so maybe this is a > uDAPL issue? > > > > > Can you tell me what adapters are installed (ibstat), how they are > configured (ifconfig), and what your dat.conf looks like? It sounds > like a device mapping issue during the dat_ia_open() processing. > > Multiple ports and HCAs should work fine but there is some > care required > in configuration of the dat.conf so you consitantly pick up > the correct > device across the cluster. Intel MPI will simply open a > device based on > the provider/device name (example: setenv > I_MPI_DAPL_PROVIDER=OpenIB-cma) defined in the dat.conf and > query dapl > for the address to be used for connections. This line in the dat.conf > will determine which library to load and which IB device to open and > bind too. If you have the same exact configuration on each > node and know > that the ib0,ib1,ib2, etc will always come up in the same > order then you > can simply use the same netdev names across the cluster and > use the same > exact copy of dat.conf on each node. > > Here are the dat.conf options for OpenIB-cma configurations. > > # For cma version you specify as: > # network address, network hostname, or netdev name and > 0 for port > # > # Simple (OpenIB-cma) default with netdev name provided first on list > # to enable use of same dat.conf version on all nodes > # > OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so > mv_dapl.1.2 > "ib0 0" "" > OpenIB-cma-ip u1.2 nonthreadsafe default /usr/lib/libdaplcma.so > mv_dapl.1.2 "192.168.0.22 0" "" > OpenIB-cma-name u1.2 nonthreadsafe default /usr/lib/libdaplcma.so > mv_dapl.1.2 "svr1-ib0 0" "" > OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/lib/libdaplcma.so > mv_dapl.1.2 "ib0 0" "" > > Which type are you using? address, hostname, or netdev names? > > Also, Intel MPI is sometimes too smart for its own good when opening > rdma devices via uDAPL. If the open fails with the first rdma device > specified in the dat.conf it will continue onto the next line > until one > is successfull. If all rdma devices fail it will then go onto > the static > device automatcally. This sometimes does more harm then good > since one > node could be failing over to the second device in your configuration > and the other nodes are all on the first device. If they are > all on the > same subnet then it would work fine but if they are on > different subnets > then we would not be able to connect. > > If you send me your configuration, we can set it up here and > hopefully > duplicate your error case. > > -arlin > From mshefty at ichips.intel.com Wed Jun 7 17:19:37 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 07 Jun 2006 17:19:37 -0700 Subject: [openib-general] Re: Failed multicast join withnew multicast module In-Reply-To: <1149706206.4510.292005.camel@hal.voltaire.com> References: <1149024804.4510.1056.camel@hal.voltaire.com> <20060531090817.GQ21266@mellanox.co.il> <447DC8F8.60409@ichips.intel.com> <1149095100.4510.29902.camel@hal.voltaire.com> <447DD2E4.3030709@ichips.intel.com> <44871A04.9010705@ichips.intel.com> <1149706206.4510.292005.camel@hal.voltaire.com> Message-ID: <44876D19.5040205@ichips.intel.com> Hal Rosenstock wrote: >> This >>leads to a race where NonMembers and SendOnlyNonMembers will fail to re-join >>until one of the FullMembers joins. > > Might also be true with joins (not creates) from FullMembers too. I > would presume in such cases, the join would be retried. SendOnlyMembers > (at least for IPoIB) do this if not joined every time a packet is sent. Correct. But all clients trying to rejoin groups must be aware of this, and delay / retry until their groups are recreated. Let me know if I'm off here, but it also appears that clients can't rely on an existing QP attachment or address handle to send to the new group. Even if a group is re-created, there's no guarantee that the SA didn't assign a different MLID to the group. So, the only safe thing to do is for all multicast clients to detach from all multicast groups, destroy all address handles, possibly wait for a new group to be created, and then start all over again. Is this correct? - Sean From halr at voltaire.com Wed Jun 7 17:55:27 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Jun 2006 20:55:27 -0400 Subject: [openib-general] Re: Failed multicast join withnew multicast module In-Reply-To: <44876D19.5040205@ichips.intel.com> References: <1149024804.4510.1056.camel@hal.voltaire.com> <20060531090817.GQ21266@mellanox.co.il> <447DC8F8.60409@ichips.intel.com> <1149095100.4510.29902.camel@hal.voltaire.com> <447DD2E4.3030709@ichips.intel.com> <44871A04.9010705@ichips.intel.com> <1149706206.4510.292005.camel@hal.voltaire.com> <44876D19.5040205@ichips.intel.com> Message-ID: <1149728121.4510.301957.camel@hal.voltaire.com> On Wed, 2006-06-07 at 20:19, Sean Hefty wrote: > Hal Rosenstock wrote: > >> This > >>leads to a race where NonMembers and SendOnlyNonMembers will fail to re-join > >>until one of the FullMembers joins. > > > > Might also be true with joins (not creates) from FullMembers too. I > > would presume in such cases, the join would be retried. SendOnlyMembers > > (at least for IPoIB) do this if not joined every time a packet is sent. > > Correct. But all clients trying to rejoin groups must be aware of this, and > delay / retry until their groups are recreated. I might be missing your point but UD is unreliable so the sends can be dropped. The delay/retry is to make sure the join does occur, > Let me know if I'm off here, but it also appears that clients can't rely on an > existing QP attachment or address handle to send to the new group. Even if a > group is re-created, there's no guarantee that the SA didn't assign a different > MLID to the group. Correct. I have seen this behavior with various dynamic groups. I know there was code in IPoIB to handle local LID changes (adjusting the AH). I'm not sure about whether multicast changes were handled too but I don't recall this. > So, the only safe thing to do is for all multicast clients to detach from all > multicast groups, destroy all address handles, Why all groups ? > possibly wait for a new group to be created, and then start all over again. Start what all over again ? > Is this correct? I'm not completely following you yet. -- Hal > - Sean From sean.hefty at intel.com Wed Jun 7 19:48:42 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 7 Jun 2006 19:48:42 -0700 Subject: [openib-general] RE: Failed multicast join withnew multicast module In-Reply-To: <1149728121.4510.301957.camel@hal.voltaire.com> Message-ID: >I might be missing your point but UD is unreliable so the sends can be >dropped. The delay/retry is to make sure the join does occur, This is different than a dropped request or reply. In this case, the receiver gets a reply, but it will be a failure from the SA to join the group. For example, a NonMember tries to re-join before a FullMember which would have created the group does. The result is that requests that receive a reply also need to be retried, with the timeout dependent on some remote node in the fabric creating the group. >> So, the only safe thing to do is for all multicast clients to detach from all >> multicast groups, destroy all address handles, > >Why all groups ? Because the SM has lost track that any groups in the fabric existed, so those groups must be recreated, all potentially with different mlids. >> possibly wait for a new group to be created, and then start all over again. > >Start what all over again ? I meant attach the QP to the new group and allocate a new address handle. This is a general comment, and not directed at anyone specific, but is this really the architecture and implementation that we want to aim for? I really think that we need to look at solutions that don't break existing communication, unless the links providing that communication actually go down, even if this means extending the architecture. - Sean From zhushisongzhu at yahoo.com Wed Jun 7 22:00:10 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Wed, 7 Jun 2006 22:00:10 -0700 (PDT) Subject: [openib-general] how about sdp progress In-Reply-To: <20060607174343.315582283DE@openib.ca.sandia.gov> Message-ID: <20060608050010.83564.qmail@web36910.mail.mud.yahoo.com> MST, how about sdp progress now? zhu __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From tziporet at mellanox.co.il Wed Jun 7 23:25:55 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 8 Jun 2006 09:25:55 +0300 Subject: [openib-general] OFED-1.0-rc6 is available Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7162@mtlexch01.mtl.com> We sure will. -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Wednesday, June 07, 2006 11:03 PM To: Tziporet Koren; openfabrics-ewg at openib.org; openib-general Subject: Re: [openib-general] OFED-1.0-rc6 is available We also just found a bug in how ibsrpdm discovers Cisco/Topspin FC gateways. The patch is below, and is also checked in to the trunk as svn rev 7803. Please include this in OFED 1.0 final. Thanks, Roland --- srptools/ChangeLog (revision 7796) +++ srptools/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2006-06-07 Roland Dreier + * src/srp-dm.c (do_port): Use correct endianness when comparing + GUID against Topspin OUI. + + * src/srp-dm.c (set_class_port_info): Trivial whitespace fixes. + 2006-05-29 Ishai Rabinovitz * src/srp-dm.c (main): The agent ID array is declared with 0 --- srptools/src/srp-dm.c (revision 7796) +++ srptools/src/srp-dm.c (working copy) @@ -52,8 +52,6 @@ #include "ib_user_mad.h" #include "srp-dm.h" -static const uint8_t topspin_oui[3] = { 0x00, 0x05, 0xad }; - static char *umad_dev = "/dev/infiniband/umad0"; static char *port_sysfs_path; static int timeout_ms = 25000; @@ -249,7 +247,7 @@ static int set_class_port_info(int fd, u init_srp_dm_mad(&out_mad, agent[1], dlid, SRP_DM_ATTR_CLASS_PORT_INFO, 0); - out_dm_mad = (void *) out_mad.data; + out_dm_mad = (void *) out_mad.data; out_dm_mad->method = SRP_DM_METHOD_SET; cpi = (void *) out_dm_mad->data; @@ -266,9 +264,8 @@ static int set_class_port_info(int fd, u return -1; } - for (i = 0; i < 8; ++i) { + for (i = 0; i < 8; ++i) ((uint16_t *) cpi->trap_gid)[i] = htons(strtol(val + i * 5, NULL, 16)); - } if (send_and_get(fd, &out_mad, &in_mad, 0) < 0) return -1; @@ -371,7 +368,10 @@ static int do_port(int fd, uint32_t agen struct srp_dm_svc_entries svc_entries; int i, j, k; - if (!memcmp(&guid, topspin_oui, 3) && + static const uint64_t topspin_oui = 0x0005ad0000000000ull; + static const uint64_t oui_mask = 0xffffff0000000000ull; + + if ((guid & oui_mask) == topspin_oui && set_class_port_info(fd, agent, dlid)) fprintf(stderr, "Warning: set of ClassPortInfo failed\n"); From jackm at mellanox.co.il Wed Jun 7 23:42:48 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 8 Jun 2006 09:42:48 +0300 Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS In-Reply-To: References: <200606071639.03787.jackm@mellanox.co.il> Message-ID: <200606080942.48767.jackm@mellanox.co.il> On Wednesday 07 June 2006 18:26, James Lentini wrote: > On Wed, 7 Jun 2006, Jack Morgenstein wrote: > > This (bug fix) can still be included in next-week's release, if you > > think it is important (I have extracted it from the changes checked > > in at svn 7755) > > If you are going to make another release anyway, then I would included > it. Do you mean -- include the fix in next week's release -- or -- wait with the fix for the following release? - Jack From akpm at osdl.org Thu Jun 8 00:54:52 2006 From: akpm at osdl.org (Andrew Morton) Date: Thu, 8 Jun 2006 00:54:52 -0700 Subject: [openib-general] Re: [PATCH v2 1/2] iWARP Connection Manager. In-Reply-To: <20060607200605.9003.25830.stgit@stevo-desktop> References: <20060607200600.9003.56328.stgit@stevo-desktop> <20060607200605.9003.25830.stgit@stevo-desktop> Message-ID: <20060608005452.087b34db.akpm@osdl.org> On Wed, 07 Jun 2006 15:06:05 -0500 Steve Wise wrote: > > This patch provides the new files implementing the iWARP Connection > Manager. > > Review Changes: > > - sizeof -> sizeof() > > - removed printks > > - removed TT debug code > > - cleaned up lock/unlock around switch statements. > > - waitqueue -> completion for destroy path. > > ... > > +/* > + * This function is called on interrupt context. Schedule events on > + * the iwcm_wq thread to allow callback functions to downcall into > + * the CM and/or block. Events are queued to a per-CM_ID > + * work_list. If this is the first event on the work_list, the work > + * element is also queued on the iwcm_wq thread. > + * > + * Each event holds a reference on the cm_id. Until the last posted > + * event has been delivered and processed, the cm_id cannot be > + * deleted. > + */ > +static void cm_event_handler(struct iw_cm_id *cm_id, > + struct iw_cm_event *iw_event) > +{ > + struct iwcm_work *work; > + struct iwcm_id_private *cm_id_priv; > + unsigned long flags; > + > + work = kmalloc(sizeof(*work), GFP_ATOMIC); > + if (!work) > + return; This allocation _will_ fail sometimes. The driver must recover from it. Will it do so? > +EXPORT_SYMBOL(iw_cm_init_qp_attr); This file exports a ton of symbols. It's usual to provide some justifying commentary in the changelog when this happens. > +/* > + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > +#if !defined(IW_CM_PRIVATE_H) > +#define IW_CM_PRIVATE_H We normally use #ifndef here. From akpm at osdl.org Thu Jun 8 01:17:44 2006 From: akpm at osdl.org (Andrew Morton) Date: Thu, 8 Jun 2006 01:17:44 -0700 Subject: [openib-general] Re: [PATCH v2 4/7] AMSO1100 Memory Management. In-Reply-To: <20060607200655.9259.90768.stgit@stevo-desktop> References: <20060607200646.9259.24588.stgit@stevo-desktop> <20060607200655.9259.90768.stgit@stevo-desktop> Message-ID: <20060608011744.1a66e85a.akpm@osdl.org> On Wed, 07 Jun 2006 15:06:55 -0500 Steve Wise wrote: > > +void c2_free(struct c2_alloc *alloc, u32 obj) > +{ > + spin_lock(&alloc->lock); > + clear_bit(obj, alloc->table); > + spin_unlock(&alloc->lock); > +} The spinlock is unneeded here. What does all the code in this file do, anyway? It looks totally generic (and hence inappropriate for drivers/infiniband/hw/amso1100/) and somewhat similar to idr trees, perhaps. > +int c2_array_set(struct c2_array *array, int index, void *value) > +{ > + int p = (index * sizeof(void *)) >> PAGE_SHIFT; > + > + /* Allocate with GFP_ATOMIC because we'll be called with locks held. */ > + if (!array->page_list[p].page) > + array->page_list[p].page = > + (void **) get_zeroed_page(GFP_ATOMIC); > + > + if (!array->page_list[p].page) > + return -ENOMEM; This _will_ happen under load. What will the result of that be, in the context of thise driver? This function is incorrectly designed - it should receive a gfp_t argument. Because you don't *know* that the caller will always hold a spinlock. And GFP_KERNEL is far, far stronger than GFP_ATOMIC. > +static int c2_alloc_mqsp_chunk(gfp_t gfp_mask, struct sp_chunk **head) > +{ > + int i; > + struct sp_chunk *new_head; > + > + new_head = (struct sp_chunk *) __get_free_page(gfp_mask | GFP_DMA); Why is __GFP_DMA in there? Unless you've cornered the ISA bus infiniband market, it's likely to be wrong. From eitan at mellanox.co.il Thu Jun 8 01:40:30 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 08 Jun 2006 11:40:30 +0300 Subject: [openib-general] [PATCH] osm: fix mlx vendor rmpp sender fail to send zero size RMPP Message-ID: <86hd2wkpkx.fsf@mtl066.yok.mtl.com> Hi Hal Run into this by chance. Some changes introduced lately to the SA queries now sends zero size RMPP (single segment with only headers). It used to send them as non-RMPP responses. Anyway, this broke the mlx vendor code that I use for simulation. This patch resolves this new problem. Eitan Signed-off-by: Eitan Zahavi Index: libvendor/osm_vendor_mlx_sar.c =================================================================== --- libvendor/osm_vendor_mlx_sar.c (revision 7703) +++ libvendor/osm_vendor_mlx_sar.c (working copy) @@ -91,7 +91,7 @@ osmv_rmpp_sar_get_mad_seg( num_segs++; } - if ( seg_idx > num_segs) + if ( (seg_idx > num_segs) && (seg_idx != 1) ) { return IB_NOT_FOUND; } @@ -102,18 +102,14 @@ osmv_rmpp_sar_get_mad_seg( /* attach header */ memcpy(p_buf,p_sar->p_arbt_mad,p_sar->hdr_sz); - /* fill data */ p_seg = (char*)p_sar->p_arbt_mad + p_sar->hdr_sz + ((seg_idx-1) * p_sar->data_sz); sz_left = p_sar->data_len - ((seg_idx -1) * p_sar->data_sz); if (sz_left > p_sar->data_sz) - { memcpy((char*)p_buf+p_sar->hdr_sz,(char*)p_seg,p_sar->data_sz); - } else memcpy((char*)p_buf+ p_sar->hdr_sz, (char*)p_seg, sz_left); - return IB_SUCCESS; } From tziporet at mellanox.co.il Thu Jun 8 01:53:05 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 8 Jun 2006 11:53:05 +0300 Subject: [openib-general] OFED-1.0-rc6 is available Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7173@mtlexch01.mtl.com> Roland did the fix on Trunk and I took it to OFED 1.0 branch. Tziporet -----Original Message----- From: Ramachandra K [mailto:rkuchimanchi at silverstorm.com] Sent: Wednesday, June 07, 2006 8:28 PM To: Tziporet Koren Cc: openfabrics-ewg at openib.org; openib-general; Ramachandra K Subject: Re: [openib-general] OFED-1.0-rc6 is available Tziporet Koren wrote: > Hi All, > > We have prepared OFED 1.0 RC6. > From the openib source tar ball in OFED RC6, it looks like the SRP kernel changes (ulp/srp/ib_srp.c) in the trunk for supporting Rev 10 targets have been included in RC6, but the corresponding changes to the userspace srptool--ibsrpdm (userspace/srptools/src/srp-dm.c) for displaying the IO class of the target have not been made part of RC6. The changes to ibsrpdm were committed to the SVN repository trunk in revision number 7758. Will the latest version of ibsrpdm make it to the next OFED release ? Regards, Ram From halr at voltaire.com Thu Jun 8 03:41:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Jun 2006 06:41:58 -0400 Subject: [openib-general] Re: [PATCH] osm: fix mlx vendor rmpp sender fail to send zero size RMPP In-Reply-To: <86hd2wkpkx.fsf@mtl066.yok.mtl.com> References: <86hd2wkpkx.fsf@mtl066.yok.mtl.com> Message-ID: <1149763300.4510.319639.camel@hal.voltaire.com> Hi Eitan, On Thu, 2006-06-08 at 04:40, Eitan Zahavi wrote: > Hi Hal > > Run into this by chance. Some changes introduced lately to the SA queries > now sends zero size RMPP (single segment with only headers). It used to send > them as non-RMPP responses. Not sure what that change was. > Anyway, this broke the mlx vendor code that I use > for simulation. > > This patch resolves this new problem. Thanks. Applied to trunk only. Any idea of OFED RC6 has this issue ? -- Hal > Eitan From eitan at mellanox.co.il Thu Jun 8 04:08:53 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 8 Jun 2006 14:08:53 +0300 Subject: [openib-general] RE: [PATCH] osm: fix mlx vendor rmpp sender fail to send zero sizeRMPP Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236880E@mtlexch01.mtl.com> This does not have to get into OFED. I did not see these failures there. Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, June 08, 2006 1:42 PM > To: Eitan Zahavi > Cc: OPENIB > Subject: Re: [PATCH] osm: fix mlx vendor rmpp sender fail to send zero sizeRMPP > > Hi Eitan, > > On Thu, 2006-06-08 at 04:40, Eitan Zahavi wrote: > > Hi Hal > > > > Run into this by chance. Some changes introduced lately to the SA queries > > now sends zero size RMPP (single segment with only headers). It used to send > > them as non-RMPP responses. > > Not sure what that change was. > > > Anyway, this broke the mlx vendor code that I use > > for simulation. > > > > This patch resolves this new problem. > > Thanks. Applied to trunk only. Any idea of OFED RC6 has this issue ? > > -- Hal > > > Eitan From halr at voltaire.com Thu Jun 8 04:03:09 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Jun 2006 07:03:09 -0400 Subject: [openib-general] RE: Failed multicast join withnew multicast module In-Reply-To: References: Message-ID: <1149764555.4510.320252.camel@hal.voltaire.com> On Wed, 2006-06-07 at 22:48, Sean Hefty wrote: > >I might be missing your point but UD is unreliable so the sends can be > >dropped. The delay/retry is to make sure the join does occur, > > This is different than a dropped request or reply. In this case, the receiver > gets a reply, but it will be a failure from the SA to join the group. By receiver, I think you are referring to SA requester. Yes, the SA would reject the request with a status ERR_REQ_INSUFFICIENT_COMPONENTS. > For example, a NonMember tries to re-join before a FullMember which would have > created the group does. The result is that requests that receive a reply also > need to be retried, with the timeout dependent on some remote node in the fabric > creating the group. and it is unknown when such a multicast registration (to create the group) would occur. So the proper timeout is unknown. That's why IPoIB has a couple of different strategies for handling this depending on the JoinState, > >> So, the only safe thing to do is for all multicast clients to detach from all > >> multicast groups, destroy all address handles, > > > >Why all groups ? > > Because the SM has lost track that any groups in the fabric existed, so those > groups must be recreated, all potentially with different mlids. Yes, in the case of client reregister. > >> possibly wait for a new group to be created, and then start all over again. > > > >Start what all over again ? > > I meant attach the QP to the new group and allocate a new address handle. Couldn't it modify the old one as an alternative strategy ? > This is a general comment, and not directed at anyone specific, Don't worry. I'm not taking it personally. Just want to give you my $0.02 worth on what I think you are saying below: > but is this > really the architecture and implementation that we want to aim for? I really > think that we need to look at solutions that don't break existing communication, > unless the links providing that communication actually go down, even if this > means extending the architecture. If this comment is directed at client reregister mechanism, you should note that when this was brought up there was resistance to it based on the recommendation (probably not a strong enough word for this) that SMs be redundant in the subnet. There was a fair bit of anecdotal evidence that this was not how they were being used at the time but it may have been a chicken and egg problem. -- Hal > - Sean From eitan at mellanox.co.il Thu Jun 8 04:24:03 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 08 Jun 2006 14:24:03 +0300 Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo GetTable query Message-ID: <86fyiflwks.fsf@mtl066.yok.mtl.com> Hi Hal I'm working on passing osmtest check. Found a bug in the new GUIDInfoRecord query: If you had a physical port with zero guid_cap the code would loop on blocks 0..255 instead of trying the next port. I am still looking for why we might have a guid_cap == 0 on some ports. This patch resolves this new problem. osmtest passes on some arbitrary networks. Eitan Signed-off-by: Eitan Zahavi Index: opensm/osm_sa_guidinfo_record.c =================================================================== --- opensm/osm_sa_guidinfo_record.c (revision 7703) +++ opensm/osm_sa_guidinfo_record.c (working copy) @@ -255,6 +255,10 @@ __osm_sa_gir_create_gir( continue; p_pi = osm_physp_get_port_info_ptr( p_physp ); + + if ( p_pi->guid_cap == 0 ) + continue; + num_blocks = p_pi->guid_cap / 8; if ( p_pi->guid_cap % 8 ) num_blocks++; From cganapathi at novell.com Thu Jun 8 05:12:14 2006 From: cganapathi at novell.com (CH Ganapathi) Date: Thu, 08 Jun 2006 06:12:14 -0600 Subject: [openib-general] [PATCH] ib_uverbs_get_context does not unlock file->mutex in error path Message-ID: <44886176.6C2D.007B.0@novell.com> Hi, If ibdev->alloc_ucontext(ibdev, &udata) fails then ib_uverbs_get_context does not unlock file->mutex before returning error. Thanks, Ganapathi Novell Inc. Signed-off by: Ganapathi CH Index: linux-kernel/infiniband/core/uverbs_cmd.c =================================================================== --- infiniband/core/uverbs_cmd.c 2006-06-08 11:52:29.000000000 +0530 +++ infiniband-fix/core/uverbs_cmd.c 2006-06-08 17:16:10.000000000 +0530 @@ -80,8 +80,10 @@ ssize_t ib_uverbs_get_context(struct ib_ in_len - sizeof cmd, out_len - sizeof resp); ucontext = ibdev->alloc_ucontext(ibdev, &udata); - if (IS_ERR(ucontext)) - return PTR_ERR(file->ucontext); + if (IS_ERR(ucontext)) { + ret = PTR_ERR(file->ucontext); + goto err; + } ucontext->device = ibdev; INIT_LIST_HEAD(&ucontext->pd_list); From halr at voltaire.com Thu Jun 8 05:54:06 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Jun 2006 08:54:06 -0400 Subject: [openib-general] Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable query In-Reply-To: <86fyiflwks.fsf@mtl066.yok.mtl.com> References: <86fyiflwks.fsf@mtl066.yok.mtl.com> Message-ID: <1149771197.4510.323092.camel@hal.voltaire.com> Hi Eitan, On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote: > Hi Hal > > I'm working on passing osmtest check. Found a bug in the new > GUIDInfoRecord query: If you had a physical port with zero guid_cap > the code would loop on blocks 0..255 instead of trying the next port. OK; that's definitely a problem. > I am still looking for why we might have a guid_cap == 0 on some > ports. PortInfo:GuidCap is not used for switch external ports. > This patch resolves this new problem. osmtest passes on some arbitrary > networks. > > Eitan > > Signed-off-by: Eitan Zahavi > > Index: opensm/osm_sa_guidinfo_record.c > =================================================================== > --- opensm/osm_sa_guidinfo_record.c (revision 7703) > +++ opensm/osm_sa_guidinfo_record.c (working copy) > @@ -255,6 +255,10 @@ __osm_sa_gir_create_gir( > continue; > > p_pi = osm_physp_get_port_info_ptr( p_physp ); > + > + if ( p_pi->guid_cap == 0 ) > + continue; > + I think the right fix is to detect switch external ports and use the VLCap from port 0 rather than from the switch external port (unless that concept is broken in which case it should return 0 records). -- Hal > num_blocks = p_pi->guid_cap / 8; > if ( p_pi->guid_cap % 8 ) > num_blocks++; > From jlentini at netapp.com Thu Jun 8 07:23:08 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 8 Jun 2006 10:23:08 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS In-Reply-To: <200606080942.48767.jackm@mellanox.co.il> References: <200606071639.03787.jackm@mellanox.co.il> <200606080942.48767.jackm@mellanox.co.il> Message-ID: On Thu, 8 Jun 2006, Jack Morgenstein wrote: > On Wednesday 07 June 2006 18:26, James Lentini wrote: > > On Wed, 7 Jun 2006, Jack Morgenstein wrote: > > > This (bug fix) can still be included in next-week's release, if you > > > think it is important (I have extracted it from the changes checked > > > in at svn 7755) > > > > If you are going to make another release anyway, then I would included > > it. > > Do you mean -- include the fix in next week's release -- or -- wait > with the fix for the following release? I'd include the fix in the next release, but I wouldn't create a special release just for this fix. From rdreier at cisco.com Thu Jun 8 09:21:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 Jun 2006 09:21:58 -0700 Subject: [openib-general] OFED-1.0-rc6 is available In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7162@mtlexch01.mtl.com> (Tziporet Koren's message of "Thu, 8 Jun 2006 09:25:55 +0300") References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7162@mtlexch01.mtl.com> Message-ID: Thanks... one further fix for Cisco gateways: sometimes the IsDM bit is set on switch ports as well, so ibsrpdm should not be limited to just CA ports. Here's the patch, also on the trunk as r7836. --- srptools/ChangeLog (revision 7803) +++ srptools/ChangeLog (working copy) @@ -1,3 +1,10 @@ +2006-06-08 Roland Dreier + + * src/srp-dm.c (get_port_list): In some setups (eg Cisco SFS 3001 + with an FC gateway), there will be switches with the IsDM bit set + on port 0. So the initial get of NodeRecords must retrieve all + records, not just CA ports. + 2006-06-07 Roland Dreier * src/srp-dm.c (do_port): Use correct endianness when comparing GUID against Topspin OUI. --- srptools/src/srp-dm.c (revision 7803) +++ srptools/src/srp-dm.c (working copy) @@ -523,11 +523,9 @@ static int get_port_list(int fd, uint32_ out_sa_mad->mgmt_class = SRP_MGMT_CLASS_SA; out_sa_mad->method = SRP_SA_METHOD_GET_TABLE; out_sa_mad->class_version = 2; - out_sa_mad->comp_mask = htonll(1ul << 4); /* node type */ + out_sa_mad->comp_mask = 0; /* Get all end ports */ out_sa_mad->rmpp_version = 1; out_sa_mad->rmpp_type = 1; - node = (void *) out_sa_mad->data; - node->type = 1; /* CA */ len = send_and_get(fd, &out_mad, in_mad, node_table_response_size); if (len < 0) From rdreier at cisco.com Thu Jun 8 09:28:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 Jun 2006 09:28:12 -0700 Subject: [openib-general] Re: [PATCH] ib_uverbs_get_context does not unlock file->mutex in error path In-Reply-To: <44886176.6C2D.007B.0@novell.com> (CH Ganapathi's message of "Thu, 08 Jun 2006 06:12:14 -0600") References: <44886176.6C2D.007B.0@novell.com> Message-ID: Good catch. Applied. From sean.hefty at intel.com Thu Jun 8 09:49:35 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 8 Jun 2006 09:49:35 -0700 Subject: [openib-general] RE: Failed multicast join withnew multicast module In-Reply-To: <1149764555.4510.320252.camel@hal.voltaire.com> Message-ID: >If this comment is directed at client reregister mechanism, you should >note that when this was brought up there was resistance to it based on >the recommendation (probably not a strong enough word for this) that SMs >be redundant in the subnet. There was a fair bit of anecdotal evidence >that this was not how they were being used at the time but it may have >been a chicken and egg problem. Even with redundant SMs, we wouldn't want them to reassign all of the LIDs in the subnet just because of failover. I don't think of MLIDs as being any different. Client reregister support is optional, so what if the node(s) that need to re-create the group doesn't support it? What if we started with something like the following compliance statement, and tried to add this to the spec? An SM, upon becoming the master, shall respect all existing communication in the fabric, where possible. - Sean From bpradip at in.ibm.com Thu Jun 8 10:42:03 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Thu, 08 Jun 2006 23:12:03 +0530 Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: References: Message-ID: <4488616B.7030701@in.ibm.com> Sundeep Narravula wrote: > Hi, > >> I don't see this problem at all. I am using kernel 2.6.16.16, SLES 9 glibc >> version 2.3.3-98, gcc version 3.3.3 and AMSO1100 RNIC. > > The versions I used are glibc 2.3.4, kernel 2.6.16 and gcc 3.4.3 and > AMSO1100 RNIC. > >> Will running it under gdb be of some help ? > > I am able to reproduce this error with/without gdb. The glibc error > disappears with higher number of iterations. > > (gdb) r -c -vV -C10 -S10 -a 150.111.111.100 -p 9999 The problem is due to specifying a less than sufficient size (-S10, -S4) for the buffer. If you look into the following lines from the function rping_test_client in rping.c for (ping = 0; !cb->count || ping < cb->count; ping++) { cb->state = RDMA_READ_ADV; /* Put some ascii text in the buffer. */ ------> cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping); From the above its clear that minimum size for start_buf should be atleast sufficient to hold the string, which in the invocations mentioned here (-S10 or -S4) is not the case. Hence you notice the glibc errors. cb->start_buf is allocated in rping_setup_buffers() as cb->start_buf = malloc(cb->size); Basically the check if ((cb->size < 1) || (cb->size > (RPING_BUFSIZE - 1))) { in the main() should be changed to something like this #define RPING_MIN_BUFSIZE sizeof(itoa(INT_MAX)) + sizeof("rdma-ping-%d: ") ---> 'ping' is defined as a signed int, its maximum permissible value is defined in limits.h (INT_MAX = 2147483647) We can even hardcode the RPING_MIN_BUFSIZE to '19' if desired/ if ((cb->size < RPING_MIN_BUFSIZE) || (cb->size > (RPING_BUFSIZE - 1))) { Steve what do you say ?? Thanks, Pradipta Kumar. > Starting program: /usr/local/bin/rping -c -vV -C10 -S10 -a 150.111.111.100 > -p 9999 > Reading symbols from shared object read from target memory...done. > Loaded system supplied DSO at 0xffffe000 > [Thread debugging using libthread_db enabled] > [New Thread -1208465728 (LWP 23960)] > libibverbs: Warning: no userspace device-specific driver found for uverbs1 > driver search path: /usr/local/lib/infiniband > libibverbs: Warning: no userspace device-specific driver found for uverbs0 > driver search path: /usr/local/lib/infiniband > [New Thread -1208468560 (LWP 23963)] > [New Thread -1216861264 (LWP 23964)] > ping data: rdma-ping > ping data: rdma-ping > ping data: rdma-ping > ping data: rdma-ping > ping data: rdma-ping > ping data: rdma-ping > ping data: rdma-ping > ping data: rdma-ping > ping data: rdma-ping > ping data: rdma-ping > cq completion failed status 5 > DISCONNECT EVENT... > *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 *** > > Program received signal SIGABRT, Aborted. > [Switching to Thread -1208465728 (LWP 23960)] > 0xffffe410 in __kernel_vsyscall () > (gdb) > > --Sundeep. > >> Thanks >> Pradipta Kumar. >>>> Thanx, >>>> >>>> >>>> Steve. >>>> >>>> >>>> On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote: >>>>> Hi Steve, >>>>> We are trying the new iwarp branch on ammasso adapters. The installation >>>>> has gone fine. However, on running rping there is a error during >>>>> disconnect phase. >>>>> >>>>> $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999 >>>>> libibverbs: Warning: no userspace device-specific driver found for uverbs1 >>>>> driver search path: /usr/local/lib/infiniband >>>>> libibverbs: Warning: no userspace device-specific driver found for uverbs0 >>>>> driver search path: /usr/local/lib/infiniband >>>>> ping data: rdm >>>>> ping data: rdm >>>>> ping data: rdm >>>>> ping data: rdm >>>>> cq completion failed status 5 >>>>> DISCONNECT EVENT... >>>>> *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 *** >>>>> Aborted >>>>> >>>>> There are no apparent errors showing up in dmesg. Is this error >>>>> currently expected? >>>>> >>>>> Thanks, >>>>> --Sundeep. >>>>> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From mshefty at ichips.intel.com Thu Jun 8 11:07:03 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 08 Jun 2006 11:07:03 -0700 Subject: [openib-general] [PATCH 0/4] Add support for UD QPs In-Reply-To: References: Message-ID: <44886747.4040004@ichips.intel.com> > The following patch series adds support for UD QPs to userspace through the RDMA > CM. UD QPs are referenced by an IP address, UDP port number. The RDMA CM > abstracts SIDR for Infiniband clients. Roland, Do you see any issues with this patch series or the related userspace changes? There's a small change to uverbs, and new APIs added to libibverbs. - Sean From rdreier at cisco.com Thu Jun 8 11:12:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 Jun 2006 11:12:53 -0700 Subject: [openib-general] [PATCH 0/4] Add support for UD QPs In-Reply-To: <44886747.4040004@ichips.intel.com> (Sean Hefty's message of "Thu, 08 Jun 2006 11:07:03 -0700") References: <44886747.4040004@ichips.intel.com> Message-ID: Sean> Do you see any issues with this patch series or the related Sean> userspace changes? There's a small change to uverbs, and new Sean> APIs added to libibverbs. I haven't looked too carefully yet. What's the motivation? It seems strange to put an IB-only transport into the RDMA CM -- iWARP can't handle datagrams, can it? - R. From greg.lindahl at qlogic.com Thu Jun 8 11:18:09 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Thu, 8 Jun 2006 11:18:09 -0700 Subject: [openib-general] RE: Failed multicast join withnew multicast module In-Reply-To: References: <1149764555.4510.320252.camel@hal.voltaire.com> Message-ID: <20060608181809.GI1359@greglaptop.internal.keyresearch.com> On Thu, Jun 08, 2006 at 09:49:35AM -0700, Sean Hefty wrote: > What if we started with something like the following compliance statement, and > tried to add this to the spec? > > An SM, upon becoming the master, shall respect all existing communication in the > fabric, where possible. Isn't this a quality of implementation issue? It's hard to imagine a SM author not realizing this is a good thing to do. If it was in the standard, how would you test it for compliance? -- g From mshefty at ichips.intel.com Thu Jun 8 11:33:10 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 08 Jun 2006 11:33:10 -0700 Subject: [openib-general] [PATCH 0/4] Add support for UD QPs In-Reply-To: References: <44886747.4040004@ichips.intel.com> Message-ID: <44886D66.7000703@ichips.intel.com> Roland Dreier wrote: > I haven't looked too carefully yet. > > What's the motivation? It seems strange to put an IB-only transport > into the RDMA CM -- iWARP can't handle datagrams, can it? This allows using the address translation to locate the remote service. The RDMA CM also provides an IP based interface for IB. From a user's perspective, this extends the RDMA CM to include the UDP port space, in addition to TCP. - Sean From mshefty at ichips.intel.com Thu Jun 8 11:43:24 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 08 Jun 2006 11:43:24 -0700 Subject: [openib-general] RE: Failed multicast join withnew multicast module In-Reply-To: <20060608181809.GI1359@greglaptop.internal.keyresearch.com> References: <1149764555.4510.320252.camel@hal.voltaire.com> <20060608181809.GI1359@greglaptop.internal.keyresearch.com> Message-ID: <44886FCC.3040108@ichips.intel.com> Greg Lindahl wrote: > Isn't this a quality of implementation issue? It's hard to imagine a > SM author not realizing this is a good thing to do. I don't know if any SM implementation actually does this today. I think that all break all multicast groups. > If it was in the standard, how would you test it for compliance? Stop / restart the SM and see if any existing RC, UD, MCast communication breaks could be an easy first test. - Sean From mst at mellanox.co.il Thu Jun 8 11:57:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 8 Jun 2006 21:57:52 +0300 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: References: Message-ID: <20060608185752.GA9039@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RFC: ib_cache_event problems > > Michael> But ipoib_ib_dev_flush doesn't? > > Ah, that looks like the bug I guess. What's the situation? SM clears > P_Key table and then later readds a P_Key? Any ideas on how to fix this? -- MST From mst at mellanox.co.il Thu Jun 8 12:03:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 8 Jun 2006 22:03:54 +0300 Subject: [openib-general] Re: Failed multicast join withnew multicast module In-Reply-To: References: Message-ID: <20060608190354.GB9039@mellanox.co.il> Quoting r. Sean Hefty : > What if we started with something like the following compliance statement, and > tried to add this to the spec? > > An SM, upon becoming the master, shall respect all existing communication in > the fabric, where possible. To me, "where possible" doesn't sound like an appropriate language for a compliance statement. Is there precedent for this in IB spec? -- MST From sean.hefty at intel.com Thu Jun 8 12:06:13 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 8 Jun 2006 12:06:13 -0700 Subject: [openib-general] RE: Failed multicast join withnew multicast module In-Reply-To: <20060608190354.GB9039@mellanox.co.il> Message-ID: >> An SM, upon becoming the master, shall respect all existing communication in >> the fabric, where possible. > >To me, "where possible" doesn't sound like an appropriate language for a >compliance statement. Is there precedent for this in IB spec? I was trying to express a concept, not formulate exact wording here... From rdreier at cisco.com Thu Jun 8 12:15:44 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 Jun 2006 12:15:44 -0700 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: <20060608185752.GA9039@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 8 Jun 2006 21:57:52 +0300") References: <20060608185752.GA9039@mellanox.co.il> Message-ID: > > Ah, that looks like the bug I guess. What's the situation? SM clears > > P_Key table and then later readds a P_Key? > Any ideas on how to fix this? Does it work to just start the pkey_task if ipoib_ib_dev_flush() wants for a P_Key that's not there? Or is it trickier? - R. From mst at mellanox.co.il Thu Jun 8 12:28:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 8 Jun 2006 22:28:48 +0300 Subject: [openib-general] Re: RFC: ib_cache_event problems In-Reply-To: References: Message-ID: <20060608192848.GC9039@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RFC: ib_cache_event problems > > > > Ah, that looks like the bug I guess. What's the situation? SM clears > > > P_Key table and then later readds a P_Key? > > > Any ideas on how to fix this? > > Does it work to just start the pkey_task if ipoib_ib_dev_flush() wants > for a P_Key that's not there? Or is it trickier? If this works, why is dev_up playing with pkey_chek_presence at all? Can we kill all of this then? ipoib_pkey_dev_check_presence(dev); if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { ipoib_dbg(priv, "PKEY is not assigned.\n"); return 0; } It seems we must avoid joining multicast groups while key isn't assigned ... -- MST From bugzilla-daemon at openib.org Thu Jun 8 12:41:31 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Thu, 8 Jun 2006 12:41:31 -0700 (PDT) Subject: [openib-general] [Bug 122] New: mad layer problem Message-ID: <20060608194131.B3EEC2283E0@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=122 Summary: mad layer problem Product: OpenFabrics Linux Version: gen2 Platform: All OS/Version: Other Status: NEW Severity: blocker Priority: P2 Component: IB Core AssignedTo: sean.hefty at intel.com ReportedBy: eli at mellanox.co.il CC: bugzilla at openib.org We were running polygraph http://freshmeat.net/projects/polygraph/ over ipoib and at some time ipoib connectivity was lost. When looking at the state of the machines (two machines conected through a switch) I noticed that on on one of the machines, I could not run any program that uses mads. Specifically I tried sminfo and then opensm, both got stuck. I assume what happend is that at some time the kernel refreshed its arp cache and at that time there was already a problem in sending mads so the kernel could not resolve the address so ipoib connectivity got lost. ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at openib.org Thu Jun 8 12:44:35 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Thu, 8 Jun 2006 12:44:35 -0700 (PDT) Subject: [openib-general] [Bug 122] mad layer problem Message-ID: <20060608194435.7BDE82283E0@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=122 ------- Comment #1 from rolandd at cisco.com 2006-06-08 12:44 ------- To debug this we probably need to know where sminfo and/or opensm were getting stuck. sysrq-T output for the stuck processes would probably be the most helpful. ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at openib.org Thu Jun 8 12:51:57 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Thu, 8 Jun 2006 12:51:57 -0700 (PDT) Subject: [openib-general] [Bug 122] mad layer problem Message-ID: <20060608195157.7FF482283E0@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=122 ------- Comment #2 from sean.hefty at intel.com 2006-06-08 12:51 ------- I'm not aware of any relationship between ARP and MADs. I'd like to verify that this is indeed a MAD layer issue, and not a problem in the user-to-kernel interface, or lower level driver. After the hang, were any applications able to run? Did you try running any kernel tests, like grmpp or cmatose? Loading madeye after connectivity is lost could also be helpful. How easily is this reproduced? ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From mst at mellanox.co.il Thu Jun 8 13:26:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 8 Jun 2006 23:26:35 +0300 Subject: [openib-general] race in mthca_cq.c? Message-ID: <20060608202635.GA9877@mellanox.co.il> Roland, I think I see a race in mthca: let's assume that a QP is destroyed. We remove the qpn from qp_table. Before we have the chance to cleanup the CQ, another QP is created and put in the same slot in table. If the user now polls the CQ he'll see a completion for a wrong QP, since poll CQ does: *cur_qp = mthca_array_get(&dev->qp_table.qp, be32_to_cpu(cqe->my_qpn) & (dev->limits.num_qps - 1)); Is this analysis right? If yes, I think we can fix this by testing (*cur_qp)->qpn == be32_to_cpu(cqe->my_qpn), does this make sense? Same for userspace I guess? It seems a similiar issue exists for CQs, does it not? And I think it can be solved in a similiar way, checking the CQN? -- MST From rdreier at cisco.com Thu Jun 8 13:43:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 Jun 2006 13:43:24 -0700 Subject: [openib-general] Re: race in mthca_cq.c? In-Reply-To: <20060608202635.GA9877@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 8 Jun 2006 23:26:35 +0300") References: <20060608202635.GA9877@mellanox.co.il> Message-ID: > Roland, I think I see a race in mthca: let's assume that > a QP is destroyed. We remove the qpn from qp_table. > > Before we have the chance to cleanup the CQ, another QP is created > and put in the same slot in table. If the user now polls the CQ he'll see a > completion for a wrong QP, since poll CQ does: > > *cur_qp = mthca_array_get(&dev->qp_table.qp, > be32_to_cpu(cqe->my_qpn) & > (dev->limits.num_qps - 1)); > > Is this analysis right? I don't think so. There's no way for another QP to be assigned the same number, since the mthca_free() to clear out the QPN bitmap doesn't happen until after the CQs are cleaned up. > It seems a similiar issue exists for CQs, does it not? > And I think it can be solved in a similiar way, checking the CQN? I don't see anything there either. When destroying a CQ, mthca does HW2SW_CQ and synchronize_irq() before a new CQ could be created with the same number. - R. From halr at voltaire.com Thu Jun 8 13:39:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Jun 2006 16:39:03 -0400 Subject: [openib-general] RE: Failed multicast join withnew multicast module In-Reply-To: References: Message-ID: <1149799142.4510.13468.camel@hal.voltaire.com> On Thu, 2006-06-08 at 12:49, Sean Hefty wrote: > >If this comment is directed at client reregister mechanism, you should > >note that when this was brought up there was resistance to it based on > >the recommendation (probably not a strong enough word for this) that SMs > >be redundant in the subnet. There was a fair bit of anecdotal evidence > >that this was not how they were being used at the time but it may have > >been a chicken and egg problem. > > Even with redundant SMs, we wouldn't want them to reassign all of the LIDs in > the subnet just because of failover. I don't think of MLIDs as being any > different. Do you mean without redundant SMs (rather than with) ? There are a couple of things about MLIDs are different: 1. There are very much fewer of them (not necessarily architecturally but in some implementations) 2. There is lazy deletion of MC groups allowed so the reclamation may be difficult. This is not to say it can't be done but there are some hurdles to clear. > Client reregister support is optional, so what if the node(s) that > need to re-create the group doesn't support it? The endport SMAs are claiming they do support client reregistration but it does take more than that for the endport/node to behave properly. > What if we started with something like the following compliance statement, and > tried to add this to the spec? > > An SM, upon becoming the master, shall respect all existing communication > in the fabric, where possible. At the 50K level, I can see where you are coming from and think there is merit in this but first, I'm not sure I know how to define this and second, whether that is achievable but could wait to see whether some definition could be achieved. I know it is a conceptual rather than actual compliance. One issue would be defining what it means to repect all existing communication. Then we would need to look at whether that was feasible or not and perhaps rescope what it means to a set of things achievable. Another issue would be defining where it is possible or not. If that is totally vendor dependent, then this would have no substance to it. It is largely a matter of being a "better" SM. -- Hal > - Sean From mst at mellanox.co.il Thu Jun 8 13:48:26 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 8 Jun 2006 23:48:26 +0300 Subject: [openib-general] Re: race in mthca_cq.c? In-Reply-To: References: Message-ID: <20060608204826.GB9957@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: race in mthca_cq.c? > > > Roland, I think I see a race in mthca: let's assume that > > a QP is destroyed. We remove the qpn from qp_table. > > > > Before we have the chance to cleanup the CQ, another QP is created > > and put in the same slot in table. If the user now polls the CQ he'll see a > > completion for a wrong QP, since poll CQ does: > > > > *cur_qp = mthca_array_get(&dev->qp_table.qp, > > be32_to_cpu(cqe->my_qpn) & > > (dev->limits.num_qps - 1)); > > > > Is this analysis right? > > I don't think so. There's no way for another QP to be assigned the > same number, since the mthca_free() to clear out the QPN bitmap > doesn't happen until after the CQs are cleaned up. Not in the driver I have: mthca_array_clear is at line 1351, mthca_cq_clean at line 1372. Isn't mthca_array_clear freeing the slot in QP table? > > It seems a similiar issue exists for CQs, does it not? > > And I think it can be solved in a similiar way, checking the CQN? > > I don't see anything there either. When destroying a CQ, mthca does > HW2SW_CQ and synchronize_irq() before a new CQ could be created with > the same number. But there might be more EQEs for this CQN outstanding in the EQ which we have not seen yet. -- MST From sweitzen at cisco.com Thu Jun 8 13:59:33 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 8 Jun 2006 13:59:33 -0700 Subject: [openib-general] Compilation issues on rhel4 u3 ppc64 sysfs.o Message-ID: This is working for us on RHEL4 U3, thanks! Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: Vladimir Sokolovsky [mailto:vlad at mellanox.co.il] > Sent: Thursday, May 25, 2006 2:49 AM > To: Scott Weitzenkamp (sweitzen) > Cc: Paul; openib-general at openib.org > Subject: Re: [openib-general] Compilation issues on rhel4 u3 > ppc64 sysfs.o > > In OFED-1.0-rc5 all binaries and libraries will be compiled on *ppc64 > *with *-m64* flag. > This requires sysfsutils and sysfsutils-devel 64-bit RPM to > be installed > (in order to build libibverbs). > Also pciutils and pciutils-devel 64-bit required for tvflash package. > > libsdp will be built both 32 and 64 bit libraries. > > Note: in order to build sysfsutils 64-bit RPM run: > CC="gcc -m64" rpmbuild --rebuild > sysfsutils-1.3.0-1.2.1.src.rpm > (This was tested on Fedora C4 PPC64) > > Regards, > Vladimir > > Scott Weitzenkamp (sweitzen) wrote: > > I know Vlad made some changes for rc5 in this area, at least for > > libsdp, not sure if other libs got changed as well. > > > > Scott Weitzenkamp > > SQA and Release Manager > > Server Virtualization Business Unit > > Cisco Systems > > > > > > > -------------------------------------------------------------- > ---------- > > *From:* Paul [mailto:paul.lundin at gmail.com] > > *Sent:* Wednesday, May 24, 2006 11:00 AM > > *To:* Scott Weitzenkamp (sweitzen) > > *Cc:* openib-general at openib.org > > *Subject:* Re: [openib-general] Compilation issues on rhel4 u3 > > ppc64 sysfs.o > > > > Scott, > > Upon further inspection the build.sh and > install.sh scripts > > built 32bit libraries and binaries. If I export CFLAGS (and the > > like) to include -m64 then the build dies while looking for a > > 64bit libsysfs. rhel4 u3 does not include a ppc64 > sysfsutils, nor > > have I been able to find an actual 64bit version of it. > Is there a > > workaround for getting things to build actual ppc64 > > binaries/libraries ? > > > > The actual error is: > > checking for dlsym in -ldl... yes > > checking for pthread_mutex_init in -lpthread... yes > > checking for sysfs_open_class in -lsysfs... no > > configure: error: sysfs_open_class() not found. libibverbs > > requires libsysfs. > > > From mst at mellanox.co.il Thu Jun 8 14:11:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 9 Jun 2006 00:11:33 +0300 Subject: [openib-general] Re: race in mthca_cq.c? In-Reply-To: <20060608202635.GA9877@mellanox.co.il> References: <20060608202635.GA9877@mellanox.co.il> Message-ID: <20060608211133.GA10263@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: race in mthca_cq.c? > > Roland, I think I see a race in mthca: let's assume that > a QP is destroyed. We remove the qpn from qp_table. > > Before we have the chance to cleanup the CQ, another QP is created > and put in the same slot in table. If the user now polls the CQ he'll see a > completion for a wrong QP, since poll CQ does: > > *cur_qp = mthca_array_get(&dev->qp_table.qp, > be32_to_cpu(cqe->my_qpn) & > (dev->limits.num_qps - 1)); > > Is this analysis right? > If yes, I think we can fix this by testing (*cur_qp)->qpn == > be32_to_cpu(cqe->my_qpn), does this make sense? > > Same for userspace I guess? > > It seems a similiar issue exists for CQs, does it not? > And I think it can be solved in a similiar way, checking the CQN? The following seems to work. How does it look? --- Make sure completion/completion event is not for a stale QP/CQ before reporting to user. Signed-off-by: Michael S. Tsirkin --- openib/drivers/infiniband/hw/mthca/mthca_cq.c 2006-05-09 21:07:28.623383000 +0300 +++ /mswg/work/mst/tmp/infiniband1/hw/mthca/mthca_cq.c 2006-06-08 23:46:52.404499000 +0300 @@ -217,9 +217,9 @@ void mthca_cq_completion(struct mthca_de { struct mthca_cq *cq; cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); - if (!cq) { + if (!cq || cq->cqn != cqn) { mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn); return; } @@ -513,10 +515,10 @@ static inline int mthca_poll_one(struct * because CQs will be locked while QPs are removed * from the table. */ *cur_qp = mthca_array_get(&dev->qp_table.qp, be32_to_cpu(cqe->my_qpn) & (dev->limits.num_qps - 1)); - if (!*cur_qp) { + if (!*cur_qp || (*cur_qp)->qpn != be32_to_cpu(cqe->my_qpn)) { mthca_warn(dev, "CQ entry for unknown QP %06x\n", be32_to_cpu(cqe->my_qpn) & 0xffffff); err = -EINVAL; -- MST From rdreier at cisco.com Thu Jun 8 14:19:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 Jun 2006 14:19:46 -0700 Subject: [openib-general] Re: race in mthca_cq.c? In-Reply-To: <20060608211133.GA10263@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 9 Jun 2006 00:11:33 +0300") References: <20060608202635.GA9877@mellanox.co.il> <20060608211133.GA10263@mellanox.co.il> Message-ID: > The following seems to work. How does it look? I don't think it's needed, and anyway I don't see how it fixes things. The problem only happens when the new CQ or QP has the same number as an old CQ/QP, so the test of cq->cqn == cqn might still pass even if the cq has changed (there's no guarantee the upper bits won't repeat -- or someone could be using 24 bits for index) - R. From rdreier at cisco.com Thu Jun 8 14:23:22 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 Jun 2006 14:23:22 -0700 Subject: [openib-general] Re: race in mthca_cq.c? In-Reply-To: <20060608204826.GB9957@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 8 Jun 2006 23:48:26 +0300") References: <20060608204826.GB9957@mellanox.co.il> Message-ID: Michael> Not in the driver I have: mthca_array_clear is at line Michael> 1351, mthca_cq_clean at line 1372. Isn't Michael> mthca_array_clear freeing the slot in QP table? Nope, the bitmap slot isn't freed until mthca_free(). Michael> But there might be more EQEs for this CQN outstanding in Michael> the EQ which we have not seen yet. Now that you mention it, that could be a real problem I guess. synchronize_irq() isn't enough because the interrupt handler might not have even started yet. But on the other hand a CQ can't be destroyed until after all associated QPs have been destroyed. So could we really miss EQEs for that long? - R. From mshefty at ichips.intel.com Thu Jun 8 15:00:56 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 08 Jun 2006 15:00:56 -0700 Subject: [openib-general] Re: Failed multicast join withnew multicast module In-Reply-To: <1149799142.4510.13468.camel@hal.voltaire.com> References: <1149799142.4510.13468.camel@hal.voltaire.com> Message-ID: <44889E18.8010507@ichips.intel.com> Hal Rosenstock wrote: > 2. There is lazy deletion of MC groups allowed so the reclamation may be > difficult. I'm not familiar with the switch programming. Does the SM set the entire MulticastForwardingTable for a switch every time a new group is created, or a new member joins? If the SM loses track of all multicast groups, how are the stale groups on the switches deleted? > The endport SMAs are claiming they do support client reregistration but > it does take more than that for the endport/node to behave properly. My original plan was to have the ib_multicast module rejoin all groups, but since the MLIDs can change I can't see any way to handle reregistration safely without involving the application. My latest changes are just to report errors on existing multicast groups on an affected port. > I know it is a conceptual rather than actual compliance. One issue would > be defining what it means to repect all existing communication. Then we > would need to look at whether that was feasible or not and perhaps > rescope what it means to a set of things achievable. Another issue would > be defining where it is possible or not. If that is totally vendor > dependent, then this would have no substance to it. It is largely a > matter of being a "better" SM. We could use the phrase, "except where such communication is no longer realizable" instead of "where possible". Where unrealizable means impossible because the communication uses properties that are physically impossible to achieve given the hardware configuration of the subnet. (See bottom of page 910 of the spec.) If an SM could just query switches for their MulticastForwardingTables or the end nodes, would we be able to avoid these issues? - Sean From mst at mellanox.co.il Thu Jun 8 15:06:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 9 Jun 2006 01:06:49 +0300 Subject: [openib-general] Re: race in mthca_cq.c? In-Reply-To: References: Message-ID: <20060608220649.GB10263@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: race in mthca_cq.c? > > Michael> Not in the driver I have: mthca_array_clear is at line > Michael> 1351, mthca_cq_clean at line 1372. Isn't > Michael> mthca_array_clear freeing the slot in QP table? > > Nope, the bitmap slot isn't freed until mthca_free(). Oh. Right. I see it now. > Michael> But there might be more EQEs for this CQN outstanding in > Michael> the EQ which we have not seen yet. > > Now that you mention it, that could be a real problem I guess. > synchronize_irq() isn't enough because the interrupt handler might not > have even started yet. > > But on the other hand a CQ can't be destroyed until after all > associated QPs have been destroyed. So could we really miss EQEs for > that long? Yes, I think there might be spurious EQEs and they might get delayed in HW for a long time. Destroyng QPs does not flush completion events out. So just this bit? -- Check EQE is not for a stale CQ number. Since high bits in CQ number are allocated by round-robin, we can be reasonably sure CQ number is different even for CQs which share slot in CQ table. Signed-off-by: Michael S. Tsirkin --- openib/drivers/infiniband/hw/mthca/mthca_cq.c 2006-05-09 21:07:28.623383000 +0300 +++ /mswg/work/mst/tmp/infiniband1/hw/mthca/mthca_cq.c 2006-06-08 23:46:52.404499000 +0300 @@ -217,9 +217,9 @@ void mthca_cq_completion(struct mthca_de { struct mthca_cq *cq; cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); - if (!cq) { + if (!cq || cq->cqn != cqn) { mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn); return; } -- MST From tom at opengridcomputing.com Thu Jun 8 15:27:55 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 08 Jun 2006 17:27:55 -0500 Subject: [openib-general] [ANNOUNCE] New iWARP Branch In-Reply-To: <4488616B.7030701@in.ibm.com> References: <4488616B.7030701@in.ibm.com> Message-ID: <1149805675.12361.5.camel@trinity.ogc.int> Steve is fishing in the Florida Keys right now (or will be by morning), but if he were here, I think he would say -- "...sounds like you've found an rping bug, please post a patch" ;-) I would prefer the #define you proposed, e.g. #define RPING_MSG_FMT "rdma-ping-%d" #define RPING_MIN_BUFSIZ sizeof(itoa(INT_MAX))+sizeof(RPING_MSG_FMT) Then use the RPING_MSG_FMT symbol in the code that prepares the contents of the message. then if someone decides to change the string, the error checking still works. > Tom On Thu, 2006-06-08 at 23:12 +0530, Pradipta Kumar Banerjee wrote: > Sundeep Narravula wrote: > > Hi, > > > >> I don't see this problem at all. I am using kernel 2.6.16.16, SLES 9 glibc > >> version 2.3.3-98, gcc version 3.3.3 and AMSO1100 RNIC. > > > > The versions I used are glibc 2.3.4, kernel 2.6.16 and gcc 3.4.3 and > > AMSO1100 RNIC. > > > >> Will running it under gdb be of some help ? > > > > I am able to reproduce this error with/without gdb. The glibc error > > disappears with higher number of iterations. > > > > (gdb) r -c -vV -C10 -S10 -a 150.111.111.100 -p 9999 > > The problem is due to specifying a less than sufficient size (-S10, -S4) for the > buffer. If you look into the following lines from the function rping_test_client > in rping.c > > for (ping = 0; !cb->count || ping < cb->count; ping++) { > cb->state = RDMA_READ_ADV; > > /* Put some ascii text in the buffer. */ > ------> cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping); > > From the above its clear that minimum size for start_buf should be atleast > sufficient to hold the string, which in the invocations mentioned here (-S10 or > -S4) is not the case. Hence you notice the glibc errors. > > > cb->start_buf is allocated in rping_setup_buffers() as > cb->start_buf = malloc(cb->size); > > Basically the check > > if ((cb->size < 1) || > (cb->size > (RPING_BUFSIZE - 1))) { > > in the main() should be changed to something like this > > #define RPING_MIN_BUFSIZE sizeof(itoa(INT_MAX)) + sizeof("rdma-ping-%d: ") > > ---> 'ping' is defined as a signed int, its maximum permissible value is defined > in limits.h (INT_MAX = 2147483647) > We can even hardcode the RPING_MIN_BUFSIZE to '19' if desired/ > > if ((cb->size < RPING_MIN_BUFSIZE) || > (cb->size > (RPING_BUFSIZE - 1))) { > > Steve what do you say ?? > > > Thanks, > Pradipta Kumar. > > > > Starting program: /usr/local/bin/rping -c -vV -C10 -S10 -a 150.111.111.100 > > -p 9999 > > Reading symbols from shared object read from target memory...done. > > Loaded system supplied DSO at 0xffffe000 > > [Thread debugging using libthread_db enabled] > > [New Thread -1208465728 (LWP 23960)] > > libibverbs: Warning: no userspace device-specific driver found for uverbs1 > > driver search path: /usr/local/lib/infiniband > > libibverbs: Warning: no userspace device-specific driver found for uverbs0 > > driver search path: /usr/local/lib/infiniband > > [New Thread -1208468560 (LWP 23963)] > > [New Thread -1216861264 (LWP 23964)] > > ping data: rdma-ping > > ping data: rdma-ping > > ping data: rdma-ping > > ping data: rdma-ping > > ping data: rdma-ping > > ping data: rdma-ping > > ping data: rdma-ping > > ping data: rdma-ping > > ping data: rdma-ping > > ping data: rdma-ping > > cq completion failed status 5 > > DISCONNECT EVENT... > > *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 *** > > > > Program received signal SIGABRT, Aborted. > > [Switching to Thread -1208465728 (LWP 23960)] > > 0xffffe410 in __kernel_vsyscall () > > (gdb) > > > > --Sundeep. > > > >> Thanks > >> Pradipta Kumar. > >>>> Thanx, > >>>> > >>>> > >>>> Steve. > >>>> > >>>> > >>>> On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote: > >>>>> Hi Steve, > >>>>> We are trying the new iwarp branch on ammasso adapters. The installation > >>>>> has gone fine. However, on running rping there is a error during > >>>>> disconnect phase. > >>>>> > >>>>> $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999 > >>>>> libibverbs: Warning: no userspace device-specific driver found for uverbs1 > >>>>> driver search path: /usr/local/lib/infiniband > >>>>> libibverbs: Warning: no userspace device-specific driver found for uverbs0 > >>>>> driver search path: /usr/local/lib/infiniband > >>>>> ping data: rdm > >>>>> ping data: rdm > >>>>> ping data: rdm > >>>>> ping data: rdm > >>>>> cq completion failed status 5 > >>>>> DISCONNECT EVENT... > >>>>> *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 *** > >>>>> Aborted > >>>>> > >>>>> There are no apparent errors showing up in dmesg. Is this error > >>>>> currently expected? > >>>>> > >>>>> Thanks, > >>>>> --Sundeep. > >>>>> > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Thu Jun 8 15:47:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 9 Jun 2006 01:47:53 +0300 Subject: [openib-general] Re: race in mthca_cq.c? In-Reply-To: References: Message-ID: <20060608224753.GC10263@mellanox.co.il> Quoting r. Roland Dreier : > there's no guarantee the upper bits won't > repeat -- or someone could be using 24 bits for index So we need something like mthca_clean_eq? -- MST From rdreier at cisco.com Thu Jun 8 16:01:47 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 Jun 2006 16:01:47 -0700 Subject: [openib-general] Re: race in mthca_cq.c? In-Reply-To: <20060608224753.GC10263@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 9 Jun 2006 01:47:53 +0300") References: <20060608224753.GC10263@mellanox.co.il> Message-ID: Roland> there's no guarantee the upper bits won't repeat -- or Roland> someone could be using 24 bits for index Michael> So we need something like mthca_clean_eq? That's one obvious way to handle it. We could also keep a list of freed CQNs and make sure we don't reuse the CQNs until their associated EQ has been drained once. Or just call the handler for that EQ an extra time after freeing the CQ. But I guess that would lead to tricky races against the regular interrupt handler. - R. From rdreier at cisco.com Thu Jun 8 16:09:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 Jun 2006 16:09:18 -0700 Subject: [openib-general] Re: race in mthca_cq.c? In-Reply-To: (Roland Dreier's message of "Thu, 08 Jun 2006 16:01:47 -0700") References: <20060608224753.GC10263@mellanox.co.il> Message-ID: Michael> So we need something like mthca_clean_eq? Roland> That's one obvious way to handle it. Actually that looks very hard without adding locks to the interrupt handling fast path. - R. From sweitzen at cisco.com Thu Jun 8 16:38:12 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 8 Jun 2006 16:38:12 -0700 Subject: [openib-general] OFED-1.0-rc6 is available Message-ID: The MTU change undos the changes for bug 81, so I have reopened bug 81 ( http://openib.org/bugzilla/show_bug.cgi?id=81). With rc6, PCI-X osu_bw and osu_bibw performance is bad, and PCI-E osu_bibw performance is bad. I've enclosed some performance data, look at rc4 vs rc5 vs rc6 for Cougar/Cheetah/LionMini. Are there other benchmarks driving the changes in rc6 (and rc4)? Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems OSU MPI: * Added mpi_alltoall fine tuning parameters * Added default configuration/documentation file $MPIHOME/etc/mvapich.conf * Added shell configuration files $MPIHOME/etc/mvapich.csh , $MPIHOME/etc/mvapich.csh * Default MTU was changed back to 2K for InfiniHost III Ex and InfiniHost III Lx HCAs. For InfiniHost card recommended value is: VIADEV_DEFAULT_MTU=MTU1024 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mpi_perf.xls Type: application/vnd.ms-excel Size: 33280 bytes Desc: mpi_perf.xls URL: From sean.hefty at intel.com Thu Jun 8 21:38:07 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 8 Jun 2006 21:38:07 -0700 Subject: [openib-general] [PATCH 1/2] multicast: notify users on membership errors Message-ID: Modify ib_multicast module to detect events that require clients to rejoin multicast groups. Add tracking of clients which are members of any groups, and provide notification to those clients when such an event occurs. This patch tracks all active members of a group. When an event occurs that requires clients to rejoin a multicast group, the active members are moved into an error state, and the clients are notified of a network reset error. The group is then reset to force additional join requests to generate requests to the SA. Signed-off-by: Sean Hefty --- Hal, can you apply these patches and see if it fixes the issues that you are experiencing. These should eliminate any races with ipoib leaving, then quickly re-joining a group as a result of an event. Index: multicast.c =================================================================== --- multicast.c (revision 7805) +++ multicast.c (working copy) @@ -61,6 +61,7 @@ static struct ib_client mcast_client = { .remove = mcast_remove_one }; +static struct ib_event_handler event_handler; static struct workqueue_struct *mcast_wq; struct mcast_device; @@ -86,6 +87,7 @@ enum mcast_state { MCAST_JOINING, MCAST_MEMBER, MCAST_BUSY, + MCAST_ERROR }; struct mcast_member; @@ -97,6 +99,7 @@ struct mcast_group { spinlock_t lock; struct work_struct work; struct list_head pending_list; + struct list_head active_list; struct mcast_member *last_join; int members[3]; atomic_t refcount; @@ -338,6 +341,8 @@ static void join_group(struct mcast_grou group->rec.join_state |= join_state; member->multicast.rec = group->rec; member->multicast.rec.join_state = join_state; + list_del(&member->list); + list_add(&member->list, &group->active_list); } static int fail_join(struct mcast_group *group, struct mcast_member *member, @@ -349,6 +354,34 @@ static int fail_join(struct mcast_group return member->multicast.callback(status, &member->multicast); } +static void process_group_error(struct mcast_group *group) +{ + struct mcast_member *member; + int ret; + + spin_lock_irq(&group->lock); + while (!list_empty(&group->active_list)) { + member = list_entry(group->active_list.next, + struct mcast_member, list); + atomic_inc(&member->refcount); + list_del_init(&member->list); + adjust_membership(group, member->multicast.rec.join_state, -1); + member->state = MCAST_ERROR; + spin_unlock_irq(&group->lock); + + ret = member->multicast.callback(-ENETRESET, + &member->multicast); + deref_member(member); + if (ret) + ib_free_multicast(&member->multicast); + spin_lock_irq(&group->lock); + } + + group->rec.join_state = 0; + group->state = MCAST_BUSY; + spin_unlock_irq(&group->lock); +} + static void mcast_work_handler(void *data) { struct mcast_group *group = data; @@ -359,6 +392,12 @@ static void mcast_work_handler(void *dat retest: spin_lock_irq(&group->lock); + if (group->state == MCAST_ERROR) { + spin_unlock_irq(&group->lock); + process_group_error(group); + goto retest; + } + while (!list_empty(&group->pending_list)) { member = list_entry(group->pending_list.next, struct mcast_member, list); @@ -371,8 +410,8 @@ retest: multicast->comp_mask); if (!status) join_group(group, member, join_state); - - list_del_init(&member->list); + else + list_del_init(&member->list); spin_unlock_irq(&group->lock); ret = multicast->callback(status, multicast); } else { @@ -467,6 +506,7 @@ static struct mcast_group *acquire_group group->port = port; group->rec.mgid = *mgid; INIT_LIST_HEAD(&group->pending_list); + INIT_LIST_HEAD(&group->active_list); INIT_WORK(&group->work, mcast_work_handler, group); spin_lock_init(&group->lock); @@ -551,16 +591,10 @@ void ib_free_multicast(struct ib_multica group = member->group; spin_lock_irq(&group->lock); - switch (member->state) { - case MCAST_MEMBER: + if (member->state == MCAST_MEMBER) adjust_membership(group, multicast->rec.join_state, -1); - break; - case MCAST_JOINING: - list_del_init(&member->list); - break; - default: - break; - } + + list_del_init(&member->list); if (group->state == MCAST_IDLE) { group->state = MCAST_BUSY; @@ -578,6 +612,48 @@ void ib_free_multicast(struct ib_multica } EXPORT_SYMBOL(ib_free_multicast); +static void mcast_groups_lost(struct mcast_port *port) +{ + struct mcast_group *group; + struct rb_node *node; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + for (node = rb_first(&port->table); node; node = rb_next(node)) { + group = rb_entry(node, struct mcast_group, node); + spin_lock(&group->lock); + if (group->state == MCAST_IDLE) { + atomic_inc(&group->refcount); + queue_work(mcast_wq, &group->work); + } + group->state = MCAST_ERROR; + spin_unlock(&group->lock); + } + spin_unlock_irqrestore(&port->lock, flags); +} + +static void mcast_event_handler(struct ib_event_handler *handler, + struct ib_event *event) +{ + struct mcast_device *dev; + + dev = ib_get_client_data(event->device, &mcast_client); + if (!dev) + return; + + switch (event->event) { + case IB_EVENT_PORT_ERR: + case IB_EVENT_LID_CHANGE: + case IB_EVENT_SM_CHANGE: + case IB_EVENT_CLIENT_REREGISTER: + mcast_groups_lost(&dev->port[event->element.port_num - + dev->start_port]); + break; + default: + break; + } +} + static void mcast_add_one(struct ib_device *device) { struct mcast_device *dev; @@ -611,6 +687,9 @@ static void mcast_add_one(struct ib_devi dev->device = device; ib_set_client_data(device, &mcast_client, dev); + + INIT_IB_EVENT_HANDLER(&event_handler, device, mcast_event_handler); + ib_register_event_handler(&event_handler); } static void mcast_remove_one(struct ib_device *device) @@ -623,6 +702,7 @@ static void mcast_remove_one(struct ib_d if (!dev) return; + ib_unregister_event_handler(&event_handler); flush_workqueue(mcast_wq); for (i = 0; i < dev->end_port - dev->start_port; i++) { From sean.hefty at intel.com Thu Jun 8 21:54:59 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 8 Jun 2006 21:54:59 -0700 Subject: [openib-general] [PATCH 2/2] ipoib: handle multicast group reset notification In-Reply-To: Message-ID: Ipoib already checks for events that require rejoining multicast groups. We just need to add code to handle (i.e. ignore) multicast group reset notifications. Signed-off-by: Sean Hefty --- Ignoring the callback is a simple fix. I didn't try to see what it would take to have ipoib use the ib_multicast event to trigger a re-join. My guess is that it would be less efficient, since ipoib would get a callback for every group on the affected port. Index: ipoib_multicast.c =================================================================== --- ipoib_multicast.c (revision 7758) +++ ipoib_multicast.c (working copy) @@ -306,6 +306,10 @@ ipoib_mcast_sendonly_join_complete(int s struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); + /* We trap for port events ourselves. */ + if (status == -ENETRESET) + return 0; + if (!status) status = ipoib_mcast_join_finish(mcast, &multicast->rec); @@ -390,6 +394,10 @@ static int ipoib_mcast_join_complete(int " (status %d)\n", IPOIB_GID_ARG(mcast->mcmember.mgid), status); + /* We trap for port events ourselves. */ + if (status == -ENETRESET) + return 0; + if (!status) status = ipoib_mcast_join_finish(mcast, &multicast->rec); From mst at mellanox.co.il Thu Jun 8 22:09:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 9 Jun 2006 08:09:10 +0300 Subject: [openib-general] Re: [PATCH 1/2] multicast: notify users on membership errors In-Reply-To: References: Message-ID: <20060609050910.GD10263@mellanox.co.il> Quoting r. Sean Hefty : > These should eliminate any races with ipoib leaving, > then quickly re-joining a group as a result of an event. Is there a chance this will fix the crashes me and Or were seeing? -- MST From bpradip at in.ibm.com Thu Jun 8 23:01:39 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Fri, 9 Jun 2006 11:31:39 +0530 Subject: [openib-general] [PATCH] rping: Erroneous check for minumum ping buffer size Message-ID: <20060609060138.GB13602@harry-potter.in.ibm.com> rping didn't checked correctly for the minimum size of the ping buffer resulting in the following error from glibc "*** glibc detected *** free(): invalid next size (fast)" Signed-off-by: Pradipta Kumar Banerjee --- Index: rping.c ============================================================= --- rping.org 2006-06-09 10:57:43.000000000 +0530 +++ rping.c 2006-06-09 11:00:28.000000000 +0530 @@ -96,6 +96,12 @@ struct rping_rdma_info { #define RPING_BUFSIZE 64*1024 #define RPING_SQ_DEPTH 16 +/* Default string for print data and + * minimum buffer size + */ +#define RPING_MSG_FMT "rdma-ping-%d: " +#define RPING_MIN_BUFSIZE sizeof(itoa(INT_MAX))+sizeof(RPING_MSG_FMT) + /* * Control block struct. */ @@ -774,7 +780,7 @@ static void rping_test_client(struct rpi cb->state = RDMA_READ_ADV; /* Put some ascii text in the buffer. */ - cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping); + cc = sprintf(cb->start_buf, RPING_MSG_FMT, ping); for (i = cc, c = start; i < cb->size; i++) { cb->start_buf[i] = c; c++; @@ -977,11 +983,11 @@ int main(int argc, char *argv[]) break; case 'S': cb->size = atoi(optarg); - if ((cb->size < 1) || + if ((cb->size < RPING_MIN_BUFSIZE) || (cb->size > (RPING_BUFSIZE - 1))) { fprintf(stderr, "Invalid size %d " - "(valid range is 1 to %d)\n", - cb->size, RPING_BUFSIZE); + "(valid range is %d to %d)\n", + cb->size, RPING_MIN_BUFSIZE, RPING_BUFSIZE); ret = EINVAL; } else DEBUG_LOG("size %d\n", (int) atoi(optarg)); From bpradip at in.ibm.com Fri Jun 9 00:14:28 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Fri, 09 Jun 2006 12:44:28 +0530 Subject: [openib-general] [PATCH] rping: Erroneous check for minumum ping buffer size In-Reply-To: <20060609060138.GB13602@harry-potter.in.ibm.com> References: <20060609060138.GB13602@harry-potter.in.ibm.com> Message-ID: <44891FD4.30104@in.ibm.com> Pradipta Kumar Banerjee wrote: > rping didn't checked correctly for the minimum size of the ping > buffer resulting in the following error from glibc > > "*** glibc detected *** free(): invalid next size (fast)" > > Signed-off-by: Pradipta Kumar Banerjee > --- > > Index: rping.c > ============================================================= > --- rping.org 2006-06-09 10:57:43.000000000 +0530 > +++ rping.c 2006-06-09 11:00:28.000000000 +0530 > @@ -96,6 +96,12 @@ struct rping_rdma_info { > #define RPING_BUFSIZE 64*1024 > #define RPING_SQ_DEPTH 16 > > +/* Default string for print data and > + * minimum buffer size > + */ > +#define RPING_MSG_FMT "rdma-ping-%d: " > +#define RPING_MIN_BUFSIZE sizeof(itoa(INT_MAX))+sizeof(RPING_MSG_FMT) > + Tom, Just found that 'itoa' is not a built-in library function. The sizeof is returning '4' which is not what we really want. Do we hard-code the value to 10 ( like #define RPING_MIN_BUFSIZE 10 + sizeof(RPING_MSG_FMT) )? INT_MAX is 2147483647 (10 - chars). Other options might include writing our own 'itoa'. Thanks, Pradipta Kumar. > /* > * Control block struct. > */ > @@ -774,7 +780,7 @@ static void rping_test_client(struct rpi > cb->state = RDMA_READ_ADV; > > /* Put some ascii text in the buffer. */ > - cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping); > + cc = sprintf(cb->start_buf, RPING_MSG_FMT, ping); > for (i = cc, c = start; i < cb->size; i++) { > cb->start_buf[i] = c; > c++; > @@ -977,11 +983,11 @@ int main(int argc, char *argv[]) > break; > case 'S': > cb->size = atoi(optarg); > - if ((cb->size < 1) || > + if ((cb->size < RPING_MIN_BUFSIZE) || > (cb->size > (RPING_BUFSIZE - 1))) { > fprintf(stderr, "Invalid size %d " > - "(valid range is 1 to %d)\n", > - cb->size, RPING_BUFSIZE); > + "(valid range is %d to %d)\n", > + cb->size, RPING_MIN_BUFSIZE, RPING_BUFSIZE); > ret = EINVAL; > } else > DEBUG_LOG("size %d\n", (int) atoi(optarg)); > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From mst at mellanox.co.il Fri Jun 9 00:59:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 9 Jun 2006 10:59:12 +0300 Subject: [openib-general] [PATCH] mthca: send opcode in error CQE for debug Message-ID: <20060609075912.GB25811@mellanox.co.il> I find the following helpful for debug. Pls consider for 2.6.18 -- While IP spec does not require opcode to be valid in error CQEs, Mellanox HCAs differentiate between send/receive errors, which is useful for debugging purposes. Signed-off-by: Michael S. Tsirkin Index: last_stable/drivers/infiniband/hw/mthca/mthca_cq.c =================================================================== --- last_stable.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2006-06-09 10:14:53.000000000 +0300 +++ last_stable/drivers/infiniband/hw/mthca/mthca_cq.c 2006-06-09 10:15:08.000000000 +0300 @@ -562,6 +562,7 @@ static inline int mthca_poll_one(struct handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send, (struct mthca_err_cqe *) cqe, entry, &free_cqe); + entry->opcode = is_send ? IB_WC_SEND : IB_WC_RECV; goto out; } -- MST From halr at voltaire.com Fri Jun 9 03:43:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jun 2006 06:43:13 -0400 Subject: [openib-general] Re: Failed multicast join withnew multicast module In-Reply-To: <44889E18.8010507@ichips.intel.com> References: <1149799142.4510.13468.camel@hal.voltaire.com> <44889E18.8010507@ichips.intel.com> Message-ID: <1149849791.4510.41634.camel@hal.voltaire.com> On Thu, 2006-06-08 at 18:00, Sean Hefty wrote: > Hal Rosenstock wrote: > > 2. There is lazy deletion of MC groups allowed so the reclamation may be > > difficult. > > I'm not familiar with the switch programming. Note the MGRPs are MGIDs and switches are programmed with MLIDs and these can be 1:1 or many:1 depending on the implementation. Most do not do the many:1 but this is allowed by the spec. Also, note that switches know nothing about the groups themselves (only MLIDs and which ports) so most of the information is in the SM. > Does the SM set the entire > MulticastForwardingTable for a switch every time a new group is created, or a > new member joins? No. It only needs to program the affected block(s) of the MFT based on the MLID and the portmask (ports for replication). > If the SM loses track of all multicast groups, how are the > stale groups on the switches deleted? There are different strategies for dealing with this. It could clear out all the MFTs in all the switches but that is expensive. It could also wait for multicast registrations and then program the needed MFT blocks in the affected switches only caring about those. In this case, packets on those MLIDs would still be forwarded until the MLID is reclaimed. > > The endport SMAs are claiming they do support client reregistration but > > it does take more than that for the endport/node to behave properly. > > My original plan was to have the ib_multicast module rejoin all groups, but > since the MLIDs can change I can't see any way to handle reregistration safely > without involving the application. Because the application needs to modify the QP for this ? As I said, I'm not sure IPoIB was handling this before. I'm sure Roland knows for sure. > My latest changes are just to report errors > on existing multicast groups on an affected port. How ? > > I know it is a conceptual rather than actual compliance. One issue would > > be defining what it means to repect all existing communication. Then we > > would need to look at whether that was feasible or not and perhaps > > rescope what it means to a set of things achievable. Another issue would > > be defining where it is possible or not. If that is totally vendor > > dependent, then this would have no substance to it. It is largely a > > matter of being a "better" SM. > > We could use the phrase, "except where such communication is no longer > realizable" instead of "where possible". Where unrealizable means impossible > because the communication uses properties that are physically impossible to > achieve given the hardware configuration of the subnet. (See bottom of page 910 > of the spec.) That specific text is defined there for the case of unrealizable joins which is very different from the case being discussed. The specific property mismatches are listed. Still not sure what determines this in the case we are discussing. > If an SM could just query switches for their MulticastForwardingTables or the > end nodes, It can. > would we be able to avoid these issues? How ? Not all the group information is in the switches. -- Hal > - Sean From eli at mellanox.co.il Fri Jun 9 03:53:54 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Fri, 9 Jun 2006 13:53:54 +0300 Subject: [openib-general] [Bug 122] New: mad layer problem Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30249F8EE@mtlexch01.mtl.com> Hi, Here is some info: 1. Attached are the SysRq messages. 2. The relation of MADs to ARP is that after ARP resolves a hardware address it is required to use an SM query to resolve the path to the host bearing the hardware address. 3. How to invoke the tests: Attached one readme file and one configuration file. Eli -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: messages.bz2 Type: application/octet-stream Size: 10202 bytes Desc: messages.bz2 URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 114-115-800conn.conf Type: application/octet-stream Size: 1571 bytes Desc: 114-115-800conn.conf URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: polyReadme.txt URL: From halr at voltaire.com Fri Jun 9 04:26:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jun 2006 07:26:18 -0400 Subject: [openib-general] [PATCH] ibroute: When multiple paths, indicate port GUID on alternate paths Message-ID: <1149852273.4510.42972.camel@hal.voltaire.com> ibroute: When multiple paths, indicate port GUID on alternate paths Signed-off-by: Hal Rosenstock Index: diags/src/ibroute.c =================================================================== --- diags/src/ibroute.c (revision 7646) +++ diags/src/ibroute.c (working copy) @@ -272,10 +272,22 @@ dump_lid(char *str, int strlen, int lid, if (!valid) return snprintf(str, strlen, ": (path #%d - illegal port)", lid - base_port_lid); - else - return snprintf(str, strlen, ": (path #%d out of %d)", - lid - base_port_lid + 1, - last_port_lid - base_port_lid + 1); + else { + lidport.lid = lid; + if (!smp_query(ni, &lidport, IB_ATTR_NODE_INFO, 0, 100)) + return snprintf(str, strlen, + ": (path #%d out of %d)", + lid - base_port_lid + 1, + last_port_lid - base_port_lid + 1); + else { + mad_decode_field(ni, IB_NODE_PORT_GUID_F, &portguid); + return snprintf(str, strlen, + ": (path #%d out of %d: portguid %s)", + lid - base_port_lid + 1, + last_port_lid - base_port_lid + 1, + mad_dump_val(IB_NODE_PORT_GUID_F, sguid, sizeof sguid, &portguid)); + } + } } if (!valid) From halr at voltaire.com Fri Jun 9 06:07:24 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jun 2006 09:07:24 -0400 Subject: [openib-general] Re: [PATCH 1/2] multicast: notify users on membership errors In-Reply-To: References: Message-ID: <1149858444.4744.73.camel@hal.voltaire.com> On Fri, 2006-06-09 at 00:38, Sean Hefty wrote: > Modify ib_multicast module to detect events that require clients to rejoin > multicast groups. Add tracking of clients which are members of any groups, > and provide notification to those clients when such an event occurs. > > This patch tracks all active members of a group. When an event occurs that > requires clients to rejoin a multicast group, the active members are moved > into an error state, and the clients are notified of a network reset error. > The group is then reset to force additional join requests to generate requests > to the SA. > > Signed-off-by: Sean Hefty > --- > Hal, can you apply these patches and see if it fixes the issues that you > are experiencing. These should eliminate any races with ipoib leaving, > then quickly re-joining a group as a result of an event. This is working better now. Thanks! -- Hal From halr at voltaire.com Fri Jun 9 06:11:56 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jun 2006 09:11:56 -0400 Subject: [openib-general] [PATCH] osmtest: Support LMC > 0 Message-ID: <1149858716.4744.224.camel@hal.voltaire.com> osmtest: Support LMC > 0 Signed-off-by: Hal Rosenstock Index: osmtest/osmtest.c =================================================================== --- osmtest/osmtest.c (revision 7839) +++ osmtest/osmtest.c (working copy) @@ -1609,6 +1609,74 @@ osmtest_stress_port_recs_small( IN osmte } /********************************************************************** + **********************************************************************/ +ib_api_status_t +osmtest_get_local_port_lmc( IN osmtest_t * const p_osmt, + OUT uint8_t * const p_lmc ) +{ + osmtest_req_context_t context; + ib_portinfo_record_t *p_rec; + uint32_t i; + cl_status_t status; + uint32_t num_recs = 0; + + OSM_LOG_ENTER( &p_osmt->log, osmtest_get_local_port_lmc ); + + memset( &context, 0, sizeof( context ) ); + + /* + * Do a blocking query for our own PortRecord in the subnet. + */ + status = osmtest_get_port_rec( p_osmt, + cl_ntoh16(p_osmt->local_port.lid), + &context ); + + if( status != IB_SUCCESS ) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_get_local_port_lmc: ERR 001A: " + "osmtest_get_port_rec failed (%s)\n", + ib_get_err_str( status ) ); + goto Exit; + } + + num_recs = context.result.result_cnt; + + if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) + { + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_get_local_port_lmc: " + "Received %u records\n", num_recs ); + } + + for( i = 0; i < num_recs; i++ ) + { + p_rec = osmv_get_query_portinfo_rec( context.result.p_result_madw, i ); + osm_dump_portinfo_record( &p_osmt->log, p_rec, OSM_LOG_VERBOSE ); + if ( p_lmc) + { + *p_lmc = ib_port_info_get_lmc( &p_rec->port_info ); + osm_log( &p_osmt->log, OSM_LOG_DEBUG, + "osmtest_get_local_port_lmc: " + "LMC %d\n", *p_lmc ); + } + } + + Exit: + /* + * Return the IB query MAD to the pool as necessary. + */ + if( context.result.p_result_madw != NULL ) + { + osm_mad_pool_put( &p_osmt->mad_pool, context.result.p_result_madw ); + context.result.p_result_madw = NULL; + } + + OSM_LOG_EXIT( &p_osmt->log ); + return ( status ); +} + +/********************************************************************** * Use a wrong SM_Key in a simple port query and report success if * failed. **********************************************************************/ @@ -3100,6 +3168,7 @@ osmtest_validate_path_data( IN osmtest_t IN const ib_path_rec_t * const p_rec ) { cl_status_t status = IB_SUCCESS; + uint8_t lmc = 0; OSM_LOG_ENTER( &p_osmt->log, osmtest_validate_path_data ); @@ -3111,17 +3180,38 @@ osmtest_validate_path_data( IN osmtest_t cl_ntoh16( p_rec->slid ), cl_ntoh16( p_rec->dlid ) ); } - /* - * Has this record already been returned? - */ - if( p_path->count != 0 ) + status = osmtest_get_local_port_lmc( p_osmt, &lmc ); + + /* HACK: Assume uniform LMC across endports in the subnet */ + /* In absence of this assumption, validation of this is much more complicated */ + if ( lmc == 0 ) { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmtest_validate_path_data: ERR 0056: " - "Already received path SLID 0x%X to DLID 0x%X\n", - cl_ntoh16( p_rec->slid ), cl_ntoh16( p_rec->dlid ) ); - status = IB_ERROR; - goto Exit; + /* + * Has this record already been returned? + */ + if( p_path->count != 0 ) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_validate_path_data: ERR 0056: " + "Already received path SLID 0x%X to DLID 0x%X\n", + cl_ntoh16( p_rec->slid ), cl_ntoh16( p_rec->dlid ) ); + status = IB_ERROR; + goto Exit; + } + } + else + { + /* Also, this doesn't detect fewer than the correct number of paths being returned */ + if ( p_path->count >= ( 1 << lmc ) * ( 1 << lmc ) ) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_validate_path_data: ERR 0052: " + "Already received path SLID 0x%X to DLID 0x%X count %d LMC %d\n", + cl_ntoh16( p_rec->slid ), cl_ntoh16( p_rec->dlid ), + p_path->count, lmc ); + status = IB_ERROR; + goto Exit; + } } ++p_path->count; From halr at voltaire.com Fri Jun 9 06:52:02 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jun 2006 09:52:02 -0400 Subject: [openib-general] Re: Failed multicast join withnew multicast module In-Reply-To: <1149849791.4510.41634.camel@hal.voltaire.com> References: <1149799142.4510.13468.camel@hal.voltaire.com> <44889E18.8010507@ichips.intel.com> <1149849791.4510.41634.camel@hal.voltaire.com> Message-ID: <1149861120.4744.1558.camel@hal.voltaire.com> On Fri, 2006-06-09 at 06:43, Hal Rosenstock wrote: > On Thu, 2006-06-08 at 18:00, Sean Hefty wrote: > > Hal Rosenstock wrote: > > > 2. There is lazy deletion of MC groups allowed so the reclamation may be > > > difficult. > > > > I'm not familiar with the switch programming. > > Note the MGRPs are MGIDs and switches are programmed with MLIDs and > these can be 1:1 or many:1 depending on the implementation. Most do not > do the many:1 but this is allowed by the spec. Also, note that switches > know nothing about the groups themselves (only MLIDs and which ports) so > most of the information is in the SM. > > > Does the SM set the entire > > MulticastForwardingTable for a switch every time a new group is created, or a > > new member joins? > > No. It only needs to program the affected block(s) of the MFT based on > the MLID and the portmask (ports for replication). > > > If the SM loses track of all multicast groups, how are the > > stale groups on the switches deleted? > > There are different strategies for dealing with this. It could clear out > all the MFTs in all the switches but that is expensive. It could also > wait for multicast registrations and then program the needed MFT blocks > in the affected switches only caring about those. In this case, packets > on those MLIDs would still be forwarded until the MLID is reclaimed. > > > > The endport SMAs are claiming they do support client reregistration but > > > it does take more than that for the endport/node to behave properly. > > > > My original plan was to have the ib_multicast module rejoin all groups, but > > since the MLIDs can change I can't see any way to handle reregistration safely > > without involving the application. > > Because the application needs to modify the QP for this ? As I said, I'm > not sure IPoIB was handling this before. I'm sure Roland knows for sure. It does look to me like the pre multicast module IPoIB does leave and then rejoin on receipt of a client reregister from the SM. -- Hal > > My latest changes are just to report errors > > on existing multicast groups on an affected port. > > How ? > > > > I know it is a conceptual rather than actual compliance. One issue would > > > be defining what it means to repect all existing communication. Then we > > > would need to look at whether that was feasible or not and perhaps > > > rescope what it means to a set of things achievable. Another issue would > > > be defining where it is possible or not. If that is totally vendor > > > dependent, then this would have no substance to it. It is largely a > > > matter of being a "better" SM. > > > > We could use the phrase, "except where such communication is no longer > > realizable" instead of "where possible". Where unrealizable means impossible > > because the communication uses properties that are physically impossible to > > achieve given the hardware configuration of the subnet. (See bottom of page 910 > > of the spec.) > > That specific text is defined there for the case of unrealizable joins > which is very different from the case being discussed. The specific > property mismatches are listed. Still not sure what determines this in > the case we are discussing. > > > If an SM could just query switches for their MulticastForwardingTables or the > > end nodes, > > It can. > > > would we be able to avoid these issues? > > How ? Not all the group information is in the switches. > > -- Hal > > > - Sean > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From krause at cup.hp.com Fri Jun 9 06:59:30 2006 From: krause at cup.hp.com (Michael Krause) Date: Fri, 09 Jun 2006 06:59:30 -0700 Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs In-Reply-To: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com> References: <7.0.1.0.2.20060605081948.044849d0@netapp.com> <20060606074314.GC2432@mellanox.co.il> <7.0.1.0.2.20060606081548.0469bcc8@netapp.com> Message-ID: <6.2.0.14.2.20060609065546.02cd8b38@esmail.cup.hp.com> Whether iWARP or IB, there is a fixed number of RDMA Requests allowed to be outstanding at any given time. If one posts more RDMA Read requests than the fixed number, the transmit queue is stalled. This is documented in both technology specifications. It is something that all ULP should be aware of and some go so far as to communicate that as part of the Hello / login exchange. This allows the ULP implementation to determine whether it wants to stall or wants to wait until Read Responses complete before sending another request. This isn't something silent; this isn't something new; this is something for the ULP implementation to decide how to deal with the issue. BTW, this is part of the hardware and associated specifications so it is up to software to deal with the limited hardware resources and the associated consequences. Please keep in mind that there are a limited number of RDMA Request / Atomic resource "slots" at the receiving HCA / RNIC. These are kept in hardware thus one must know the exact limit to avoid creating protocol problems. A ULP transmitter may post to the transmit queue more than the allotted slots but the transmitting (source) HCA / RNIC must not issue them to the remote. These requests do cause the source to stall. This is a well understood problem and if people give the iSCSI / iSER and DA specs good read or SDP they can see that this issue is comprehended. I agree with people that ULP designers / implementers must pay close attention to this constraint as it is in the iWARP / IB specifications for a very good reason and these semantics must be preserved to maintain the ordering requirements that are the used by the overall RDMA protocols themselves. Mike At 05:24 AM 6/6/2006, Talpey, Thomas wrote: >At 03:43 AM 6/6/2006, Michael S. Tsirkin wrote: > >Quoting r. Talpey, Thomas : > >> Semantically, the provider is not required to provide any such flow > control > >> behavior by the way. The Mellanox one apparently does, but it is not > >> a requirement of the verbs, it's a requirement on the upper layer. If more > >> RDMA Reads are posted than the remote peer supports, the connection > >> may break. > > > >This does not sound right. Isn't this the meaning of this field: > >"Initiator Depth: Number of RDMA Reads & atomic operations > >outstanding at any time"? Shouldn't any provider enforce this limit? > >The core spec does not require it. An implementation *may* enforce it, >but is not *required* to do so. And as pointed out in the other message, >there are repercussions of doing so. > >I believe the silent queue stalling is a bit of a time bomb for upper layers, >whose implementers are quite likely unaware of the danger. I greatly >prefer an implementation which simply sends the RDMA Read request, >resulting in a failed (but unblocked!) connection. Silence is a very >dangerous thing, no matter how helpful the intent. > >Tom. > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Jun 9 07:23:57 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jun 2006 10:23:57 -0400 Subject: [openib-general] [PATCH] ibnetdiscover: Add LMC display to switch port 0 Message-ID: <1149863036.5093.773.camel@hal.voltaire.com> ibnetdiscover: Add LMC display to switch port 0 Signed-off-by: Hal Rosenstock Index: src/ibnetdiscover.c =================================================================== --- src/ibnetdiscover.c (revision 7841) +++ src/ibnetdiscover.c (working copy) @@ -158,6 +158,7 @@ get_node(Node *node, Port *port, ib_port return 0; node->smalid = port->lid; + node->smalmc = port->lmc; DEBUG("portid %s: got switch node %Lx '%s'", portid2str(portid), node->nodeguid, nd); @@ -530,9 +531,9 @@ out_switch(Node *node, int group) } } - fprintf(f, "\nSwitch\t%d %s\t\t# %s port 0 lid %d\n", + fprintf(f, "\nSwitch\t%d %s\t\t# %s port 0 lid %d lmc %d\n", node->numports, node_name(node), - clean_nodedesc(node->nodedesc), node->smalid); + clean_nodedesc(node->nodedesc), node->smalid, node->smalmc); } void Index: include/ibnetdiscover.h =================================================================== --- include/ibnetdiscover.h (revision 7841) +++ include/ibnetdiscover.h (working copy) @@ -82,6 +82,7 @@ struct Node { int numports; int localport; int smalid; + int smalmc; uint32_t devid; uint32_t vendid; uint64_t sysimgguid; From halr at voltaire.com Fri Jun 9 08:01:44 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jun 2006 11:01:44 -0400 Subject: [openib-general] [PATCH] ibnetdiscover: Indicate SP0 type Message-ID: <1149865304.5093.2035.camel@hal.voltaire.com> ibnetdiscover: Indicate SP0 type Signed-off-by: Hal Rosenstock Index: diags/src/ibnetdiscover.c =================================================================== --- diags/src/ibnetdiscover.c (revision 7842) +++ diags/src/ibnetdiscover.c (working copy) @@ -126,7 +126,9 @@ int get_node(Node *node, Port *port, ib_portid_t *portid) { char portinfo[64]; + char switchinfo[32]; void *pi = portinfo, *ni = node->nodeinfo, *nd = node->nodedesc; + void *si = switchinfo; if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, timeout)) return -1; @@ -160,6 +162,12 @@ get_node(Node *node, Port *port, ib_port node->smalid = port->lid; node->smalmc = port->lmc; + if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout)) + node->smaenhsp0 = 0; /* assume base SP0 */ + else { + mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0); + } + DEBUG("portid %s: got switch node %Lx '%s'", portid2str(portid), node->nodeguid, nd); return 1; @@ -531,9 +539,11 @@ out_switch(Node *node, int group) } } - fprintf(f, "\nSwitch\t%d %s\t\t# %s port 0 lid %d lmc %d\n", + fprintf(f, "\nSwitch\t%d %s\t\t# %s %s port 0 lid %d lmc %d\n", node->numports, node_name(node), - clean_nodedesc(node->nodedesc), node->smalid, node->smalmc); + clean_nodedesc(node->nodedesc), + node->smaenhsp0 ? "enhanced" : "base", + node->smalid, node->smalmc); } void Index: diags/include/ibnetdiscover.h =================================================================== --- diags/include/ibnetdiscover.h (revision 7842) +++ diags/include/ibnetdiscover.h (working copy) @@ -83,6 +83,7 @@ struct Node { int localport; int smalid; int smalmc; + int smaenhsp0; uint32_t devid; uint32_t vendid; uint64_t sysimgguid; From mshefty at ichips.intel.com Fri Jun 9 09:46:37 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 09 Jun 2006 09:46:37 -0700 Subject: [openib-general] Re: Failed multicast join withnew multicast module In-Reply-To: <1149849791.4510.41634.camel@hal.voltaire.com> References: <1149799142.4510.13468.camel@hal.voltaire.com> <44889E18.8010507@ichips.intel.com> <1149849791.4510.41634.camel@hal.voltaire.com> Message-ID: <4489A5ED.8060200@ichips.intel.com> Hal Rosenstock wrote: > Note the MGRPs are MGIDs and switches are programmed with MLIDs and > these can be 1:1 or many:1 depending on the implementation. Most do not > do the many:1 but this is allowed by the spec. Also, note that switches > know nothing about the groups themselves (only MLIDs and which ports) so > most of the information is in the SM. Is there any chance that someone using an "old" join can receive data on a group that was created after an SM restart? My guess is that the QP would discard the message unless both the MLID and MGIDs matched, so there's probably not a real issue here. > How ? Not all the group information is in the switches. It's likely that the end nodes have the mcmember records from previous joins. Isn't that along with the switch information enough to reconstruct the group information? - Sean From sweitzen at cisco.com Fri Jun 9 10:44:56 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 9 Jun 2006 10:44:56 -0700 Subject: [openib-general] IB MTU tunable for uDAPL and/or Intel MPI? Message-ID: While we're talking about MTUs, is the IB MTU tunable in uDAPL and/or Intel MPI via env var or config file? Looks like Intel MPI 2.0.1 uses 2K for IB MTU like MVAPICH does in OFED 1.0 rc4 and rc6, I'd like to try 1K with Intel MPI. Scott ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Scott Weitzenkamp (sweitzen) Sent: Thursday, June 08, 2006 4:38 PM To: Tziporet Koren; openfabrics-ewg at openib.org Cc: openib-general Subject: RE: [openib-general] OFED-1.0-rc6 is available The MTU change undos the changes for bug 81, so I have reopened bug 81 (http://openib.org/bugzilla/show_bug.cgi?id=81). With rc6, PCI-X osu_bw and osu_bibw performance is bad, and PCI-E osu_bibw performance is bad. I've enclosed some performance data, look at rc4 vs rc5 vs rc6 for Cougar/Cheetah/LionMini. Are there other benchmarks driving the changes in rc6 (and rc4)? Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems OSU MPI: * Added mpi_alltoall fine tuning parameters * Added default configuration/documentation file $MPIHOME/etc/mvapich.conf * Added shell configuration files $MPIHOME/etc/mvapich.csh , $MPIHOME/etc/mvapich.csh * Default MTU was changed back to 2K for InfiniHost III Ex and InfiniHost III Lx HCAs. For InfiniHost card recommended value is: VIADEV_DEFAULT_MTU=MTU1024 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Fri Jun 9 11:14:16 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 09 Jun 2006 11:14:16 -0700 Subject: [openib-general] Re: [PATCH 1/2] multicast: notify users on membership errors In-Reply-To: <20060609050910.GD10263@mellanox.co.il> References: <20060609050910.GD10263@mellanox.co.il> Message-ID: <4489BA78.4050700@ichips.intel.com> Michael S. Tsirkin wrote: >>These should eliminate any races with ipoib leaving, >>then quickly re-joining a group as a result of an event. > > Is there a chance this will fix the crashes me and Or were seeing? It shouldn't. The race that I was referring to only involved whether or not a MAD is sent when a new join request is received by the ib_multicast module. I literally spent hours reviewing the ib_multicast code trying to identify how the crash you and Or saw could occur. My conclusion is that it's impossible. :) So, at this point, I'm still not sure how it happens. I'll continue to spend some time looking at this, in the hope that I can at least add in some assertions that might indicate when the problem starts. - Sean From mshefty at ichips.intel.com Fri Jun 9 11:34:48 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 09 Jun 2006 11:34:48 -0700 Subject: [openib-general] [PATCH 2/2] ipoib: handle multicast group reset notification In-Reply-To: References: Message-ID: <4489BF48.8010804@ichips.intel.com> Sean Hefty wrote: > Ipoib already checks for events that require rejoining multicast groups. > We just need to add code to handle (i.e. ignore) multicast group reset > notifications. Roland, Any issue committing this? - Sean From ardavis at ichips.intel.com Fri Jun 9 11:38:07 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 09 Jun 2006 11:38:07 -0700 Subject: [openib-general] IB MTU tunable for uDAPL and/or Intel MPI? In-Reply-To: References: Message-ID: <4489C00F.2000902@ichips.intel.com> Scott Weitzenkamp (sweitzen) wrote: > While we're talking about MTUs, is the IB MTU tunable in uDAPL and/or > Intel MPI via env var or config file? > > Looks like Intel MPI 2.0.1 uses 2K for IB MTU like MVAPICH does in > OFED 1.0 rc4 and rc6, I'd like to try 1K with Intel MPI. > > Scott There is no mechanism for me to modify the MTU using rdma_cm so whatever is returned in the path record is what you get with the OpenIB-cma provider. However, you could use the OpenIB-scm provider which is hard coded for 1K MTU as a comparision. Can you run with "-genv I_MPI_DAPL_PROVIDER OpenIB-scm" on your cluster? -arlin > > ------------------------------------------------------------------------ > *From:* openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] *On Behalf Of *Scott > Weitzenkamp (sweitzen) > *Sent:* Thursday, June 08, 2006 4:38 PM > *To:* Tziporet Koren; openfabrics-ewg at openib.org > *Cc:* openib-general > *Subject:* RE: [openib-general] OFED-1.0-rc6 is available > > The MTU change undos the changes for bug 81, so I have reopened > bug 81 (http://openib.org/bugzilla/show_bug.cgi?id=81). > > With rc6, PCI-X osu_bw and osu_bibw performance is bad, and PCI-E > osu_bibw performance is bad. I've enclosed some performance data, > look at rc4 vs rc5 vs rc6 for Cougar/Cheetah/LionMini. > > Are there other benchmarks driving the changes in rc6 (and rc4)? > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > > > > *OSU MPI:* > > · Added mpi_alltoall fine tuning parameters > > · Added default configuration/documentation file > $MPIHOME/etc/mvapich.conf > > · Added shell configuration files > $MPIHOME/etc/mvapich.csh , $MPIHOME/etc/mvapich.csh > > · Default MTU was changed back to 2K for InfiniHost III > Ex and InfiniHost III Lx HCAs. For InfiniHost card recommended > value is: > VIADEV_DEFAULT_MTU=MTU1024 > >------------------------------------------------------------------------ > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tom at opengridcomputing.com Fri Jun 9 13:10:01 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 09 Jun 2006 15:10:01 -0500 Subject: [openib-general] [PATCH] rping: Erroneous check for minumum ping buffer size In-Reply-To: <44891FD4.30104@in.ibm.com> References: <20060609060138.GB13602@harry-potter.in.ibm.com> <44891FD4.30104@in.ibm.com> Message-ID: <1149883801.29808.52.camel@trinity.ogc.int> Well it's almost a puzzle at this point. just hard coding 10 with a comment is probably easier to read. But ... for the curious, this will do what you want ... but may cause you to lose your breakfast. #define _stringify( _x ) # _x #define stringify( _x ) _stringify( _x ) Then printf("%s %d\n", stringify(INT_MAX), sizeof(INT_MAX)) will get you... 2147483647 10 just like you "expected". The double nested macro call is necessary to get cpp to substitute INT_MAX for 2147483647, otherwise you get INT_MAX 7 Later, On Fri, 2006-06-09 at 12:44 +0530, Pradipta Kumar Banerjee wrote: > Pradipta Kumar Banerjee wrote: > > rping didn't checked correctly for the minimum size of the ping > > buffer resulting in the following error from glibc > > > > "*** glibc detected *** free(): invalid next size (fast)" > > > > Signed-off-by: Pradipta Kumar Banerjee > > --- > > > > Index: rping.c > > ============================================================= > > --- rping.org 2006-06-09 10:57:43.000000000 +0530 > > +++ rping.c 2006-06-09 11:00:28.000000000 +0530 > > @@ -96,6 +96,12 @@ struct rping_rdma_info { > > #define RPING_BUFSIZE 64*1024 > > #define RPING_SQ_DEPTH 16 > > > > +/* Default string for print data and > > + * minimum buffer size > > + */ > > +#define RPING_MSG_FMT "rdma-ping-%d: " > > +#define RPING_MIN_BUFSIZE sizeof(itoa(INT_MAX))+sizeof(RPING_MSG_FMT) > > + > Tom, > Just found that 'itoa' is not a built-in library function. The sizeof is > returning '4' which is not what we really want. Do we hard-code the value to 10 > ( like #define RPING_MIN_BUFSIZE 10 + sizeof(RPING_MSG_FMT) )? > INT_MAX is 2147483647 (10 - chars). Other options might include writing our own > 'itoa'. > > Thanks, > Pradipta Kumar. > > > /* > > * Control block struct. > > */ > > @@ -774,7 +780,7 @@ static void rping_test_client(struct rpi > > cb->state = RDMA_READ_ADV; > > > > /* Put some ascii text in the buffer. */ > > - cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping); > > + cc = sprintf(cb->start_buf, RPING_MSG_FMT, ping); > > for (i = cc, c = start; i < cb->size; i++) { > > cb->start_buf[i] = c; > > c++; > > @@ -977,11 +983,11 @@ int main(int argc, char *argv[]) > > break; > > case 'S': > > cb->size = atoi(optarg); > > - if ((cb->size < 1) || > > + if ((cb->size < RPING_MIN_BUFSIZE) || > > (cb->size > (RPING_BUFSIZE - 1))) { > > fprintf(stderr, "Invalid size %d " > > - "(valid range is 1 to %d)\n", > > - cb->size, RPING_BUFSIZE); > > + "(valid range is %d to %d)\n", > > + cb->size, RPING_MIN_BUFSIZE, RPING_BUFSIZE); > > ret = EINVAL; > > } else > > DEBUG_LOG("size %d\n", (int) atoi(optarg)); > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From halr at voltaire.com Fri Jun 9 13:12:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jun 2006 16:12:28 -0400 Subject: [openib-general] Re: Failed multicast join withnew multicast module In-Reply-To: <4489A5ED.8060200@ichips.intel.com> References: <1149799142.4510.13468.camel@hal.voltaire.com> <44889E18.8010507@ichips.intel.com> <1149849791.4510.41634.camel@hal.voltaire.com> <4489A5ED.8060200@ichips.intel.com> Message-ID: <1149883948.5093.12572.camel@hal.voltaire.com> On Fri, 2006-06-09 at 12:46, Sean Hefty wrote: > Hal Rosenstock wrote: > > Note the MGRPs are MGIDs and switches are programmed with MLIDs and > > these can be 1:1 or many:1 depending on the implementation. Most do not > > do the many:1 but this is allowed by the spec. Also, note that switches > > know nothing about the groups themselves (only MLIDs and which ports) so > > most of the information is in the SM. > > Is there any chance that someone using an "old" join can receive data on a group > that was created after an SM restart? I think so. One can also view this as another aspect of lazy deletion. Actually the deletion can be so slow as to never occur. > My guess is that the QP would discard the > message unless both the MLID and MGIDs matched, That would be my guess too but I'm not sure. > so there's probably not a real issue here. > > How ? Not all the group information is in the switches. > > It's likely that the end nodes have the mcmember records from previous joins. > Isn't that along with the switch information enough to reconstruct the group > information? No. The MCMemberRecord joins don't match the MGIDs to the MLIDs. You would need more info than that although it is available. The other issue is whether you trust the state of the network or not when the SM comes up. That's sometimes a dangerous proposition. -- Hal > - Sean From robert.j.woodruff at intel.com Fri Jun 9 13:21:23 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 9 Jun 2006 13:21:23 -0700 Subject: [openib-general] RE: [openfabrics-ewg] OFED-1.0-rc6 is available Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007ED438C@orsmsx408> Is there any plan to release an RC6 package (or an RC7) that has a Pathscale driver that compiles on RHEL4 - U3 that we can test before the release ? woody ________________________________ From: openfabrics-ewg-bounces at openib.org [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of Tziporet Koren Sent: Wednesday, June 07, 2006 7:59 AM To: Tziporet Koren; openfabrics-ewg at openib.org Cc: openib-general Subject: [openfabrics-ewg] OFED-1.0-rc6 is available Hi All, We have prepared OFED 1.0 RC6. Release location: https://openib.org/svn/gen2/branches/1.0/ofed/releases File: OFED-1.0-rc6.tgz Note: This release is the code freeze release for OFED 1.0. Only showstopper bugs will be fixed. BUILD_ID: OFED-1.0-rc6 openib-1.0 (REV=7772) # User space https://openib.org/svn/gen2/branches/1.0/src/userspace # Kernel space https://openib.org/svn/gen2/branches/1.0/ofed/tags/rc6/linux-kernel Git: ref: refs/heads/for-2.6.17 commit d9ec5ad24ce80b7ef69a0717363db661d13aada5 # MPI mpi_osu-0.9.7-mlx2.1.0.tgz openmpi-1.1b1-1.src.rpm mpitests-1.0-0.src.rpm OSes: * RH EL4 up2: 2.6.9-22.ELsmp * RH EL4 up3: 2.6.9-34.ELsmp * Fedora C4: 2.6.11-1.1369_FC4 * SLES10 RC2: 2.6.16.16-1.6-smp * SUSE 10 Pro: 2.6.13-15-smp * kernel.org: 2.6.16.x Systems: * x86_64 * x86 * ia64 * ppc64 Main changes from RC5: 1. SDP - libsdp implementation of RFC proposed by Eitan Zahavi; bug fixes in kernel module. See details below. 2. SRP - bug fixes 3. Open MPI - new package based on 1.1b1-1 4. OSU-MPI - See details below. 5. iSER: Enhanced to support SLES 10 RC1. 6. IPoIB default configuration changed: a. IPoIB configuration at install time is now optional. b. The default configuration of IPoIB interfaces (if performed at install time) is DHCP; it can be changed during interactive installation. c. For unattended installation one can give a new configuration file. See the example below. 7. Bug Fixes. Package limitations: 1. The ipath driver does not compile/load on most systems. To be fixed in final release. Meanwhile, one must work with custom build and not choose ipath driver, or change in the conf file: ib_ipath=n. I attached a reference ofed-no_ipath.conf file. Once Qlogic fixes the backport patches I will publish them on the release page so any one interested can use them with this release. 2. iSER is working on SuSE SLES 10 RC1 only IPoIB configuration file example: If you are going to install OFED on a 32 node cluster and want to use static IPoIB configuration based on Ethernet device configuration follow instructions below: Assume that the Ethernet IP addresses (eth0 interfaces) of the cluster are: 10.0.0.1 - 10.0.0.32 and you want to assign to ib0 IP addresses in the range: 192.168.0.1 - 192.168.0.32 and to ib1 IP addresses in the range: 172.16.0.1 - 172.16.0.32 Then create the file ofed_net.conf with the following lines: LAN_INTERFACE_ib0=eth0 IPADDR_ib0=192.168.'*'.'*' NETMASK_ib0=255.255.0.0 NETWORK_ib0=192.168.0.0 BROADCAST_ib0=192.168.255.255 ONBOOT_ib0=1 LAN_INTERFACE_ib1=eth0 IPADDR_ib1=172.16.'*'.'*' NETMASK_ib1=255.255.0.0 NETWORK_ib1=172.16.0.0 BROADCAST_ib1=172.16.255.255 ONBOOT_ib1=1 Note: '*' will be replaced by the corresponding octal from the eth0 IP address. Assuming that you already have OFED configuration file (ofed.conf) with selected packages (created by running OFED-1.0/install.sh) Run: ./install.sh -c ofed.conf -net ofed_net.conf OSU MPI: * Added mpi_alltoall fine tuning parameters * Added default configuration/documentation file $MPIHOME/etc/mvapich.conf * Added shell configuration files $MPIHOME/etc/mvapich.csh , $MPIHOME/etc/mvapich.csh * Default MTU was changed back to 2K for InfiniHost III Ex and InfiniHost III Lx HCAs. For InfiniHost card recommended value is: VIADEV_DEFAULT_MTU=MTU1024 SDP Details: libsdp enhancements according to the RFC: 1. New config syntax (please see libsdp.conf) 2. With no config or empty config use SIMPLE_LIBSDP mode 3. Support listening on both tcp and sdp 4. Support trying both connections (first SDP then TCP) 5. Support IPv4 embedded in IPv6 (also convert back address) 6. Comprehensive verbosity logging 7. BNF based config parser Current SDP limitations: * SDP currently does not support sending/receiving out of band data (MSG_OOB). * Generally, SDP supports only SOL_SOCKET socket options. * The following options can be set but actual support is missing: o SO_KEEPALIVE - no keepalives are sent o SO_OOBINLINE - out of band data is not supported o SDP currently supports setting the following SOL_TCP socket options: o TCP_NODELAY, TCP_CORK - but actual support for these options is still missing * SDP currently does not handle Zcopy mode messages correctly and does not set MaxAdverts properly in HH/HAH messages. OFED components tested by Mellanox: * Verbs over mthca * IPoIB * OpenSM * OSU-MPI * SRP * SDP * IB administration utils (ibutils) Please send us any issues you encounter and/or test results. Thanks Tziporet & Vlad Tziporet Koren Software Director Mellanox Technologies mailto: tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Fri Jun 9 13:35:55 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 09 Jun 2006 13:35:55 -0700 Subject: [openib-general] Re: Failed multicast join withnew multicast module In-Reply-To: <1149883948.5093.12572.camel@hal.voltaire.com> References: <1149799142.4510.13468.camel@hal.voltaire.com> <44889E18.8010507@ichips.intel.com> <1149849791.4510.41634.camel@hal.voltaire.com> <4489A5ED.8060200@ichips.intel.com> <1149883948.5093.12572.camel@hal.voltaire.com> Message-ID: <4489DBAB.3080707@ichips.intel.com> Hal Rosenstock wrote: > The other issue is whether you trust the state of the network or not > when the SM comes up. That's sometimes a dangerous proposition. I considered this, but I think there's a difference between trusting one of the systems on the network, versus the network as a whole. For example, as long as the MCMember records from the end nodes mesh with MulticastForwarding tables on the switches, then we may be okay. Also, the MCMember records carry both the MGID and MLID, so what more would you need? - Sean From bugzilla-daemon at openib.org Fri Jun 9 13:45:20 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Fri, 9 Jun 2006 13:45:20 -0700 (PDT) Subject: [openib-general] [Bug 126] New: RDMA_CM and UCM not loaded on boot Message-ID: <20060609204520.6FD342283FD@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=126 Summary: RDMA_CM and UCM not loaded on boot Product: OpenFabrics Linux Version: 1.0rc6 Platform: Other OS/Version: Other Status: NEW Severity: normal Priority: P2 Component: RDMA CM AssignedTo: bugzilla at openib.org ReportedBy: robert.j.woodruff at intel.com The RDMA_CM and UCM are not being loaded automatically when the system boots. (RHELEL4-U3). This causes uDAPL and Intel MPI to fail. woody ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Fri Jun 9 13:55:14 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Fri, 9 Jun 2006 13:55:14 -0700 (PDT) Subject: [openib-general] [Bug 122] mad layer problem Message-ID: <20060609205514.AD40A228766@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=122 sean.hefty at intel.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From halr at voltaire.com Fri Jun 9 13:53:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jun 2006 16:53:46 -0400 Subject: [openib-general] Failed multicast join withnew multicast module In-Reply-To: <4489DBAB.3080707@ichips.intel.com> References: <1149799142.4510.13468.camel@hal.voltaire.com> <44889E18.8010507@ichips.intel.com> <1149849791.4510.41634.camel@hal.voltaire.com> <4489A5ED.8060200@ichips.intel.com> <1149883948.5093.12572.camel@hal.voltaire.com> <4489DBAB.3080707@ichips.intel.com> Message-ID: <1149886425.5093.13966.camel@hal.voltaire.com> On Fri, 2006-06-09 at 16:35, Sean Hefty wrote: > Hal Rosenstock wrote: > > The other issue is whether you trust the state of the network or not > > when the SM comes up. That's sometimes a dangerous proposition. > > I considered this, but I think there's a difference between trusting one of the > systems on the network, versus the network as a whole. For example, as long as > the MCMember records from the end nodes mesh with MulticastForwarding tables on > the switches, then we may be okay. What does mesh mean in this instance ? How do you know the multicast routing tables are indeed valid and that the SM didn't corrupt them ? (Why did the SM need restarting ?) > Also, the MCMember records carry both the MGID and MLID, so what more would you > need? The MLID is supplied by the SA in response to a group request from the end node, not the other way around. The end node doesn't tell the SA what MLID to use for a group. -- Hal > - Sean From mshefty at ichips.intel.com Fri Jun 9 14:18:58 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 09 Jun 2006 14:18:58 -0700 Subject: [openib-general] Failed multicast join withnew multicast module In-Reply-To: <1149886425.5093.13966.camel@hal.voltaire.com> References: <1149799142.4510.13468.camel@hal.voltaire.com> <44889E18.8010507@ichips.intel.com> <1149849791.4510.41634.camel@hal.voltaire.com> <4489A5ED.8060200@ichips.intel.com> <1149883948.5093.12572.camel@hal.voltaire.com> <4489DBAB.3080707@ichips.intel.com> <1149886425.5093.13966.camel@hal.voltaire.com> Message-ID: <4489E5C2.3090905@ichips.intel.com> Hal Rosenstock wrote: > What does mesh mean in this instance ? How do you know the multicast > routing tables are indeed valid and that the SM didn't corrupt them ? > (Why did the SM need restarting ?) I meant that the values agree with each other, and there are no conflicts. > The MLID is supplied by the SA in response to a group request from the > end node, not the other way around. The end node doesn't tell the SA > what MLID to use for a group. One of the ideas is for the end nodes to provide this data, even if that means extending the architecture. The problem is that the SA lost its state, but the network is working fine. The end nodes know which groups they have joined and the mapping of MGIDs to MLIDs. And the switches are already programmed correctly. Even if we have the ability for an SM to transparently fail over to another SM, because of the architecture, the end nodes are being forced to assume that all multicast group information has been lost. How about this? What if the end nodes only re-joined their groups on LID_CHANGE or CLIENT_REREGISTER events? That is, an SM_CHANGE would not result in clients needing to rejoin any groups. This puts the burden on the SM to generate a CLIENT_REREGISTER event only if it's needed. SMs that can fail over and maintain multicast state in the process would be able to do so. - Sean From sean.hefty at intel.com Fri Jun 9 14:19:24 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 9 Jun 2006 14:19:24 -0700 Subject: [openib-general] [PATCH 0/5] multicast abstraction Message-ID: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com> This patch series enhances support for joining and leaving multicast groups, providing the following functionality: 1. Users identify a multicast group by a multicast IP address. 2. A user binds to a local RDMA device based on resolving the IP address. 3. A new multicast group is created. The parameters for the multicast group are obtained based on the ipoib broadcast group, and the MGID is derived using the same algorithm as ipoib, except with a different signature. 4. Any QP associated with the join is attached to the group once the join operation completes. 5. A QP may join multiple groups. Signed-off-by: Sean Hefty From sean.hefty at intel.com Fri Jun 9 14:40:45 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 9 Jun 2006 14:40:45 -0700 Subject: [openib-general] [PATCH 1/5] ib_addr: retrieve MGID from device address In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com> Message-ID: <000e01c68c0d$5d31b500$ff0da8c0@amr.corp.intel.com> Extract the MGID used by ipoib for broadcast traffic from the device address. Signed-off-by: Sean Hefty --- This will be used to get the MCMemberRecord for the ipoib broadcast group. --- svn3/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_addr.h 2006-05-25 11:18:47.000000000 -0700 +++ svn/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_addr.h 2006-06-06 16:14:11.000000000 -0700 @@ -89,6 +89,11 @@ static inline void ib_addr_set_pkey(stru dev_addr->broadcast[9] = (unsigned char) pkey; } +static inline union ib_gid *ib_addr_get_mgid(struct rdma_dev_addr *dev_addr) +{ + return (union ib_gid *) (dev_addr->broadcast + 4); +} + static inline union ib_gid *ib_addr_get_sgid(struct rdma_dev_addr *dev_addr) { return (union ib_gid *) (dev_addr->src_dev_addr + 4); From sean.hefty at intel.com Fri Jun 9 14:46:56 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 9 Jun 2006 14:46:56 -0700 Subject: [openib-general] [PATCH 2/5] multicast: allow retrieving an MCMemberRecord based on MGID In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com> Message-ID: <000f01c68c0e$3acafc50$ff0da8c0@amr.corp.intel.com> Add an API to allow retrieving an MCMemberRecord from the local cache based on an MGID. Signed-off-by: Sean Hefty --- This allows an existing MCMemberRecord to be used as a template for creating other multicast groups. --- svn3/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_multicast.h 2006-05-25 11:18:47.000000000 -0700 +++ svn/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_multicast.h 2006-05-23 14:58:06.000000000 -0700 @@ -82,4 +82,16 @@ struct ib_multicast *ib_join_multicast(s */ void ib_free_multicast(struct ib_multicast *multicast); +/** + * ib_get_mcmember_rec - Looks up a multicast member record by its MGID and + * returns it if found. + * @device: Device associated with the multicast group. + * @port_num: Port on the specified device to associate with the multicast + * group. + * @mgid: MGID of multicast group. + * @rec: Location to copy SA multicast member record. + */ +int ib_get_mcmember_rec(struct ib_device *device, u8 port_num, + union ib_gid *mgid, struct ib_sa_mcmember_rec *rec); + #endif /* IB_MULTICAST_H */ --- svn3/gen2/trunk/src/linux-kernel/infiniband/core/multicast.c 2006-06-08 21:53:21.000000000 -0700 +++ svn/gen2/trunk/src/linux-kernel/infiniband/core/multicast.c 2006-06-08 17:14:01.000000000 -0700 @@ -612,6 +612,29 @@ void ib_free_multicast(struct ib_multica } EXPORT_SYMBOL(ib_free_multicast); +int ib_get_mcmember_rec(struct ib_device *device, u8 port_num, + union ib_gid *mgid, struct ib_sa_mcmember_rec *rec) +{ + struct mcast_device *dev; + struct mcast_port *port; + struct mcast_group *group; + unsigned long flags; + + dev = ib_get_client_data(device, &mcast_client); + if (!dev) + return -ENODEV; + + port = &dev->port[port_num - dev->start_port]; + spin_lock_irqsave(&port->lock, flags); + group = mcast_find(port, mgid); + if (group) + *rec = group->rec; + spin_unlock_irqrestore(&port->lock, flags); + + return group ? 0 : -EADDRNOTAVAIL; +} +EXPORT_SYMBOL(ib_get_mcmember_rec); + static void mcast_groups_lost(struct mcast_port *port) { struct mcast_group *group; From sean.hefty at intel.com Fri Jun 9 14:49:28 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 9 Jun 2006 14:49:28 -0700 Subject: [openib-general] [PATCH 3/5] sa_query: add call to initialize ah_attr from an mcmember record In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com> Message-ID: <001001c68c0e$950f3370$ff0da8c0@amr.corp.intel.com> Export a call to initialize an ib_ah_attr structure based on an MCMemberRecord returned from a multicast join request. Signed-off-by: Sean Hefty --- --- svn3/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_sa.h 2006-06-06 15:21:05.000000000 -0700 +++ svn/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_sa.h 2006-06-06 15:56:37.000000000 -0700 @@ -380,6 +380,14 @@ int ib_init_ah_from_path(struct ib_devic struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr); + /** + * ib_init_ah_from_mcmember - Initialize address handle attributes based on an + * SA mcmember record. + */ +int ib_init_ah_from_mcmember(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + struct ib_ah_attr *ah_attr); + /** * ib_sa_pack_attr - Copy an SA attribute from a host defined structure to * a network packed structure. --- svn3/gen2/trunk/src/linux-kernel/infiniband/core/sa_query.c 2006-06-06 15:21:05.000000000 -0700 +++ svn/gen2/trunk/src/linux-kernel/infiniband/core/sa_query.c 2006-06-06 15:57:21.000000000 -0700 @@ -471,6 +471,36 @@ int ib_init_ah_from_path(struct ib_devic } EXPORT_SYMBOL(ib_init_ah_from_path); +int ib_init_ah_from_mcmember(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + struct ib_ah_attr *ah_attr) +{ + int ret; + u16 gid_index; + u8 p; + + ret = ib_find_cached_gid(device, &rec->port_gid, &p, &gid_index); + if (ret) + return ret; + + memset(ah_attr, 0, sizeof *ah_attr); + ah_attr->dlid = be16_to_cpu(rec->mlid); + ah_attr->sl = rec->sl; + ah_attr->port_num = port_num; + ah_attr->static_rate = rec->rate; + + ah_attr->ah_flags = IB_AH_GRH; + ah_attr->grh.dgid = rec->mgid; + + ah_attr->grh.sgid_index = (u8) gid_index; + ah_attr->grh.flow_label = be32_to_cpu(rec->flow_label); + ah_attr->grh.hop_limit = rec->hop_limit; + ah_attr->grh.traffic_class = rec->traffic_class; + + return 0; +} +EXPORT_SYMBOL(ib_init_ah_from_mcmember); + int ib_sa_pack_attr(void *dst, void *src, int attr_id) { switch (attr_id) { From ardavis at ichips.intel.com Fri Jun 9 14:54:36 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 09 Jun 2006 14:54:36 -0700 Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS In-Reply-To: References: <200606071639.03787.jackm@mellanox.co.il> <200606080942.48767.jackm@mellanox.co.il> Message-ID: <4489EE1C.1090200@ichips.intel.com> James Lentini wrote: >On Thu, 8 Jun 2006, Jack Morgenstein wrote: > > > >>On Wednesday 07 June 2006 18:26, James Lentini wrote: >> >> >>>On Wed, 7 Jun 2006, Jack Morgenstein wrote: >>> >>> >>>>This (bug fix) can still be included in next-week's release, if you >>>>think it is important (I have extracted it from the changes checked >>>>in at svn 7755) >>>> >>>> >>>If you are going to make another release anyway, then I would included >>>it. >>> >>> >>Do you mean -- include the fix in next week's release -- or -- wait >>with the fix for the following release? >> >> > >I'd include the fix in the next release, but I wouldn't create a >special release just for this fix. > > So are we getting this in next weeks release or not? I think we need it. >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From halr at voltaire.com Fri Jun 9 15:01:15 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jun 2006 18:01:15 -0400 Subject: [openib-general] Failed multicast join withnew multicast module In-Reply-To: <4489E5C2.3090905@ichips.intel.com> References: <1149799142.4510.13468.camel@hal.voltaire.com> <44889E18.8010507@ichips.intel.com> <1149849791.4510.41634.camel@hal.voltaire.com> <4489A5ED.8060200@ichips.intel.com> <1149883948.5093.12572.camel@hal.voltaire.com> <4489DBAB.3080707@ichips.intel.com> <1149886425.5093.13966.camel@hal.voltaire.com> <4489E5C2.3090905@ichips.intel.com> Message-ID: <1149890474.5093.16250.camel@hal.voltaire.com> On Fri, 2006-06-09 at 17:18, Sean Hefty wrote: > Hal Rosenstock wrote: > > What does mesh mean in this instance ? How do you know the multicast > > routing tables are indeed valid and that the SM didn't corrupt them ? > > (Why did the SM need restarting ?) > > I meant that the values agree with each other, and there are no conflicts. How are conflicts determined ? The SA has no way of querying the end nodes for their multicast information; it currently is the other way around. > > The MLID is supplied by the SA in response to a group request from the > > end node, not the other way around. The end node doesn't tell the SA > > what MLID to use for a group. > > One of the ideas is for the end nodes to provide this data, even if that means > extending the architecture. OK. What if the SM already put the MLID to use for something else ? > The problem is that the SA lost its state, but the network is working fine. How does the SM know that the network is working fine ? > The end nodes know which groups they have joined and the mapping of MGIDs to MLIDs. > And the switches are already programmed correctly. I'm not sure what constitutes a correctness criterion here. > Even if we have the ability for an SM to transparently fail over to another SM, > because of the architecture, the end nodes are being forced to assume that all > multicast group information has been lost. In the case of an SM which replicated its database, it would replicate the registrations which include multicast so this reregistration shouldn't be necessary. But I don't know of a way that the end node knows whether the SM is doing this database replication. > How about this? What if the end nodes only re-joined their groups on LID_CHANGE > or CLIENT_REREGISTER events? That is, an SM_CHANGE would not result in clients > needing to rejoin any groups. This puts the burden on the SM to generate a > CLIENT_REREGISTER event only if it's needed. SMs that can fail over and > maintain multicast state in the process would be able to do so. I think more than this is needed. -- Hal > - Sean From jlentini at netapp.com Fri Jun 9 15:11:34 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 9 Jun 2006 18:11:34 -0400 (EDT) Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS In-Reply-To: <4489EE1C.1090200@ichips.intel.com> References: <200606071639.03787.jackm@mellanox.co.il> <200606080942.48767.jackm@mellanox.co.il> <4489EE1C.1090200@ichips.intel.com> Message-ID: On Fri, 9 Jun 2006, Arlin Davis wrote: > James Lentini wrote: > > > On Thu, 8 Jun 2006, Jack Morgenstein wrote: > > > > > > > On Wednesday 07 June 2006 18:26, James Lentini wrote: > > > > > > > On Wed, 7 Jun 2006, Jack Morgenstein wrote: > > > > > > > > > This (bug fix) can still be included in next-week's release, if you > > > > > think it is important (I have extracted it from the changes checked > > > > > in at svn 7755) > > > > > > > > > If you are going to make another release anyway, then I would included > > > > it. > > > > > > > Do you mean -- include the fix in next week's release -- or -- wait with > > > the fix for the following release? > > > > > > > I'd include the fix in the next release, but I wouldn't create a special > > release just for this fix. > > > So are we getting this in next weeks release or not? I think we need it. Tziporet, Will this be in this fix be in the next OFED release? From sean.hefty at intel.com Fri Jun 9 15:15:18 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 9 Jun 2006 15:15:18 -0700 Subject: [openib-general] [PATCH 4/5] rdma cm: add support to join / leave multicast groups In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com> Message-ID: <001101c68c12$31021d80$ff0da8c0@amr.corp.intel.com> Add IB multicast abstraction to the CMA. Signed-off-by: Sean Hefty --- --- svn3/gen2/trunk/src/linux-kernel/infiniband/include/rdma/rdma_cm.h 2006-06-06 16:53:56.000000000 -0700 +++ svn/gen2/trunk/src/linux-kernel/infiniband/include/rdma/rdma_cm.h 2006-06-02 10:22:29.000000000 -0700 @@ -52,6 +52,8 @@ enum rdma_cm_event_type { RDMA_CM_EVENT_ESTABLISHED, RDMA_CM_EVENT_DISCONNECTED, RDMA_CM_EVENT_DEVICE_REMOVAL, + RDMA_CM_EVENT_MULTICAST_JOIN, + RDMA_CM_EVENT_MULTICAST_ERROR }; enum rdma_port_space { @@ -77,6 +79,13 @@ struct rdma_route { int num_paths; }; +struct rdma_multicast_data { + void *context; + struct sockaddr addr; + u8 pad[sizeof(struct sockaddr_in6) - + sizeof(struct sockaddr)]; +}; + struct rdma_cm_event { enum rdma_cm_event_type event; int status; @@ -258,5 +267,20 @@ int rdma_reject(struct rdma_cm_id *id, c */ int rdma_disconnect(struct rdma_cm_id *id); -#endif /* RDMA_CM_H */ +/** + * rdma_join_multicast - Join the multicast group specified by the given + * address. + * @id: Communication identifier associated with the request. + * @addr: Multicast address identifying the group to join. + * @context: User-defined context associated with the join request. + */ +int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, + void *context); +/** + * rdma_leave_multicast - Leave the multicast group specified by the given + * address. + */ +void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr); + +#endif /* RDMA_CM_H */ --- svn3/gen2/trunk/src/linux-kernel/infiniband/core/cma.c 2006-06-06 19:30:12.000000000 -0700 +++ svn/gen2/trunk/src/linux-kernel/infiniband/core/cma.c 2006-06-06 16:12:42.000000000 -0700 @@ -43,6 +43,7 @@ #include #include #include +#include MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); @@ -111,6 +112,7 @@ struct rdma_id_private { struct list_head list; struct list_head listen_list; struct cma_device *cma_dev; + struct list_head mc_list; enum cma_state state; spinlock_t lock; @@ -137,6 +139,15 @@ struct rdma_id_private { u8 srq; }; +struct cma_multicast { + struct rdma_id_private *id_priv; + union { + struct ib_multicast *ib; + } multicast; + struct list_head list; + struct rdma_multicast_data data; +}; + struct cma_work { struct work_struct work; struct rdma_id_private *id; @@ -328,6 +339,7 @@ struct rdma_cm_id* rdma_create_id(rdma_c init_waitqueue_head(&id_priv->wait_remove); atomic_set(&id_priv->dev_remove, 0); INIT_LIST_HEAD(&id_priv->listen_list); + INIT_LIST_HEAD(&id_priv->mc_list); get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num); return &id_priv->id; @@ -474,6 +486,32 @@ int rdma_init_qp_attr(struct rdma_cm_id } EXPORT_SYMBOL(rdma_init_qp_attr); +static int cma_get_ib_mc_attr(struct rdma_id_private *id_priv, + struct sockaddr *addr, + struct ib_ah_attr *ah_attr, uint32_t *remote_qpn, + uint32_t *remote_qkey) +{ + struct cma_multicast *mc; + unsigned long flags; + int ret = -EADDRNOTAVAIL; + + spin_lock_irqsave(&id_priv->lock, flags); + list_for_each_entry(mc, &id_priv->mc_list, list) { + if (!memcmp(&mc->data.addr, addr, ip_addr_size(addr))) { + ib_init_ah_from_mcmember(id_priv->id.device, + id_priv->id.port_num, + &mc->multicast.ib->rec, + ah_attr); + *remote_qpn = 0xFFFFFF; + *remote_qkey = be32_to_cpu(mc->multicast.ib->rec.qkey); + ret = 0; + break; + } + } + spin_unlock_irqrestore(&id_priv->lock, flags); + return ret; +} + int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr, struct ib_ah_attr *ah_attr, u32 *remote_qpn, u32 *remote_qkey) @@ -484,7 +522,10 @@ int rdma_get_dst_attr(struct rdma_cm_id id_priv = container_of(id, struct rdma_id_private, id); switch (rdma_node_get_transport(id_priv->id.device->node_type)) { case RDMA_TRANSPORT_IB: - if (!memcmp(&id->route.addr.dst_addr, addr, ip_addr_size(addr))) + ret = cma_get_ib_mc_attr(id_priv, addr, ah_attr, + remote_qpn, remote_qkey); + if (ret && id_priv->cm_id.ib && + !memcmp(&id->route.addr.dst_addr, addr, ip_addr_size(addr))) ret = ib_cm_get_dst_attr(id_priv->cm_id.ib, ah_attr, remote_qpn, remote_qkey); break; @@ -718,6 +759,19 @@ static void cma_release_port(struct rdma mutex_unlock(&lock); } +static void cma_leave_mc_groups(struct rdma_id_private *id_priv) +{ + struct cma_multicast *mc; + + while (!list_empty(&id_priv->mc_list)) { + mc = container_of(id_priv->mc_list.next, + struct cma_multicast, list); + list_del(&mc->list); + ib_free_multicast(mc->multicast.ib); + kfree(mc); + } +} + void rdma_destroy_id(struct rdma_cm_id *id) { struct rdma_id_private *id_priv; @@ -736,6 +790,7 @@ void rdma_destroy_id(struct rdma_cm_id * default: break; } + cma_leave_mc_groups(id_priv); mutex_lock(&lock); cma_detach_from_dev(id_priv); mutex_unlock(&lock); @@ -2053,6 +2108,150 @@ out: } EXPORT_SYMBOL(rdma_disconnect); +static int cma_ib_join_handler(int status, struct ib_multicast *multicast) +{ + struct rdma_id_private *id_priv; + struct cma_multicast *mc = multicast->context; + enum rdma_cm_event_type event; + int ret; + + id_priv = mc->id_priv; + atomic_inc(&id_priv->dev_remove); + if (!cma_comp(id_priv, CMA_ADDR_BOUND) && + !cma_comp(id_priv, CMA_ADDR_RESOLVED)) + goto out; + + if (!status && id_priv->id.qp) { + status = ib_attach_mcast(id_priv->id.qp, &multicast->rec.mgid, + multicast->rec.mlid); + } + + event = status ? RDMA_CM_EVENT_MULTICAST_ERROR : + RDMA_CM_EVENT_MULTICAST_JOIN; + + ret = cma_notify_user(id_priv, event, status, &mc->data, + sizeof mc->data); + if (ret) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + rdma_destroy_id(&id_priv->id); + return 0; + } +out: + cma_release_remove(id_priv); + return 0; +} + +static int cma_join_ib_multicast(struct rdma_id_private *id_priv, + struct cma_multicast *mc) +{ + struct ib_sa_mcmember_rec rec; + unsigned char mc_map[MAX_ADDR_LEN]; + struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; + struct sockaddr_in *sin = (struct sockaddr_in *) &mc->data.addr; + ib_sa_comp_mask comp_mask; + int ret; + + ret = ib_get_mcmember_rec(id_priv->id.device, id_priv->id.port_num, + ib_addr_get_mgid(dev_addr), &rec); + if (ret) + return ret; + + ip_ib_mc_map(sin->sin_addr.s_addr, mc_map); + mc_map[7] = 0x01; /* Use RDMA CM signature */ + mc_map[8] = ib_addr_get_pkey(dev_addr) >> 8; + mc_map[9] = (unsigned char) ib_addr_get_pkey(dev_addr); + + rec.mgid = *(union ib_gid *) (mc_map + 4); + rec.port_gid = *ib_addr_get_sgid(dev_addr); + rec.pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); + rec.join_state = 1; + rec.qkey = sin->sin_addr.s_addr; + + comp_mask = IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE | + IB_SA_MCMEMBER_REC_QKEY | IB_SA_MCMEMBER_REC_SL | + IB_SA_MCMEMBER_REC_FLOW_LABEL | + IB_SA_MCMEMBER_REC_TRAFFIC_CLASS; + + mc->multicast.ib = ib_join_multicast(id_priv->id.device, + id_priv->id.port_num, &rec, + comp_mask, GFP_KERNEL, + cma_ib_join_handler, mc); + if (IS_ERR(mc->multicast.ib)) + return PTR_ERR(mc->multicast.ib); + + return 0; +} + +int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, + void *context) +{ + struct rdma_id_private *id_priv; + struct cma_multicast *mc; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_ADDR_BOUND) && + !cma_comp(id_priv, CMA_ADDR_RESOLVED)) + return -EINVAL; + + mc = kmalloc(sizeof *mc, GFP_KERNEL); + if (!mc) + return -ENOMEM; + + memcpy(&mc->data.addr, addr, ip_addr_size(addr)); + mc->data.context = context; + mc->id_priv = id_priv; + + spin_lock(&id_priv->lock); + list_add(&mc->list, &id_priv->mc_list); + spin_unlock(&id_priv->lock); + + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: + ret = cma_join_ib_multicast(id_priv, mc); + break; + default: + ret = -ENOSYS; + break; + } + + if (ret) { + spin_lock_irq(&id_priv->lock); + list_del(&mc->list); + spin_unlock_irq(&id_priv->lock); + kfree(mc); + } + return ret; +} +EXPORT_SYMBOL(rdma_join_multicast); + +void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr) +{ + struct rdma_id_private *id_priv; + struct cma_multicast *mc; + + id_priv = container_of(id, struct rdma_id_private, id); + spin_lock_irq(&id_priv->lock); + list_for_each_entry(mc, &id_priv->mc_list, list) { + if (!memcmp(&mc->data.addr, addr, ip_addr_size(addr))) { + list_del(&mc->list); + spin_unlock_irq(&id_priv->lock); + + if (id->qp) + ib_detach_mcast(id->qp, + &mc->multicast.ib->rec.mgid, + mc->multicast.ib->rec.mlid); + ib_free_multicast(mc->multicast.ib); + kfree(mc); + return; + } + } + spin_unlock_irq(&id_priv->lock); +} +EXPORT_SYMBOL(rdma_leave_multicast); + static void cma_add_one(struct ib_device *device) { struct cma_device *cma_dev; From sean.hefty at intel.com Fri Jun 9 15:16:28 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 9 Jun 2006 15:16:28 -0700 Subject: [openib-general] [PATCH 5/5] ucma: export multicast suport to userspace In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com> Message-ID: <001201c68c12$5adc3a00$ff0da8c0@amr.corp.intel.com> Expose multicast abstraction through the CMA to userspace. Signed-off-by: Sean Hefty --- --- svn3/gen2/trunk/src/linux-kernel/infiniband/include/rdma/rdma_user_cm.h 2006-06-06 16:53:46.000000000 -0700 +++ svn/gen2/trunk/src/linux-kernel/infiniband/include/rdma/rdma_user_cm.h 2006-06-06 12:22:57.000000000 -0700 @@ -58,6 +58,8 @@ enum { RDMA_USER_CM_CMD_GET_EVENT, RDMA_USER_CM_CMD_GET_OPTION, RDMA_USER_CM_CMD_SET_OPTION, + RDMA_USER_CM_CMD_JOIN_MCAST, + RDMA_USER_CM_CMD_LEAVE_MCAST, RDMA_USER_CM_CMD_GET_DST_ATTR }; @@ -174,6 +176,17 @@ struct rdma_ucm_init_qp_attr { __u32 qp_state; }; +struct rdma_ucm_join_mcast { + __u32 id; + struct sockaddr_in6 addr; + __u64 uid; +}; + +struct rdma_ucm_leave_mcast { + __u32 id; + struct sockaddr_in6 addr; +}; + struct rdma_ucm_dst_attr_resp { __u32 remote_qpn; __u32 remote_qkey; --- svn3/gen2/trunk/src/linux-kernel/infiniband/core/ucma.c 2006-06-06 16:56:53.000000000 -0700 +++ svn/gen2/trunk/src/linux-kernel/infiniband/core/ucma.c 2006-06-01 17:48:42.000000000 -0700 @@ -167,6 +167,21 @@ error: return NULL; } +static void ucma_copy_multicast_data(struct ucma_context *ctx, + struct ucma_event *uevent, + struct rdma_cm_event *event) +{ + struct rdma_multicast_data *mc_data = event->private_data; + struct rdma_ucm_join_mcast *umc_data; + + umc_data = (struct rdma_ucm_join_mcast *) uevent->resp.private_data; + + uevent->resp.private_data_len = sizeof *umc_data; + umc_data->id = ctx->id; + memcpy(&umc_data->addr, &mc_data->addr, ip_addr_size(&mc_data->addr)); + umc_data->uid = (unsigned long) mc_data->context; +} + static int ucma_event_handler(struct rdma_cm_id *cm_id, struct rdma_cm_event *event) { @@ -184,9 +199,17 @@ static int ucma_event_handler(struct rdm uevent->resp.id = ctx->id; uevent->resp.event = event->event; uevent->resp.status = event->status; - if ((uevent->resp.private_data_len = event->private_data_len)) - memcpy(uevent->resp.private_data, event->private_data, - event->private_data_len); + switch (event->event) { + case RDMA_CM_EVENT_MULTICAST_JOIN: + case RDMA_CM_EVENT_MULTICAST_ERROR: + ucma_copy_multicast_data(ctx, uevent, event); + break; + default: + if ((uevent->resp.private_data_len = event->private_data_len)) + memcpy(uevent->resp.private_data, event->private_data, + event->private_data_len); + break; + } mutex_lock(&ctx->file->file_mutex); if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) { @@ -737,6 +760,45 @@ static ssize_t ucma_set_option(struct uc return ret; } +static ssize_t ucma_join_mcast(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_join_mcast cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_join_multicast(ctx->cm_id, (struct sockaddr *) &cmd.addr, + (void *) (unsigned long) cmd.uid); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_leave_mcast(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_leave_mcast cmd; + struct ucma_context *ctx; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + rdma_leave_multicast(ctx->cm_id, (struct sockaddr *) &cmd.addr); + ucma_put_ctx(ctx); + return 0; +} + static ssize_t ucma_get_dst_attr(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) @@ -789,6 +851,8 @@ static ssize_t (*ucma_cmd_table[])(struc [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event, [RDMA_USER_CM_CMD_GET_OPTION] = ucma_get_option, [RDMA_USER_CM_CMD_SET_OPTION] = ucma_set_option, + [RDMA_USER_CM_CMD_JOIN_MCAST] = ucma_join_mcast, + [RDMA_USER_CM_CMD_LEAVE_MCAST] = ucma_leave_mcast, [RDMA_USER_CM_CMD_GET_DST_ATTR] = ucma_get_dst_attr }; From mshefty at ichips.intel.com Fri Jun 9 15:20:45 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 09 Jun 2006 15:20:45 -0700 Subject: [openib-general] [PATCH 0/5] multicast abstraction In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com> References: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com> Message-ID: <4489F43D.60502@ichips.intel.com> Sean Hefty wrote: > This patch series enhances support for joining and leaving multicast groups, > providing the following functionality: > > 1. Users identify a multicast group by a multicast IP address. > 2. A user binds to a local RDMA device based on resolving the IP address. > 3. A new multicast group is created. The parameters for the multicast group are > obtained based on the ipoib broadcast group, and the MGID is derived using the > same algorithm as ipoib, except with a different signature. > 4. Any QP associated with the join is attached to the group once the join > operation completes. > 5. A QP may join multiple groups. I forgot to mention that this patch series is dependent on adding UD QP support to the RDMA CM. - Sean From arlin.r.davis at intel.com Fri Jun 9 15:37:34 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 9 Jun 2006 15:37:34 -0700 Subject: [openib-general] [PATCH] uDAPL openib_cma, cleanup reported CM error events, add TIMEOUT Message-ID: James, I cleaned up the connection error events to report the proper events during address resolution errors and timeouts. It was returning incorrect DAT event codes. -arlin Signed-off by: Arlin Davis Index: dapl_ib_cm.c =================================================================== --- dapl_ib_cm.c (revision 7839) +++ dapl_ib_cm.c (working copy) @@ -330,6 +330,8 @@ static void dapli_cm_active_cb(struct da switch (event->event) { case RDMA_CM_EVENT_UNREACHABLE: case RDMA_CM_EVENT_CONNECT_ERROR: + { + ib_cm_events_t cm_event; dapl_dbg_log( DAPL_DBG_TYPE_WARN, " dapli_cm_active_handler: CONN_ERR " @@ -337,10 +339,15 @@ static void dapli_cm_active_cb(struct da event->event, event->status, (event->status == -110)?"TIMEOUT":"" ); - dapl_evd_connection_callback(conn, - IB_CME_DESTINATION_UNREACHABLE, - NULL, conn->ep); + /* no device type specified so assume IB for now */ + if (event->status == -110) /* IB timeout */ + cm_event = IB_CME_TIMEOUT; + else + cm_event = IB_CME_DESTINATION_UNREACHABLE; + + dapl_evd_connection_callback(conn, cm_event, NULL, conn->ep); break; + } case RDMA_CM_EVENT_REJECTED: { ib_cm_events_t cm_event; @@ -357,7 +364,6 @@ static void dapli_cm_active_cb(struct da event->status); dapl_evd_connection_callback(conn, cm_event, NULL, conn->ep); - break; } case RDMA_CM_EVENT_ESTABLISHED: @@ -1028,7 +1034,7 @@ int dapls_ib_private_data_size(IN DAPL_P /* * Map all socket CM event codes to the DAT equivelent. */ -#define DAPL_IB_EVENT_CNT 12 +#define DAPL_IB_EVENT_CNT 13 static struct ib_cm_event_map { @@ -1058,7 +1064,9 @@ static struct ib_cm_event_map /* 10 */ { IB_CME_LOCAL_FAILURE, DAT_CONNECTION_EVENT_BROKEN}, /* 11 */ { IB_CME_BROKEN, - DAT_CONNECTION_EVENT_BROKEN} + DAT_CONNECTION_EVENT_BROKEN}, + /* 12 */ { IB_CME_TIMEOUT, + DAT_CONNECTION_EVENT_TIMED_OUT}, }; /* @@ -1164,7 +1172,7 @@ void dapli_cma_event_cb(void) case RDMA_CM_EVENT_ADDR_ERROR: case RDMA_CM_EVENT_ROUTE_ERROR: dapl_evd_connection_callback(conn, - IB_CME_LOCAL_FAILURE, + IB_CME_DESTINATION_UNREACHABLE, NULL, conn->ep); break; case RDMA_CM_EVENT_DEVICE_REMOVAL: Index: dapl_ib_util.h =================================================================== --- dapl_ib_util.h (revision 7839) +++ dapl_ib_util.h (working copy) @@ -86,7 +86,8 @@ typedef enum { IB_CME_DESTINATION_UNREACHABLE, IB_CME_TOO_MANY_CONNECTION_REQUESTS, IB_CME_LOCAL_FAILURE, - IB_CME_BROKEN + IB_CME_BROKEN, + IB_CME_TIMEOUT } ib_cm_events_t; /* CQ notifications */ From betsy at pathscale.com Fri Jun 9 15:50:02 2006 From: betsy at pathscale.com (Betsy Zeller) Date: Fri, 09 Jun 2006 15:50:02 -0700 Subject: [openib-general] [openfabrics-ewg] OFED-1.0-rc6 is available In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007ED438C@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0007ED438C@orsmsx408> Message-ID: <1149893403.3034.55.camel@sarium.pathscale.com> Woody - The short answer is yes - Bryan has created patches in the subversion tree, which will install on top of what Tziporet pulled from Roland's tree. These will be in the 1.0 release (and, we will be testing an early version of that on Monday). We've tested the ipath driver code pretty thoroughly in-house. Bryan will send you a pointer to a tarball with these changes, so you can try them out today. He's planning to have those to you before 4:30. - Betsy On Fri, 2006-06-09 at 13:21 -0700, Woodruff, Robert J wrote: > Is there any plan to release an RC6 package (or an RC7) that has a > Pathscale driver that > compiles on RHEL4 - U3 that we can test before the release ? > > woody > > > > ______________________________________________________________________ > From: openfabrics-ewg-bounces at openib.org > [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of Tziporet > Koren > Sent: Wednesday, June 07, 2006 7:59 AM > To: Tziporet Koren; openfabrics-ewg at openib.org > Cc: openib-general > Subject: [openfabrics-ewg] OFED-1.0-rc6 is available > > > > Hi All, > > > > We have prepared OFED 1.0 RC6. > > Release location: > https://openib.org/svn/gen2/branches/1.0/ofed/releases > > File: OFED-1.0-rc6.tgz > > > > Note: This release is the code freeze release for OFED 1.0. Only > showstopper bugs will be fixed. > > > > BUILD_ID: > > OFED-1.0-rc6 > > > > openib-1.0 (REV=7772) > > # User space > > https://openib.org/svn/gen2/branches/1.0/src/userspace > > # Kernel space > > https://openib.org/svn/gen2/branches/1.0/ofed/tags/rc6/linux-kernel > > Git: > > ref: refs/heads/for-2.6.17 > > commit d9ec5ad24ce80b7ef69a0717363db661d13aada5 > > > > # MPI > > mpi_osu-0.9.7-mlx2.1.0.tgz > > openmpi-1.1b1-1.src.rpm > > mpitests-1.0-0.src.rpm > > > > OSes: > > * RH EL4 up2: 2.6.9-22.ELsmp > > * RH EL4 up3: 2.6.9-34.ELsmp > > * Fedora C4: 2.6.11-1.1369_FC4 > > * SLES10 RC2: 2.6.16.16-1.6-smp > > * SUSE 10 Pro: 2.6.13-15-smp > > * kernel.org: 2.6.16.x > > > > Systems: > > * x86_64 > > * x86 > > * ia64 > > * ppc64 > > > > Main changes from RC5: > > 1. SDP – libsdp implementation of RFC proposed by Eitan Zahavi; > bug fixes in kernel module. See details below. > > 2. SRP – bug fixes > > 3. Open MPI – new package based on 1.1b1-1 > > 4. OSU-MPI – See details below. > > 5. iSER: Enhanced to support SLES 10 RC1. > > 6. IPoIB default configuration changed: > > a. IPoIB configuration at install time is now optional. > > b. The default configuration of IPoIB interfaces (if performed > at install time) is DHCP; it can be changed during interactive > installation. > > c. For unattended installation one can give a new configuration > file. See the example below. > > 7. Bug Fixes. > > > > > > Package limitations: > > 1. The ipath driver does not compile/load on most systems. To be > fixed in final release. > Meanwhile, one must work with custom build and not choose ipath > driver, or change in the conf file: ib_ipath=n. > I attached a reference ofed-no_ipath.conf file. > Once Qlogic fixes the backport patches I will publish them on the > release page so any one interested can use them with this release. > > 2. iSER is working on SuSE SLES 10 RC1 only > > > > > > IPoIB configuration file example: > > If you are going to install OFED on a 32 node cluster and want to use > static IPoIB configuration based on Ethernet device configuration > follow instructions below: > > > > Assume that the Ethernet IP addresses (eth0 interfaces) of the cluster > are: 10.0.0.1 - 10.0.0.32 > > and you want to assign to ib0 IP addresses in the range: 192.168.0.1 - > 192.168.0.32 > > and to ib1 IP addresses in the range: 172.16.0.1 - 172.16.0.32 > > > > Then create the file ofed_net.conf with the following lines: > > > > LAN_INTERFACE_ib0=eth0 > IPADDR_ib0=192.168.'*'.'*' > NETMASK_ib0=255.255.0.0 > NETWORK_ib0=192.168.0.0 > BROADCAST_ib0=192.168.255.255 > ONBOOT_ib0=1 > LAN_INTERFACE_ib1=eth0 > IPADDR_ib1=172.16.'*'.'*' > NETMASK_ib1=255.255.0.0 > NETWORK_ib1=172.16.0.0 > BROADCAST_ib1=172.16.255.255 > ONBOOT_ib1=1 > > > > Note: ‘*’ will be replaced by the corresponding octal from the eth0 IP > address. > > > > Assuming that you already have OFED configuration file (ofed.conf) > with selected packages (created by running OFED-1.0/install.sh) > > Run: ./install.sh -c ofed.conf -net ofed_net.conf > > > > > > > > OSU MPI: > > · Added mpi_alltoall fine tuning parameters > > · Added default configuration/documentation file > $MPIHOME/etc/mvapich.conf > > · Added shell configuration files $MPIHOME/etc/mvapich.csh , > $MPIHOME/etc/mvapich.csh > > · Default MTU was changed back to 2K for InfiniHost III Ex and > InfiniHost III Lx HCAs. For InfiniHost card recommended value is: > VIADEV_DEFAULT_MTU=MTU1024 > > > > > > SDP Details: > > libsdp enhancements according to the RFC: > > 1. New config syntax (please see libsdp.conf) > 2. With no config or empty config use SIMPLE_LIBSDP mode > 3. Support listening on both tcp and sdp > 4. Support trying both connections (first SDP then TCP) > 5. Support IPv4 embedded in IPv6 (also convert back address) > 6. Comprehensive verbosity logging > 7. BNF based config parser > > > > Current SDP limitations: > > · SDP currently does not support sending/receiving out of band > data (MSG_OOB). > > · Generally, SDP supports only SOL_SOCKET socket options. > > · The following options can be set but actual support is > missing: > > o SO_KEEPALIVE - no keepalives are sent > > o SO_OOBINLINE - out of band data is not supported > > o SDP currently supports setting the following SOL_TCP socket > options: > > o TCP_NODELAY, TCP_CORK - but actual support for these options > is still missing > > · SDP currently does not handle Zcopy mode messages correctly > and does not set MaxAdverts properly in HH/HAH messages. > > > > > > OFED components tested by Mellanox: > > * Verbs over mthca > * IPoIB > * OpenSM > * OSU-MPI > * SRP > * SDP > * IB administration utils (ibutils) > > > > > > Please send us any issues you encounter and/or test results. > > > > Thanks > > Tziporet & Vlad > > > > > > Tziporet Koren > > Software Director > > Mellanox Technologies > > mailto: tziporet at mellanox.co.il > Tel +972-4-9097200, ext 380 > > > > From robert.j.woodruff at intel.com Fri Jun 9 16:13:32 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 9 Jun 2006 16:13:32 -0700 Subject: [openib-general] [openfabrics-ewg] OFED-1.0-rc6 is available Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007ED46E6@orsmsx408> Betzy wrote, >Woody - The short answer is yes - Bryan has created patches in >the subversion tree, which will install on top of what Tziporet >pulled from Roland's tree. These will be in the 1.0 release (and, >we will be testing an early version of that on Monday). We've >tested the ipath driver code pretty thoroughly in-house. Thanks, probably cannot get to it till monday as I needed to pulled the pathscale cards from my OFED test systems for now so I can test RC6 with the Mellanox DDR cards over the weekend but should be able to get back to this on Monday. Will their be a RC candidate tarball that has the patches included ? woody From bos at pathscale.com Fri Jun 9 16:20:36 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 09 Jun 2006 16:20:36 -0700 Subject: [openib-general] OFED 1.0-rc6 tarball available with working ipath driver Message-ID: <1149895236.27921.2.camel@pelerin.serpentine.com> Due to unfortunate timing, the ipath driver in OFED 1.0-rc6 does not work correctly. You can download an updated tarball from here, for which the ipath driver works fine: http://openib.red-bean.com/OFED-1.0-rc6+ipath.tar.bz2 Alternatively, pull the necessary patches from SVN. http://openib.org/bugzilla/show_bug.cgi?id=9 sweitzen at cisco.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED ------- Comment #2 from sweitzen at cisco.com 2006-06-09 22:41 ------- Close old bug. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Fri Jun 9 22:41:23 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Fri, 9 Jun 2006 22:41:23 -0700 (PDT) Subject: [openib-general] [Bug 10] [CHECKER] NULL deref in drivers/infiniband/core/ucm.c:ib_ucm_event_process Message-ID: <20060610054123.AEBDF2287AD@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=10 sweitzen at cisco.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED ------- Comment #2 from sweitzen at cisco.com 2006-06-09 22:41 ------- Close INVALID bug. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Fri Jun 9 22:40:41 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Fri, 9 Jun 2006 22:40:41 -0700 (PDT) Subject: [openib-general] [Bug 8] [CHECKER] Leak in drivers/infiniband/core/sysfs.c:alloc_group_attrs Message-ID: <20060610054041.13F752287AB@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=8 sweitzen at cisco.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED ------- Comment #3 from sweitzen at cisco.com 2006-06-09 22:40 ------- Close old bug. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Fri Jun 9 22:41:42 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Fri, 9 Jun 2006 22:41:42 -0700 (PDT) Subject: [openib-general] [Bug 11] [CHECKER] Return value of idr_find not checked for NULL Message-ID: <20060610054142.B0B932287AE@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=11 sweitzen at cisco.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED ------- Comment #3 from sweitzen at cisco.com 2006-06-09 22:41 ------- Close INVALID bug. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Fri Jun 9 22:42:05 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Fri, 9 Jun 2006 22:42:05 -0700 (PDT) Subject: [openib-general] [Bug 12] [CHECKER] drivers/infiniband/ulp/ipoib/ipoib_main.c: confusion over NULL pointer Message-ID: <20060610054205.4FD842287AF@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=12 sweitzen at cisco.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED ------- Comment #2 from sweitzen at cisco.com 2006-06-09 22:42 ------- Close INVALID bug. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Fri Jun 9 22:44:46 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Fri, 9 Jun 2006 22:44:46 -0700 (PDT) Subject: [openib-general] [Bug 17] [CHECKER] NULL deref in drivers/infiniband/ulp/srp/ib_srp.c Message-ID: <20060610054446.C0B3D2287AB@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=17 sweitzen at cisco.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED ------- Comment #2 from sweitzen at cisco.com 2006-06-09 22:44 ------- Close INVALID bug. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eitan at mellanox.co.il Sat Jun 10 10:12:45 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 10 Jun 2006 20:12:45 +0300 Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo GetTable query In-Reply-To: <1149771197.4510.323092.camel@hal.voltaire.com> References: <86fyiflwks.fsf@mtl066.yok.mtl.com> <1149771197.4510.323092.camel@hal.voltaire.com> Message-ID: <448AFD8D.3030809@mellanox.co.il> Hal Rosenstock wrote: > Hi Eitan, > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote: > >>Hi Hal >> >>I'm working on passing osmtest check. Found a bug in the new >>GUIDInfoRecord query: If you had a physical port with zero guid_cap >>the code would loop on blocks 0..255 instead of trying the next port. > > > OK; that's definitely a problem. > > >>I am still looking for why we might have a guid_cap == 0 on some >>ports. > > > PortInfo:GuidCap is not used for switch external ports. > > >>This patch resolves this new problem. osmtest passes on some arbitrary >>networks. >> >>Eitan >> >>Signed-off-by: Eitan Zahavi >> >>Index: opensm/osm_sa_guidinfo_record.c >>=================================================================== >>--- opensm/osm_sa_guidinfo_record.c (revision 7703) >>+++ opensm/osm_sa_guidinfo_record.c (working copy) >>@@ -255,6 +255,10 @@ __osm_sa_gir_create_gir( >> continue; >> >> p_pi = osm_physp_get_port_info_ptr( p_physp ); >>+ >>+ if ( p_pi->guid_cap == 0 ) >>+ continue; >>+ > > > I think the right fix is to detect switch external ports and use the > VLCap from port 0 rather than from the switch external port (unless that > concept is broken in which case it should return 0 records). I think switch external ports do not have any PortGUID assigned to them since they are not "end port" (i.e. addressable). So I think this patch is good enough. What if a port reports guid_cap == 0? (I understand it is illegal for addressable port but for the SM it is probably better not to assume all ports are legal...) EZ > > -- Hal > > >> num_blocks = p_pi->guid_cap / 8; >> if ( p_pi->guid_cap % 8 ) >> num_blocks++; >> From bpradip at in.ibm.com Sat Jun 10 10:34:27 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Sat, 10 Jun 2006 23:04:27 +0530 Subject: [openib-general] [PATCH] rping: Erroneous check for minumum ping buffer size Message-ID: <20060610173417.GA14280@harry-potter.ibm.com> This includes the changes suggested by Tom. Signed-off-by: Pradipta Kumar Banerjee --- Index: rping.c ================================================================= --- rping.org 2006-06-09 10:57:43.000000000 +0530 +++ rping.c.new 2006-06-10 22:48:53.000000000 +0530 @@ -96,6 +96,15 @@ struct rping_rdma_info { #define RPING_BUFSIZE 64*1024 #define RPING_SQ_DEPTH 16 +/* Default string for print data and + * minimum buffer size + */ +#define _stringify( _x ) # _x +#define stringify( _x ) _stringify( _x ) + +#define RPING_MSG_FMT "rdma-ping-%d: " +#define RPING_MIN_BUFSIZE sizeof(stringify(INT_MAX)) + sizeof(RPING_MSG_FMT) + /* * Control block struct. */ @@ -774,7 +783,7 @@ static void rping_test_client(struct rpi cb->state = RDMA_READ_ADV; /* Put some ascii text in the buffer. */ - cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping); + cc = sprintf(cb->start_buf, RPING_MSG_FMT, ping); for (i = cc, c = start; i < cb->size; i++) { cb->start_buf[i] = c; c++; @@ -977,11 +986,11 @@ int main(int argc, char *argv[]) break; case 'S': cb->size = atoi(optarg); - if ((cb->size < 1) || + if ((cb->size < RPING_MIN_BUFSIZE) || (cb->size > (RPING_BUFSIZE - 1))) { fprintf(stderr, "Invalid size %d " - "(valid range is 1 to %d)\n", - cb->size, RPING_BUFSIZE); + "(valid range is %d to %d)\n", + cb->size, RPING_MIN_BUFSIZE, RPING_BUFSIZE); ret = EINVAL; } else DEBUG_LOG("size %d\n", (int) atoi(optarg)); From halr at voltaire.com Sat Jun 10 11:11:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jun 2006 14:11:21 -0400 Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo GetTable query In-Reply-To: <448AFD8D.3030809@mellanox.co.il> References: <86fyiflwks.fsf@mtl066.yok.mtl.com> <1149771197.4510.323092.camel@hal.voltaire.com> <448AFD8D.3030809@mellanox.co.il> Message-ID: <1149963035.5093.58165.camel@hal.voltaire.com> Hi Eitan, On Sat, 2006-06-10 at 13:12, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > Hi Eitan, > > > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote: > > > >>Hi Hal > >> > >>I'm working on passing osmtest check. Found a bug in the new > >>GUIDInfoRecord query: If you had a physical port with zero guid_cap > >>the code would loop on blocks 0..255 instead of trying the next port. > > > > > > OK; that's definitely a problem. > > > > > >>I am still looking for why we might have a guid_cap == 0 on some > >>ports. > > > > > > PortInfo:GuidCap is not used for switch external ports. > > > > > >>This patch resolves this new problem. osmtest passes on some arbitrary > >>networks. > >> > >>Eitan > >> > >>Signed-off-by: Eitan Zahavi > >> > >>Index: opensm/osm_sa_guidinfo_record.c > >>=================================================================== > >>--- opensm/osm_sa_guidinfo_record.c (revision 7703) > >>+++ opensm/osm_sa_guidinfo_record.c (working copy) > >>@@ -255,6 +255,10 @@ __osm_sa_gir_create_gir( > >> continue; > >> > >> p_pi = osm_physp_get_port_info_ptr( p_physp ); > >>+ > >>+ if ( p_pi->guid_cap == 0 ) > >>+ continue; > >>+ > > > > > > I think the right fix is to detect switch external ports and use the > > VLCap from port 0 rather than from the switch external port (unless that > > concept is broken in which case it should return 0 records). > I think switch external ports do not have any PortGUID assigned to them since > they are not "end port" (i.e. addressable). Right; that's what I said earlier in a different way (PortGUID is not used for switch external ports). > So I think this patch is good enough. I think its better (an improvement) but not a complete fix for this issue. > What if a port reports guid_cap == 0? Is that legal ? Shouldn't any port where GUIDCap is valid have a non zero GUIDCap ? On any port where GUIDCap is not used (e.g. invalid), it should be ignored. > (I understand it is illegal for addressable port > but for the SM it is probably better not to assume all ports are legal...) That's my point on what a complete fix for this would include. -- Hal > EZ > > > > -- Hal > > > > > >> num_blocks = p_pi->guid_cap / 8; > >> if ( p_pi->guid_cap % 8 ) > >> num_blocks++; > >> > From eitan at mellanox.co.il Sat Jun 10 14:02:47 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 11 Jun 2006 00:02:47 +0300 Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo GetTable query Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881B@mtlexch01.mtl.com> Hi Hal, When is a complete fix expected? Meanwhile osmtest on large enough cluster is not passing due to the huge number of GUID blocks... If this full fix not anticipated soon can we have the simple fix applied first? Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Saturday, June 10, 2006 9:11 PM > To: Eitan Zahavi > Cc: OPENIB > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable query > > Hi Eitan, > > On Sat, 2006-06-10 at 13:12, Eitan Zahavi wrote: > > Hal Rosenstock wrote: > > > Hi Eitan, > > > > > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote: > > > > > >>Hi Hal > > >> > > >>I'm working on passing osmtest check. Found a bug in the new > > >>GUIDInfoRecord query: If you had a physical port with zero guid_cap > > >>the code would loop on blocks 0..255 instead of trying the next port. > > > > > > > > > OK; that's definitely a problem. > > > > > > > > >>I am still looking for why we might have a guid_cap == 0 on some > > >>ports. > > > > > > > > > PortInfo:GuidCap is not used for switch external ports. > > > > > > > > >>This patch resolves this new problem. osmtest passes on some arbitrary > > >>networks. > > >> > > >>Eitan > > >> > > >>Signed-off-by: Eitan Zahavi > > >> > > >>Index: opensm/osm_sa_guidinfo_record.c > > >>=================================================================== > > >>--- opensm/osm_sa_guidinfo_record.c (revision 7703) > > >>+++ opensm/osm_sa_guidinfo_record.c (working copy) > > >>@@ -255,6 +255,10 @@ __osm_sa_gir_create_gir( > > >> continue; > > >> > > >> p_pi = osm_physp_get_port_info_ptr( p_physp ); > > >>+ > > >>+ if ( p_pi->guid_cap == 0 ) > > >>+ continue; > > >>+ > > > > > > > > > I think the right fix is to detect switch external ports and use the > > > VLCap from port 0 rather than from the switch external port (unless that > > > concept is broken in which case it should return 0 records). > > I think switch external ports do not have any PortGUID assigned to them since > > they are not "end port" (i.e. addressable). > > Right; that's what I said earlier in a different way (PortGUID is not > used for switch external ports). > > > So I think this patch is good enough. > > I think its better (an improvement) but not a complete fix for this > issue. > > > What if a port reports guid_cap == 0? > > Is that legal ? Shouldn't any port where GUIDCap is valid have a non > zero GUIDCap ? On any port where GUIDCap is not used (e.g. invalid), it > should be ignored. > > > (I understand it is illegal for addressable port > > but for the SM it is probably better not to assume all ports are legal...) > > That's my point on what a complete fix for this would include. > > -- Hal > > > EZ > > > > > > -- Hal > > > > > > > > >> num_blocks = p_pi->guid_cap / 8; > > >> if ( p_pi->guid_cap % 8 ) > > >> num_blocks++; > > >> > > From halr at voltaire.com Sat Jun 10 14:07:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jun 2006 17:07:33 -0400 Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo GetTable query In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881B@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881B@mtlexch01.mtl.com> Message-ID: <1149973652.5093.64803.camel@hal.voltaire.com> On Sat, 2006-06-10 at 17:02, Eitan Zahavi wrote: > Hi Hal, > > When is a complete fix expected? > Meanwhile osmtest on large enough cluster is not passing due to the huge > number of GUID blocks... > > If this full fix not anticipated soon can we have the simple fix applied > first? Sure. Let me know if this is also needed on the 1.0 branch. -- Hal > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Saturday, June 10, 2006 9:11 PM > > To: Eitan Zahavi > > Cc: OPENIB > > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable query > > > > Hi Eitan, > > > > On Sat, 2006-06-10 at 13:12, Eitan Zahavi wrote: > > > Hal Rosenstock wrote: > > > > Hi Eitan, > > > > > > > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote: > > > > > > > >>Hi Hal > > > >> > > > >>I'm working on passing osmtest check. Found a bug in the new > > > >>GUIDInfoRecord query: If you had a physical port with zero > guid_cap > > > >>the code would loop on blocks 0..255 instead of trying the next > port. > > > > > > > > > > > > OK; that's definitely a problem. > > > > > > > > > > > >>I am still looking for why we might have a guid_cap == 0 on some > > > >>ports. > > > > > > > > > > > > PortInfo:GuidCap is not used for switch external ports. > > > > > > > > > > > >>This patch resolves this new problem. osmtest passes on some > arbitrary > > > >>networks. > > > >> > > > >>Eitan > > > >> > > > >>Signed-off-by: Eitan Zahavi > > > >> > > > >>Index: opensm/osm_sa_guidinfo_record.c > > > > >>=================================================================== > > > >>--- opensm/osm_sa_guidinfo_record.c (revision 7703) > > > >>+++ opensm/osm_sa_guidinfo_record.c (working copy) > > > >>@@ -255,6 +255,10 @@ __osm_sa_gir_create_gir( > > > >> continue; > > > >> > > > >> p_pi = osm_physp_get_port_info_ptr( p_physp ); > > > >>+ > > > >>+ if ( p_pi->guid_cap == 0 ) > > > >>+ continue; > > > >>+ > > > > > > > > > > > > I think the right fix is to detect switch external ports and use > the > > > > VLCap from port 0 rather than from the switch external port > (unless that > > > > concept is broken in which case it should return 0 records). > > > I think switch external ports do not have any PortGUID assigned to > them since > > > they are not "end port" (i.e. addressable). > > > > Right; that's what I said earlier in a different way (PortGUID is not > > used for switch external ports). > > > > > So I think this patch is good enough. > > > > I think its better (an improvement) but not a complete fix for this > > issue. > > > > > What if a port reports guid_cap == 0? > > > > Is that legal ? Shouldn't any port where GUIDCap is valid have a non > > zero GUIDCap ? On any port where GUIDCap is not used (e.g. invalid), > it > > should be ignored. > > > > > (I understand it is illegal for addressable port > > > but for the SM it is probably better not to assume all ports are > legal...) > > > > That's my point on what a complete fix for this would include. > > > > -- Hal > > > > > EZ > > > > > > > > -- Hal > > > > > > > > > > > >> num_blocks = p_pi->guid_cap / 8; > > > >> if ( p_pi->guid_cap % 8 ) > > > >> num_blocks++; > > > >> > > > From halr at voltaire.com Sat Jun 10 14:21:36 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jun 2006 17:21:36 -0400 Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo GetTable query In-Reply-To: <86fyiflwks.fsf@mtl066.yok.mtl.com> References: <86fyiflwks.fsf@mtl066.yok.mtl.com> Message-ID: <1149974496.5093.65332.camel@hal.voltaire.com> Eitan, On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote: > Hi Hal > > I'm working on passing osmtest check. Found a bug in the new > GUIDInfoRecord query: If you had a physical port with zero guid_cap > the code would loop on blocks 0..255 instead of trying the next port. > > I am still looking for why we might have a guid_cap == 0 on some > ports. > > This patch resolves this new problem. osmtest passes on some arbitrary > networks. > > Eitan > > Signed-off-by: Eitan Zahavi Thanks. Applied to trunk only. Let me know if it also should be applied to 1.0. -- Hal From tom at opengridcomputing.com Sat Jun 10 15:22:59 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Sat, 10 Jun 2006 17:22:59 -0500 Subject: [openib-general] [PATCH] rping: Erroneous check for minumum ping buffer size In-Reply-To: <20060610173417.GA14280@harry-potter.ibm.com> References: <20060610173417.GA14280@harry-potter.ibm.com> Message-ID: <1149978179.7311.29.camel@trinity.ogc.int> Thanks Pradipta, I'll apply test, and check these in. Tom. On Sat, 2006-06-10 at 23:04 +0530, Pradipta Kumar Banerjee wrote: > This includes the changes suggested by Tom. > > Signed-off-by: Pradipta Kumar Banerjee > --- > > Index: rping.c > ================================================================= > --- rping.org 2006-06-09 10:57:43.000000000 +0530 > +++ rping.c.new 2006-06-10 22:48:53.000000000 +0530 > @@ -96,6 +96,15 @@ struct rping_rdma_info { > #define RPING_BUFSIZE 64*1024 > #define RPING_SQ_DEPTH 16 > > +/* Default string for print data and > + * minimum buffer size > + */ > +#define _stringify( _x ) # _x > +#define stringify( _x ) _stringify( _x ) > + > +#define RPING_MSG_FMT "rdma-ping-%d: " > +#define RPING_MIN_BUFSIZE sizeof(stringify(INT_MAX)) + sizeof(RPING_MSG_FMT) > + > /* > * Control block struct. > */ > @@ -774,7 +783,7 @@ static void rping_test_client(struct rpi > cb->state = RDMA_READ_ADV; > > /* Put some ascii text in the buffer. */ > - cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping); > + cc = sprintf(cb->start_buf, RPING_MSG_FMT, ping); > for (i = cc, c = start; i < cb->size; i++) { > cb->start_buf[i] = c; > c++; > @@ -977,11 +986,11 @@ int main(int argc, char *argv[]) > break; > case 'S': > cb->size = atoi(optarg); > - if ((cb->size < 1) || > + if ((cb->size < RPING_MIN_BUFSIZE) || > (cb->size > (RPING_BUFSIZE - 1))) { > fprintf(stderr, "Invalid size %d " > - "(valid range is 1 to %d)\n", > - cb->size, RPING_BUFSIZE); > + "(valid range is %d to %d)\n", > + cb->size, RPING_MIN_BUFSIZE, RPING_BUFSIZE); > ret = EINVAL; > } else > DEBUG_LOG("size %d\n", (int) atoi(optarg)); From sashak at voltaire.com Sat Jun 10 17:27:58 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 11 Jun 2006 03:27:58 +0300 Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes from the file Message-ID: <20060611002758.22430.63061.stgit@sashak.voltaire.com> Hi, There are couple of unicast routing related patches for OpenSM. Basically it implements routing module which provides possibility to load switch forwarding tables from pre-created dump file. Currently unicast tables loading is only supported, multicast may be added in a future. Short patch descriptions (more details may be found in emails with patches): 1. Ucast dump file simplification. 2. Modular routing - preliminary implements generic model to plug new routing engine to OpenSM. 3. New simple unicast routing engine which allows to load LFTs from pre-created dump file. 4. Example of ucast dump generation script. Please comment and test. Thanks. Sasha From sashak at voltaire.com Sat Jun 10 17:32:45 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 11 Jun 2006 03:32:45 +0300 Subject: [openib-general] [PATCH 4/4] diags: ucast routing dump file generator example - dump_lfts.sh In-Reply-To: <20060611002758.22430.63061.stgit@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> Message-ID: <20060611003245.22430.93904.stgit@sashak.voltaire.com> New simple script - dump_lfts.sh, may be used for ucast dump file generation. Signed-off-by: Sasha Khapyorsky --- diags/Makefile.am | 2 +- diags/scripts/dump_lfts.sh | 41 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 42 insertions(+), 1 deletions(-) diff --git a/diags/Makefile.am b/diags/Makefile.am index bf0c077..9654675 100644 --- a/diags/Makefile.am +++ b/diags/Makefile.am @@ -24,7 +24,7 @@ bin_SCRIPTS = scripts/ibcheckerrs script scripts/ibcheckstate scripts/ibcheckportstate \ scripts/ibcheckerrors scripts/ibclearerrors \ scripts/ibclearcounters scripts/discover.pl \ - scripts/set_mthca_nodedesc.sh + scripts/set_mthca_nodedesc.sh scripts/dump_lfts.sh src_ibaddr_SOURCES = src/ibaddr.c src_ibaddr_CFLAGS = -Wall $(DBGFLAGS) diff --git a/diags/scripts/dump_lfts.sh b/diags/scripts/dump_lfts.sh new file mode 100755 index 0000000..bed4778 --- /dev/null +++ b/diags/scripts/dump_lfts.sh @@ -0,0 +1,41 @@ +#!/bin/sh +# +# This simple script will collect outputs of ibroute for all switches +# on the subnet and drop it on stdout. May be used for LFTs dump +# generation. +# + +usage () +{ + echo "usage: $0 [-D]" + exit 2 +} + +dump_by_lid () +{ +for sw_lid in `ibswitches \ + | sed -ne 's/^.* lid \([1-9a-f]*\) .*$/\1/p'` ; do + ibroute $sw_lid +done +} + +dump_by_dr_path () +{ +for sw_dr in `ibnetdiscover -v \ + | sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \ + | sed -e 's/\]\[/,/g' \ + | sort -u` ; do + ibroute -D ${sw_dr} +done +} + + +if [ "$1" = "-D" ] ; then + dump_by_dr_path +elif [ -z "$1" ] ; then + dump_by_lid +else + usage +fi + +exit From sashak at voltaire.com Sat Jun 10 17:32:38 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 11 Jun 2006 03:32:38 +0300 Subject: [openib-general] [PATCH 1/4] Simplification of the ucast fdb dumps. In-Reply-To: <20060611002758.22430.63061.stgit@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> Message-ID: <20060611003238.22430.62423.stgit@sashak.voltaire.com> This separates the dump procedure from rest of the flow and prevents multiple fopen()/fclose() (one pair per switch) - one fopen() and one fclose() instead. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_ucast_mgr.c | 187 +++++++++++++++++++++++--------------------- 1 files changed, 96 insertions(+), 91 deletions(-) diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index 40422e5..cac7f9b 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -128,7 +128,7 @@ osm_ucast_mgr_init( /********************************************************************** **********************************************************************/ -void +static void osm_ucast_mgr_dump_path_distribution( IN const osm_ucast_mgr_t* const p_mgr, IN const osm_switch_t* const p_sw ) @@ -143,70 +143,65 @@ osm_ucast_mgr_dump_path_distribution( OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_dump_path_distribution ); - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) - { - p_node = osm_switch_get_node_ptr( p_sw ); + p_node = osm_switch_get_node_ptr( p_sw ); - num_ports = osm_switch_get_num_ports( p_sw ); - sprintf( p_mgr->p_report_buf, "osm_ucast_mgr_dump_path_distribution: " - "Switch 0x%" PRIx64 "\n" - "Port : Path Count Through Port", - cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); + num_ports = osm_switch_get_num_ports( p_sw ); + sprintf( p_mgr->p_report_buf, "osm_ucast_mgr_dump_path_distribution: " + "Switch 0x%" PRIx64 "\n" + "Port : Path Count Through Port", + cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); - for( i = 0; i < num_ports; i++ ) + for( i = 0; i < num_ports; i++ ) + { + num_paths = osm_switch_path_count_get( p_sw , i ); + sprintf( line, "\n %03u : %u", i, num_paths ); + strcat( p_mgr->p_report_buf, line ); + if( i == 0 ) { - num_paths = osm_switch_path_count_get( p_sw , i ); - sprintf( line, "\n %03u : %u", i, num_paths ); - strcat( p_mgr->p_report_buf, line ); - if( i == 0 ) - { - strcat( p_mgr->p_report_buf, " (switch management port)" ); - continue; - } - - p_remote_node = osm_node_get_remote_node( - p_node, i, NULL ); - - if( p_remote_node == NULL ) - continue; + strcat( p_mgr->p_report_buf, " (switch management port)" ); + continue; + } - remote_guid_ho = cl_ntoh64( - osm_node_get_node_guid( p_remote_node ) ); + p_remote_node = osm_node_get_remote_node( p_node, i, NULL ); + if( p_remote_node == NULL ) + continue; - switch( osm_node_get_remote_type( p_node, i ) ) - { - case IB_NODE_TYPE_SWITCH: - strcat( p_mgr->p_report_buf, " (link to switch" ); - break; - case IB_NODE_TYPE_ROUTER: - strcat( p_mgr->p_report_buf, " (link to router" ); - break; - case IB_NODE_TYPE_CA: - strcat( p_mgr->p_report_buf, " (link to CA" ); - break; - default: - strcat( p_mgr->p_report_buf, " (link to unknown type, node" ); - break; - } + remote_guid_ho = cl_ntoh64( osm_node_get_node_guid( p_remote_node ) ); - sprintf( line, " 0x%" PRIx64 ")", remote_guid_ho ); - strcat( p_mgr->p_report_buf, line ); + switch( osm_node_get_remote_type( p_node, i ) ) + { + case IB_NODE_TYPE_SWITCH: + strcat( p_mgr->p_report_buf, " (link to switch" ); + break; + case IB_NODE_TYPE_ROUTER: + strcat( p_mgr->p_report_buf, " (link to router" ); + break; + case IB_NODE_TYPE_CA: + strcat( p_mgr->p_report_buf, " (link to CA" ); + break; + default: + strcat( p_mgr->p_report_buf, " (link to unknown type, node" ); + break; } - strcat( p_mgr->p_report_buf, "\n" ); - - osm_log_raw( p_mgr->p_log, OSM_LOG_ROUTING, p_mgr->p_report_buf ); + sprintf( line, " 0x%" PRIx64 ")", remote_guid_ho ); + strcat( p_mgr->p_report_buf, line ); } + strcat( p_mgr->p_report_buf, "\n" ); + + osm_log_raw( p_mgr->p_log, OSM_LOG_ROUTING, p_mgr->p_report_buf ); + OSM_LOG_EXIT( p_mgr->p_log ); } /********************************************************************** **********************************************************************/ -void +static void osm_ucast_mgr_dump_ucast_routes( IN const osm_ucast_mgr_t* const p_mgr, - IN const osm_switch_t* const p_sw ) + IN const osm_switch_t* const p_sw, + IN FILE *p_fdbFile) { const osm_node_t* p_node; uint8_t port_num; @@ -217,34 +212,10 @@ osm_ucast_mgr_dump_ucast_routes( uint16_t lid_ho; char line[OSM_REPORT_LINE_SIZE]; uint32_t line_num = 0; - FILE * p_fdbFile; boolean_t ui_ucast_fdb_assign_func_defined; - char *file_name = NULL; OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_dump_ucast_routes ); - if( !osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) - goto Exit; - - file_name = - (char*)malloc(strlen(p_mgr->p_subn->opt.dump_files_dir) + 10); - - CL_ASSERT(file_name); - - strcpy(file_name, p_mgr->p_subn->opt.dump_files_dir); - strcat(file_name,"/osm.fdbs"); - - /* Open the file or error */ - p_fdbFile = fopen(file_name, "a"); - if (! p_fdbFile) - { - osm_log( p_mgr->p_log, OSM_LOG_ERROR, - "osm_ucast_mgr_dump_ucast_routes: ERR 3A12: " - "Failed to open fdb file (%s)\n", - file_name ); - goto Exit; - } - p_node = osm_switch_get_node_ptr( p_sw ); max_lid_ho = osm_switch_get_max_lid_ho( p_sw ); @@ -324,15 +295,59 @@ osm_ucast_mgr_dump_ucast_routes( if( line_num != 0 ) fprintf(p_fdbFile,"%s\n",p_mgr->p_report_buf ); - fclose(p_fdbFile); - - Exit: - if (file_name) - free(file_name); OSM_LOG_EXIT( p_mgr->p_log ); } /********************************************************************** + **********************************************************************/ +struct ucast_mgr_dump_context { + osm_ucast_mgr_t *p_mgr; + FILE *file; +}; + +static void +__osm_ucast_mgr_dump_table( + IN cl_map_item_t* const p_map_item, + IN void* context ) +{ + osm_switch_t* const p_sw = (osm_switch_t*)p_map_item; + struct ucast_mgr_dump_context *cxt = context; + + if( osm_log_is_active( cxt->p_mgr->p_log, OSM_LOG_DEBUG ) ) + osm_ucast_mgr_dump_path_distribution( cxt->p_mgr, p_sw ); + osm_ucast_mgr_dump_ucast_routes( cxt->p_mgr, p_sw, cxt->file ); +} + +static void osm_ucast_mgr_dump_tables( + IN osm_ucast_mgr_t *p_mgr) +{ + char file_name[1024]; + struct ucast_mgr_dump_context dump_context; + FILE *file; + + strncpy(file_name, p_mgr->p_subn->opt.dump_files_dir, sizeof(file_name) - 1); + strncat(file_name, "/osm.fdbs", sizeof(file_name) - strlen(file_name) - 1); + + file = fopen(file_name, "w"); + if (!file) + { + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "osm_ucast_mgr_dump_ucast_routes: ERR 3A12: " + "Failed to open fdb file (%s)\n", + file_name ); + return; + } + + dump_context.p_mgr = p_mgr; + dump_context.file = file; + + cl_qmap_apply_func( &p_mgr->p_subn->sw_guid_tbl, + __osm_ucast_mgr_dump_table, &dump_context ); + + fclose(file); +} + +/********************************************************************** Add each switch's own LID to its LID matrix. **********************************************************************/ static void @@ -952,8 +967,6 @@ __osm_ucast_mgr_process_tbl( __osm_ucast_mgr_set_table( p_mgr, p_sw ); - osm_ucast_mgr_dump_path_distribution( p_mgr, p_sw ); - osm_ucast_mgr_dump_ucast_routes( p_mgr, p_sw ); OSM_LOG_EXIT( p_mgr->p_log ); } @@ -1047,7 +1060,6 @@ osm_ucast_mgr_process( uint32_t iteration_max; osm_signal_t signal; cl_qmap_t *p_sw_guid_tbl; - char *file_name = NULL; OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_process ); @@ -1148,26 +1160,19 @@ osm_ucast_mgr_process( build and download the switch forwarding tables. */ - /* remove the old fdb dump file: */ - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) && (file_name = - (char*)malloc(strlen(p_mgr->p_subn->opt.dump_files_dir) + 10)) ) - { - strcpy(file_name, p_mgr->p_subn->opt.dump_files_dir); - strcat(file_name, "/osm.fdbs"); - unlink(file_name); - free(file_name); - } - cl_qmap_apply_func( p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, p_mgr ); + /* dump fdb into file: */ + if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) + osm_ucast_mgr_dump_tables( p_mgr ); + /* For now don't bother checking if the switch forwarding tables actually needed updating. The current code will always update them, and thus leave transactions pending on the wire. Therefore, return OSM_SIGNAL_DONE_PENDING. */ - signal = OSM_SIGNAL_DONE_PENDING; } else From sashak at voltaire.com Sat Jun 10 17:32:43 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 11 Jun 2006 03:32:43 +0300 Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT tables from dump file. In-Reply-To: <20060611002758.22430.63061.stgit@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> Message-ID: <20060611003243.22430.56582.stgit@sashak.voltaire.com> This patch implements trivial routing module which able to load LFT tables from dump file. Main features: - support for unicast LFTs only, support for multicast can be added later - this will run after min hop matrix calculation - this will load switch LFTs according to the path entries introduced in the dump file - no additional checks will be performed (like is port connected, etc) - in case when fabric LIDs were changed this will try to reconstruct LFTs correctly if endport GUIDs are represented in the dump file (in order to disable this GUIDs may be removed from the dump file or zeroed) The dump file format is compatible with output of 'ibroute' util and for whole fabric may be generated with script like this: for sw_lid in `ibswitches | awk '{print $NF}'` ; do ibroute $sw_lid done > /path/to/dump_file , or using DR paths: for sw_dr in `ibnetdiscover -v \ | sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \ | sed -e 's/\]\[/,/g' \ | sort -u` ; do ibroute -D ${sw_dr} done > /path/to/dump_file In order to activate new module use: opensm -R file -U /path/to/dump_file Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_subnet.h | 5 + osm/opensm/Makefile.am | 2 osm/opensm/main.c | 16 ++ osm/opensm/osm_opensm.c | 2 osm/opensm/osm_subnet.c | 10 ++ osm/opensm/osm_ucast_file.c | 258 +++++++++++++++++++++++++++++++++++++++ 6 files changed, 289 insertions(+), 4 deletions(-) diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h index a637367..ec1d056 100644 --- a/osm/include/opensm/osm_subnet.h +++ b/osm/include/opensm/osm_subnet.h @@ -277,6 +277,7 @@ typedef struct _osm_subn_opt boolean_t sweep_on_trap; osm_testability_modes_t testability_mode; char * routing_engine_name; + char * ucast_dump_file; char * updn_guid_file; boolean_t exit_on_fatal; boolean_t honor_guid2lid_file; @@ -423,6 +424,10 @@ typedef struct _osm_subn_opt * routing_engine_name * Name of used routing engine (other than default Min Hop Algorithm) * +* ucast_dump_file +* Name of the unicast routing dump file from where switch +* forwearding tables will be loaded +* * updn_guid_file * Pointer to name of the UPDN guid file given by User * diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am index 7b1060a..5da88a4 100644 --- a/osm/opensm/Makefile.am +++ b/osm/opensm/Makefile.am @@ -83,7 +83,7 @@ opensm_SOURCES = main.c osm_console.c os osm_sw_info_rcv_ctrl.c osm_switch.c \ osm_prtn.c osm_prtn_config.c osm_qos.c \ osm_trap_rcv.c osm_trap_rcv_ctrl.c \ - osm_ucast_mgr.c osm_ucast_updn.c \ + osm_ucast_mgr.c osm_ucast_updn.c osm_ucast_file.c \ osm_vl15intf.c osm_vl_arb_rcv.c \ osm_vl_arb_rcv_ctrl.c st.c opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 diff --git a/osm/opensm/main.c b/osm/opensm/main.c index c888ed4..dfb2aec 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -175,8 +175,12 @@ show_usage(void) " LID assignments resolving multiple use of same LID.\n\n"); printf( "-R\n" "--routing_engine \n" - " This option choose routing engine instead of Min Hop\n" - " algorithm (default). Supported engines: updn\n"); + " This option chooses routing engine instead of Min Hop\n" + " algorithm (default). Supported engines: updn, file\n"); + printf( "-U\n" + "--ucast_file \n" + " This option specifies name of the unicast dump file\n" + " from where switch forwarding tables will be loaded.\nn"); printf ("-a\n" "--add_guid_file \n" " Set the root nodes for the Up/Down routing algorithm\n" @@ -523,7 +527,7 @@ #endif boolean_t cache_options = FALSE; char *ignore_guids_file_name = NULL; uint32_t val; - const char * const short_option = "i:f:ed:g:l:s:t:a:R:P:NQvVhorcyx"; + const char * const short_option = "i:f:ed:g:l:s:t:a:R:U:P:NQvVhorcyx"; /* In the array below, the 2nd parameter specified the number @@ -556,6 +560,7 @@ #endif { "priority", 1, NULL, 'p'}, { "smkey", 1, NULL, 'k'}, { "routing_engine",1, NULL, 'R'}, + { "ucast_file" ,1, NULL, 'U'}, { "add_guid_file", 1, NULL, 'a'}, { "cache-options", 0, NULL, 'c'}, { "stay_on_fatal", 0, NULL, 'y'}, @@ -780,6 +785,11 @@ #endif printf(" Activate \'%s\' routing engine\n", optarg); break; + case 'U': + opt.ucast_dump_file = optarg; + printf(" Ucast dump file is \'%s\'\n", optarg); + break; + case 'a': /* Specifies port guids file diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c index 52f06da..a189591 100644 --- a/osm/opensm/osm_opensm.c +++ b/osm/opensm/osm_opensm.c @@ -74,10 +74,12 @@ struct routing_engine_module { }; extern int osm_ucast_updn_setup(osm_opensm_t *p_osm); +extern int osm_ucast_file_setup(osm_opensm_t *p_osm); const static struct routing_engine_module routing_modules[] = { {"null", NULL}, {"updn", osm_ucast_updn_setup }, + {"file", osm_ucast_file_setup }, {} }; diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index 27f97ab..0d46f85 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -489,6 +489,7 @@ osm_subn_set_default_opt( p_opt->sweep_on_trap = TRUE; p_opt->testability_mode = OSM_TEST_MODE_NONE; p_opt->routing_engine_name = NULL; + p_opt->ucast_dump_file = NULL; p_opt->updn_guid_file = NULL; p_opt->exit_on_fatal = TRUE; subn_set_default_qos_options(&p_opt->qos_options); @@ -937,6 +938,10 @@ osm_subn_parse_conf_file( p_key, p_val, &p_opts->dump_files_dir); __osm_subn_opts_unpack_charp( + "ucast_dump_file" , + p_key, p_val, &p_opts->ucast_dump_file); + + __osm_subn_opts_unpack_charp( "updn_guid_file" , p_key, p_val, &p_opts->updn_guid_file); @@ -1094,6 +1099,11 @@ osm_subn_write_conf_file( "# Routing engine\n" "routing_engine %s\n\n", p_opts->routing_engine_name); + if (p_opts->ucast_dump_file) + fprintf( opts_file, + "# Ucast dump file name\n" + "ucast_dump_file %s\n\n", + p_opts->ucast_dump_file); if (p_opts->updn_guid_file) fprintf( opts_file, "# The file holding the Up/Down root node guids\n" diff --git a/osm/opensm/osm_ucast_file.c b/osm/opensm/osm_ucast_file.c new file mode 100644 index 0000000..a68d9ec --- /dev/null +++ b/osm/opensm/osm_ucast_file.c @@ -0,0 +1,258 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +/* + * Abstract: + * Implementation of OpenSM unicast routing module which loads + * routes from the dump file + * + * Environment: + * Linux User Mode + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include + +#include +#include +#include +#include +#include + +#define PARSEERR(log, file_name, lineno, fmt, arg...) \ + osm_log(log, OSM_LOG_ERROR, "PARSE ERROR: %s:%u: " fmt , \ + file_name, lineno, ##arg ) + +#define PARSEWARN(log, file_name, lineno, fmt, arg...) \ + osm_log(log, OSM_LOG_VERBOSE, "PARSE WARN: %s:%u: " fmt , \ + file_name, lineno, ##arg ) + +static uint16_t remap_lid(osm_opensm_t *p_osm, uint16_t lid, ib_net64_t guid) +{ + osm_port_t *p_port; + uint16_t min_lid, max_lid; + uint8_t lmc; + + p_port = (osm_port_t *)cl_qmap_get(&p_osm->subn.port_guid_tbl, guid); + if (!p_port || + p_port == (osm_port_t *)cl_qmap_end(&p_osm->subn.port_guid_tbl)) { + osm_log(&p_osm->log, OSM_LOG_VERBOSE, + "remap_lid: cannot find port guid 0x%016" PRIx64 + " , will use the same lid.\n", cl_ntoh64(guid)); + return lid; + } + + osm_port_get_lid_range_ho(p_port, &min_lid, &max_lid); + if (min_lid <= lid && lid <= max_lid) + return lid; + + lmc = osm_port_get_lmc(p_port); + return min_lid + (lid & ((1 << lmc) - 1)); +} + +static void add_path(osm_opensm_t * p_osm, + osm_switch_t * p_sw, uint16_t lid, uint8_t port_num, + ib_net64_t port_guid) +{ + uint16_t new_lid; + uint8_t old_port; + + new_lid = port_guid ? remap_lid(p_osm, lid, port_guid) : lid; + old_port = osm_fwd_tbl_get(osm_switch_get_fwd_tbl_ptr(p_sw), new_lid); + if (old_port != OSM_NO_PATH && old_port != port_num) { + osm_log(&p_osm->log, OSM_LOG_VERBOSE, + "add_path: LID collision is detected on switch " + "0x016%" PRIx64 ", will overwrite LID 0x%x entry.\n", + cl_ntoh64(osm_node_get_node_guid + (osm_switch_get_node_ptr(p_sw))), new_lid); + } + + osm_switch_set_path(p_sw, new_lid, port_num, TRUE); + + osm_log(&p_osm->log, OSM_LOG_DEBUG, + "add_path: route 0x%04x(was 0x%04x) %u 0x%016" PRIx64 + " is added to switch 0x%016" PRIx64 "\n", + new_lid, lid, port_num, cl_ntoh64(port_guid), + cl_ntoh64(osm_node_get_node_guid + (osm_switch_get_node_ptr(p_sw)))); +} + +static void clean_sw_fwd_table(void *arg, void *context) +{ + osm_switch_t *p_sw = arg; + uint16_t lid, max_lid; + + max_lid = osm_switch_get_max_lid_ho(p_sw); + for (lid = 1 ; lid <= max_lid ; lid++) + osm_switch_set_path(p_sw, lid, OSM_NO_PATH, TRUE); +} + +static int do_ucast_file_load(void *context) +{ + char line[1024]; + char *file_name; + FILE *file; + ib_net64_t sw_guid, port_guid; + osm_opensm_t *p_osm = context; + osm_switch_t *p_sw; + uint16_t lid; + uint8_t port_num; + unsigned lineno; + + file_name = p_osm->subn.opt.ucast_dump_file; + + if (!file_name) { + osm_log(&p_osm->log, OSM_LOG_ERROR, + "do_ucast_file_load: " + "ucast dump file name is not defined.\n"); + return -1; + } + + file = fopen(file_name, "r"); + if (!file) { + osm_log(&p_osm->log, OSM_LOG_ERROR, + "do_ucast_file_load: " + "cannot open ucast dump file \'%s\'\n", file_name); + return -1; + } + + cl_qmap_apply_func(&p_osm->subn.sw_guid_tbl, clean_sw_fwd_table, NULL); + + lineno = 0; + p_sw = NULL; + + while (fgets(line, sizeof(line) - 1, file) != NULL) { + char *p, *q; + lineno++; + + p = line; + while (isspace(*p)) + p++; + + if (*p == '#') + continue; + + if (!strncmp(p, "Multicast mlids", 15)) { + osm_log(&p_osm->log, OSM_LOG_ERROR, + "do_ucast_file_load: " + "Multicast dump file is detected. " + "Skip parsing.\n"); + } + else if (!strncmp(p, "Unicast lids", 12)) { + q = strstr(p, " guid 0x"); + if (!q) { + PARSEERR(&p_osm->log, file_name, lineno, + "cannot parse switch definition\n"); + return -1; + } + p = q + 6; + sw_guid = strtoll(p, &q, 16); + if (q && !isspace(*q)) { + PARSEERR(&p_osm->log, file_name, lineno, + "cannot parse switch guid: \'%s\'\n", + p); + return -1; + } + sw_guid = cl_hton64(sw_guid); + + p_sw = (osm_switch_t *)cl_qmap_get(&p_osm->subn.sw_guid_tbl, + sw_guid); + if (!p_sw || + p_sw == (osm_switch_t *)cl_qmap_end(&p_osm->subn.sw_guid_tbl)) { + p_sw = NULL; + osm_log(&p_osm->log, OSM_LOG_VERBOSE, + "do_ucast_file_load: " + "cannot find switch %016" PRIx64 ".\n", + cl_ntoh64(sw_guid)); + continue; + } + } + else if (p_sw && !strncmp(p, "0x", 2)) { + lid = strtoul(p, &q, 16); + if (q && !isspace(*q)) { + PARSEERR(&p_osm->log, file_name, lineno, + "cannot parse lid: \'%s\'\n", p); + return -1; + } + p = q; + while (isspace(*p)) + p++; + port_num = strtoul(p, &q, 10); + if (q && !isspace(*q)) { + PARSEERR(&p_osm->log, file_name, lineno, + "cannot parse port: \'%s\'\n", p); + return -1; + } + p = q; + /* additionally try to exract guid */ + q = strstr(p, " portguid 0x"); + if (!q) { + PARSEWARN(&p_osm->log, file_name, lineno, + "cannot find port guid " + "(maybe broken dump): \'%s\'\n", p); + port_guid = 0; + } + else { + p = q + 10; + port_guid = strtoll(p, &q, 16); + if (!q && !isspace(*q) && *q != ':') { + PARSEWARN(&p_osm->log, file_name, + lineno, + "cannot parse port guid " + "(maybe broken dump): " + "\'%s\'\n", p); + port_guid = 0; + } + } + port_guid = cl_hton64(port_guid); + add_path(p_osm, p_sw, lid, port_num, port_guid); + } + } + + fclose(file); + return 0; +} + +int osm_ucast_file_setup(osm_opensm_t * p_osm) +{ + p_osm->routing_engine.context = (void *)p_osm; + p_osm->routing_engine.ucast_build_fwd_tables = do_ucast_file_load; + return 0; +} From sashak at voltaire.com Sat Jun 10 17:32:41 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 11 Jun 2006 03:32:41 +0300 Subject: [openib-general] [PATCH 2/4] Modular routing engine (unicast only yet). In-Reply-To: <20060611002758.22430.63061.stgit@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> Message-ID: <20060611003240.22430.88414.stgit@sashak.voltaire.com> This patch introduces routing_engine structure which may be used for "plugging" new routing module. Currently only unicast callbacks are supported (multicast can be added later). And existing routing module is up-down 'updn', may be activated with '-R updn' option (instead of old '-u'). General usage is: $ opensm -R 'module-name' Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_opensm.h | 17 ++++++++- osm/include/opensm/osm_subnet.h | 16 ++------ osm/include/opensm/osm_ucast_updn.h | 26 ------------- osm/opensm/main.c | 26 +++++-------- osm/opensm/osm_opensm.c | 41 ++++++++++++++++++--- osm/opensm/osm_subnet.c | 23 ++++++------ osm/opensm/osm_ucast_mgr.c | 69 ++++++++++++++++++++++++----------- osm/opensm/osm_ucast_updn.c | 69 ++++++++++++++++++----------------- 8 files changed, 156 insertions(+), 131 deletions(-) diff --git a/osm/include/opensm/osm_opensm.h b/osm/include/opensm/osm_opensm.h index 3235ad4..3e6e120 100644 --- a/osm/include/opensm/osm_opensm.h +++ b/osm/include/opensm/osm_opensm.h @@ -92,6 +92,18 @@ BEGIN_C_DECLS * *********/ +/* + * routing engine structure - yet limited by ucast_fdb_assign and + * ucast_build_fwd_tables (multicast callbacks may be added later) + */ +struct osm_routing_engine { + const char *name; + void *context; + int (*ucast_build_fwd_tables)(void *context); + int (*ucast_fdb_assign)(void *context); + void (*delete)(void *context); +}; + /****s* OpenSM: OpenSM/osm_opensm_t * NAME * osm_opensm_t @@ -116,7 +128,7 @@ typedef struct _osm_opensm_t osm_log_t log; cl_dispatcher_t disp; cl_plock_t lock; - updn_t *p_updn_ucast_routing; + struct osm_routing_engine routing_engine; osm_stats_t stats; } osm_opensm_t; /* @@ -153,6 +165,9 @@ typedef struct _osm_opensm_t * lock * Shared lock guarding most OpenSM structures. * +* routing_engine +* Routing engine, will be initialized then used +* * stats * Open SM statistics block * diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h index 4db449d..a637367 100644 --- a/osm/include/opensm/osm_subnet.h +++ b/osm/include/opensm/osm_subnet.h @@ -272,13 +272,11 @@ typedef struct _osm_subn_opt uint32_t max_port_profile; osm_pfn_ui_extension_t pfn_ui_pre_lid_assign; void * ui_pre_lid_assign_ctx; - osm_pfn_ui_extension_t pfn_ui_ucast_fdb_assign; - void * ui_ucast_fdb_assign_ctx; osm_pfn_ui_mcast_extension_t pfn_ui_mcast_fdb_assign; void * ui_mcast_fdb_assign_ctx; boolean_t sweep_on_trap; osm_testability_modes_t testability_mode; - boolean_t updn_activate; + char * routing_engine_name; char * updn_guid_file; boolean_t exit_on_fatal; boolean_t honor_guid2lid_file; @@ -407,13 +405,6 @@ typedef struct _osm_subn_opt * ui_pre_lid_assign_ctx * A UI context (void *) to be provided to the pfn_ui_pre_lid_assign * -* pfn_ui_ucast_fdb_assign -* A UI function to be called instead of the ucast manager FDB -* configuration. -* -* ui_ucast_fdb_assign_ctx -* A UI context (void *) to be provided to the pfn_ui_ucast_fdb_assign -* * pfn_ui_mcast_fdb_assign * A UI function to be called inside the mcast manager instead of the * call for the build spanning tree. This will be called on every @@ -429,9 +420,8 @@ typedef struct _osm_subn_opt * testability_mode * Object that indicates if we are running in a special testability mode. * -* updn_activate -* Object that indicates if we are running the UPDN algorithm (TRUE) or -* Min Hop Algorithm (FALSE) +* routing_engine_name +* Name of used routing engine (other than default Min Hop Algorithm) * * updn_guid_file * Pointer to name of the UPDN guid file given by User diff --git a/osm/include/opensm/osm_ucast_updn.h b/osm/include/opensm/osm_ucast_updn.h index 027056c..fbf8782 100644 --- a/osm/include/opensm/osm_ucast_updn.h +++ b/osm/include/opensm/osm_ucast_updn.h @@ -421,32 +421,6 @@ osm_subn_calc_up_down_min_hop_table( * This function returns 0 when rankning has succeded , otherwise 1. ******/ -/****f* OpenSM: OpenSM/osm_updn_reg_calc_min_hop_table -* NAME -* osm_updn_reg_calc_min_hop_table -* -* DESCRIPTION -* Registration function to ucast routing manager (instead of -* Min Hop Algorithm) -* -* SYNOPSIS -*/ -int -osm_updn_reg_calc_min_hop_table( - IN updn_t * p_updn, - IN osm_subn_opt_t* p_opt ); -/* -* PARAMETERS -* -* RETURN VALUES -* 0 - on success , 1 - on failure -* -* NOTES -* -* SEE ALSO -* osm_subn_calc_up_down_min_hop_table -*********/ - /****** Osmsh: UpDown/osm_updn_find_root_nodes_by_min_hop * NAME * osm_updn_find_root_nodes_by_min_hop diff --git a/osm/opensm/main.c b/osm/opensm/main.c index 22591eb..c888ed4 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -60,7 +60,6 @@ #include #include #include #include -#include #include /******************************************************************** @@ -174,10 +173,10 @@ show_usage(void) " may disrupt subnet traffic.\n" " Without -r, OpenSM attempts to preserve existing\n" " LID assignments resolving multiple use of same LID.\n\n"); - printf( "-u\n" - "--updn\n" - " This option activate UPDN algorithm instead of Min Hop\n" - " algorithm (default).\n"); + printf( "-R\n" + "--routing_engine \n" + " This option choose routing engine instead of Min Hop\n" + " algorithm (default). Supported engines: updn\n"); printf ("-a\n" "--add_guid_file \n" " Set the root nodes for the Up/Down routing algorithm\n" @@ -524,7 +523,7 @@ #endif boolean_t cache_options = FALSE; char *ignore_guids_file_name = NULL; uint32_t val; - const char * const short_option = "i:f:ed:g:l:s:t:a:P:NQuvVhorcyx"; + const char * const short_option = "i:f:ed:g:l:s:t:a:R:P:NQvVhorcyx"; /* In the array below, the 2nd parameter specified the number @@ -556,7 +555,7 @@ #endif { "reassign_lids", 0, NULL, 'r'}, { "priority", 1, NULL, 'p'}, { "smkey", 1, NULL, 'k'}, - { "updn", 0, NULL, 'u'}, + { "routing_engine",1, NULL, 'R'}, { "add_guid_file", 1, NULL, 'a'}, { "cache-options", 0, NULL, 'c'}, { "stay_on_fatal", 0, NULL, 'y'}, @@ -776,9 +775,9 @@ #endif opt.sm_key = sm_key; break; - case 'u': - opt.updn_activate = TRUE; - printf(" Activate UPDN algorithm\n"); + case 'R': + opt.routing_engine_name = optarg; + printf(" Activate \'%s\' routing engine\n", optarg); break; case 'a': @@ -885,13 +884,6 @@ #endif setup_signals(); osm_opensm_sweep( &osm ); - /* since osm_opensm_init get opt as RO we'll set the opt value with UI pfn here */ - /* Now do the registration */ - if (opt.updn_activate) - if (osm_updn_reg_calc_min_hop_table(osm.p_updn_ucast_routing, &(osm.subn.opt))) { - status = IB_ERROR; - goto Exit; - } if( run_once_flag == TRUE ) { diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c index 8c422b5..52f06da 100644 --- a/osm/opensm/osm_opensm.c +++ b/osm/opensm/osm_opensm.c @@ -68,6 +68,37 @@ #include #include #include +struct routing_engine_module { + const char *name; + int (*setup)(osm_opensm_t *p_osm); +}; + +extern int osm_ucast_updn_setup(osm_opensm_t *p_osm); + +const static struct routing_engine_module routing_modules[] = { + {"null", NULL}, + {"updn", osm_ucast_updn_setup }, + {} +}; + +static int setup_routing_engine(osm_opensm_t *p_osm, const char *name) +{ + const struct routing_engine_module *r; + for (r = routing_modules ; r->name && *r->name ; r++) { + if(!strcmp(r->name, name)) { + p_osm->routing_engine.name = r->name; + if (r->setup(p_osm)) + break; + osm_log (&p_osm->log, OSM_LOG_DEBUG, + "opensm: setup_routing_engine: " + "\'%s\' routing engine set up.\n", + p_osm->routing_engine.name); + return 0; + } + } + return -1; +} + /********************************************************************** **********************************************************************/ void @@ -118,7 +149,8 @@ osm_opensm_destroy( cl_disp_shutdown( &p_osm->disp ); /* do the destruction in reverse order as init */ - updn_destroy( p_osm->p_updn_ucast_routing ); + if (p_osm->routing_engine.delete) + p_osm->routing_engine.delete(p_osm->routing_engine.context); osm_sa_destroy( &p_osm->sa ); osm_sm_destroy( &p_osm->sm ); osm_db_destroy( &p_osm->db ); @@ -252,11 +284,8 @@ #endif if( status != IB_SUCCESS ) goto Exit; - /* HACK - the UpDown manager should have been a part of the osm_sm_t */ - /* Init updn struct */ - p_osm->p_updn_ucast_routing = updn_construct( ); - status = updn_init( p_osm->p_updn_ucast_routing ); - if( status != IB_SUCCESS ) + if( p_opt->routing_engine_name && + setup_routing_engine(p_osm, p_opt->routing_engine_name)) goto Exit; Exit: diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index 7c08556..27f97ab 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -484,13 +484,11 @@ osm_subn_set_default_opt( p_opt->max_port_profile = 0xffffffff; p_opt->pfn_ui_pre_lid_assign = NULL; p_opt->ui_pre_lid_assign_ctx = NULL; - p_opt->pfn_ui_ucast_fdb_assign = NULL; - p_opt->ui_ucast_fdb_assign_ctx = NULL; p_opt->pfn_ui_mcast_fdb_assign = NULL; p_opt->ui_mcast_fdb_assign_ctx = NULL; p_opt->sweep_on_trap = TRUE; p_opt->testability_mode = OSM_TEST_MODE_NONE; - p_opt->updn_activate = FALSE; + p_opt->routing_engine_name = NULL; p_opt->updn_guid_file = NULL; p_opt->exit_on_fatal = TRUE; subn_set_default_qos_options(&p_opt->qos_options); @@ -911,9 +909,9 @@ osm_subn_parse_conf_file( "sweep_on_trap", p_key, p_val, &p_opts->sweep_on_trap); - __osm_subn_opts_unpack_boolean( - "updn_activate", - p_key, p_val, &p_opts->updn_activate); + __osm_subn_opts_unpack_charp( + "routing_engine", + p_key, p_val, &p_opts->routing_engine_name); __osm_subn_opts_unpack_charp( "log_file", p_key, p_val, &p_opts->log_file); @@ -1089,12 +1087,13 @@ osm_subn_write_conf_file( opts_file, "#\n# ROUTING OPTIONS\n#\n" "# If true do not count switches as link subscriptions\n" - "port_profile_switch_nodes %s\n\n" - "# Activate the Up/Down routing algorithm\n" - "updn_activate %s\n\n", - p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE", - p_opts->updn_activate ? "TRUE" : "FALSE" - ); + "port_profile_switch_nodes %s\n\n", + p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE"); + if (p_opts->routing_engine_name) + fprintf( opts_file, + "# Routing engine\n" + "routing_engine %s\n\n", + p_opts->routing_engine_name); if (p_opts->updn_guid_file) fprintf( opts_file, "# The file holding the Up/Down root node guids\n" diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index cac7f9b..0c0d635 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -62,6 +62,7 @@ #include #include #include #include +#include #define LINE_LENGTH 256 @@ -269,7 +270,7 @@ osm_ucast_mgr_dump_ucast_routes( strcat( p_mgr->p_report_buf, "yes" ); else { - if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign) { + if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) { ui_ucast_fdb_assign_func_defined = TRUE; } else { ui_ucast_fdb_assign_func_defined = FALSE; @@ -708,7 +709,7 @@ __osm_ucast_mgr_process_port( node_guid = osm_node_get_node_guid(osm_switch_get_node_ptr( p_sw ) ); /* Flag to mark whether or not a ui ucast fdb assign function was given */ - if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign) + if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) ui_ucast_fdb_assign_func_defined = TRUE; else ui_ucast_fdb_assign_func_defined = FALSE; @@ -753,7 +754,7 @@ __osm_ucast_mgr_process_port( /* Up/Down routing can cause unreachable routes between some switches so we do not report that as an error in that case */ - if (!p_mgr->p_subn->opt.updn_activate) + if (!p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) { osm_log( p_mgr->p_log, OSM_LOG_ERROR, "__osm_ucast_mgr_process_port: ERR 3A08: " @@ -973,6 +974,18 @@ __osm_ucast_mgr_process_tbl( /********************************************************************** **********************************************************************/ static void +__osm_ucast_mgr_set_table_cb( + IN cl_map_item_t* const p_map_item, + IN void* context ) +{ + osm_switch_t* const p_sw = (osm_switch_t*)p_map_item; + osm_ucast_mgr_t* const p_mgr = (osm_ucast_mgr_t*)context; + __osm_ucast_mgr_set_table( p_mgr, p_sw ); +} + +/********************************************************************** + **********************************************************************/ +static void __osm_ucast_mgr_process_neighbors( IN cl_map_item_t* const p_map_item, IN void* context ) @@ -1058,12 +1071,14 @@ osm_ucast_mgr_process( { uint32_t i; uint32_t iteration_max; + struct osm_routing_engine *p_routing_eng; osm_signal_t signal; cl_qmap_t *p_sw_guid_tbl; OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_process ); p_sw_guid_tbl = &p_mgr->p_subn->sw_guid_tbl; + p_routing_eng = &p_mgr->p_subn->p_osm->routing_engine; CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); @@ -1129,6 +1144,14 @@ osm_ucast_mgr_process( i ); + if (p_routing_eng->ucast_build_fwd_tables && + p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0) + { + cl_qmap_apply_func( p_sw_guid_tbl, + __osm_ucast_mgr_set_table_cb, p_mgr ); + } /* fallback on the regular path in case of failures */ + else + { /* This is the place where we can load pre-defined routes into the switches fwd_tbl structures. @@ -1136,32 +1159,34 @@ osm_ucast_mgr_process( Later code will use these values if not configured for re-assignment. */ - if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign) - { - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) + if (p_routing_eng->ucast_fdb_assign) { - osm_log( p_mgr->p_log, OSM_LOG_DEBUG, - "osm_ucast_mgr_process: " - "Invoking UI function pfn_ui_ucast_fdb_assign\n"); - } - p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign(p_mgr->p_subn->opt.ui_ucast_fdb_assign_ctx); - } else { + if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_mgr->p_log, OSM_LOG_DEBUG, + "osm_ucast_mgr_process: " + "Invoking \'%s\' function ucast_fdb_assign\n", + p_routing_eng->name); + } + p_routing_eng->ucast_fdb_assign(p_routing_eng->context); + } else { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "osm_ucast_mgr_process: " "UI pfn was not invoked\n"); - } + } - osm_log(p_mgr->p_log, OSM_LOG_INFO, - "osm_ucast_mgr_process: " - "Min Hop Tables configured on all switches\n"); + osm_log(p_mgr->p_log, OSM_LOG_INFO, + "osm_ucast_mgr_process: " + "Min Hop Tables configured on all switches\n"); - /* - Now that the lid matrixes have been built, we can - build and download the switch forwarding tables. - */ + /* + Now that the lid matrixes have been built, we can + build and download the switch forwarding tables. + */ - cl_qmap_apply_func( p_sw_guid_tbl, - __osm_ucast_mgr_process_tbl, p_mgr ); + cl_qmap_apply_func( p_sw_guid_tbl, + __osm_ucast_mgr_process_tbl, p_mgr ); + } /* dump fdb into file: */ if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) diff --git a/osm/opensm/osm_ucast_updn.c b/osm/opensm/osm_ucast_updn.c index d80f7eb..8e36854 100644 --- a/osm/opensm/osm_ucast_updn.c +++ b/osm/opensm/osm_ucast_updn.c @@ -76,8 +76,9 @@ __updn_get_dir(IN uint8_t cur_rank, IN uint64_t cur_guid, IN uint64_t rem_guid) { - uint32_t i = 0, max_num_guids = osm.p_updn_ucast_routing->updn_ucast_reg_inputs.num_guids; - uint64_t *p_guid = osm.p_updn_ucast_routing->updn_ucast_reg_inputs.guid_list; + updn_t *p_updn = osm.routing_engine.context; + uint32_t i = 0, max_num_guids = p_updn->updn_ucast_reg_inputs.num_guids; + uint64_t *p_guid = p_updn->updn_ucast_reg_inputs.guid_list; boolean_t cur_is_root = FALSE , rem_is_root = FALSE; /* HACK: comes to solve root nodes connection, in a classic subnet root nodes does not connect @@ -540,7 +541,7 @@ updn_init( p_updn->updn_ucast_reg_inputs.guid_list = NULL; p_updn->auto_detect_root_nodes = FALSE; /* Check if updn is activated , then fetch root nodes */ - if (osm.subn.opt.updn_activate) + if (osm.routing_engine.context) { /* Check the source for root node list, if file parse it, otherwise @@ -569,7 +570,7 @@ updn_init( { p_tmp = malloc(sizeof(uint64_t)); *p_tmp = strtoull(line, NULL, 16); - cl_list_insert_tail(osm.p_updn_ucast_routing->p_root_nodes, p_tmp); + cl_list_insert_tail(p_updn->p_root_nodes, p_tmp); } } else @@ -588,8 +589,8 @@ updn_init( "osm_opensm_init: " "UPDN - Root nodes fetching by file %s\n", osm.subn.opt.updn_guid_file); - guid_iterator = cl_list_head(osm.p_updn_ucast_routing->p_root_nodes); - while( guid_iterator != cl_list_end(osm.p_updn_ucast_routing->p_root_nodes) ) + guid_iterator = cl_list_head(p_updn->p_root_nodes); + while( guid_iterator != cl_list_end(p_updn->p_root_nodes) ) { osm_log( &osm.log, OSM_LOG_DEBUG, "osm_opensm_init: " @@ -600,7 +601,7 @@ updn_init( } else { - osm.p_updn_ucast_routing->auto_detect_root_nodes = TRUE; + p_updn->auto_detect_root_nodes = TRUE; } /* If auto mode detection reuired - will be executed in main b4 the assignment of UI Ucast */ } @@ -985,33 +986,6 @@ void __osm_updn_convert_list2array(IN up /********************************************************************** **********************************************************************/ -/* Registration function to ucast routing manager (instead of - Min Hop Algorithm) */ -int -osm_updn_reg_calc_min_hop_table( - IN updn_t * p_updn, - IN osm_subn_opt_t* p_opt ) -{ - OSM_LOG_ENTER(&(osm.log), osm_updn_reg_calc_min_hop_table); - /* - If root nodes were supplied by the user - we need to convert into array - otherwise, will be created & converted in callback function activation - */ - if (!p_updn->auto_detect_root_nodes) - { - __osm_updn_convert_list2array(p_updn); - } - osm_log (&(osm.log), OSM_LOG_DEBUG, - "osm_updn_reg_calc_min_hop_table: " - "assigning ucast fdb UI function with updn callback\n"); - p_opt->pfn_ui_ucast_fdb_assign = __osm_updn_call; - p_opt->ui_ucast_fdb_assign_ctx = (void *)p_updn; - OSM_LOG_EXIT(&(osm.log)); - return 0; -} - -/********************************************************************** - **********************************************************************/ /* Find Root nodes automatically by Min Hop Table info */ int osm_updn_find_root_nodes_by_min_hop( OUT updn_t * p_updn ) @@ -1210,3 +1184,30 @@ osm_updn_find_root_nodes_by_min_hop( OUT OSM_LOG_EXIT(&(osm.log)); return 0; } + +/********************************************************************** + **********************************************************************/ + +static void __osm_updn_delete(void *context) +{ + updn_t *p_updn = context; + updn_destroy(p_updn); +} + +int osm_ucast_updn_setup(osm_opensm_t *p_osm) +{ + updn_t *p_updn; + p_updn = updn_construct(); + if (!p_updn) + return -1; + p_osm->routing_engine.context = p_updn; + p_osm->routing_engine.delete = __osm_updn_delete; + p_osm->routing_engine.ucast_fdb_assign = __osm_updn_call; + + if (updn_init(p_updn) != IB_SUCCESS) + return -1; + if (!p_updn->auto_detect_root_nodes) + __osm_updn_convert_list2array(p_updn); + + return 0; +} From eitan at mellanox.co.il Sat Jun 10 23:07:56 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 11 Jun 2006 09:07:56 +0300 Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo GetTable query Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881C@mtlexch01.mtl.com> Hi Hal, As the 1.0 does not support GUIDInfo I do not this patch is relevant to 1.0 Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Sunday, June 11, 2006 12:22 AM > To: Eitan Zahavi > Cc: OPENIB > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable query > > Eitan, > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote: > > Hi Hal > > > > I'm working on passing osmtest check. Found a bug in the new > > GUIDInfoRecord query: If you had a physical port with zero guid_cap > > the code would loop on blocks 0..255 instead of trying the next port. > > > > I am still looking for why we might have a guid_cap == 0 on some > > ports. > > > > This patch resolves this new problem. osmtest passes on some arbitrary > > networks. > > > > Eitan > > > > Signed-off-by: Eitan Zahavi > > Thanks. Applied to trunk only. > > Let me know if it also should be applied to 1.0. > > -- Hal From bugzilla-daemon at openib.org Sat Jun 10 23:23:13 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Sat, 10 Jun 2006 23:23:13 -0700 (PDT) Subject: [openib-general] [Bug 126] RDMA_CM and UCM not loaded on boot Message-ID: <20060611062313.10CDC2287AC@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=126 vlad at mellanox.co.il changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WONTFIX ------- Comment #1 from vlad at mellanox.co.il 2006-06-10 23:23 ------- RDMA_CM and RDMA_UCM are not loaded by default. In order to load them upon boot edit /etc/infiniband/openib.conf file and set RDMA_CM_LOAD=yes and RDMA_UCM_LOAD=yes: # Start HCA driver upon boot ONBOOT=yes # Load UCM module UCM_LOAD=no # Load RDMA_CM module RDMA_CM_LOAD=no # Load RDMA_UCM module RDMA_UCM_LOAD=no ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eitan at mellanox.co.il Sat Jun 10 23:36:45 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 11 Jun 2006 09:36:45 +0300 Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes from the file Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881F@mtlexch01.mtl.com> Hi Sasha, General comments: 1. I hope the change in osm.fdbs is not going to break the parser in ibdm:Fabric.cpp - was it really necessary change? or just nice to have ? 2. The modular routing is a great idea. From my first glance it seems that it assumes calculation of min-hop-tables is common to all routing engines. I think it should be a callback provided by the engine too. Please note that the Min-Hop engine takes most of the routing time so in the future if we could avoid that stage it would be even better. [EZ] We should start thinking about testing of this new feature too. Further comment on the patches themselves. > There are couple of unicast routing related patches for OpenSM. > > Basically it implements routing module which provides possibility to load > switch forwarding tables from pre-created dump file. Currently unicast > tables loading is only supported, multicast may be added in a future. > > Short patch descriptions (more details may be found in emails with > patches): > > 1. Ucast dump file simplification. > 2. Modular routing - preliminary implements generic model to plug new > routing engine to OpenSM. > 3. New simple unicast routing engine which allows to load LFTs from > pre-created dump file. > 4. Example of ucast dump generation script. > > Please comment and test. Thanks. > > Sasha From mst at mellanox.co.il Sat Jun 10 23:38:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 11 Jun 2006 09:38:42 +0300 Subject: [openib-general] race in mthca_cq.c? In-Reply-To: References: Message-ID: <20060611063842.GU7359@mellanox.co.il> Quoting r. Roland Dreier : > Michael> But there might be more EQEs for this CQN outstanding in > Michael> the EQ which we have not seen yet. > > Now that you mention it, that could be a real problem I guess. > synchronize_irq() isn't enough because the interrupt handler might not > have even started yet. Only in MSI configurations though: with regular interrupts command interface shares IRQ with completions so the EQ will be emptied before interrupt handler is done. -- MST From halr at voltaire.com Sun Jun 11 03:12:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Jun 2006 06:12:08 -0400 Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo GetTable query In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881C@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881C@mtlexch01.mtl.com> Message-ID: <1150020727.570.29434.camel@hal.voltaire.com> Hi Eitan, On Sun, 2006-06-11 at 02:07, Eitan Zahavi wrote: > Hi Hal, > > As the 1.0 does not support GUIDInfo I do not this patch is relevant to > 1.0 Huh ? What's https://openfabrics.org/svn/gen2/branches/1.0/src/userspace/management/osm/opensm/osm_sa_guidinfo_record.c -- Hal > > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Sunday, June 11, 2006 12:22 AM > > To: Eitan Zahavi > > Cc: OPENIB > > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable query > > > > Eitan, > > > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote: > > > Hi Hal > > > > > > I'm working on passing osmtest check. Found a bug in the new > > > GUIDInfoRecord query: If you had a physical port with zero guid_cap > > > the code would loop on blocks 0..255 instead of trying the next > port. > > > > > > I am still looking for why we might have a guid_cap == 0 on some > > > ports. > > > > > > This patch resolves this new problem. osmtest passes on some > arbitrary > > > networks. > > > > > > Eitan > > > > > > Signed-off-by: Eitan Zahavi > > > > Thanks. Applied to trunk only. > > > > Let me know if it also should be applied to 1.0. > > > > -- Hal From eitan at mellanox.co.il Sun Jun 11 03:46:37 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 11 Jun 2006 13:46:37 +0300 Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo GetTable query Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302368825@mtlexch01.mtl.com> Auuuch it is there! My mistake. Sp please apply the patch to the OFED 1.0 branch too. BTW: Is the osmtest -f a excersizes this query on the OFED 1.0 ? > Huh ? What's > https://openfabrics.org/svn/gen2/branches/1.0/src/userspace/management/o sm/opens > m/osm_sa_guidinfo_record.c > > -- Hal > > > > > Eitan Zahavi > > Senior Engineering Director, Software Architect > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Sunday, June 11, 2006 12:22 AM > > > To: Eitan Zahavi > > > Cc: OPENIB > > > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable query > > > > > > Eitan, > > > > > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote: > > > > Hi Hal > > > > > > > > I'm working on passing osmtest check. Found a bug in the new > > > > GUIDInfoRecord query: If you had a physical port with zero guid_cap > > > > the code would loop on blocks 0..255 instead of trying the next > > port. > > > > > > > > I am still looking for why we might have a guid_cap == 0 on some > > > > ports. > > > > > > > > This patch resolves this new problem. osmtest passes on some > > arbitrary > > > > networks. > > > > > > > > Eitan > > > > > > > > Signed-off-by: Eitan Zahavi > > > > > > Thanks. Applied to trunk only. > > > > > > Let me know if it also should be applied to 1.0. > > > > > > -- Hal From tziporet at mellanox.co.il Sun Jun 11 03:48:33 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 11 Jun 2006 13:48:33 +0300 Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA719C@mtlexch01.mtl.com> Jack put the bug fix to OFED 1.0. Tziporet -----Original Message----- From: James Lentini [mailto:jlentini at netapp.com] Sent: Saturday, June 10, 2006 1:12 AM To: Tziporet Koren Cc: Jack Morgenstein; openib-general; Arlin Davis Subject: Re: [openib-general] Re: [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS On Fri, 9 Jun 2006, Arlin Davis wrote: > James Lentini wrote: > > > On Thu, 8 Jun 2006, Jack Morgenstein wrote: > > > > > > > On Wednesday 07 June 2006 18:26, James Lentini wrote: > > > > > > > On Wed, 7 Jun 2006, Jack Morgenstein wrote: > > > > > > > > > This (bug fix) can still be included in next-week's release, if you > > > > > think it is important (I have extracted it from the changes checked > > > > > in at svn 7755) > > > > > > > > > If you are going to make another release anyway, then I would included > > > > it. > > > > > > > Do you mean -- include the fix in next week's release -- or -- wait with > > > the fix for the following release? > > > > > > > I'd include the fix in the next release, but I wouldn't create a special > > release just for this fix. > > > So are we getting this in next weeks release or not? I think we need it. Tziporet, Will this be in this fix be in the next OFED release? From mst at mellanox.co.il Sun Jun 11 03:52:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 11 Jun 2006 13:52:10 +0300 Subject: [openib-general] [PATCH 1/5] ib_addr: retrieve MGID from device address In-Reply-To: <000e01c68c0d$5d31b500$ff0da8c0@amr.corp.intel.com> References: <000e01c68c0d$5d31b500$ff0da8c0@amr.corp.intel.com> Message-ID: <20060611105210.GA7359@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [PATCH 1/5] ib_addr: retrieve MGID from device address > > Extract the MGID used by ipoib for broadcast traffic from the device > address. > > Signed-off-by: Sean Hefty > --- > This will be used to get the MCMemberRecord for the ipoib broadcast group. > > --- svn3/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_addr.h 2006-05-25 11:18:47.000000000 -0700 > +++ svn/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_addr.h 2006-06-06 16:14:11.000000000 -0700 > @@ -89,6 +89,11 @@ static inline void ib_addr_set_pkey(stru > dev_addr->broadcast[9] = (unsigned char) pkey; > } > > +static inline union ib_gid *ib_addr_get_mgid(struct rdma_dev_addr *dev_addr) > +{ > + return (union ib_gid *) (dev_addr->broadcast + 4); > +} > + > static inline union ib_gid *ib_addr_get_sgid(struct rdma_dev_addr *dev_addr) > { > return (union ib_gid *) (dev_addr->src_dev_addr + 4); > dev_addr->broadcast + 4/dev_addr->src_dev_addr + 4 may not be naturally aligned, so casting this pointer to structure type may cause compiler to generate incorrect code. In particular, there will generate misaligned access faults on ia64 when used, as we have already seen in case of IPoIB. Please fix these to return gid as char[16] instead, so that user uses memcpy properly and so that compiler knows the address may not be aligned. -- MST From halr at voltaire.com Sun Jun 11 04:14:22 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Jun 2006 07:14:22 -0400 Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo GetTable query In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E302368825@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302368825@mtlexch01.mtl.com> Message-ID: <1150024461.570.31906.camel@hal.voltaire.com> On Sun, 2006-06-11 at 06:46, Eitan Zahavi wrote: > Auuuch it is there! > My mistake. Sp please apply the patch to the OFED 1.0 branch too. > BTW: Is the osmtest -f a excersizes this query on the OFED 1.0 ? Yes. -- Hal > > Huh ? What's > > > https://openfabrics.org/svn/gen2/branches/1.0/src/userspace/management/o > sm/opens > > m/osm_sa_guidinfo_record.c > > > > -- Hal > > > > > > > > Eitan Zahavi > > > Senior Engineering Director, Software Architect > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > -----Original Message----- > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Sent: Sunday, June 11, 2006 12:22 AM > > > > To: Eitan Zahavi > > > > Cc: OPENIB > > > > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable > query > > > > > > > > Eitan, > > > > > > > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote: > > > > > Hi Hal > > > > > > > > > > I'm working on passing osmtest check. Found a bug in the new > > > > > GUIDInfoRecord query: If you had a physical port with zero > guid_cap > > > > > the code would loop on blocks 0..255 instead of trying the next > > > port. > > > > > > > > > > I am still looking for why we might have a guid_cap == 0 on some > > > > > ports. > > > > > > > > > > This patch resolves this new problem. osmtest passes on some > > > arbitrary > > > > > networks. > > > > > > > > > > Eitan > > > > > > > > > > Signed-off-by: Eitan Zahavi > > > > > > > > Thanks. Applied to trunk only. > > > > > > > > Let me know if it also should be applied to 1.0. > > > > > > > > -- Hal From bugzilla-daemon at openib.org Sun Jun 11 05:55:54 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Sun, 11 Jun 2006 05:55:54 -0700 (PDT) Subject: [openib-general] [Bug 131] New: working with huge pages may crash the kernel on Suse10 Message-ID: <20060611125554.9656E2287AC@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=131 Summary: working with huge pages may crash the kernel on Suse10 Product: OpenFabrics Linux Version: 1.0rc6 Platform: X86-64 OS/Version: Other Status: NEW Severity: normal Priority: P2 Component: IB Core AssignedTo: bugzilla at openib.org ReportedBy: dotanb at mellanox.co.il ************************************************************* Host Architecture : x86_64 Linux Distribution: SUSE LINUX 10.0 (X86-64) OSS VERSION = 10.0 Kernel Version : 2.6.13-15-smp Memory size : 5099744 kB Driver Version : OFED-1.0-rc6-post1 HCA ID(s) : mthca0 HCA model(s) : 25218 FW version(s) : 5.1.915 Board(s) : MT_0200000001 ************************************************************* working with huge pages may cause a kernel crash in sus10: kernel 2.6.13-15-smp. everything was fine when we used kenels 2.6.9, 2.6.16 . here is the back trace from the /var/log/messages: Jun 9 15:15:03 sw030 kernel: general protection fault: 0000 [1] SMP Jun 9 15:15:03 sw030 kernel: CPU 1 Jun 9 15:15:03 sw030 kernel: Modules linked in: rdma_ucm ib_sdp rdma_cm ib_addr ib_cm ib_local_sa findex ib_ipoib ib_sa ib_uverbs ib_umad ib_mthca ib_mad ib_core memtrack mst_pciconf mst_pci hfsplus vfat fat subfs freq_table autofs4 edd ipv6 button battery ac af_packet floppy e1000 i2c_i801 i2c_core generic ide_core ehci_hcd hw_random uhci_hcd usbcore shpchp pci_hotplug parport_pc lp parport dm_mod ext3 jbd fan thermal processor aic79xx scsi_transport_spi sg sr_mod cdrom ata_piix libata sd_mod scsi_mod Jun 9 15:15:03 sw030 kernel: Pid: 1822, comm: mr_test Tainted: G U 2.6.13-15-smp Jun 9 15:15:03 sw030 kernel: RIP: 0010:[] {set_page_dirty+34} Jun 9 15:15:03 sw030 kernel: RSP: 0018:ffff81007c5a9e20 EFLAGS: 00010286 Jun 9 15:15:03 sw030 kernel: RAX: 803d9290c7c7485b RBX: 0000000000000001 RCX: ffff8100016cf000 Jun 9 15:15:03 sw030 kernel: RDX: ffffffff80183550 RSI: ffff8100016cf000 RDI: ffff8100016cf038 Jun 9 15:15:03 sw030 kernel: RBP: ffff8100016cf038 R08: 0000000000001000 R09: ffff810051568cd8 Jun 9 15:15:03 sw030 kernel: R10: 000000000000003f R11: ffffffff801dd920 R12: 0000000000000001 Jun 9 15:15:03 sw030 kernel: R13: ffff810064415ca8 R14: ffff81000dc86000 R15: 0000000000000001 Jun 9 15:15:03 sw030 kernel: FS: 00002aaaab21c0a0(0000) GS:ffffffff8050e880(0000) knlGS:0000000000000000 Jun 9 15:15:03 sw030 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 9 15:15:03 sw030 kernel: CR2: 0000000000603018 CR3: 000000003d2df000 CR4: 00000000000006e0 Jun 9 15:15:03 sw030 kernel: Process mr_test (pid: 1822, threadinfo ffff81007c5a8000, task ffff81007bc743f0) Jun 9 15:15:03 sw030 kernel: Stack: ffffffff8016ce49 ffff810072047a00 0000000000000001 ffff81005259f000 Jun 9 15:15:03 sw030 kernel: ffffffff882e5c3a ffff810072047a00 ffff8100585fe000 ffff810064415cd0 Jun 9 15:15:03 sw030 kernel: ffff810064415ca8 ffff810064415c80 Jun 9 15:15:03 sw030 kernel: Call Trace:{set_page_dirty_lock+41} {:ib_uverbs:__ib_umem_release+122} Jun 9 15:15:03 sw030 kernel: {:ib_uverbs:ib_umem_release+14} {:ib_uverbs:ib_uverbs_dereg_mr+245} Jun 9 15:15:03 sw030 kernel: {tty_write+578} {:ib_uverbs:ib_uverbs_write+158} Jun 9 15:15:03 sw030 kernel: {vfs_write+234} {sys_write+83} Jun 9 15:15:03 sw030 kernel: {system_call+126} Jun 9 15:15:03 sw030 kernel: Jun 9 15:15:03 sw030 kernel: Code: 48 8b 40 20 48 85 c0 74 06 49 89 c3 41 ff e3 e9 4a 17 02 00 Jun 9 15:15:03 sw030 kernel: RIP {set_page_dirty+34} RSP ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tziporet at mellanox.co.il Sun Jun 11 07:31:34 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 11 Jun 2006 17:31:34 +0300 Subject: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 tarball available with working ipath driver In-Reply-To: <1149895236.27921.2.camel@pelerin.serpentine.com> References: <1149895236.27921.2.camel@pelerin.serpentine.com> Message-ID: <448C2946.5010707@mellanox.co.il> Bryan O'Sullivan wrote: > Due to unfortunate timing, the ipath driver in OFED 1.0-rc6 does not > work correctly. You can download an updated tarball from here, for > which the ipath driver works fine: > > http://openib.red-bean.com/OFED-1.0-rc6+ipath.tar.bz2 > > Alternatively, pull the necessary patches from SVN. > > > > __ Hi Bryan You have forgot some of the patches in your tarball file, thus several OSes does not pass (e.g RH EL4 up3). /openib-1.0/patches/ > ls */ipath* 2.6.11_FC4/ipath_backport.patch 2.6.13/ipath_backport.patch 2.6.15/ipath_backport.patch 2.6.11/ipath_backport.patch 2.6.13_suse10_0_u/ipath_backport.patch 2.6.9/ipath_backport.patch 2.6.12/ipath_backport.patch 2.6.14/ipath_backport.patch fixes/ipath_rollup.patch /openib-1.0/patches/ > ls 2.6.11/ 2.6.12/ 2.6.13_suse10_0_u/ 2.6.15/ 2.6.16_sles10/ 2.6.9/ dapl/ memtrack/ 2.6.11_FC4/ 2.6.13/ 2.6.14/ 2.6.16/ 2.6.17/ 2.6.9_U3/ fixes/ I took the freedom to copy these patches to svn since I noticed that ipath backport patches are same for all OSes & kernels. Please take a look and make sure ipath driver compiles & loads on all supported systems (including ia64, PPC64 etc.) Tziporet From mamidala at cse.ohio-state.edu Sun Jun 11 09:51:02 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Sun, 11 Jun 2006 12:51:02 -0400 (EDT) Subject: [openib-general] [PATCH 0/5] multicast abstraction In-Reply-To: <4489F43D.60502@ichips.intel.com> Message-ID: Hi Sean, I am planning to use RDMA CM for multicast functionality. It would be great if you can point me to a simple multicast test program using RDMA CM? Thanks, Amith On Fri, 9 Jun 2006, Sean Hefty wrote: > Sean Hefty wrote: > > This patch series enhances support for joining and leaving multicast groups, > > providing the following functionality: > > > > 1. Users identify a multicast group by a multicast IP address. > > 2. A user binds to a local RDMA device based on resolving the IP address. > > 3. A new multicast group is created. The parameters for the multicast group are > > obtained based on the ipoib broadcast group, and the MGID is derived using the > > same algorithm as ipoib, except with a different signature. > > 4. Any QP associated with the join is attached to the group once the join > > operation completes. > > 5. A QP may join multiple groups. > > I forgot to mention that this patch series is dependent on adding UD QP support > to the RDMA CM. > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Sun Jun 11 10:42:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 11 Jun 2006 20:42:41 +0300 Subject: [openib-general] bug report: mad.c: ib_req_notify_cq called without polling cq Message-ID: <20060611174241.GA2993@mellanox.co.il> Hello, Sean! mad.c calls ib_req_notify_cq on hotplug event in ib_mad_port_start, after QPs are attached to a CQ. Since this function does not poll the CQ, if sufficient number of MADs arrive at the QP before ib_req_notify_cq is called, RQ might get empty and no completion events will ever be generated. Please comment. -- MST From mst at mellanox.co.il Sun Jun 11 10:50:06 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 11 Jun 2006 20:50:06 +0300 Subject: [openib-general] [PATCH] mad: prevent duplicate RMPP sessions on responder side In-Reply-To: <4473371C.6040504@ichips.intel.com> References: <200605231459.46326.jackm@mellanox.co.il> <4473371C.6040504@ichips.intel.com> Message-ID: <20060611175006.GB2993@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] mad: prevent duplicate RMPP sessions on responder side > > Jack Morgenstein wrote: > >Prevent opening multiple RMPP MAD transaction sessions at responder side > >with the same TID, GID/LID, class. > > > >Could happen if RMPP requests are retried while response is in progress. > > My preference for handling this is to detect and discard duplicate > requests, and verify that response MADs match a request when being sent. > See the mail thread starting at: > > http://openib.org/pipermail/openib-general/2006-April/020703.html > > This will also help us add in support for DS RMPP. > > For kernel clients, I anticipate that this sort of change is fairly small. > Userspace support requires a bit more work, especially if we don't want to > change the ABI. Sean, is anyone looking at this? If not, given that Jack's approach does not touch ABI or API, might it make sense to merge Jack's patch after all and use that as a starting point? With current code in 2.6.17 large RMPPs often get aborted because of the problem of the duplicates. On the other hand, I'm not aware of users for DS RMPP. -- MST From pasha at mellanox.co.il Sun Jun 11 12:15:41 2006 From: pasha at mellanox.co.il (Pavel Shamis (Pasha)) Date: Sun, 11 Jun 2006 22:15:41 +0300 Subject: [openib-general] [openfabrics-ewg] RE: OFED-1.0-rc6 is available In-Reply-To: References: Message-ID: <448C6BDD.6030007@mellanox.co.il> We also did performance checks on different platforms, and the default MTU was changed to 1K (The 2k is more optimal for ddr platforms) Thank you for pointing the issue. Pavel Shamis (Pasha) Scott Weitzenkamp (sweitzen) wrote: > The MTU change undos the changes for bug 81, so I have reopened bug 81 > (http://openib.org/bugzilla/show_bug.cgi?id=81). > > With rc6, PCI-X osu_bw and osu_bibw performance is bad, and PCI-E > osu_bibw performance is bad. I've enclosed some performance data, look > at rc4 vs rc5 vs rc6 for Cougar/Cheetah/LionMini. > > Are there other benchmarks driving the changes in rc6 (and rc4)? > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > > > > *OSU MPI:* > > · Added mpi_alltoall fine tuning parameters > > · Added default configuration/documentation file > $MPIHOME/etc/mvapich.conf > > · Added shell configuration files $MPIHOME/etc/mvapich.csh , > $MPIHOME/etc/mvapich.csh > > · Default MTU was changed back to 2K for InfiniHost III Ex > and InfiniHost III Lx HCAs. For InfiniHost card recommended value is: > VIADEV_DEFAULT_MTU=MTU1024 > > > ------------------------------------------------------------------------ > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg From bugzilla-daemon at openib.org Sun Jun 11 13:57:05 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Sun, 11 Jun 2006 13:57:05 -0700 (PDT) Subject: [openib-general] [Bug 1] kernel prints out error message for each ib interface Message-ID: <20060611205705.0D8A52287AC@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=1 sweitzen at cisco.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED ------- Comment #3 from sweitzen at cisco.com 2006-06-11 13:57 ------- Message still there in RHEL4 U3. Close bug because it is benign. [root at svbu-qa1850-1 ~]# uname -a Linux svbu-qa1850-1 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x8 6_64 x86_64 GNU/Linux [root at svbu-qa1850-1 ~]# dmesg | grep divert divert: not allocating divert_blk for non-ethernet device lo divert: allocating divert_blk for eth0 divert: allocating divert_blk for eth1 divert: not allocating divert_blk for non-ethernet device ib0 divert: not allocating divert_blk for non-ethernet device sit0 ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From rdreier at cisco.com Sun Jun 11 17:02:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 11 Jun 2006 17:02:45 -0700 Subject: [openib-general] [PATCH 2/2] ipoib: handle multicast group reset notification In-Reply-To: <4489BF48.8010804@ichips.intel.com> (Sean Hefty's message of "Fri, 09 Jun 2006 11:34:48 -0700") References: <4489BF48.8010804@ichips.intel.com> Message-ID: Sean> Any issue committing this? No, looks fine. - R. From rdreier at cisco.com Sun Jun 11 17:06:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 11 Jun 2006 17:06:12 -0700 Subject: [openib-general] [PATCH 5/5] ucma: export multicast suport to userspace In-Reply-To: <001201c68c12$5adc3a00$ff0da8c0@amr.corp.intel.com> (Sean Hefty's message of "Fri, 9 Jun 2006 15:16:28 -0700") References: <001201c68c12$5adc3a00$ff0da8c0@amr.corp.intel.com> Message-ID: > @@ -58,6 +58,8 @@ enum { > RDMA_USER_CM_CMD_GET_EVENT, > RDMA_USER_CM_CMD_GET_OPTION, > RDMA_USER_CM_CMD_SET_OPTION, > + RDMA_USER_CM_CMD_JOIN_MCAST, > + RDMA_USER_CM_CMD_LEAVE_MCAST, > RDMA_USER_CM_CMD_GET_DST_ATTR > }; I think this changes the exported ABI by changing the value of RDMA_USER_CM_CMD_GET_DST_ATTR, right? - R. From greg.lindahl at qlogic.com Sun Jun 11 17:40:29 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Sun, 11 Jun 2006 17:40:29 -0700 Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo GetTable query In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E302368825@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302368825@mtlexch01.mtl.com> Message-ID: <20060612004029.GA16596@greglaptop.hsd1.ca.comcast.net> So this is a _critical_ bugfix ? > Auuuch it is there! > My mistake. Sp please apply the patch to the OFED 1.0 branch too. > BTW: Is the osmtest -f a excersizes this query on the OFED 1.0 ? > > > Huh ? What's > > > https://openfabrics.org/svn/gen2/branches/1.0/src/userspace/management/o > sm/opens > > m/osm_sa_guidinfo_record.c > > > > -- Hal > > > > > > > > Eitan Zahavi > > > Senior Engineering Director, Software Architect > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > -----Original Message----- > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Sent: Sunday, June 11, 2006 12:22 AM > > > > To: Eitan Zahavi > > > > Cc: OPENIB > > > > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable > query > > > > > > > > Eitan, > > > > > > > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote: > > > > > Hi Hal > > > > > > > > > > I'm working on passing osmtest check. Found a bug in the new > > > > > GUIDInfoRecord query: If you had a physical port with zero > guid_cap > > > > > the code would loop on blocks 0..255 instead of trying the next > > > port. > > > > > > > > > > I am still looking for why we might have a guid_cap == 0 on some > > > > > ports. > > > > > > > > > > This patch resolves this new problem. osmtest passes on some > > > arbitrary > > > > > networks. > > > > > > > > > > Eitan > > > > > > > > > > Signed-off-by: Eitan Zahavi > > > > > > > > Thanks. Applied to trunk only. > > > > > > > > Let me know if it also should be applied to 1.0. > > > > > > > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Sun Jun 11 17:46:35 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Jun 2006 20:46:35 -0400 Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo GetTable query In-Reply-To: <20060612004029.GA16596@greglaptop.hsd1.ca.comcast.net> References: <6AB138A2AB8C8E4A98B9C0C3D52670E302368825@mtlexch01.mtl.com> <20060612004029.GA16596@greglaptop.hsd1.ca.comcast.net> Message-ID: <1150073195.570.63586.camel@hal.voltaire.com> On Sun, 2006-06-11 at 20:40, Greg Lindahl wrote: > So this is a _critical_ bugfix ? Depends on one's definition. Anyhow, it's been applied to 1.0. -- Hal > > > Auuuch it is there! > > My mistake. Sp please apply the patch to the OFED 1.0 branch too. > > BTW: Is the osmtest -f a excersizes this query on the OFED 1.0 ? > > > > > Huh ? What's > > > > > https://openfabrics.org/svn/gen2/branches/1.0/src/userspace/management/o > > sm/opens > > > m/osm_sa_guidinfo_record.c > > > > > > -- Hal > > > > > > > > > > > Eitan Zahavi > > > > Senior Engineering Director, Software Architect > > > > Mellanox Technologies LTD > > > > Tel:+972-4-9097208 > > > > Fax:+972-4-9593245 > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > -----Original Message----- > > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > Sent: Sunday, June 11, 2006 12:22 AM > > > > > To: Eitan Zahavi > > > > > Cc: OPENIB > > > > > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable > > query > > > > > > > > > > Eitan, > > > > > > > > > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote: > > > > > > Hi Hal > > > > > > > > > > > > I'm working on passing osmtest check. Found a bug in the new > > > > > > GUIDInfoRecord query: If you had a physical port with zero > > guid_cap > > > > > > the code would loop on blocks 0..255 instead of trying the next > > > > port. > > > > > > > > > > > > I am still looking for why we might have a guid_cap == 0 on some > > > > > > ports. > > > > > > > > > > > > This patch resolves this new problem. osmtest passes on some > > > > arbitrary > > > > > > networks. > > > > > > > > > > > > Eitan > > > > > > > > > > > > Signed-off-by: Eitan Zahavi > > > > > > > > > > Thanks. Applied to trunk only. > > > > > > > > > > Let me know if it also should be applied to 1.0. > > > > > > > > > > -- Hal > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Sun Jun 11 20:54:54 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 11 Jun 2006 20:54:54 -0700 Subject: [openib-general] [PATCH 5/5] ucma: export multicast suport to userspace In-Reply-To: Message-ID: <000001c68dd3$f70d55e0$68fc070a@amr.corp.intel.com> > > @@ -58,6 +58,8 @@ enum { > > RDMA_USER_CM_CMD_GET_EVENT, > > RDMA_USER_CM_CMD_GET_OPTION, > > RDMA_USER_CM_CMD_SET_OPTION, > > + RDMA_USER_CM_CMD_JOIN_MCAST, > > + RDMA_USER_CM_CMD_LEAVE_MCAST, > > RDMA_USER_CM_CMD_GET_DST_ATTR > > }; > >I think this changes the exported ABI by changing the value of >RDMA_USER_CM_CMD_GET_DST_ATTR, right? Good catch. Originally the last 3 commands were only for multicast support, but I went back and added GET_DST_ATTR as part of the UD QP patch. I will swap those around, so that the check-in doesn't change the values. My intent is to commit the UD QP patch, followed immediately by the multicast patch, and only update the ABI once. - Sean From sean.hefty at intel.com Sun Jun 11 20:57:53 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 11 Jun 2006 20:57:53 -0700 Subject: [openib-general] [PATCH] mad: prevent duplicate RMPP sessions on responder side In-Reply-To: <20060611175006.GB2993@mellanox.co.il> Message-ID: <000101c68dd4$61724620$68fc070a@amr.corp.intel.com> >Sean, is anyone looking at this? If not, given that Jack's approach does not >touch ABI or API, might it make sense to merge Jack's patch after all and use >that as a starting point? With current code in 2.6.17 large RMPPs often get >aborted because of the problem of the duplicates. On the other hand, I'm not >aware of users for DS RMPP. Yes - I am looking into this. I have it in my plan to fix this after multicast support for userspace. (Which is now done, pending review and updates.) - Sean From sean.hefty at intel.com Sun Jun 11 21:02:05 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 11 Jun 2006 21:02:05 -0700 Subject: [openib-general] [PATCH 0/5] multicast abstraction In-Reply-To: Message-ID: <000201c68dd4$f7edd560$68fc070a@amr.corp.intel.com> >I am planning to use RDMA CM for multicast functionality. It would be >great if you can point me to a simple multicast test program using >RDMA CM? There is a userspace test program (mckey) that will be available, but has not been posted yet. (A kernel test program would look fairly similar.) I will post that tomorrow, along with the changes to the librdmacm. Let me know if you have any questions. - Sean From sean.hefty at intel.com Sun Jun 11 21:04:59 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 11 Jun 2006 21:04:59 -0700 Subject: [openib-general] [PATCH 1/5] ib_addr: retrieve MGID from device address In-Reply-To: <20060611105210.GA7359@mellanox.co.il> Message-ID: <000301c68dd5$5f569ca0$68fc070a@amr.corp.intel.com> >dev_addr->broadcast + 4/dev_addr->src_dev_addr + 4 may not be naturally >aligned, >so casting this pointer to structure type may cause compiler to generate >incorrect code. Thanks - I'll update this. - Sean From sean.hefty at intel.com Sun Jun 11 21:31:46 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 11 Jun 2006 21:31:46 -0700 Subject: [openib-general] bug report: mad.c: ib_req_notify_cq called without polling cq In-Reply-To: <20060611174241.GA2993@mellanox.co.il> Message-ID: <000401c68dd9$1d9a29e0$68fc070a@amr.corp.intel.com> >mad.c calls ib_req_notify_cq on hotplug event in ib_mad_port_start, after QPs >are attached to a CQ. Since this function does not poll the CQ, if sufficient >number of MADs arrive at the QP before ib_req_notify_cq is called, RQ might get >empty and no completion events will ever be generated. This is arming the CQ _before_ we post MADs to the receive queue of the QP. I don't think that there's a race here. - Sean From mst at mellanox.co.il Sun Jun 11 22:21:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Jun 2006 08:21:49 +0300 Subject: [openib-general] bug report: mad.c: ib_req_notify_cq called without polling cq In-Reply-To: <000401c68dd9$1d9a29e0$68fc070a@amr.corp.intel.com> References: <000401c68dd9$1d9a29e0$68fc070a@amr.corp.intel.com> Message-ID: <20060612052149.GA3390@mellanox.co.il> Quoting r. Sean Hefty : > This is arming the CQ _before_ we post MADs to the receive queue of the QP. I > don't think that there's a race here. Good point, thanks. -- MST From mst at mellanox.co.il Mon Jun 12 05:16:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Jun 2006 15:16:35 +0300 Subject: [openib-general] [PATCH] mthca: memfree completion with error workaround Message-ID: <20060612121635.GX7359@mellanox.co.il> Roland, please consider the following for 2.6.17. --- Memfree firmware is in rare cases reporting WQE index == -1 in receive completion with error instead of (rq size - 1). Here is a patch to avoid kernel crash and report a correct WR id in this case. Since reporting a wrong WR id has severe consequences for ULPs, make the test as restrictive as possible, and report an error if we see an unexpected value. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/hw/mthca/mthca_cq.c =================================================================== --- openib/drivers/infiniband/hw/mthca/mthca_cq.c (revision 7837) +++ openib/drivers/infiniband/hw/mthca/mthca_cq.c (working copy) @@ -542,6 +542,22 @@ } else { wq = &(*cur_qp)->rq; wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift; + /* WQE index == -1 might be reported by + Sinai FW 1.0.800, Arbel FW 5.1.400 and should be fixed + in later revisions. */ + if (unlikely(wqe_index >= (*cur_qp)->rq.max)) { + if (unlikely(is_error) && + unlikely(wqe_index == 0xffffffff >> wq->wqe_shift) && + mthca_is_memfree(dev)) + wqe_index = wq->max - 1; + else { + mthca_err(dev, "Corrupted RQ CQE. " + "CQ 0x%x QP 0x%x idx 0x%x > 0x%x\n", + cq->cqn, entry->qp_num, wqe_index, + wq->max); + return -EINVAL; + } + } entry->wr_id = (*cur_qp)->wrid[wqe_index]; } -- MST From mst at mellanox.co.il Mon Jun 12 05:16:47 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Jun 2006 15:16:47 +0300 Subject: [openib-general] [PATCH] libmthca: work around for cqe with error Message-ID: <20060612121647.GY7359@mellanox.co.il> Same patch as posted earlier for kernel. --- libmthca: completion with error for memfree might get WQE index == -1. Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/libmthca/src/cq.c =================================================================== --- openib/src/userspace/libmthca/src/cq.c (revision 7890) +++ openib/src/userspace/libmthca/src/cq.c (working copy) @@ -349,6 +349,22 @@ } else { wq = &(*cur_qp)->rq; wqe_index = ntohl(cqe->wqe) >> wq->wqe_shift; + /* WQE index == -1 might be reported by + Sinai FW 1.0.800, Arbel FW 5.1.400 and should be fixed + in later revisions. */ + if (wqe_index >= (*cur_qp)->rq.max) { + if (is_error && + (wqe_index == 0xffffffff >> wq->wqe_shift) && + mthca_is_memfree(cq->ibv_cq.context)) + wqe_index = wq->max - 1; + else { + printf("Corrupted RQ CQE. " + "CQ 0x%x QP 0x%x idx 0x%x > 0x%x\n", + cq->cqn, wc->qp_num, wqe_index, + wq->max); + return -1; + } + } wc->wr_id = (*cur_qp)->wrid[wqe_index]; } -- MST From mst at mellanox.co.il Mon Jun 12 05:48:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Jun 2006 15:48:33 +0300 Subject: [openib-general] potential bug: ipoib freeing ah before completion Message-ID: <20060612124833.GA19452@mellanox.co.il> Hello, Roland! The following was noted by Eli Cohen: It seems that ipoib_flush_paths can be called while completions are still outstanding on IPoIB QP (e.g. from ipoib_ib_dev_flush). If this happens, an address handle might get freed while a work request is still outstanding for it. This can trigger a local QP error, and IPoIB will stop working, until QP is reset. Please comment. -- MST From mst at mellanox.co.il Mon Jun 12 06:57:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Jun 2006 16:57:51 +0300 Subject: [openib-general] [PATCH] mthca: restore missing registers Message-ID: <20060612135751.GB19518@mellanox.co.il> Roland, please consider the following for 2.6.17. --- mthca misses restoring the following PCI-X/PCI-Express registers at reset: PCI-X device: PCI-X command register PCI-X bridge: upstream and downstream split transaction registers PCI-Express : PCI-Express device control and link control registers This causes instability and/or bad performance on systems where one of these registers is set to a non-default value by BIOS. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16/drivers/infiniband/hw/mthca/mthca_reset.c =================================================================== --- linux-2.6.16.orig/drivers/infiniband/hw/mthca/mthca_reset.c 2006-04-26 15:04:26.000000000 +0300 +++ linux-2.6.16/drivers/infiniband/hw/mthca/mthca_reset.c 2006-06-11 21:52:44.000000000 +0300 @@ -48,6 +48,12 @@ int mthca_reset(struct mthca_dev *mdev) u32 *hca_header = NULL; u32 *bridge_header = NULL; struct pci_dev *bridge = NULL; + int bridge_pcix_cap = 0; + int hca_pcie_cap = 0; + int hca_pcix_cap = 0; + + u16 devctl; + u16 linkctl; #define MTHCA_RESET_OFFSET 0xf0010 #define MTHCA_RESET_VALUE swab32(1) @@ -109,6 +115,9 @@ int mthca_reset(struct mthca_dev *mdev) } } + hca_pcix_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); + hca_pcie_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP); + if (bridge) { bridge_header = kmalloc(256, GFP_KERNEL); if (!bridge_header) { @@ -128,6 +137,13 @@ int mthca_reset(struct mthca_dev *mdev) goto out; } } + bridge_pcix_cap = pci_find_capability(bridge, PCI_CAP_ID_PCIX); + if (!bridge_pcix_cap) { + err = -ENODEV; + mthca_err(mdev, "Couldn't locate HCA bridge " + "PCI-X capability, aborting.\n"); + goto out; + } } /* actually hit reset */ @@ -177,6 +193,20 @@ int mthca_reset(struct mthca_dev *mdev) good: /* Now restore the PCI headers */ if (bridge) { + if (pci_write_config_dword(bridge, bridge_pcix_cap + 0x8, + bridge_header[(bridge_pcix_cap + 0x8)/ 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge Upstream " + "split transaction control, aborting.\n"); + goto out; + } + if (pci_write_config_dword(bridge, bridge_pcix_cap + 0xc, + bridge_header[(bridge_pcix_cap + 0xc)/ 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge Downstream " + "split transaction control, aborting.\n"); + goto out; + } /* * Bridge control register is at 0x3e, so we'll * naturally restore it last in this loop. @@ -202,6 +232,35 @@ good: } } + if (hca_pcix_cap) { + if (pci_write_config_dword(mdev->pdev, hca_pcix_cap, + hca_header[hca_pcix_cap / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA PCI-X " + "command register, aborting.\n"); + goto out; + } + } + + if (hca_pcie_cap) { + devctl = hca_header[(hca_pcie_cap + 0x8)/ 4]; + if (pci_write_config_word(mdev->pdev, hca_pcie_cap + 0x8, + devctl)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA PCI-X " + "Device Control register, aborting.\n"); + goto out; + } + linkctl = hca_header[(hca_pcie_cap + 0x10)/ 4]; + if (pci_write_config_word(mdev->pdev, hca_pcie_cap + 0x10, + linkctl)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA PCI-Express " + "Link control register, aborting.\n"); + goto out; + } + } + for (i = 0; i < 16; ++i) { if (i * 4 == PCI_COMMAND) continue; -- MST From eitan at mellanox.co.il Mon Jun 12 06:59:14 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 12 Jun 2006 16:59:14 +0300 Subject: [openib-general] [PATCH] osm: partition manager force policy (and other fixes) Message-ID: <86ejxulbkd.fsf@mtl066.yok.mtl.com> Hi Hal As I started to test the partition manager code (using ibmgtsim pkey test), I realized the implementation does not really enforces the partition policy on the given fabric. This patch fixes that. It was verified using the simulation test. Several other corner cases were fixed too. Eitan Signed-off-by: Eitan Zahavi Index: include/opensm/osm_port.h =================================================================== --- include/opensm/osm_port.h (revision 7867) +++ include/opensm/osm_port.h (working copy) @@ -586,6 +586,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy * Port, Physical Port *********/ +/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl +* NAME +* osm_physp_get_mod_pkey_tbl +* +* DESCRIPTION +* Returns a NON CONST pointer to the P_Key table object of the Physical Port object. +* +* SYNOPSIS +*/ +static inline osm_pkey_tbl_t * +osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp ) +{ + CL_ASSERT( osm_physp_is_valid( p_physp ) ); + /* + (14.2.5.7) - the block number valid values are 0-2047, and are further + limited by the size of the P_Key table specified by the PartitionCap on the node. + */ + return( &p_physp->pkeys ); +}; +/* +* PARAMETERS +* p_physp +* [in] Pointer to an osm_physp_t object. +* +* RETURN VALUES +* The pointer to the P_Key table object. +* +* NOTES +* +* SEE ALSO +* Port, Physical Port +*********/ + /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl * NAME * osm_physp_set_slvl_tbl Index: include/opensm/osm_pkey.h =================================================================== --- include/opensm/osm_pkey.h (revision 7867) +++ include/opensm/osm_pkey.h (working copy) @@ -92,6 +92,8 @@ typedef struct _osm_pkey_tbl cl_ptr_vector_t blocks; cl_ptr_vector_t new_blocks; cl_map_t keys; + cl_qlist_t pending; + uint16_t used_blocks; } osm_pkey_tbl_t; /* * FIELDS @@ -104,6 +106,13 @@ typedef struct _osm_pkey_tbl * keys * A set holding all keys * +* pending +* A list osm_pending_pkey structs that is temporarily set by the +* pkey mgr and used during pkey mgr algorithm only +* +* used_blocks +* Tracks the number of blocks having non-zero pkeys +* * NOTES * 'blocks' vector should be used to store pkey values obtained from * the port and SM pkey manager should not change it directly, for this @@ -114,6 +123,39 @@ typedef struct _osm_pkey_tbl * *********/ +/****s* OpenSM: osm_pending_pkey_t +* NAME +* osm_pending_pkey_t +* +* DESCRIPTION +* This objects stores temporary information on pkeys their target block and index +* during the pkey manager operation +* +* SYNOPSIS +*/ +typedef struct _osm_pending_pkey { + cl_list_item_t list_item; + uint16_t pkey; + uint32_t block; + uint8_t index; + boolean_t is_new; +} osm_pending_pkey_t; +/* +* FIELDS +* pkey +* The actual P_Key +* +* block +* The block index based on the previous table extracted from the device +* +* index +* The index of the pky within the block +* +* is_new +* TRUE for new P_Keys such that the block and index are invalid in that case +* +*********/ + /****f* OpenSM: osm_pkey_tbl_construct * NAME * osm_pkey_tbl_construct @@ -263,6 +305,41 @@ void osm_pkey_tbl_sync_new_blocks( * *********/ +/****f* OpenSM: osm_pkey_tbl_get_block_and_idx +* NAME +* osm_pkey_tbl_get_block_and_idx +* +* DESCRIPTION +* set the block index and pkey index the given +* pkey is found in. return 1 if cound not find +* it, 0 if OK +* +* SYNOPSIS +*/ +int +osm_pkey_tbl_get_block_and_idx( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t *p_pkey, + OUT uint32_t *block_idx, + OUT uint8_t *pkey_index); +/* +* p_pkey_tbl +* [in] Pointer to osm_pkey_tbl_t object. +* +* p_pkey +* [in] Pointer to the P_Key entry searched +* +* p_block_idx +* [out] Pointer to the block index to be updated +* +* p_pkey_idx +* [out] Pointer to the pkey index (in the block) to be updated +* +* +* NOTES +* +*********/ + /****f* OpenSM: osm_pkey_tbl_set * NAME * osm_pkey_tbl_set Index: opensm/osm_prtn.c =================================================================== --- opensm/osm_prtn.c (revision 7904) +++ opensm/osm_prtn.c (working copy) @@ -140,6 +140,12 @@ ib_api_status_t osm_prtn_add_port(osm_lo p_tbl = (full == TRUE) ? &p->full_guid_tbl : &p->part_guid_tbl ; + osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: " + "Added port 0x%" PRIx64 " to " + "partition \'%s\' (0x%04x) As %s member\n", + cl_ntoh64(guid), p->name, cl_ntoh16(p->pkey), + full ? "full" : "partial" ); + if (cl_map_insert(p_tbl, guid, p_physp) == NULL) return IB_INSUFFICIENT_MEMORY; Index: opensm/osm_pkey.c =================================================================== --- opensm/osm_pkey.c (revision 7904) +++ opensm/osm_pkey.c (working copy) @@ -100,6 +100,8 @@ int osm_pkey_tbl_init( cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1); cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1); cl_map_init( &p_pkey_tbl->keys, 1 ); + cl_qlist_init( &p_pkey_tbl->pending ); + return(IB_SUCCESS); } @@ -118,14 +120,28 @@ void osm_pkey_tbl_sync_new_blocks( p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b); if ( b < new_blocks ) p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b); - else { + else + { p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block)); if (!p_new_block) break; - memset(p_new_block, 0, sizeof(*p_new_block)); cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block); } - memcpy(p_new_block, p_block, sizeof(*p_new_block)); + + memset(p_new_block, 0, sizeof(*p_new_block)); + } +} + +/********************************************************************** + **********************************************************************/ +void osm_pkey_tbl_cleanup_pending( + IN osm_pkey_tbl_t *p_pkey_tbl) +{ + cl_list_item_t *p_item; + p_item = cl_qlist_remove_head( &p_pkey_tbl->pending ); + while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) ) + { + free( (osm_pending_pkey_t *)p_item ); } } @@ -202,6 +218,38 @@ int osm_pkey_tbl_set( /********************************************************************** **********************************************************************/ +int +osm_pkey_tbl_get_block_and_idx( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t *p_pkey, + OUT uint32_t *p_block_idx, + OUT uint8_t *p_pkey_index) +{ + uint32_t num_of_blocks; + uint32_t block_index; + ib_pkey_table_t *block; + + CL_ASSERT( p_pkey_tbl ); + CL_ASSERT( p_block_idx != NULL ); + CL_ASSERT( p_pkey_idx != NULL ); + + num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks); + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + { + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + if ( ( block->pkey_entry <= p_pkey ) && + ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK)) + { + *p_block_idx = block_index; + *p_pkey_index = p_pkey - block->pkey_entry; + return 0; + } + } + return 1; +} + +/********************************************************************** + **********************************************************************/ static boolean_t __osm_match_pkey ( IN const ib_net16_t *pkey1, IN const ib_net16_t *pkey2 ) { @@ -321,7 +369,8 @@ osm_port_share_pkey( OSM_LOG_ENTER( p_log, osm_port_share_pkey ); - if (!p_port_1 || !p_port_2) { + if (!p_port_1 || !p_port_2) + { ret = FALSE; goto Exit; } @@ -329,7 +378,8 @@ osm_port_share_pkey( p_physp1 = osm_port_get_default_phys_ptr(p_port_1); p_physp2 = osm_port_get_default_phys_ptr(p_port_2); - if (!p_physp1 || !p_physp2) { + if (!p_physp1 || !p_physp2) + { ret = FALSE; goto Exit; } Index: opensm/osm_pkey_mgr.c =================================================================== --- opensm/osm_pkey_mgr.c (revision 7904) +++ opensm/osm_pkey_mgr.c (working copy) @@ -62,6 +62,138 @@ /********************************************************************** **********************************************************************/ +/* + the max number of pkey blocks for a physical port is located in + different place for switch external ports (SwitchInfo) and the + rest of the ports (NodeInfo) +*/ +static int pkey_mgr_get_physp_max_blocks( + IN const osm_subn_t *p_subn, + IN const osm_physp_t *p_physp) +{ + osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); + osm_switch_t *p_sw; + uint16_t num_pkeys = 0; + + if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) || + (osm_physp_get_port_num( p_physp ) == 0)) + num_pkeys = cl_ntoh16( p_node->node_info.partition_cap ); + else + { + p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid); + if (p_sw) + num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap ); + } + return( (num_pkeys + 31) / 32 ); +} + +/********************************************************************** + **********************************************************************/ +/* + * Insert the new pending pkey entry to the specific port pkey table + * pending pkeys. new entries are inserted at the back. + */ +static void pkey_mgr_process_physical_port( + IN osm_log_t *p_log, + IN const osm_req_t *p_req, + IN const ib_net16_t pkey, + IN osm_physp_t *p_physp ) +{ + osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); + osm_pkey_tbl_t *p_pkey_tbl; + ib_net16_t *p_orig_pkey; + char *stat = NULL; + osm_pending_pkey_t *p_pending; + + p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); + if (! p_pkey_tbl) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: ERR 0501: " + "No pkey table found for node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + return; + } + + p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t)); + if (! p_pending) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: ERR 0502: " + "Fail to allocate new pending pkey entry for node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + return; + } + p_pending->pkey = pkey; + p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); + if ( !p_orig_pkey || (ib_pkey_get_base(*p_orig_pkey) != ib_pkey_get_base(pkey) )) + { + p_pending->is_new = TRUE; + cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); + stat = "inserted"; + } + else + { + p_pending->is_new = FALSE; + if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey, + &p_pending->block, &p_pending->index)) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: ERR 0503: " + "Fail to obtain P_Key 0x%04x block and index for node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + return; + } + cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); + stat = "updated"; + } + + osm_log( p_log, OSM_LOG_VERBOSE, + "pkey_mgr_process_physical_port: " + "pkey 0x%04x was %s for node 0x%016" PRIx64 + " port %u\n", + cl_ntoh16( pkey ), stat, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); +} + +/********************************************************************** + **********************************************************************/ +static void +pkey_mgr_process_partition_table( + osm_log_t *p_log, + const osm_req_t *p_req, + const osm_prtn_t *p_prtn, + const boolean_t full ) +{ + const cl_map_t *p_tbl = full ? + &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; + cl_map_iterator_t i, i_next; + ib_net16_t pkey = p_prtn->pkey; + osm_physp_t *p_physp; + + if ( full ) + pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); + + i_next = cl_map_head( p_tbl ); + while ( i_next != cl_map_end( p_tbl ) ) + { + i = i_next; + i_next = cl_map_next( i ); + p_physp = cl_map_obj( i ); + if ( p_physp && osm_physp_is_valid( p_physp ) ) + pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); + } +} + +/********************************************************************** + **********************************************************************/ static ib_api_status_t pkey_mgr_update_pkey_entry( IN const osm_req_t *p_req, @@ -131,80 +263,153 @@ pkey_mgr_enforce_partition( /********************************************************************** **********************************************************************/ -/* - * Prepare a new entry for the pkey table for this port when this pkey - * does not exist. Update existed entry when membership was changed. - */ -static void pkey_mgr_process_physical_port( - IN osm_log_t *p_log, - IN const osm_req_t *p_req, - IN const ib_net16_t pkey, - IN osm_physp_t *p_physp ) +static boolean_t pkey_mgr_update_port( + osm_log_t *p_log, + osm_req_t *p_req, + const osm_port_t * const p_port ) { - osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); - ib_pkey_table_t *block; - uint16_t block_index; + osm_physp_t *p_physp; + osm_node_t *p_node; + ib_pkey_table_t *block, *new_block, *p_old_block; + osm_pkey_tbl_t *p_pkey_tbl; + uint16_t block_index = 0; + uint16_t last_free_block_index = 0; + uint16_t last_free_entry_index = 0; uint16_t num_of_blocks; - const osm_pkey_tbl_t *p_pkey_tbl; - ib_net16_t *p_orig_pkey; - char *stat = NULL; - uint32_t i; + uint16_t max_num_of_blocks; - p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); + ib_api_status_t status; + boolean_t ret_val = FALSE; + osm_pending_pkey_t *p_pending; + boolean_t found; + + p_physp = osm_port_get_default_phys_ptr( p_port ); + if ( !osm_physp_is_valid( p_physp ) ) + return FALSE; + + p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp ); - p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); + osm_pkey_tbl_sync_new_blocks( p_pkey_tbl ); + cl_map_remove_all( &p_pkey_tbl->keys ); + p_pkey_tbl->used_blocks = 0; - if ( !p_orig_pkey ) - { - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + /* process every pending pkey in order - first must be "updated" last are "new" */ + p_pending = (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); + while (p_pending != (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) ) { - block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) + if (p_pending->is_new == FALSE) { - if ( ib_pkey_is_invalid( block->pkey_entry[i] ) ) + block = osm_pkey_tbl_new_block_get( p_pkey_tbl, p_pending->block ); + if (block == NULL) { - block->pkey_entry[i] = pkey; - stat = "inserted"; - goto _done; + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0504: " + "failed to get block %d for node 0x%016" PRIx64 " port %u\n", + p_pending->block, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); } + else + { + p_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, p_pending->block ); + CL_ASSERT( p_old_block != NULL ); + cl_map_insert( &p_pkey_tbl->keys, + ib_pkey_get_base(p_pending->pkey), + &(p_old_block->pkey_entry[p_pending->index])); + block->pkey_entry[p_pending->index] = p_pending->pkey; + if (p_pkey_tbl->used_blocks < p_pending->index) + p_pending->index = p_pending->index; } } + else + { + /* need either an empty entry or next block */ + block = osm_pkey_tbl_new_block_get( p_pkey_tbl, last_free_block_index ); + found = FALSE; + while ( !found && (last_free_block_index < max_num_of_blocks)) + { + if ( block->pkey_entry[last_free_entry_index] == 0) + found = TRUE; + else + { + if (last_free_entry_index == IB_NUM_PKEY_ELEMENTS_IN_BLOCK) + { + last_free_entry_index = 0; + last_free_block_index++; + block = osm_pkey_tbl_new_block_get( p_pkey_tbl, last_free_block_index ); + if ((!block) && (last_free_block_index < max_num_of_blocks)) + { + block = (ib_pkey_table_t *)malloc(sizeof(*block)); + if (!block) + { osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_process_physical_port: ERR 0501: " - "No empty pkey entry was found to insert 0x%04x for node " - "0x%016" PRIx64 " port %u\n", - cl_ntoh16( pkey ), + "pkey_mgr_update_port: ERR 0513: " + "failed to allocate new block %d for node 0x%016" PRIx64 " port %u\n", + last_free_block_index, cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); + continue; + } + cl_ptr_vector_set(&p_pkey_tbl->new_blocks, last_free_block_index, block); } - else if ( *p_orig_pkey != pkey ) + } + else { - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + last_free_entry_index++; + } + } + } + + if ( !found ) { - /* we need real block (not just new_block) in order - * to resolve block/pkey indices */ - block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); - i = p_orig_pkey - block->pkey_entry; - if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) { - block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - block->pkey_entry[i] = pkey; - stat = "updated"; - goto _done; + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0505: " + "failed to empty space for new pkey 0x%04x for node 0x%016" PRIx64 " port %u\n", + cl_ntoh16(p_pending->pkey), + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); } + else + { + p_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, last_free_entry_index); + CL_ASSERT( p_old_block != NULL ); + block->pkey_entry[last_free_entry_index] = p_pending->pkey; + cl_map_insert( &p_pkey_tbl->keys, + ib_pkey_get_base(p_pending->pkey), + &(p_old_block->pkey_entry[last_free_entry_index])); + if (p_pkey_tbl->used_blocks < last_free_entry_index) + p_pending->index = last_free_entry_index; } } + free( p_pending ); + p_pending = (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); + } - _done: - if (stat) { - osm_log( p_log, OSM_LOG_VERBOSE, - "pkey_mgr_process_physical_port: " - "pkey 0x%04x was %s for node 0x%016" PRIx64 - " port %u\n", - cl_ntoh16( pkey ), stat, + /* now look for changes and store */ + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + { + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); + + if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) ) + continue; + + status = pkey_mgr_update_pkey_entry( p_req, p_physp , new_block, block_index ); + if (status == IB_SUCCESS) + ret_val = TRUE; + else + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0506: " + "pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); } + + return ret_val; } /********************************************************************** @@ -217,21 +422,23 @@ pkey_mgr_update_peer_port( const osm_port_t * const p_port, boolean_t enforce ) { - osm_physp_t *p, *peer; + osm_physp_t *p_physp, *peer; osm_node_t *p_node; ib_pkey_table_t *block, *peer_block; - const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl; + const osm_pkey_tbl_t *p_pkey_tbl; + osm_pkey_tbl_t *p_peer_pkey_tbl; osm_switch_t *p_sw; ib_switch_info_t *p_si; uint16_t block_index; uint16_t num_of_blocks; + uint16_t peer_max_blocks; ib_api_status_t status = IB_SUCCESS; boolean_t ret_val = FALSE; - p = osm_port_get_default_phys_ptr( p_port ); - if ( !osm_physp_is_valid( p ) ) + p_physp = osm_port_get_default_phys_ptr( p_port ); + if ( !osm_physp_is_valid( p_physp ) ) return FALSE; - peer = osm_physp_get_remote( p ); + peer = osm_physp_get_remote( p_physp ); if ( !peer || !osm_physp_is_valid( peer ) ) return FALSE; p_node = osm_physp_get_node_ptr( peer ); @@ -245,7 +452,7 @@ pkey_mgr_update_peer_port( if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) { osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0502: " + "pkey_mgr_update_peer_port: ERR 0507: " "pkey_mgr_enforce_partition() failed to update " "node 0x%016" PRIx64 " port %u\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), @@ -255,24 +462,36 @@ pkey_mgr_update_peer_port( if (enforce == FALSE) return FALSE; - p_pkey_tbl = osm_physp_get_pkey_tbl( p ); - p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); + p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); + p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer ); num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); - if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) ) - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ); + peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer ); + if (peer_max_blocks < p_pkey_tbl->used_blocks) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_peer_port: ERR 0508: " + "not enough entries (%u < %u) on switch 0x%016" PRIx64 + " port %u\n", + peer_max_blocks, num_of_blocks, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( peer ) ); + return FALSE; + } - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; + for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++ ) { block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); if ( memcmp( peer_block, block, sizeof( *peer_block ) ) ) { + osm_pkey_tbl_set(p_peer_pkey_tbl, block_index, block); status = pkey_mgr_update_pkey_entry( p_req, peer, block, block_index ); if ( status == IB_SUCCESS ) ret_val = TRUE; else osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0503: " + "pkey_mgr_update_peer_port: ERR 0509: " "pkey_mgr_update_pkey_entry() failed to update " "pkey table block %d for node 0x%016" PRIx64 " port %u\n", @@ -282,7 +501,7 @@ pkey_mgr_update_peer_port( } } - if ( ret_val == TRUE && + if ( (ret_val == TRUE) && osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) { osm_log( p_log, OSM_LOG_VERBOSE, @@ -298,82 +517,6 @@ pkey_mgr_update_peer_port( /********************************************************************** **********************************************************************/ -static boolean_t pkey_mgr_update_port( - osm_log_t *p_log, - osm_req_t *p_req, - const osm_port_t * const p_port ) -{ - osm_physp_t *p; - osm_node_t *p_node; - ib_pkey_table_t *block, *new_block; - const osm_pkey_tbl_t *p_pkey_tbl; - uint16_t block_index; - uint16_t num_of_blocks; - ib_api_status_t status; - boolean_t ret_val = FALSE; - - p = osm_port_get_default_phys_ptr( p_port ); - if ( !osm_physp_is_valid( p ) ) - return FALSE; - - p_pkey_tbl = osm_physp_get_pkey_tbl(p); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); - - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) - { - block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); - new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - - if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) ) - continue; - - status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index ); - if (status == IB_SUCCESS) - ret_val = TRUE; - else - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_port: ERR 0504: " - "pkey_mgr_update_pkey_entry() failed to update " - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", - block_index, - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( p ) ); - } - - return ret_val; -} - -/********************************************************************** - **********************************************************************/ -static void -pkey_mgr_process_partition_table( - osm_log_t *p_log, - const osm_req_t *p_req, - const osm_prtn_t *p_prtn, - const boolean_t full ) -{ - const cl_map_t *p_tbl = full ? - &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; - cl_map_iterator_t i, i_next; - ib_net16_t pkey = p_prtn->pkey; - osm_physp_t *p_physp; - - if ( full ) - pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); - - i_next = cl_map_head( p_tbl ); - while ( i_next != cl_map_end( p_tbl ) ) - { - i = i_next; - i_next = cl_map_next( i ); - p_physp = cl_map_obj( i ); - if ( p_physp && osm_physp_is_valid( p_physp ) ) - pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); - } -} - -/********************************************************************** - **********************************************************************/ osm_signal_t osm_pkey_mgr_process( IN osm_opensm_t *p_osm ) @@ -383,7 +526,6 @@ osm_pkey_mgr_process( osm_prtn_t *p_prtn; osm_port_t *p_port; osm_signal_t signal = OSM_SIGNAL_DONE; - osm_physp_t *p_physp; CL_ASSERT( p_osm ); @@ -394,22 +536,12 @@ osm_pkey_mgr_process( if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS ) { osm_log( &p_osm->log, OSM_LOG_ERROR, - "osm_pkey_mgr_process: ERR 0505: " + "osm_pkey_mgr_process: ERR 0510: " "osm_prtn_make_partitions() failed\n" ); goto _err; } - p_tbl = &p_osm->subn.port_guid_tbl; - p_next = cl_qmap_head( p_tbl ); - while ( p_next != cl_qmap_end( p_tbl ) ) - { - p_port = ( osm_port_t * ) p_next; - p_next = cl_qmap_next( p_next ); - p_physp = osm_port_get_default_phys_ptr( p_port ); - if ( osm_physp_is_valid( p_physp ) ) - osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) ); - } - + /* populate the pending pkey entries by scanning all partitions */ p_tbl = &p_osm->subn.prtn_pkey_tbl; p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) @@ -420,6 +552,7 @@ osm_pkey_mgr_process( pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); } + /* calculate new pkey tables and set */ p_tbl = &p_osm->subn.port_guid_tbl; p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) @@ -428,7 +561,7 @@ osm_pkey_mgr_process( p_next = cl_qmap_next( p_next ); if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) ) signal = OSM_SIGNAL_DONE_PENDING; - if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH && + if ( ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH ) && pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, &p_osm->subn, p_port, !p_osm->subn.opt.no_partition_enforcement ) ) From eli at mellanox.co.il Mon Jun 12 07:59:00 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 12 Jun 2006 17:59:00 +0300 Subject: [openib-general] potential bug: ipoib freeing ah before completion Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30249F8FF@mtlexch01.mtl.com> The issue was originally raised by Eitan Rabin. -----Original Message----- From: Michael S. Tsirkin Sent: Monday, June 12, 2006 3:49 PM To: openib-general at openib.org; Roland Dreier Cc: Eli Cohen Subject: potential bug: ipoib freeing ah before completion Hello, Roland! The following was noted by Eli Cohen: It seems that ipoib_flush_paths can be called while completions are still outstanding on IPoIB QP (e.g. from ipoib_ib_dev_flush). If this happens, an address handle might get freed while a work request is still outstanding for it. This can trigger a local QP error, and IPoIB will stop working, until QP is reset. Please comment. -- MST From jlentini at netapp.com Mon Jun 12 08:44:29 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 12 Jun 2006 11:44:29 -0400 (EDT) Subject: [openib-general] [PATCH] uDAPL openib_cma, cleanup reported CM error events, add TIMEOUT In-Reply-To: References: Message-ID: On Fri, 9 Jun 2006, Arlin Davis wrote: > James, > > I cleaned up the connection error events to report the proper events > during address resolution errors and timeouts. It was returning > incorrect DAT event codes. Looks good. I committed in revision 7931 with a few minor additions (see below). > Index: dapl_ib_cm.c > =================================================================== > --- dapl_ib_cm.c (revision 7839) > +++ dapl_ib_cm.c (working copy) > @@ -330,6 +330,8 @@ static void dapli_cm_active_cb(struct da > switch (event->event) { > case RDMA_CM_EVENT_UNREACHABLE: > case RDMA_CM_EVENT_CONNECT_ERROR: > + { > + ib_cm_events_t cm_event; > dapl_dbg_log( > DAPL_DBG_TYPE_WARN, > " dapli_cm_active_handler: CONN_ERR " > @@ -337,10 +339,15 @@ static void dapli_cm_active_cb(struct da > event->event, event->status, > (event->status == -110)?"TIMEOUT":"" ); > > - dapl_evd_connection_callback(conn, > - IB_CME_DESTINATION_UNREACHABLE, > - NULL, conn->ep); > + /* no device type specified so assume IB for now */ > + if (event->status == -110) /* IB timeout */ I changed -110 to -ETIMEDOUT > + cm_event = IB_CME_TIMEOUT; > + else > + cm_event = IB_CME_DESTINATION_UNREACHABLE; > + > + dapl_evd_connection_callback(conn, cm_event, NULL, conn->ep); > break; > + } > case RDMA_CM_EVENT_REJECTED: > { > ib_cm_events_t cm_event; > @@ -357,7 +364,6 @@ static void dapli_cm_active_cb(struct da > event->status); > > dapl_evd_connection_callback(conn, cm_event, NULL, conn->ep); > - > break; > } > case RDMA_CM_EVENT_ESTABLISHED: > @@ -1028,7 +1034,7 @@ int dapls_ib_private_data_size(IN DAPL_P > /* > * Map all socket CM event codes to the DAT equivelent. I corrected this comment. > */ > -#define DAPL_IB_EVENT_CNT 12 > +#define DAPL_IB_EVENT_CNT 13 > > static struct ib_cm_event_map > { > @@ -1058,7 +1064,9 @@ static struct ib_cm_event_map > /* 10 */ { IB_CME_LOCAL_FAILURE, > DAT_CONNECTION_EVENT_BROKEN}, > /* 11 */ { IB_CME_BROKEN, > - DAT_CONNECTION_EVENT_BROKEN} > + DAT_CONNECTION_EVENT_BROKEN}, > + /* 12 */ { IB_CME_TIMEOUT, > + DAT_CONNECTION_EVENT_TIMED_OUT}, > }; > > /* > @@ -1164,7 +1172,7 @@ void dapli_cma_event_cb(void) > case RDMA_CM_EVENT_ADDR_ERROR: > case RDMA_CM_EVENT_ROUTE_ERROR: > dapl_evd_connection_callback(conn, > - IB_CME_LOCAL_FAILURE, > + IB_CME_DESTINATION_UNREACHABLE, > NULL, conn->ep); > break; > case RDMA_CM_EVENT_DEVICE_REMOVAL: > Index: dapl_ib_util.h > =================================================================== > --- dapl_ib_util.h (revision 7839) > +++ dapl_ib_util.h (working copy) > @@ -86,7 +86,8 @@ typedef enum { > IB_CME_DESTINATION_UNREACHABLE, > IB_CME_TOO_MANY_CONNECTION_REQUESTS, > IB_CME_LOCAL_FAILURE, > - IB_CME_BROKEN > + IB_CME_BROKEN, > + IB_CME_TIMEOUT > } ib_cm_events_t; > > /* CQ notifications */ > From tom at opengridcomputing.com Mon Jun 12 09:05:49 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 12 Jun 2006 11:05:49 -0500 Subject: [openib-general] [PATCH v2 4/7] AMSO1100 Memory Management. In-Reply-To: <20060608011744.1a66e85a.akpm@osdl.org> References: <20060607200646.9259.24588.stgit@stevo-desktop> <20060607200655.9259.90768.stgit@stevo-desktop> <20060608011744.1a66e85a.akpm@osdl.org> Message-ID: <1150128349.22704.20.camel@trinity.ogc.int> On Thu, 2006-06-08 at 01:17 -0700, Andrew Morton wrote: > On Wed, 07 Jun 2006 15:06:55 -0500 > Steve Wise wrote: > > > > > +void c2_free(struct c2_alloc *alloc, u32 obj) > > +{ > > + spin_lock(&alloc->lock); > > + clear_bit(obj, alloc->table); > > + spin_unlock(&alloc->lock); > > +} > > The spinlock is unneeded here. Good point. > > > What does all the code in this file do, anyway? It looks totally generic > (and hence inappropriate for drivers/infiniband/hw/amso1100/) and somewhat > similar to idr trees, perhaps. > We mimicked the mthca driver. It may be code that should be replaced with Linux core services for new drivers. We'll investigate. > > +int c2_array_set(struct c2_array *array, int index, void *value) > > +{ > > + int p = (index * sizeof(void *)) >> PAGE_SHIFT; > > + > > + /* Allocate with GFP_ATOMIC because we'll be called with locks held. */ > > + if (!array->page_list[p].page) > > + array->page_list[p].page = > > + (void **) get_zeroed_page(GFP_ATOMIC); > > + > > + if (!array->page_list[p].page) > > + return -ENOMEM; > > This _will_ happen under load. What will the result of that be, in the > context of thise driver? A higher level object allocation will fail. In this case, a kernel application request will fail and the application must handle the error. > > This function is incorrectly designed - it should receive a gfp_t argument. > Because you don't *know* that the caller will always hold a spinlock. And > GFP_KERNEL is far, far stronger than GFP_ATOMIC. This service is allocating a page that the adapter will DMA 2B message indices into. > > > +static int c2_alloc_mqsp_chunk(gfp_t gfp_mask, struct sp_chunk **head) > > +{ > > + int i; > > + struct sp_chunk *new_head; > > + > > + new_head = (struct sp_chunk *) __get_free_page(gfp_mask | GFP_DMA); > > Why is __GFP_DMA in there? Unless you've cornered the ISA bus infiniband > market, it's likely to be wrong. > Flag confusion about what GFP_DMA means. We'll revisit this whole file ... > > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html From rdreier at cisco.com Mon Jun 12 09:04:02 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Jun 2006 09:04:02 -0700 Subject: [openib-general] potential bug: ipoib freeing ah before completion In-Reply-To: <20060612124833.GA19452@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 12 Jun 2006 15:48:33 +0300") References: <20060612124833.GA19452@mellanox.co.il> Message-ID: Michael> It seems that ipoib_flush_paths can be called while Michael> completions are still outstanding on IPoIB QP (e.g. from Michael> ipoib_ib_dev_flush). If this happens, an address handle Michael> might get freed while a work request is still outstanding Michael> for it. This can trigger a local QP error, and IPoIB Michael> will stop working, until QP is reset. So what if path_free is called early? The address handle shouldn't get freed until tx_tail is past ah->last_send, so all associated work requests are complete. Am I missing something? Have you actually seen this happen? From rdreier at cisco.com Mon Jun 12 09:06:43 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Jun 2006 09:06:43 -0700 Subject: [openib-general] [PATCH] mthca: restore missing registers In-Reply-To: <20060612135751.GB19518@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 12 Jun 2006 16:57:51 +0300") References: <20060612135751.GB19518@mellanox.co.il> Message-ID: Michael> mthca misses restoring the following PCI-X/PCI-Express Michael> registers at reset: PCI-X device: PCI-X command register Michael> PCI-X bridge: upstream and downstream split transaction Michael> registers PCI-Express : PCI-Express device control and Michael> link control registers Would it be simpler to just restore the full 256-byte PCI headers instead? - R. From mst at mellanox.co.il Mon Jun 12 09:48:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Jun 2006 19:48:03 +0300 Subject: [openib-general] potential bug: ipoib freeing ah before completion In-Reply-To: References: Message-ID: <20060612164803.GD19518@mellanox.co.il> Quoting r. Roland Dreier : > The address handle shouldn't get freed until tx_tail is past ah->last_send, so > all associated work requests are complete. Am I missing something? I think you are right - I forgot that ipoib_free_ah doesn't actually free the ah Looks like there's no issue. > Have you actually seen this happen? No. -- MST From mst at mellanox.co.il Mon Jun 12 09:53:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Jun 2006 19:53:27 +0300 Subject: [openib-general] [PATCH] mthca: restore missing registers In-Reply-To: References: Message-ID: <20060612165327.GE19518@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca: restore missing registers > > Michael> mthca misses restoring the following PCI-X/PCI-Express > Michael> registers at reset: PCI-X device: PCI-X command register > Michael> PCI-X bridge: upstream and downstream split transaction > Michael> registers PCI-Express : PCI-Express device control and > Michael> link control registers > > Would it be simpler to just restore the full 256-byte PCI headers instead? You can't do that - there are lots of registers where reading a value and then writing it back has side effects. Consider fields that are write 1 to clear, or the VPD flag register. Further, there is no guarantee what writing fields outside any capability list will do - consider the Tavor 22/23 debugging registers as an example. -- MST From mshefty at ichips.intel.com Mon Jun 12 10:12:15 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 12 Jun 2006 10:12:15 -0700 Subject: [openib-general] [PATCH 0/4] Add support for UD QPs In-Reply-To: <44886D66.7000703@ichips.intel.com> References: <44886747.4040004@ichips.intel.com> <44886D66.7000703@ichips.intel.com> Message-ID: <448DA06F.4090802@ichips.intel.com> To clarify the motivation more, a question to answer is if we ignore iWarp completely, does it make sense to provide a higher level communication manager for IB. I believe that it does, especially for userspace applications. This lets us leverage existing name services, ipoib, and provides an interface that is simpler for most users. A lot of existing applications and test programs exchange IB QP information over TCP connections, including programs that communicate using UD QPs. If you compare the amount of work necessary to establish communication over an RC QP versus a UD QP, the difference is basically a single MAD, which isn't even needed. So there's a fair amount of work that can be shared by a higher level CM. - Sean From ardavis at ichips.intel.com Mon Jun 12 10:30:16 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 12 Jun 2006 10:30:16 -0700 Subject: [openib-general] [Bug 126] RDMA_CM and UCM not loaded on boot In-Reply-To: <20060611062313.10CDC2287AC@openib.ca.sandia.gov> References: <20060611062313.10CDC2287AC@openib.ca.sandia.gov> Message-ID: <448DA4A8.10209@ichips.intel.com> bugzilla-daemon at openib.org wrote: >http://openib.org/bugzilla/show_bug.cgi?id=126 > > >vlad at mellanox.co.il changed: > > What |Removed |Added >---------------------------------------------------------------------------- > Status|NEW |RESOLVED > Resolution| |WONTFIX > > > >------- Comment #1 from vlad at mellanox.co.il 2006-06-10 23:23 ------- >RDMA_CM and RDMA_UCM are not loaded by default. In order to load them upon boot >edit /etc/infiniband/openib.conf file and set RDMA_CM_LOAD=yes and >RDMA_UCM_LOAD=yes: > ># Start HCA driver upon boot >ONBOOT=yes > ># Load UCM module >UCM_LOAD=no > ># Load RDMA_CM module >RDMA_CM_LOAD=no > ># Load RDMA_UCM module >RDMA_UCM_LOAD=no > > > Did the default openib.conf script get updated with: RDMA_CM_LOAD=yes RDMA_UCM_LOAD=yes -arlin -arlin From sean.hefty at intel.com Mon Jun 12 10:39:53 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 12 Jun 2006 10:39:53 -0700 Subject: [openib-general] [PATCH 1/2] librdmacm: userspace support for multicast abstraction Message-ID: <000001c68e47$36caf250$ff0da8c0@amr.corp.intel.com> Add support to the userspace RDMA CM library for joining multicast group based on IP addressing. Signed-off-by: Sean Hefty --- diff -up svn3/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma_abi.h svn/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma_abi.h --- svn3/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma_abi.h 2006-06-06 17:35:31.000000000 -0700 +++ svn/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma_abi.h 2006-06-12 10:16:44.598117880 -0700 @@ -60,7 +60,9 @@ enum { UCMA_CMD_GET_EVENT, UCMA_CMD_GET_OPTION, UCMA_CMD_SET_OPTION, - UCMA_CMD_GET_DST_ATTR + UCMA_CMD_GET_DST_ATTR, + UCMA_CMD_JOIN_MCAST, + UCMA_CMD_LEAVE_MCAST }; struct ucma_abi_cmd_hdr { @@ -178,6 +180,17 @@ struct ucma_abi_init_qp_attr { __u32 qp_state; }; +struct ucma_abi_join_mcast { + __u32 id; + struct sockaddr_in6 addr; + __u64 uid; +}; + +struct ucma_abi_leave_mcast { + __u32 id; + struct sockaddr_in6 addr; +}; + struct ucma_abi_dst_attr_resp { __u32 remote_qpn; __u32 remote_qkey; diff -up svn3/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma.h svn/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma.h --- svn3/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma.h 2006-06-06 17:35:31.000000000 -0700 +++ svn/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma.h 2006-06-06 12:26:21.000000000 -0700 @@ -52,6 +52,8 @@ enum rdma_cm_event_type { RDMA_CM_EVENT_ESTABLISHED, RDMA_CM_EVENT_DISCONNECTED, RDMA_CM_EVENT_DEVICE_REMOVAL, + RDMA_CM_EVENT_MULTICAST_JOIN, + RDMA_CM_EVENT_MULTICAST_ERROR }; enum rdma_port_space { @@ -99,6 +101,13 @@ struct rdma_cm_id { uint8_t port_num; }; +struct rdma_multicast_data { + void *context; + struct sockaddr addr; + uint8_t pad[sizeof(struct sockaddr_in6) - + sizeof(struct sockaddr)]; +}; + struct rdma_cm_event { struct rdma_cm_id *id; struct rdma_cm_id *listen_id; @@ -245,6 +254,24 @@ int rdma_reject(struct rdma_cm_id *id, c int rdma_disconnect(struct rdma_cm_id *id); /** + * rdma_join_multicast - Join the multicast group specified by the given + * address. + * @id: Communication identifier associated with the request. + * @addr: Multicast address identifying the group to join. + * @context: User-defined context associated with the join request. The + * context is returned to the user through the private_data field in + * the rdma_cm_event. + */ +int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, + void *context); + +/** + * rdma_leave_multicast - Leave the multicast group specified by the given + * address. + */ +int rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr); + +/** * rdma_get_cm_event - Retrieves the next pending communications event, * if no event is pending waits for an event. * @channel: Event channel to check for events. diff -up svn3/gen2/trunk/src/userspace/librdmacm/src/cma.c svn/gen2/trunk/src/userspace/librdmacm/src/cma.c --- svn3/gen2/trunk/src/userspace/librdmacm/src/cma.c 2006-06-06 17:35:31.000000000 -0700 +++ svn/gen2/trunk/src/userspace/librdmacm/src/cma.c 2006-06-06 17:30:17.000000000 -0700 @@ -896,6 +896,66 @@ int rdma_disconnect(struct rdma_cm_id *i return 0; } +int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, + void *context) +{ + struct ucma_abi_join_mcast *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size, addrlen; + + addrlen = ucma_addrlen(addr); + if (!addrlen) + return -EINVAL; + + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_JOIN_MCAST, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + memcpy(&cmd->addr, addr, addrlen); + cmd->uid = (uintptr_t) context; + + ret = write(id->channel->fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + return 0; +} + +int rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr) +{ + struct ucma_abi_leave_mcast *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size, addrlen; + struct ibv_ah_attr ah_attr; + uint32_t qp_info; + + addrlen = ucma_addrlen(addr); + if (!addrlen) + return -EINVAL; + + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_LEAVE_MCAST, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + memcpy(&cmd->addr, addr, addrlen); + + if (id->qp) { + ret = rdma_get_dst_attr(id, addr, &ah_attr, &qp_info, &qp_info); + if (ret) + goto out; + + ret = ibv_detach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid); + if (ret) + goto out; + } + + ret = write(id->channel->fd, msg, size); + if (ret != size) + ret = (ret > 0) ? -ENODATA : ret; +out: + return ret; +} + static void ucma_copy_event_from_kern(struct rdma_cm_event *dst, struct ucma_abi_event_resp *src) { @@ -1004,6 +1064,36 @@ static int ucma_process_establish(struct return ret; } +static void ucma_process_mcast(struct rdma_cm_id *id, struct rdma_cm_event *evt) +{ + struct ucma_abi_join_mcast kmc_data; + struct rdma_multicast_data *mc_data; + struct ibv_ah_attr ah_attr; + uint32_t qp_info; + + kmc_data = *(struct ucma_abi_join_mcast *) evt->private_data; + + mc_data = evt->private_data; + mc_data->context = (void *) (uintptr_t) kmc_data.uid; + memcpy(&mc_data->addr, &kmc_data.addr, + ucma_addrlen((struct sockaddr *) &kmc_data.addr)); + + if (evt->status || !id->qp) + return; + + evt->status = rdma_get_dst_attr(id, &mc_data->addr, &ah_attr, + &qp_info, &qp_info); + if (evt->status) + goto err; + + evt->status = ibv_attach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid); + if (evt->status) + goto err; + return; +err: + evt->event = RDMA_CM_EVENT_MULTICAST_ERROR; +} + int rdma_get_cm_event(struct rdma_event_channel *channel, struct rdma_cm_event **event) { @@ -1085,6 +1175,10 @@ retry: goto retry; } break; + case RDMA_CM_EVENT_MULTICAST_JOIN: + case RDMA_CM_EVENT_MULTICAST_ERROR: + ucma_process_mcast(&id_priv->id, evt); + break; default: break; } diff -up svn3/gen2/trunk/src/userspace/librdmacm/src/librdmacm.map svn/gen2/trunk/src/userspace/librdmacm/src/librdmacm.map --- svn3/gen2/trunk/src/userspace/librdmacm/src/librdmacm.map 2006-06-06 17:35:31.000000000 -0700 +++ svn/gen2/trunk/src/userspace/librdmacm/src/librdmacm.map 2006-06-01 15:03:13.000000000 -0700 @@ -19,5 +19,7 @@ RDMACM_1.0 { rdma_get_option; rdma_set_option; rdma_get_dst_attr; + rdma_join_multicast; + rdma_leave_multicast; local: *; }; From sean.hefty at intel.com Mon Jun 12 10:43:38 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 12 Jun 2006 10:43:38 -0700 Subject: [openib-general] [PATCH 2/2] librdmacm: add multicast test program to examples In-Reply-To: <000001c68e47$36caf250$ff0da8c0@amr.corp.intel.com> Message-ID: <000101c68e47$bc5dbd80$ff0da8c0@amr.corp.intel.com> Simple multicast test program. When run, the client creates a QP and joins it to a multicast group. It then either sends or receives messages on the group. Signed-off-by: Sean Hefty --- diff -up svn3/gen2/trunk/src/userspace/librdmacm/librdmacm.spec.in svn/gen2/trunk/src/userspace/librdmacm/librdmacm.spec.in --- svn3/gen2/trunk/src/userspace/librdmacm/librdmacm.spec.in 2006-06-06 17:35:31.000000000 -0700 +++ svn/gen2/trunk/src/userspace/librdmacm/librdmacm.spec.in 2006-06-01 14:53:47.000000000 -0700 @@ -67,3 +67,4 @@ rm -rf $RPM_BUILD_ROOT %{_bindir}/rping %{_bindir}/ucmatose %{_bindir}/udaddy +%{_bindir}/mckey diff -up svn3/gen2/trunk/src/userspace/librdmacm/Makefile.am svn/gen2/trunk/src/userspace/librdmacm/Makefile.am --- svn3/gen2/trunk/src/userspace/librdmacm/Makefile.am 2006-06-06 17:35:31.000000000 -0700 +++ svn/gen2/trunk/src/userspace/librdmacm/Makefile.am 2006-06-06 14:48:23.000000000 -0700 @@ -18,13 +18,15 @@ endif src_librdmacm_la_SOURCES = src/cma.c src_librdmacm_la_LDFLAGS = -avoid-version $(rdmacm_version_script) -bin_PROGRAMS = examples/ucmatose examples/rping examples/udaddy +bin_PROGRAMS = examples/ucmatose examples/rping examples/udaddy examples/mckey examples_ucmatose_SOURCES = examples/cmatose.c examples_ucmatose_LDADD = $(top_builddir)/src/librdmacm.la examples_rping_SOURCES = examples/rping.c examples_rping_LDADD = $(top_builddir)/src/librdmacm.la examples_udaddy_SOURCES = examples/udaddy.c examples_udaddy_LDADD = $(top_builddir)/src/librdmacm.la +examples_mckey_SOURCES = examples/mckey.c +examples_mckey_LDADD = $(top_builddir)/src/librdmacm.la librdmacmincludedir = $(includedir)/rdma diff -upN svn3/gen2/trunk/src/userspace/librdmacm/examples/mckey.c svn/gen2/trunk/src/userspace/librdmacm/examples/mckey.c --- svn3/gen2/trunk/src/userspace/librdmacm/examples/mckey.c 1969-12-31 16:00:00.000000000 -0800 +++ svn/gen2/trunk/src/userspace/librdmacm/examples/mckey.c 2006-06-06 12:56:35.000000000 -0700 @@ -0,0 +1,505 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +struct cmatest_node { + int id; + struct rdma_cm_id *cma_id; + int connected; + struct ibv_pd *pd; + struct ibv_cq *cq; + struct ibv_mr *mr; + struct ibv_ah *ah; + uint32_t remote_qpn; + uint32_t remote_qkey; + void *mem; +}; + +struct cmatest { + struct rdma_event_channel *channel; + struct cmatest_node *nodes; + int conn_index; + int connects_left; + + struct sockaddr_in dst_in; + struct sockaddr *dst_addr; + struct sockaddr_in src_in; + struct sockaddr *src_addr; +}; + +static struct cmatest test; +static int connections = 1; +static int message_size = 100; +static int message_count = 10; +static int is_sender; + +static int create_message(struct cmatest_node *node) +{ + if (!message_size) + message_count = 0; + + if (!message_count) + return 0; + + node->mem = malloc(message_size + sizeof(struct ibv_grh)); + if (!node->mem) { + printf("failed message allocation\n"); + return -1; + } + node->mr = ibv_reg_mr(node->pd, node->mem, + message_size + sizeof(struct ibv_grh), + IBV_ACCESS_LOCAL_WRITE); + if (!node->mr) { + printf("failed to reg MR\n"); + goto err; + } + return 0; +err: + free(node->mem); + return -1; +} + +static int init_node(struct cmatest_node *node) +{ + struct ibv_qp_init_attr init_qp_attr; + int cqe, ret; + + node->pd = ibv_alloc_pd(node->cma_id->verbs); + if (!node->pd) { + ret = -ENOMEM; + printf("cmatose: unable to allocate PD\n"); + goto out; + } + + cqe = message_count ? message_count * 2 : 2; + node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0); + if (!node->cq) { + ret = -ENOMEM; + printf("cmatose: unable to create CQ\n"); + goto out; + } + + memset(&init_qp_attr, 0, sizeof init_qp_attr); + init_qp_attr.cap.max_send_wr = message_count ? message_count : 1; + init_qp_attr.cap.max_recv_wr = message_count ? message_count : 1; + init_qp_attr.cap.max_send_sge = 1; + init_qp_attr.cap.max_recv_sge = 1; + init_qp_attr.qp_context = node; + init_qp_attr.sq_sig_all = 0; + init_qp_attr.qp_type = IBV_QPT_UD; + init_qp_attr.send_cq = node->cq; + init_qp_attr.recv_cq = node->cq; + ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr); + if (ret) { + printf("cmatose: unable to create QP: %d\n", ret); + goto out; + } + + ret = create_message(node); + if (ret) { + printf("cmatose: failed to create messages: %d\n", ret); + goto out; + } +out: + return ret; +} + +static int post_recvs(struct cmatest_node *node) +{ + struct ibv_recv_wr recv_wr, *recv_failure; + struct ibv_sge sge; + int i, ret = 0; + + if (!message_count) + return 0; + + recv_wr.next = NULL; + recv_wr.sg_list = &sge; + recv_wr.num_sge = 1; + recv_wr.wr_id = (uintptr_t) node; + + sge.length = message_size + sizeof(struct ibv_grh); + sge.lkey = node->mr->lkey; + sge.addr = (uintptr_t) node->mem; + + for (i = 0; i < message_count && !ret; i++ ) { + ret = ibv_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); + if (ret) { + printf("failed to post receives: %d\n", ret); + break; + } + } + return ret; +} + +static int post_sends(struct cmatest_node *node, int signal_flag) +{ + struct ibv_send_wr send_wr, *bad_send_wr; + struct ibv_sge sge; + int i, ret = 0; + + if (!node->connected || !message_count) + return 0; + + send_wr.next = NULL; + send_wr.sg_list = &sge; + send_wr.num_sge = 1; + send_wr.opcode = IBV_WR_SEND_WITH_IMM; + send_wr.send_flags = IBV_SEND_INLINE | signal_flag; + send_wr.wr_id = (unsigned long)node; + send_wr.imm_data = htonl(node->cma_id->qp->qp_num); + + send_wr.wr.ud.ah = node->ah; + send_wr.wr.ud.remote_qpn = node->remote_qpn; + send_wr.wr.ud.remote_qkey = node->remote_qkey; + + sge.length = message_size - sizeof(struct ibv_grh); + sge.lkey = node->mr->lkey; + sge.addr = (uintptr_t) node->mem; + + for (i = 0; i < message_count && !ret; i++) { + ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); + if (ret) + printf("failed to post sends: %d\n", ret); + } + return ret; +} + +static void connect_error(void) +{ + test.connects_left--; +} + +static int addr_handler(struct cmatest_node *node) +{ + int ret; + + ret = init_node(node); + if (ret) + goto err; + + if (!is_sender) { + ret = post_recvs(node); + if (ret) + goto err; + } + + ret = rdma_join_multicast(node->cma_id, test.dst_addr, node); + if (ret) { + printf("cmatose: failure joining: %d\n", ret); + goto err; + } + return 0; +err: + connect_error(); + return ret; +} + +static int join_handler(struct cmatest_node *node) +{ + struct ibv_ah_attr ah_attr; + int ret; + + ret = rdma_get_dst_attr(node->cma_id, test.dst_addr, &ah_attr, + &node->remote_qpn, &node->remote_qkey); + if (ret) { + printf("mckey: failure getting destination attributes\n"); + goto err; + } + + node->ah = ibv_create_ah(node->pd, &ah_attr); + if (!node->ah) { + printf("mckey: failure creating address handle\n"); + goto err; + } + + node->connected = 1; + test.connects_left--; + return 0; +err: + connect_error(); + return ret; +} + +static int cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event) +{ + int ret = 0; + + switch (event->event) { + case RDMA_CM_EVENT_ADDR_RESOLVED: + ret = addr_handler(cma_id->context); + break; + case RDMA_CM_EVENT_MULTICAST_JOIN: + ret = join_handler(cma_id->context); + break; + case RDMA_CM_EVENT_ADDR_ERROR: + case RDMA_CM_EVENT_ROUTE_ERROR: + case RDMA_CM_EVENT_MULTICAST_ERROR: + printf("cmatose: event: %d, error: %d\n", event->event, + event->status); + connect_error(); + ret = event->status; + break; + case RDMA_CM_EVENT_DEVICE_REMOVAL: + /* Cleanup will occur after test completes. */ + break; + default: + break; + } + return ret; +} + +static void destroy_node(struct cmatest_node *node) +{ + if (!node->cma_id) + return; + + if (node->ah) + ibv_destroy_ah(node->ah); + + if (node->cma_id->qp) + rdma_destroy_qp(node->cma_id); + + if (node->cq) + ibv_destroy_cq(node->cq); + + if (node->mem) { + ibv_dereg_mr(node->mr); + free(node->mem); + } + + if (node->pd) + ibv_dealloc_pd(node->pd); + + /* Destroy the RDMA ID after all device resources */ + rdma_destroy_id(node->cma_id); +} + +static int alloc_nodes(void) +{ + int ret, i; + + test.nodes = malloc(sizeof *test.nodes * connections); + if (!test.nodes) { + printf("cmatose: unable to allocate memory for test nodes\n"); + return -ENOMEM; + } + memset(test.nodes, 0, sizeof *test.nodes * connections); + + for (i = 0; i < connections; i++) { + test.nodes[i].id = i; + ret = rdma_create_id(test.channel, &test.nodes[i].cma_id, + &test.nodes[i], RDMA_PS_UDP); + if (ret) + goto err; + } + return 0; +err: + while (--i >= 0) + rdma_destroy_id(test.nodes[i].cma_id); + free(test.nodes); + return ret; +} + +static void destroy_nodes(void) +{ + int i; + + for (i = 0; i < connections; i++) + destroy_node(&test.nodes[i]); + free(test.nodes); +} + +static int poll_cqs(void) +{ + struct ibv_wc wc[8]; + int done, i, ret; + + for (i = 0; i < connections; i++) { + if (!test.nodes[i].connected) + continue; + + for (done = 0; done < message_count; done += ret) { + ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); + if (ret < 0) { + printf("cmatose: failed polling CQ: %d\n", ret); + return ret; + } + } + } + return 0; +} + +static int connect_events(void) +{ + struct rdma_cm_event *event; + int ret = 0; + + while (test.connects_left && !ret) { + ret = rdma_get_cm_event(test.channel, &event); + if (!ret) { + ret = cma_handler(event->id, event); + rdma_ack_cm_event(event); + } + } + return ret; +} + +static int get_addr(char *dst, struct sockaddr_in *addr) +{ + struct addrinfo *res; + int ret; + + ret = getaddrinfo(dst, NULL, NULL, &res); + if (ret) { + printf("getaddrinfo failed - invalid hostname or IP address\n"); + return ret; + } + + if (res->ai_family != PF_INET) { + ret = -1; + goto out; + } + + *addr = *(struct sockaddr_in *) res->ai_addr; +out: + freeaddrinfo(res); + return ret; +} + +static int run(char *dst, char *src) +{ + int i, ret; + + printf("cmatose: starting client\n"); + if (src) { + ret = get_addr(src, &test.src_in); + if (ret) + return ret; + } + + ret = get_addr(dst, &test.dst_in); + if (ret) + return ret; + + test.dst_in.sin_port = 7174; + + printf("cmatose: joining\n"); + for (i = 0; i < connections; i++) { + ret = rdma_resolve_addr(test.nodes[i].cma_id, + src ? test.src_addr : NULL, + test.dst_addr, 2000); + if (ret) { + printf("cmatose: failure getting addr: %d\n", ret); + connect_error(); + return ret; + } + } + + ret = connect_events(); + if (ret) + goto out; + + /* + * Pause to give SM chance to configure switches. We don't want to + * handle reliability issue in this simple test program. + */ + sleep(3); + + if (message_count) { + if (is_sender) { + printf("initiating data transfers\n"); + for (i = 0; i < connections; i++) { + ret = post_sends(&test.nodes[i], 0); + if (ret) + goto out; + } + } else { + printf("receiving data transfers\n"); + ret = poll_cqs(); + if (ret) + goto out; + } + printf("data transfers complete\n"); + } +out: + return ret; +} + +int main(int argc, char **argv) +{ + int ret; + + if (argc < 3 || argc > 4) { + printf("usage: %s {s[end] | r[ecv]} mcast_addr [bind_addr]]\n", + argv[0]); + exit(1); + } + is_sender = (argv[1][0] == 's'); + + test.dst_addr = (struct sockaddr *) &test.dst_in; + test.src_addr = (struct sockaddr *) &test.src_in; + test.connects_left = connections; + + test.channel = rdma_create_event_channel(); + if (!test.channel) { + printf("failed to create event channel\n"); + exit(1); + } + + if (alloc_nodes()) + exit(1); + + ret = run(argv[2], (argc == 4) ? argv[3] : NULL); + + printf("test complete\n"); + destroy_nodes(); + rdma_destroy_event_channel(test.channel); + + printf("return status %d\n", ret); + return ret; +} From robert.j.woodruff at intel.com Mon Jun 12 10:49:59 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 12 Jun 2006 10:49:59 -0700 Subject: [openib-general] OFED 1.0-rc6 tarball available with working ipath driver Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007F0B6BD@orsmsx408> Bryan wrote, >Due to unfortunate timing, the ipath driver in OFED 1.0-rc6 does not >work correctly. You can download an updated tarball from here, for >which the ipath driver works fine: http://openib.red-bean.com/OFED-1.0-rc6+ipath.tar.bz2 >Alternatively, pull the necessary patches from SVN. Still does not seem to compile. In file included from /var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipa th_cq.c:36: /var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipa th_verbs.h:399: error: `BITS_PER_BYTE' undeclared here (not in a function) make[3]: *** [/var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ip ath_cq.o] Error 1 make[2]: *** [/var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath] Error 2 make[1]: *** [_module_/var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband] Error 2 make[1]: Leaving directory `/usr/src/kernels/2.6.9-34.EL-smp-x86_64' make: *** [kernel] Error 2 ERROR: Failed to execute: make kernel ~ From mst at mellanox.co.il Mon Jun 12 11:32:47 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Jun 2006 21:32:47 +0300 Subject: [openib-general] another ipoib question Message-ID: <20060612183247.GD20500@mellanox.co.il> Hello, Roland! Here's another question from code review conducted by Eitan Rabin: could flush task set the ipoib_neigh pointer encoded inside the neighbour hardware address to NULL and free the neighbour, while ipoib_start_xmit is accessing the ipoib_neigh through the pointer is has loaded from the hardware address? flush task does not seem to hold xmit_lock. -- MST From rdreier at cisco.com Mon Jun 12 12:36:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Jun 2006 12:36:36 -0700 Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with ib_uverbs_idr_mutex Message-ID: IB/uverbs: Don't serialize with ib_uverbs_idr_mutex Currently, all userspace verbs operations that call into the kernel are serialized by ib_uverbs_idr_mutex. This can be a scalability issue for some workloads, especially for devices driven by the ipath driver, which needs to call into the kernel even for datapath operations. Fix this by adding reference counts to the userspace objects, and then converting ib_uverbs_idr_mutex into a spinlock that only protects the idrs long enough to take a reference on the object being looked up. Because remove operations may fail, we have to do a slightly funky two-step deletion, which is described in the comments at the top of uverbs_cmd.c. This also still leaves ib_uverbs_idr_lock as a single lock that is possibly subject to contention. However, the lock hold time will only be a single idr operation, so multiple threads should still be able to make progress, even if ib_uverbs_idr_lock is being ping-ponged. Surprisingly, these changes even shrink the object code: add/remove: 23/5 grow/shrink: 4/21 up/down: 589/-688 (-99) Signed-off-by: Roland Dreier --- I started thinking about the "kill ib_uverbs_idr_mutex" problem, and I realized that there are actually some interesting issues there (as described in the comment at the top of uverbs_cmd.c). In fact I ended up coding the solution below. This passes some basic tests but it could probably use some review. I'm thinking of checking it into svn for some further cooking in the next day or two, so let me know if you see any issues. diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index 3372d67..bb9bee5 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -132,7 +132,7 @@ struct ib_ucq_object { u32 async_events_reported; }; -extern struct mutex ib_uverbs_idr_mutex; +extern spinlock_t ib_uverbs_idr_lock; extern struct idr ib_uverbs_pd_idr; extern struct idr ib_uverbs_mr_idr; extern struct idr ib_uverbs_mw_idr; @@ -141,6 +141,8 @@ extern struct idr ib_uverbs_cq_idr; extern struct idr ib_uverbs_qp_idr; extern struct idr ib_uverbs_srq_idr; +void idr_remove_uobj(struct idr *idp, struct ib_uobject *uobj); + struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file, int is_async, int *fd); void ib_uverbs_release_event_file(struct kref *ref); diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 403dd81..7968b5f 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -50,7 +50,64 @@ #define INIT_UDATA(udata, ibuf, obuf, il (udata)->outlen = (olen); \ } while (0) -static int idr_add_uobj(struct idr *idr, void *obj, struct ib_uobject *uobj) +/* + * The ib_uobject locking scheme is as follows: + * + * - ib_uverbs_idr_lock protects the uverbs idrs themselves, so it + * needs to be held during all idr operations. When an object is + * looked up, a reference must be taken on the object's kref before + * dropping this lock. + * + * - Each object also has an rwsem. This rwsem must be held for + * reading while an operation that uses the object is performed. + * For example, while registering an MR, the associated PD's + * uobject.mutex must be held for reading. The rwsem must be held + * for writing while initializing or destroying an object. + * + * - In addition, each object has a "live" flag. If this flag is not + * set, then lookups of the object will fail even if it is found in + * the idr. This handles a reader that blocks and does not acquire + * the rwsem until after the object is destroyed. The destroy + * operation will set the live flag to 0 and then drop the rwsem; + * this will allow the reader to acquire the rwsem, see that the + * live flag is 0, and then drop the rwsem and its reference to + * object. The underlying storage will not be freed until the last + * reference to the object is dropped. + */ + +static void init_uobj(struct ib_uobject *uobj, u64 user_handle, + struct ib_ucontext *context) +{ + uobj->user_handle = user_handle; + uobj->context = context; + kref_init(&uobj->ref); + init_rwsem(&uobj->mutex); + uobj->live = 0; +} + +static void release_uobj(struct kref *kref) +{ + kfree(container_of(kref, struct ib_uobject, ref)); +} + +static void put_uobj(struct ib_uobject *uobj) +{ + kref_put(&uobj->ref, release_uobj); +} + +static void put_uobj_read(struct ib_uobject *uobj) +{ + up_read(&uobj->mutex); + put_uobj(uobj); +} + +static void put_uobj_write(struct ib_uobject *uobj) +{ + up_write(&uobj->mutex); + put_uobj(uobj); +} + +static int idr_add_uobj(struct idr *idr, struct ib_uobject *uobj) { int ret; @@ -58,7 +115,9 @@ retry: if (!idr_pre_get(idr, GFP_KERNEL)) return -ENOMEM; + spin_lock(&ib_uverbs_idr_lock); ret = idr_get_new(idr, uobj, &uobj->id); + spin_unlock(&ib_uverbs_idr_lock); if (ret == -EAGAIN) goto retry; @@ -66,6 +125,121 @@ retry: return ret; } +void idr_remove_uobj(struct idr *idr, struct ib_uobject *uobj) +{ + spin_lock(&ib_uverbs_idr_lock); + idr_remove(idr, uobj->id); + spin_unlock(&ib_uverbs_idr_lock); +} + +static struct ib_uobject *__idr_get_uobj(struct idr *idr, int id, + struct ib_ucontext *context) +{ + struct ib_uobject *uobj; + + spin_lock(&ib_uverbs_idr_lock); + uobj = idr_find(idr, id); + if (uobj) + kref_get(&uobj->ref); + spin_unlock(&ib_uverbs_idr_lock); + + return uobj; +} + +static struct ib_uobject *idr_read_uobj(struct idr *idr, int id, + struct ib_ucontext *context) +{ + struct ib_uobject *uobj; + + uobj = __idr_get_uobj(idr, id, context); + if (!uobj) + return NULL; + + down_read(&uobj->mutex); + if (!uobj->live) { + put_uobj_read(uobj); + return NULL; + } + + return uobj; +} + +static struct ib_uobject *idr_write_uobj(struct idr *idr, int id, + struct ib_ucontext *context) +{ + struct ib_uobject *uobj; + + uobj = __idr_get_uobj(idr, id, context); + if (!uobj) + return NULL; + + down_write(&uobj->mutex); + if (!uobj->live) { + put_uobj_write(uobj); + return NULL; + } + + return uobj; +} + +static void *idr_read_obj(struct idr *idr, int id, struct ib_ucontext *context) +{ + struct ib_uobject *uobj; + + uobj = idr_read_uobj(idr, id, context); + return uobj ? uobj->object : NULL; +} + +static struct ib_pd *idr_read_pd(int pd_handle, struct ib_ucontext *context) +{ + return idr_read_obj(&ib_uverbs_pd_idr, pd_handle, context); +} + +static void put_pd_read(struct ib_pd *pd) +{ + put_uobj_read(pd->uobject); +} + +static struct ib_cq *idr_read_cq(int cq_handle, struct ib_ucontext *context) +{ + return idr_read_obj(&ib_uverbs_cq_idr, cq_handle, context); +} + +static void put_cq_read(struct ib_cq *cq) +{ + put_uobj_read(cq->uobject); +} + +static struct ib_ah *idr_read_ah(int ah_handle, struct ib_ucontext *context) +{ + return idr_read_obj(&ib_uverbs_ah_idr, ah_handle, context); +} + +static void put_ah_read(struct ib_ah *ah) +{ + put_uobj_read(ah->uobject); +} + +static struct ib_qp *idr_read_qp(int qp_handle, struct ib_ucontext *context) +{ + return idr_read_obj(&ib_uverbs_qp_idr, qp_handle, context); +} + +static void put_qp_read(struct ib_qp *qp) +{ + put_uobj_read(qp->uobject); +} + +static struct ib_srq *idr_read_srq(int srq_handle, struct ib_ucontext *context) +{ + return idr_read_obj(&ib_uverbs_srq_idr, srq_handle, context); +} + +static void put_srq_read(struct ib_srq *srq) +{ + put_uobj_read(srq->uobject); +} + ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) @@ -296,7 +470,8 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uve if (!uobj) return -ENOMEM; - uobj->context = file->ucontext; + init_uobj(uobj, 0, file->ucontext); + down_write(&uobj->mutex); pd = file->device->ib_dev->alloc_pd(file->device->ib_dev, file->ucontext, &udata); @@ -309,11 +484,10 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uve pd->uobject = uobj; atomic_set(&pd->usecnt, 0); - mutex_lock(&ib_uverbs_idr_mutex); - - ret = idr_add_uobj(&ib_uverbs_pd_idr, pd, uobj); + uobj->object = pd; + ret = idr_add_uobj(&ib_uverbs_pd_idr, uobj); if (ret) - goto err_up; + goto err_idr; memset(&resp, 0, sizeof resp); resp.pd_handle = uobj->id; @@ -321,26 +495,27 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uve if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { ret = -EFAULT; - goto err_idr; + goto err_copy; } mutex_lock(&file->mutex); list_add_tail(&uobj->list, &file->ucontext->pd_list); mutex_unlock(&file->mutex); - mutex_unlock(&ib_uverbs_idr_mutex); + uobj->live = 1; + + up_write(&uobj->mutex); return in_len; -err_idr: - idr_remove(&ib_uverbs_pd_idr, uobj->id); +err_copy: + idr_remove_uobj(&ib_uverbs_pd_idr, uobj); -err_up: - mutex_unlock(&ib_uverbs_idr_mutex); +err_idr: ib_dealloc_pd(pd); err: - kfree(uobj); + put_uobj_write(uobj); return ret; } @@ -349,37 +524,34 @@ ssize_t ib_uverbs_dealloc_pd(struct ib_u int in_len, int out_len) { struct ib_uverbs_dealloc_pd cmd; - struct ib_pd *pd; struct ib_uobject *uobj; - int ret = -EINVAL; + int ret; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); + uobj = idr_write_uobj(&ib_uverbs_pd_idr, cmd.pd_handle, file->ucontext); + if (!uobj) + return -EINVAL; - pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); - if (!pd || pd->uobject->context != file->ucontext) - goto out; + ret = ib_dealloc_pd(uobj->object); + if (!ret) + uobj->live = 0; - uobj = pd->uobject; + put_uobj_write(uobj); - ret = ib_dealloc_pd(pd); if (ret) - goto out; + return ret; - idr_remove(&ib_uverbs_pd_idr, cmd.pd_handle); + idr_remove_uobj(&ib_uverbs_pd_idr, uobj); mutex_lock(&file->mutex); list_del(&uobj->list); mutex_unlock(&file->mutex); - kfree(uobj); - -out: - mutex_unlock(&ib_uverbs_idr_mutex); + put_uobj(uobj); - return ret ? ret : in_len; + return in_len; } ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, @@ -419,7 +591,8 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb if (!obj) return -ENOMEM; - obj->uobject.context = file->ucontext; + init_uobj(&obj->uobject, 0, file->ucontext); + down_write(&obj->uobject.mutex); /* * We ask for writable memory if any access flags other than @@ -436,23 +609,14 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb obj->umem.virt_base = cmd.hca_va; - mutex_lock(&ib_uverbs_idr_mutex); - - pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); - if (!pd || pd->uobject->context != file->ucontext) { - ret = -EINVAL; - goto err_up; - } - - if (!pd->device->reg_user_mr) { - ret = -ENOSYS; - goto err_up; - } + pd = idr_read_pd(cmd.pd_handle, file->ucontext); + if (!pd) + goto err_release; mr = pd->device->reg_user_mr(pd, &obj->umem, cmd.access_flags, &udata); if (IS_ERR(mr)) { ret = PTR_ERR(mr); - goto err_up; + goto err_put; } mr->device = pd->device; @@ -461,43 +625,48 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb atomic_inc(&pd->usecnt); atomic_set(&mr->usecnt, 0); - memset(&resp, 0, sizeof resp); - resp.lkey = mr->lkey; - resp.rkey = mr->rkey; - - ret = idr_add_uobj(&ib_uverbs_mr_idr, mr, &obj->uobject); + obj->uobject.object = mr; + ret = idr_add_uobj(&ib_uverbs_mr_idr, &obj->uobject); if (ret) goto err_unreg; + memset(&resp, 0, sizeof resp); + resp.lkey = mr->lkey; + resp.rkey = mr->rkey; resp.mr_handle = obj->uobject.id; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { ret = -EFAULT; - goto err_idr; + goto err_copy; } + put_pd_read(pd); + mutex_lock(&file->mutex); list_add_tail(&obj->uobject.list, &file->ucontext->mr_list); mutex_unlock(&file->mutex); - mutex_unlock(&ib_uverbs_idr_mutex); + obj->uobject.live = 1; + + up_write(&obj->uobject.mutex); return in_len; -err_idr: - idr_remove(&ib_uverbs_mr_idr, obj->uobject.id); +err_copy: + idr_remove_uobj(&ib_uverbs_mr_idr, &obj->uobject); err_unreg: ib_dereg_mr(mr); -err_up: - mutex_unlock(&ib_uverbs_idr_mutex); +err_put: + put_pd_read(pd); +err_release: ib_umem_release(file->device->ib_dev, &obj->umem); err_free: - kfree(obj); + put_uobj_write(&obj->uobject); return ret; } @@ -507,37 +676,40 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uve { struct ib_uverbs_dereg_mr cmd; struct ib_mr *mr; + struct ib_uobject *uobj; struct ib_umem_object *memobj; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); - - mr = idr_find(&ib_uverbs_mr_idr, cmd.mr_handle); - if (!mr || mr->uobject->context != file->ucontext) - goto out; + uobj = idr_write_uobj(&ib_uverbs_mr_idr, cmd.mr_handle, file->ucontext); + if (!uobj) + return -EINVAL; - memobj = container_of(mr->uobject, struct ib_umem_object, uobject); + memobj = container_of(uobj, struct ib_umem_object, uobject); + mr = uobj->object; ret = ib_dereg_mr(mr); + if (!ret) + uobj->live = 0; + + put_uobj_write(uobj); + if (ret) - goto out; + return ret; - idr_remove(&ib_uverbs_mr_idr, cmd.mr_handle); + idr_remove_uobj(&ib_uverbs_mr_idr, uobj); mutex_lock(&file->mutex); - list_del(&memobj->uobject.list); + list_del(&uobj->list); mutex_unlock(&file->mutex); ib_umem_release(file->device->ib_dev, &memobj->umem); - kfree(memobj); -out: - mutex_unlock(&ib_uverbs_idr_mutex); + put_uobj(uobj); - return ret ? ret : in_len; + return in_len; } ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file, @@ -576,7 +748,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uv struct ib_uverbs_create_cq cmd; struct ib_uverbs_create_cq_resp resp; struct ib_udata udata; - struct ib_ucq_object *uobj; + struct ib_ucq_object *obj; struct ib_uverbs_event_file *ev_file = NULL; struct ib_cq *cq; int ret; @@ -594,10 +766,13 @@ ssize_t ib_uverbs_create_cq(struct ib_uv if (cmd.comp_vector >= file->device->num_comp_vectors) return -EINVAL; - uobj = kmalloc(sizeof *uobj, GFP_KERNEL); - if (!uobj) + obj = kmalloc(sizeof *obj, GFP_KERNEL); + if (!obj) return -ENOMEM; + init_uobj(&obj->uobject, cmd.user_handle, file->ucontext); + down_write(&obj->uobject.mutex); + if (cmd.comp_channel >= 0) { ev_file = ib_uverbs_lookup_comp_file(cmd.comp_channel); if (!ev_file) { @@ -606,63 +781,64 @@ ssize_t ib_uverbs_create_cq(struct ib_uv } } - uobj->uobject.user_handle = cmd.user_handle; - uobj->uobject.context = file->ucontext; - uobj->uverbs_file = file; - uobj->comp_events_reported = 0; - uobj->async_events_reported = 0; - INIT_LIST_HEAD(&uobj->comp_list); - INIT_LIST_HEAD(&uobj->async_list); + obj->uverbs_file = file; + obj->comp_events_reported = 0; + obj->async_events_reported = 0; + INIT_LIST_HEAD(&obj->comp_list); + INIT_LIST_HEAD(&obj->async_list); cq = file->device->ib_dev->create_cq(file->device->ib_dev, cmd.cqe, file->ucontext, &udata); if (IS_ERR(cq)) { ret = PTR_ERR(cq); - goto err; + goto err_file; } cq->device = file->device->ib_dev; - cq->uobject = &uobj->uobject; + cq->uobject = &obj->uobject; cq->comp_handler = ib_uverbs_comp_handler; cq->event_handler = ib_uverbs_cq_event_handler; cq->cq_context = ev_file; atomic_set(&cq->usecnt, 0); - mutex_lock(&ib_uverbs_idr_mutex); - - ret = idr_add_uobj(&ib_uverbs_cq_idr, cq, &uobj->uobject); + obj->uobject.object = cq; + ret = idr_add_uobj(&ib_uverbs_cq_idr, &obj->uobject); if (ret) - goto err_up; + goto err_free; memset(&resp, 0, sizeof resp); - resp.cq_handle = uobj->uobject.id; + resp.cq_handle = obj->uobject.id; resp.cqe = cq->cqe; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { ret = -EFAULT; - goto err_idr; + goto err_copy; } mutex_lock(&file->mutex); - list_add_tail(&uobj->uobject.list, &file->ucontext->cq_list); + list_add_tail(&obj->uobject.list, &file->ucontext->cq_list); mutex_unlock(&file->mutex); - mutex_unlock(&ib_uverbs_idr_mutex); + obj->uobject.live = 1; + + up_write(&obj->uobject.mutex); return in_len; -err_idr: - idr_remove(&ib_uverbs_cq_idr, uobj->uobject.id); +err_copy: + idr_remove_uobj(&ib_uverbs_cq_idr, &obj->uobject); + -err_up: - mutex_unlock(&ib_uverbs_idr_mutex); +err_free: ib_destroy_cq(cq); -err: +err_file: if (ev_file) - ib_uverbs_release_ucq(file, ev_file, uobj); - kfree(uobj); + ib_uverbs_release_ucq(file, ev_file, obj); + +err: + put_uobj_write(&obj->uobject); return ret; } @@ -673,6 +849,7 @@ ssize_t ib_uverbs_resize_cq(struct ib_uv struct ib_uverbs_resize_cq cmd; struct ib_uverbs_resize_cq_resp resp; struct ib_udata udata; + struct ib_uobject *uobj; struct ib_cq *cq; int ret = -EINVAL; @@ -683,11 +860,10 @@ ssize_t ib_uverbs_resize_cq(struct ib_uv (unsigned long) cmd.response + sizeof resp, in_len - sizeof cmd, out_len - sizeof resp); - mutex_lock(&ib_uverbs_idr_mutex); - - cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); - if (!cq || cq->uobject->context != file->ucontext || !cq->device->resize_cq) - goto out; + uobj = idr_read_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext); + if (!uobj) + return -EINVAL; + cq = uobj->object; ret = cq->device->resize_cq(cq, cmd.cqe, &udata); if (ret) @@ -701,7 +877,7 @@ ssize_t ib_uverbs_resize_cq(struct ib_uv ret = -EFAULT; out: - mutex_unlock(&ib_uverbs_idr_mutex); + put_uobj_read(uobj); return ret ? ret : in_len; } @@ -712,6 +888,7 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver { struct ib_uverbs_poll_cq cmd; struct ib_uverbs_poll_cq_resp *resp; + struct ib_uobject *uobj; struct ib_cq *cq; struct ib_wc *wc; int ret = 0; @@ -732,15 +909,17 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver goto out_wc; } - mutex_lock(&ib_uverbs_idr_mutex); - cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); - if (!cq || cq->uobject->context != file->ucontext) { + uobj = idr_read_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext); + if (!uobj) { ret = -EINVAL; goto out; } + cq = uobj->object; resp->count = ib_poll_cq(cq, cmd.ne, wc); + put_uobj_read(uobj); + for (i = 0; i < resp->count; i++) { resp->wc[i].wr_id = wc[i].wr_id; resp->wc[i].status = wc[i].status; @@ -762,7 +941,6 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver ret = -EFAULT; out: - mutex_unlock(&ib_uverbs_idr_mutex); kfree(resp); out_wc: @@ -775,22 +953,23 @@ ssize_t ib_uverbs_req_notify_cq(struct i int out_len) { struct ib_uverbs_req_notify_cq cmd; + struct ib_uobject *uobj; struct ib_cq *cq; - int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); - cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); - if (cq && cq->uobject->context == file->ucontext) { - ib_req_notify_cq(cq, cmd.solicited_only ? - IB_CQ_SOLICITED : IB_CQ_NEXT_COMP); - ret = in_len; - } - mutex_unlock(&ib_uverbs_idr_mutex); + uobj = idr_read_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext); + if (!uobj) + return -EINVAL; + cq = uobj->object; - return ret; + ib_req_notify_cq(cq, cmd.solicited_only ? + IB_CQ_SOLICITED : IB_CQ_NEXT_COMP); + + put_uobj_read(uobj); + + return in_len; } ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file, @@ -799,52 +978,50 @@ ssize_t ib_uverbs_destroy_cq(struct ib_u { struct ib_uverbs_destroy_cq cmd; struct ib_uverbs_destroy_cq_resp resp; + struct ib_uobject *uobj; struct ib_cq *cq; - struct ib_ucq_object *uobj; + struct ib_ucq_object *obj; struct ib_uverbs_event_file *ev_file; - u64 user_handle; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - memset(&resp, 0, sizeof resp); - - mutex_lock(&ib_uverbs_idr_mutex); + uobj = idr_write_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext); + if (!uobj) + return -EINVAL; + cq = uobj->object; + ev_file = cq->cq_context; + obj = container_of(cq->uobject, struct ib_ucq_object, uobject); - cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); - if (!cq || cq->uobject->context != file->ucontext) - goto out; + ret = ib_destroy_cq(cq); + if (!ret) + uobj->live = 0; - user_handle = cq->uobject->user_handle; - uobj = container_of(cq->uobject, struct ib_ucq_object, uobject); - ev_file = cq->cq_context; + put_uobj_write(uobj); - ret = ib_destroy_cq(cq); if (ret) - goto out; + return ret; - idr_remove(&ib_uverbs_cq_idr, cmd.cq_handle); + idr_remove_uobj(&ib_uverbs_cq_idr, uobj); mutex_lock(&file->mutex); - list_del(&uobj->uobject.list); + list_del(&uobj->list); mutex_unlock(&file->mutex); - ib_uverbs_release_ucq(file, ev_file, uobj); + ib_uverbs_release_ucq(file, ev_file, obj); - resp.comp_events_reported = uobj->comp_events_reported; - resp.async_events_reported = uobj->async_events_reported; + memset(&resp, 0, sizeof resp); + resp.comp_events_reported = obj->comp_events_reported; + resp.async_events_reported = obj->async_events_reported; - kfree(uobj); + put_uobj(uobj); if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) - ret = -EFAULT; - -out: - mutex_unlock(&ib_uverbs_idr_mutex); + return -EFAULT; - return ret ? ret : in_len; + return in_len; } ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file, @@ -854,7 +1031,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv struct ib_uverbs_create_qp cmd; struct ib_uverbs_create_qp_resp resp; struct ib_udata udata; - struct ib_uqp_object *uobj; + struct ib_uqp_object *obj; struct ib_pd *pd; struct ib_cq *scq, *rcq; struct ib_srq *srq; @@ -872,23 +1049,21 @@ ssize_t ib_uverbs_create_qp(struct ib_uv (unsigned long) cmd.response + sizeof resp, in_len - sizeof cmd, out_len - sizeof resp); - uobj = kmalloc(sizeof *uobj, GFP_KERNEL); - if (!uobj) + obj = kmalloc(sizeof *obj, GFP_KERNEL); + if (!obj) return -ENOMEM; - mutex_lock(&ib_uverbs_idr_mutex); + init_uobj(&obj->uevent.uobject, cmd.user_handle, file->ucontext); + down_write(&obj->uevent.uobject.mutex); - pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); - scq = idr_find(&ib_uverbs_cq_idr, cmd.send_cq_handle); - rcq = idr_find(&ib_uverbs_cq_idr, cmd.recv_cq_handle); - srq = cmd.is_srq ? idr_find(&ib_uverbs_srq_idr, cmd.srq_handle) : NULL; + pd = idr_read_pd(cmd.pd_handle, file->ucontext); + scq = idr_read_cq(cmd.send_cq_handle, file->ucontext); + rcq = idr_read_cq(cmd.recv_cq_handle, file->ucontext); + srq = cmd.is_srq ? idr_read_srq(cmd.srq_handle, file->ucontext) : NULL; - if (!pd || pd->uobject->context != file->ucontext || - !scq || scq->uobject->context != file->ucontext || - !rcq || rcq->uobject->context != file->ucontext || - (cmd.is_srq && (!srq || srq->uobject->context != file->ucontext))) { + if (!pd || !scq || !rcq || (cmd.is_srq && !srq)) { ret = -EINVAL; - goto err_up; + goto err_put; } attr.event_handler = ib_uverbs_qp_event_handler; @@ -905,16 +1080,14 @@ ssize_t ib_uverbs_create_qp(struct ib_uv attr.cap.max_recv_sge = cmd.max_recv_sge; attr.cap.max_inline_data = cmd.max_inline_data; - uobj->uevent.uobject.user_handle = cmd.user_handle; - uobj->uevent.uobject.context = file->ucontext; - uobj->uevent.events_reported = 0; - INIT_LIST_HEAD(&uobj->uevent.event_list); - INIT_LIST_HEAD(&uobj->mcast_list); + obj->uevent.events_reported = 0; + INIT_LIST_HEAD(&obj->uevent.event_list); + INIT_LIST_HEAD(&obj->mcast_list); qp = pd->device->create_qp(pd, &attr, &udata); if (IS_ERR(qp)) { ret = PTR_ERR(qp); - goto err_up; + goto err_put; } qp->device = pd->device; @@ -922,7 +1095,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv qp->send_cq = attr.send_cq; qp->recv_cq = attr.recv_cq; qp->srq = attr.srq; - qp->uobject = &uobj->uevent.uobject; + qp->uobject = &obj->uevent.uobject; qp->event_handler = attr.event_handler; qp->qp_context = attr.qp_context; qp->qp_type = attr.qp_type; @@ -932,14 +1105,14 @@ ssize_t ib_uverbs_create_qp(struct ib_uv if (attr.srq) atomic_inc(&attr.srq->usecnt); - memset(&resp, 0, sizeof resp); - resp.qpn = qp->qp_num; - - ret = idr_add_uobj(&ib_uverbs_qp_idr, qp, &uobj->uevent.uobject); + obj->uevent.uobject.object = qp; + ret = idr_add_uobj(&ib_uverbs_qp_idr, &obj->uevent.uobject); if (ret) goto err_destroy; - resp.qp_handle = uobj->uevent.uobject.id; + memset(&resp, 0, sizeof resp); + resp.qpn = qp->qp_num; + resp.qp_handle = obj->uevent.uobject.id; resp.max_recv_sge = attr.cap.max_recv_sge; resp.max_send_sge = attr.cap.max_send_sge; resp.max_recv_wr = attr.cap.max_recv_wr; @@ -949,27 +1122,36 @@ ssize_t ib_uverbs_create_qp(struct ib_uv if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { ret = -EFAULT; - goto err_idr; + goto err_copy; } mutex_lock(&file->mutex); - list_add_tail(&uobj->uevent.uobject.list, &file->ucontext->qp_list); + list_add_tail(&obj->uevent.uobject.list, &file->ucontext->qp_list); mutex_unlock(&file->mutex); - mutex_unlock(&ib_uverbs_idr_mutex); + obj->uevent.uobject.live = 1; + + up_write(&obj->uevent.uobject.mutex); return in_len; -err_idr: - idr_remove(&ib_uverbs_qp_idr, uobj->uevent.uobject.id); +err_copy: + idr_remove_uobj(&ib_uverbs_qp_idr, &obj->uevent.uobject); err_destroy: ib_destroy_qp(qp); -err_up: - mutex_unlock(&ib_uverbs_idr_mutex); - - kfree(uobj); +err_put: + if (pd) + put_pd_read(pd); + if (scq) + put_cq_read(scq); + if (rcq) + put_cq_read(rcq); + if (srq) + put_srq_read(srq); + + put_uobj_write(&obj->uevent.uobject); return ret; } @@ -994,15 +1176,15 @@ ssize_t ib_uverbs_query_qp(struct ib_uve goto out; } - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (qp && qp->uobject->context == file->ucontext) - ret = ib_query_qp(qp, attr, cmd.attr_mask, init_attr); - else + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) { ret = -EINVAL; + goto out; + } - mutex_unlock(&ib_uverbs_idr_mutex); + ret = ib_query_qp(qp, attr, cmd.attr_mask, init_attr); + + put_qp_read(qp); if (ret) goto out; @@ -1089,10 +1271,8 @@ ssize_t ib_uverbs_modify_qp(struct ib_uv if (!attr) return -ENOMEM; - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (!qp || qp->uobject->context != file->ucontext) { + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) { ret = -EINVAL; goto out; } @@ -1144,13 +1324,15 @@ ssize_t ib_uverbs_modify_qp(struct ib_uv attr->alt_ah_attr.port_num = cmd.alt_dest.port_num; ret = ib_modify_qp(qp, attr, cmd.attr_mask); + + put_qp_read(qp); + if (ret) goto out; ret = in_len; out: - mutex_unlock(&ib_uverbs_idr_mutex); kfree(attr); return ret; @@ -1162,8 +1344,9 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u { struct ib_uverbs_destroy_qp cmd; struct ib_uverbs_destroy_qp_resp resp; + struct ib_uobject *uobj; struct ib_qp *qp; - struct ib_uqp_object *uobj; + struct ib_uqp_object *obj; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1171,43 +1354,43 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u memset(&resp, 0, sizeof resp); - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (!qp || qp->uobject->context != file->ucontext) - goto out; - - uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + uobj = idr_write_uobj(&ib_uverbs_qp_idr, cmd.qp_handle, file->ucontext); + if (!uobj) + return -EINVAL; + qp = uobj->object; + obj = container_of(uobj, struct ib_uqp_object, uevent.uobject); - if (!list_empty(&uobj->mcast_list)) { - ret = -EBUSY; - goto out; + if (!list_empty(&obj->mcast_list)) { + put_uobj_write(uobj); + return -EBUSY; } ret = ib_destroy_qp(qp); + if (!ret) + uobj->live = 0; + + put_uobj_write(uobj); + if (ret) - goto out; + return ret; - idr_remove(&ib_uverbs_qp_idr, cmd.qp_handle); + idr_remove_uobj(&ib_uverbs_qp_idr, uobj); mutex_lock(&file->mutex); - list_del(&uobj->uevent.uobject.list); + list_del(&uobj->list); mutex_unlock(&file->mutex); - ib_uverbs_release_uevent(file, &uobj->uevent); + ib_uverbs_release_uevent(file, &obj->uevent); - resp.events_reported = uobj->uevent.events_reported; + resp.events_reported = obj->uevent.events_reported; - kfree(uobj); + put_uobj(uobj); if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) - ret = -EFAULT; - -out: - mutex_unlock(&ib_uverbs_idr_mutex); + return -EFAULT; - return ret ? ret : in_len; + return in_len; } ssize_t ib_uverbs_post_send(struct ib_uverbs_file *file, @@ -1220,6 +1403,7 @@ ssize_t ib_uverbs_post_send(struct ib_uv struct ib_send_wr *wr = NULL, *last, *next, *bad_wr; struct ib_qp *qp; int i, sg_ind; + int is_ud; ssize_t ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1236,12 +1420,11 @@ ssize_t ib_uverbs_post_send(struct ib_uv if (!user_wr) return -ENOMEM; - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (!qp || qp->uobject->context != file->ucontext) + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) goto out; + is_ud = qp->qp_type == IB_QPT_UD; sg_ind = 0; last = NULL; for (i = 0; i < cmd.wr_count; ++i) { @@ -1249,12 +1432,12 @@ ssize_t ib_uverbs_post_send(struct ib_uv buf + sizeof cmd + i * cmd.wqe_size, cmd.wqe_size)) { ret = -EFAULT; - goto out; + goto out_put; } if (user_wr->num_sge + sg_ind > cmd.sge_count) { ret = -EINVAL; - goto out; + goto out_put; } next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + @@ -1262,7 +1445,7 @@ ssize_t ib_uverbs_post_send(struct ib_uv GFP_KERNEL); if (!next) { ret = -ENOMEM; - goto out; + goto out_put; } if (!last) @@ -1278,12 +1461,12 @@ ssize_t ib_uverbs_post_send(struct ib_uv next->send_flags = user_wr->send_flags; next->imm_data = (__be32 __force) user_wr->imm_data; - if (qp->qp_type == IB_QPT_UD) { - next->wr.ud.ah = idr_find(&ib_uverbs_ah_idr, - user_wr->wr.ud.ah); + if (is_ud) { + next->wr.ud.ah = idr_read_ah(user_wr->wr.ud.ah, + file->ucontext); if (!next->wr.ud.ah) { ret = -EINVAL; - goto out; + goto out_put; } next->wr.ud.remote_qpn = user_wr->wr.ud.remote_qpn; next->wr.ud.remote_qkey = user_wr->wr.ud.remote_qkey; @@ -1320,7 +1503,7 @@ ssize_t ib_uverbs_post_send(struct ib_uv sg_ind * sizeof (struct ib_sge), next->num_sge * sizeof (struct ib_sge))) { ret = -EFAULT; - goto out; + goto out_put; } sg_ind += next->num_sge; } else @@ -1340,10 +1523,13 @@ ssize_t ib_uverbs_post_send(struct ib_uv &resp, sizeof resp)) ret = -EFAULT; -out: - mutex_unlock(&ib_uverbs_idr_mutex); +out_put: + put_qp_read(qp); +out: while (wr) { + if (is_ud && wr->wr.ud.ah) + put_ah_read(wr->wr.ud.ah); next = wr->next; kfree(wr); wr = next; @@ -1458,14 +1644,15 @@ ssize_t ib_uverbs_post_recv(struct ib_uv if (IS_ERR(wr)) return PTR_ERR(wr); - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (!qp || qp->uobject->context != file->ucontext) + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) goto out; resp.bad_wr = 0; ret = qp->device->post_recv(qp, wr, &bad_wr); + + put_qp_read(qp); + if (ret) for (next = wr; next; next = next->next) { ++resp.bad_wr; @@ -1479,8 +1666,6 @@ ssize_t ib_uverbs_post_recv(struct ib_uv ret = -EFAULT; out: - mutex_unlock(&ib_uverbs_idr_mutex); - while (wr) { next = wr->next; kfree(wr); @@ -1509,14 +1694,15 @@ ssize_t ib_uverbs_post_srq_recv(struct i if (IS_ERR(wr)) return PTR_ERR(wr); - mutex_lock(&ib_uverbs_idr_mutex); - - srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); - if (!srq || srq->uobject->context != file->ucontext) + srq = idr_read_srq(cmd.srq_handle, file->ucontext); + if (!srq) goto out; resp.bad_wr = 0; ret = srq->device->post_srq_recv(srq, wr, &bad_wr); + + put_srq_read(srq); + if (ret) for (next = wr; next; next = next->next) { ++resp.bad_wr; @@ -1530,8 +1716,6 @@ ssize_t ib_uverbs_post_srq_recv(struct i ret = -EFAULT; out: - mutex_unlock(&ib_uverbs_idr_mutex); - while (wr) { next = wr->next; kfree(wr); @@ -1563,17 +1747,15 @@ ssize_t ib_uverbs_create_ah(struct ib_uv if (!uobj) return -ENOMEM; - mutex_lock(&ib_uverbs_idr_mutex); + init_uobj(uobj, cmd.user_handle, file->ucontext); + down_write(&uobj->mutex); - pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); - if (!pd || pd->uobject->context != file->ucontext) { + pd = idr_read_pd(cmd.pd_handle, file->ucontext); + if (!pd) { ret = -EINVAL; - goto err_up; + goto err; } - uobj->user_handle = cmd.user_handle; - uobj->context = file->ucontext; - attr.dlid = cmd.attr.dlid; attr.sl = cmd.attr.sl; attr.src_path_bits = cmd.attr.src_path_bits; @@ -1589,12 +1771,11 @@ ssize_t ib_uverbs_create_ah(struct ib_uv ah = ib_create_ah(pd, &attr); if (IS_ERR(ah)) { ret = PTR_ERR(ah); - goto err_up; + goto err; } ah->uobject = uobj; - - ret = idr_add_uobj(&ib_uverbs_ah_idr, ah, uobj); + ret = idr_add_uobj(&ib_uverbs_ah_idr, uobj); if (ret) goto err_destroy; @@ -1603,27 +1784,29 @@ ssize_t ib_uverbs_create_ah(struct ib_uv if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { ret = -EFAULT; - goto err_idr; + goto err_copy; } + put_pd_read(pd); + mutex_lock(&file->mutex); list_add_tail(&uobj->list, &file->ucontext->ah_list); mutex_unlock(&file->mutex); - mutex_unlock(&ib_uverbs_idr_mutex); + uobj->live = 1; + + up_write(&uobj->mutex); return in_len; -err_idr: - idr_remove(&ib_uverbs_ah_idr, uobj->id); +err_copy: + idr_remove_uobj(&ib_uverbs_ah_idr, uobj); err_destroy: ib_destroy_ah(ah); -err_up: - mutex_unlock(&ib_uverbs_idr_mutex); - - kfree(uobj); +err: + put_uobj_write(uobj); return ret; } @@ -1633,35 +1816,34 @@ ssize_t ib_uverbs_destroy_ah(struct ib_u struct ib_uverbs_destroy_ah cmd; struct ib_ah *ah; struct ib_uobject *uobj; - int ret = -EINVAL; + int ret; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); + uobj = idr_write_uobj(&ib_uverbs_ah_idr, cmd.ah_handle, file->ucontext); + if (!uobj) + return -EINVAL; + ah = uobj->object; - ah = idr_find(&ib_uverbs_ah_idr, cmd.ah_handle); - if (!ah || ah->uobject->context != file->ucontext) - goto out; + ret = ib_destroy_ah(ah); + if (!ret) + uobj->live = 0; - uobj = ah->uobject; + put_uobj_write(uobj); - ret = ib_destroy_ah(ah); if (ret) - goto out; + return ret; - idr_remove(&ib_uverbs_ah_idr, cmd.ah_handle); + idr_remove_uobj(&ib_uverbs_ah_idr, uobj); mutex_lock(&file->mutex); list_del(&uobj->list); mutex_unlock(&file->mutex); - kfree(uobj); + put_uobj(uobj); -out: - mutex_unlock(&ib_uverbs_idr_mutex); - - return ret ? ret : in_len; + return in_len; } ssize_t ib_uverbs_attach_mcast(struct ib_uverbs_file *file, @@ -1670,47 +1852,43 @@ ssize_t ib_uverbs_attach_mcast(struct ib { struct ib_uverbs_attach_mcast cmd; struct ib_qp *qp; - struct ib_uqp_object *uobj; + struct ib_uqp_object *obj; struct ib_uverbs_mcast_entry *mcast; - int ret = -EINVAL; + int ret; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (!qp || qp->uobject->context != file->ucontext) - goto out; + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) + return -EINVAL; - uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + obj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); - list_for_each_entry(mcast, &uobj->mcast_list, list) + list_for_each_entry(mcast, &obj->mcast_list, list) if (cmd.mlid == mcast->lid && !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) { ret = 0; - goto out; + goto out_put; } mcast = kmalloc(sizeof *mcast, GFP_KERNEL); if (!mcast) { ret = -ENOMEM; - goto out; + goto out_put; } mcast->lid = cmd.mlid; memcpy(mcast->gid.raw, cmd.gid, sizeof mcast->gid.raw); ret = ib_attach_mcast(qp, &mcast->gid, cmd.mlid); - if (!ret) { - uobj = container_of(qp->uobject, struct ib_uqp_object, - uevent.uobject); - list_add_tail(&mcast->list, &uobj->mcast_list); - } else + if (!ret) + list_add_tail(&mcast->list, &obj->mcast_list); + else kfree(mcast); -out: - mutex_unlock(&ib_uverbs_idr_mutex); +out_put: + put_qp_read(qp); return ret ? ret : in_len; } @@ -1720,7 +1898,7 @@ ssize_t ib_uverbs_detach_mcast(struct ib int out_len) { struct ib_uverbs_detach_mcast cmd; - struct ib_uqp_object *uobj; + struct ib_uqp_object *obj; struct ib_qp *qp; struct ib_uverbs_mcast_entry *mcast; int ret = -EINVAL; @@ -1728,19 +1906,17 @@ ssize_t ib_uverbs_detach_mcast(struct ib if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (!qp || qp->uobject->context != file->ucontext) - goto out; + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) + return -EINVAL; ret = ib_detach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); if (ret) - goto out; + goto out_put; - uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + obj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); - list_for_each_entry(mcast, &uobj->mcast_list, list) + list_for_each_entry(mcast, &obj->mcast_list, list) if (cmd.mlid == mcast->lid && !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) { list_del(&mcast->list); @@ -1748,8 +1924,8 @@ ssize_t ib_uverbs_detach_mcast(struct ib break; } -out: - mutex_unlock(&ib_uverbs_idr_mutex); +out_put: + put_qp_read(qp); return ret ? ret : in_len; } @@ -1761,7 +1937,7 @@ ssize_t ib_uverbs_create_srq(struct ib_u struct ib_uverbs_create_srq cmd; struct ib_uverbs_create_srq_resp resp; struct ib_udata udata; - struct ib_uevent_object *uobj; + struct ib_uevent_object *obj; struct ib_pd *pd; struct ib_srq *srq; struct ib_srq_init_attr attr; @@ -1777,17 +1953,17 @@ ssize_t ib_uverbs_create_srq(struct ib_u (unsigned long) cmd.response + sizeof resp, in_len - sizeof cmd, out_len - sizeof resp); - uobj = kmalloc(sizeof *uobj, GFP_KERNEL); - if (!uobj) + obj = kmalloc(sizeof *obj, GFP_KERNEL); + if (!obj) return -ENOMEM; - mutex_lock(&ib_uverbs_idr_mutex); - - pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); + init_uobj(&obj->uobject, 0, file->ucontext); + down_write(&obj->uobject.mutex); - if (!pd || pd->uobject->context != file->ucontext) { + pd = idr_read_pd(cmd.pd_handle, file->ucontext); + if (!pd) { ret = -EINVAL; - goto err_up; + goto err; } attr.event_handler = ib_uverbs_srq_event_handler; @@ -1796,59 +1972,59 @@ ssize_t ib_uverbs_create_srq(struct ib_u attr.attr.max_sge = cmd.max_sge; attr.attr.srq_limit = cmd.srq_limit; - uobj->uobject.user_handle = cmd.user_handle; - uobj->uobject.context = file->ucontext; - uobj->events_reported = 0; - INIT_LIST_HEAD(&uobj->event_list); + obj->events_reported = 0; + INIT_LIST_HEAD(&obj->event_list); srq = pd->device->create_srq(pd, &attr, &udata); if (IS_ERR(srq)) { ret = PTR_ERR(srq); - goto err_up; + goto err; } srq->device = pd->device; srq->pd = pd; - srq->uobject = &uobj->uobject; + srq->uobject = &obj->uobject; srq->event_handler = attr.event_handler; srq->srq_context = attr.srq_context; atomic_inc(&pd->usecnt); atomic_set(&srq->usecnt, 0); - memset(&resp, 0, sizeof resp); - - ret = idr_add_uobj(&ib_uverbs_srq_idr, srq, &uobj->uobject); + obj->uobject.object = srq; + ret = idr_add_uobj(&ib_uverbs_srq_idr, &obj->uobject); if (ret) goto err_destroy; - resp.srq_handle = uobj->uobject.id; + memset(&resp, 0, sizeof resp); + resp.srq_handle = obj->uobject.id; resp.max_wr = attr.attr.max_wr; resp.max_sge = attr.attr.max_sge; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { ret = -EFAULT; - goto err_idr; + goto err_copy; } + put_pd_read(pd); + mutex_lock(&file->mutex); - list_add_tail(&uobj->uobject.list, &file->ucontext->srq_list); + list_add_tail(&obj->uobject.list, &file->ucontext->srq_list); mutex_unlock(&file->mutex); - mutex_unlock(&ib_uverbs_idr_mutex); + obj->uobject.live = 1; + + up_write(&obj->uobject.mutex); return in_len; -err_idr: - idr_remove(&ib_uverbs_srq_idr, uobj->uobject.id); +err_copy: + idr_remove_uobj(&ib_uverbs_srq_idr, &obj->uobject); err_destroy: ib_destroy_srq(srq); -err_up: - mutex_unlock(&ib_uverbs_idr_mutex); - - kfree(uobj); +err: + put_uobj_write(&obj->uobject); return ret; } @@ -1864,21 +2040,16 @@ ssize_t ib_uverbs_modify_srq(struct ib_u if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); - - srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); - if (!srq || srq->uobject->context != file->ucontext) { - ret = -EINVAL; - goto out; - } + srq = idr_read_srq(cmd.srq_handle, file->ucontext); + if (!srq) + return -EINVAL; attr.max_wr = cmd.max_wr; attr.srq_limit = cmd.srq_limit; ret = ib_modify_srq(srq, &attr, cmd.attr_mask); -out: - mutex_unlock(&ib_uverbs_idr_mutex); + put_srq_read(srq); return ret ? ret : in_len; } @@ -1899,18 +2070,16 @@ ssize_t ib_uverbs_query_srq(struct ib_uv if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); + srq = idr_read_srq(cmd.srq_handle, file->ucontext); + if (!srq) + return -EINVAL; - srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); - if (srq && srq->uobject->context == file->ucontext) - ret = ib_query_srq(srq, &attr); - else - ret = -EINVAL; + ret = ib_query_srq(srq, &attr); - mutex_unlock(&ib_uverbs_idr_mutex); + put_srq_read(srq); if (ret) - goto out; + return ret; memset(&resp, 0, sizeof resp); @@ -1920,10 +2089,9 @@ ssize_t ib_uverbs_query_srq(struct ib_uv if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) - ret = -EFAULT; + return -EFAULT; -out: - return ret ? ret : in_len; + return in_len; } ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file, @@ -1932,45 +2100,45 @@ ssize_t ib_uverbs_destroy_srq(struct ib_ { struct ib_uverbs_destroy_srq cmd; struct ib_uverbs_destroy_srq_resp resp; + struct ib_uobject *uobj; struct ib_srq *srq; - struct ib_uevent_object *uobj; + struct ib_uevent_object *obj; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); - - memset(&resp, 0, sizeof resp); + uobj = idr_write_uobj(&ib_uverbs_srq_idr, cmd.srq_handle, file->ucontext); + if (!uobj) + return -EINVAL; + srq = uobj->object; + obj = container_of(uobj, struct ib_uevent_object, uobject); - srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); - if (!srq || srq->uobject->context != file->ucontext) - goto out; + ret = ib_destroy_srq(srq); + if (!ret) + uobj->live = 0; - uobj = container_of(srq->uobject, struct ib_uevent_object, uobject); + put_uobj_write(uobj); - ret = ib_destroy_srq(srq); if (ret) - goto out; + return ret; - idr_remove(&ib_uverbs_srq_idr, cmd.srq_handle); + idr_remove_uobj(&ib_uverbs_srq_idr, uobj); mutex_lock(&file->mutex); - list_del(&uobj->uobject.list); + list_del(&uobj->list); mutex_unlock(&file->mutex); - ib_uverbs_release_uevent(file, uobj); + ib_uverbs_release_uevent(file, obj); - resp.events_reported = uobj->events_reported; + memset(&resp, 0, sizeof resp); + resp.events_reported = obj->events_reported; - kfree(uobj); + put_uobj(uobj); if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) ret = -EFAULT; -out: - mutex_unlock(&ib_uverbs_idr_mutex); - return ret ? ret : in_len; } diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index ff092a0..5ec2d49 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -66,7 +66,7 @@ #define IB_UVERBS_BASE_DEV MKDEV(IB_UVER static struct class *uverbs_class; -DEFINE_MUTEX(ib_uverbs_idr_mutex); +DEFINE_SPINLOCK(ib_uverbs_idr_lock); DEFINE_IDR(ib_uverbs_pd_idr); DEFINE_IDR(ib_uverbs_mr_idr); DEFINE_IDR(ib_uverbs_mw_idr); @@ -183,21 +183,21 @@ static int ib_uverbs_cleanup_ucontext(st if (!context) return 0; - mutex_lock(&ib_uverbs_idr_mutex); - list_for_each_entry_safe(uobj, tmp, &context->ah_list, list) { - struct ib_ah *ah = idr_find(&ib_uverbs_ah_idr, uobj->id); - idr_remove(&ib_uverbs_ah_idr, uobj->id); + struct ib_ah *ah = uobj->object; + + idr_remove_uobj(&ib_uverbs_ah_idr, uobj); ib_destroy_ah(ah); list_del(&uobj->list); kfree(uobj); } list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) { - struct ib_qp *qp = idr_find(&ib_uverbs_qp_idr, uobj->id); + struct ib_qp *qp = uobj->object; struct ib_uqp_object *uqp = container_of(uobj, struct ib_uqp_object, uevent.uobject); - idr_remove(&ib_uverbs_qp_idr, uobj->id); + + idr_remove_uobj(&ib_uverbs_qp_idr, uobj); ib_uverbs_detach_umcast(qp, uqp); ib_destroy_qp(qp); list_del(&uobj->list); @@ -206,11 +206,12 @@ static int ib_uverbs_cleanup_ucontext(st } list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) { - struct ib_cq *cq = idr_find(&ib_uverbs_cq_idr, uobj->id); + struct ib_cq *cq = uobj->object; struct ib_uverbs_event_file *ev_file = cq->cq_context; struct ib_ucq_object *ucq = container_of(uobj, struct ib_ucq_object, uobject); - idr_remove(&ib_uverbs_cq_idr, uobj->id); + + idr_remove_uobj(&ib_uverbs_cq_idr, uobj); ib_destroy_cq(cq); list_del(&uobj->list); ib_uverbs_release_ucq(file, ev_file, ucq); @@ -218,10 +219,11 @@ static int ib_uverbs_cleanup_ucontext(st } list_for_each_entry_safe(uobj, tmp, &context->srq_list, list) { - struct ib_srq *srq = idr_find(&ib_uverbs_srq_idr, uobj->id); + struct ib_srq *srq = uobj->object; struct ib_uevent_object *uevent = container_of(uobj, struct ib_uevent_object, uobject); - idr_remove(&ib_uverbs_srq_idr, uobj->id); + + idr_remove_uobj(&ib_uverbs_srq_idr, uobj); ib_destroy_srq(srq); list_del(&uobj->list); ib_uverbs_release_uevent(file, uevent); @@ -231,11 +233,11 @@ static int ib_uverbs_cleanup_ucontext(st /* XXX Free MWs */ list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) { - struct ib_mr *mr = idr_find(&ib_uverbs_mr_idr, uobj->id); + struct ib_mr *mr = uobj->object; struct ib_device *mrdev = mr->device; struct ib_umem_object *memobj; - idr_remove(&ib_uverbs_mr_idr, uobj->id); + idr_remove_uobj(&ib_uverbs_mr_idr, uobj); ib_dereg_mr(mr); memobj = container_of(uobj, struct ib_umem_object, uobject); @@ -246,15 +248,14 @@ static int ib_uverbs_cleanup_ucontext(st } list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) { - struct ib_pd *pd = idr_find(&ib_uverbs_pd_idr, uobj->id); - idr_remove(&ib_uverbs_pd_idr, uobj->id); + struct ib_pd *pd = uobj->object; + + idr_remove_uobj(&ib_uverbs_pd_idr, uobj); ib_dealloc_pd(pd); list_del(&uobj->list); kfree(uobj); } - mutex_unlock(&ib_uverbs_idr_mutex); - return context->device->dealloc_ucontext(context); } diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 7ced208..ee1f3a3 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -697,8 +697,12 @@ struct ib_ucontext { struct ib_uobject { u64 user_handle; /* handle given to us by userspace */ struct ib_ucontext *context; /* associated user context */ + void *object; /* containing object */ struct list_head list; /* link to context's list */ u32 id; /* index into kernel idr */ + struct kref ref; + struct rw_semaphore mutex; /* protects .live */ + int live; }; struct ib_umem { From mshefty at ichips.intel.com Mon Jun 12 13:04:45 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 12 Jun 2006 13:04:45 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: References: Message-ID: <448DC8DD.20800@ichips.intel.com> Sean Hefty wrote: > I'd like to propose that the MAD layer detect duplicate requests. After a > request MAD has been handed to a client, its context would be maintained until > the user calls ib_free_recv_mad(), allowing duplicate requests to be discarded. > {snip} > > Finally, a way would need to be found for when to call ib_free_recv_mad() for > userspace clients. I've been trying to come up with a way to handle userspace clients. Here are a few ideas: 1. Export ib_free_recv_mad() to userspace. This changes the ABI, and would require changes to all existing clients for things to work properly. My preference would be to avoid this option. 2. Change the MAD registration, so that clients indicate which methods generate responses. Again, this changes the ABI. 3. Hard-code which methods generate responses. For most management classes, there's only 3-6 methods that generate responses. The kernel umad module would only free a request MAD after a response had been generated. This would make umad class aware, and would not work for user-defined classes. 4. Modify umad to learn which requests generate responses, by examining response MADs. When a response is sent, umad would mark which method the response is for by flipping the R-bit. Based on the algorithm, this could result in losing responses the first time that a request is seen. Some additional hard-coding would be needed for a Set, since a Set request generates GetResp MADs. Comments? - Sean From sweitzen at cisco.com Mon Jun 12 13:21:34 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 12 Jun 2006 13:21:34 -0700 Subject: [openib-general] IB MTU tunable for uDAPL and/or Intel MPI? Message-ID: This didn't help. Osu_bibw.c still reports max bi bandwidth in the 1600s, should be in the 1900s. I looked back at my notes, and OFED 1.0 rc4 had desired max bi bandwidth with OFED 1.0 rc4, did the uDAPL IB MTU change? $ mpiexec -genv I_MPI_DAPL_PROVIDER OpenIB-scm -genv I_MPI_DEBUG 3 -genv I_MPI_DEVICE rdssm -genv LD_LIBRARY_PATH .../lib -n 2 ../osu_bibw.x I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma # OSU MPI Bidirectional Bandwidth Test (Version 2.1) # Size Bi-Bandwidth (MB/s) 1 0.813478 2 1.637650 4 3.260333 8 6.627831 16 12.168080 32 25.683379 64 50.580351 128 95.035855 256 174.132061 512 310.656179 1024 513.066433 2048 726.685587 4096 877.233753 8192 973.311995 16384 1040.096136 32768 849.790165 65536 1088.723063 131072 1296.584344 262144 1428.176271 524288 1540.248671 1048576 1579.665660 2097152 1608.765475 4194304 1628.157462 Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: Arlin Davis [mailto:ardavis at ichips.intel.com] > Sent: Friday, June 09, 2006 11:38 AM > To: Scott Weitzenkamp (sweitzen) > Cc: Tziporet Koren; openfabrics-ewg at openib.org; Davis, Arlin > R; Lentini, James; openib-general > Subject: Re: [openib-general] IB MTU tunable for uDAPL and/or > Intel MPI? > > Scott Weitzenkamp (sweitzen) wrote: > > > While we're talking about MTUs, is the IB MTU tunable in > uDAPL and/or > > Intel MPI via env var or config file? > > > > Looks like Intel MPI 2.0.1 uses 2K for IB MTU like MVAPICH does in > > OFED 1.0 rc4 and rc6, I'd like to try 1K with Intel MPI. > > > > Scott > > > There is no mechanism for me to modify the MTU using rdma_cm > so whatever > is returned in the path record is what you get with the OpenIB-cma > provider. However, you could use the OpenIB-scm provider > which is hard > coded for 1K MTU as a comparision. Can you run with "-genv > I_MPI_DAPL_PROVIDER OpenIB-scm" on your cluster? > > -arlin > > > > > > -------------------------------------------------------------- > ---------- > > *From:* openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] *On Behalf Of *Scott > > Weitzenkamp (sweitzen) > > *Sent:* Thursday, June 08, 2006 4:38 PM > > *To:* Tziporet Koren; openfabrics-ewg at openib.org > > *Cc:* openib-general > > *Subject:* RE: [openib-general] OFED-1.0-rc6 is available > > > > The MTU change undos the changes for bug 81, so I have reopened > > bug 81 (http://openib.org/bugzilla/show_bug.cgi?id=81). > > > > With rc6, PCI-X osu_bw and osu_bibw performance is bad, > and PCI-E > > osu_bibw performance is bad. I've enclosed some > performance data, > > look at rc4 vs rc5 vs rc6 for Cougar/Cheetah/LionMini. > > > > Are there other benchmarks driving the changes in rc6 (and rc4)? > > > > Scott Weitzenkamp > > SQA and Release Manager > > Server Virtualization Business Unit > > Cisco Systems > > > > > > > > > > *OSU MPI:* > > > > * Added mpi_alltoall fine tuning parameters > > > > * Added default configuration/documentation file > > $MPIHOME/etc/mvapich.conf > > > > * Added shell configuration files > > $MPIHOME/etc/mvapich.csh , $MPIHOME/etc/mvapich.csh > > > > * Default MTU was changed back to 2K for > InfiniHost III > > Ex and InfiniHost III Lx HCAs. For InfiniHost card > recommended > > value is: > > VIADEV_DEFAULT_MTU=MTU1024 > > > >------------------------------------------------------------- > ----------- > > > >_______________________________________________ > >openib-general mailing list > >openib-general at openib.org > >http://openib.org/mailman/listinfo/openib-general > > > >To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > From rdreier at cisco.com Mon Jun 12 13:37:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Jun 2006 13:37:11 -0700 Subject: [openib-general] [PATCH] mthca: memfree completion with error workaround In-Reply-To: <20060612121635.GX7359@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 12 Jun 2006 15:16:35 +0300") References: <20060612121635.GX7359@mellanox.co.il> Message-ID: This makes me sad. We're adding considerable code to the CQ polling fast path to handle a rare FW bug. I wish there were a better way. - R. From mst at mellanox.co.il Mon Jun 12 13:47:59 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Jun 2006 23:47:59 +0300 Subject: [openib-general] [PATCH] mthca: memfree completion with error workaround In-Reply-To: References: Message-ID: <20060612204759.GB17643@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca: memfree completion with error workaround > > This makes me sad. We're adding considerable code to the CQ polling > fast path It might not be too bad - there's a single additional test on fastpath, and I am guessing both wqe_index and rq.max should be in registers already. Once wqe_index is out of rq.max range we are on slow path. > I wish there were a better way. We can make it a compile-time option, so that users can disable it once there's a firmware that does not need this code. -- MST From rdreier at cisco.com Mon Jun 12 13:47:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Jun 2006 13:47:40 -0700 Subject: [openib-general] [PATCH] mthca: memfree completion with error workaround In-Reply-To: <20060612204759.GB17643@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 12 Jun 2006 23:47:59 +0300") References: <20060612204759.GB17643@mellanox.co.il> Message-ID: Michael> It might not be too bad - there's a single additional Michael> test on fastpath, and I am guessing both wqe_index and Michael> rq.max should be in registers already. Once wqe_index is Michael> out of rq.max range we are on slow path. But it bloats the function and adds to i-cache footprint. I'm sure it benchmarks fine but it adds to general cache usage that pushes useful code out of cache. Unfortunately I don't see a clean way to move it out of line. Michael> We can make it a compile-time option, so that users can Michael> disable it once there's a firmware that does not need Michael> this code. No distro is ever going to turn it off though. - R. From ardavis at ichips.intel.com Mon Jun 12 14:00:43 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 12 Jun 2006 14:00:43 -0700 Subject: [openib-general] IB MTU tunable for uDAPL and/or Intel MPI? In-Reply-To: References: Message-ID: <448DD5FB.5090205@ichips.intel.com> Scott Weitzenkamp (sweitzen) wrote: >This didn't help. Osu_bibw.c still reports max bi bandwidth in the >1600s, should be in the 1900s. I looked back at my notes, and OFED 1.0 >rc4 had desired max bi bandwidth with OFED 1.0 rc4, did the uDAPL IB MTU >change? > > uDAPL does not have any control over IB MTU using OpenIB-cma. We just use the path record that is supplied from Open SM. Not sure where or when the change occured but it is not in uDAPL. >$ mpiexec -genv I_MPI_DAPL_PROVIDER OpenIB-scm -genv I_MPI_DEBUG 3 -genv >I_MPI_DEVICE rdssm -genv LD_LIBRARY_PATH .../lib -n 2 ../osu_bibw.x >I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so >I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma >I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so >I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma > > It picked up the OpenIB-cma device instead of OpenIB-scm. -arlin From ftillier at silverstorm.com Mon Jun 12 14:00:46 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Mon, 12 Jun 2006 14:00:46 -0700 Subject: [openib-general] [PATCH] mthca: memfree completion with error workaround In-Reply-To: References: <20060612204759.GB17643@mellanox.co.il> Message-ID: <79ae2f320606121400j320ee074v939d61435ee93cad@mail.gmail.com> Hi Roland, On 6/12/06, Roland Dreier wrote: > Michael> It might not be too bad - there's a single additional > Michael> test on fastpath, and I am guessing both wqe_index and > Michael> rq.max should be in registers already. Once wqe_index is > Michael> out of rq.max range we are on slow path. > > But it bloats the function and adds to i-cache footprint. I'm sure it > benchmarks fine but it adds to general cache usage that pushes useful > code out of cache. > > Unfortunately I don't see a clean way to move it out of line. Why not just have multiple implemenations of the function, and setup the function pointer in the verbs according to what firmware and device is in use? That way devices not affected could continue to use the optimized version... Just a thought - it does have the drawback of having multiple similar functions. - Fab From mshefty at ichips.intel.com Mon Jun 12 14:03:51 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 12 Jun 2006 14:03:51 -0700 Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with ib_uverbs_idr_mutex In-Reply-To: References: Message-ID: <448DD6B7.7010305@ichips.intel.com> > I started thinking about the "kill ib_uverbs_idr_mutex" problem, and I > realized that there are actually some interesting issues there (as > described in the comment at the top of uverbs_cmd.c). In fact I ended > up coding the solution below. This passes some basic tests but it > could probably use some review. The basic approach seems fine to me. It would be nice to eliminate the live flag, but I can't think of a way to do so. - Sean From rdreier at cisco.com Mon Jun 12 14:18:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Jun 2006 14:18:09 -0700 Subject: [openib-general] [PATCH] mthca: memfree completion with error workaround In-Reply-To: <79ae2f320606121400j320ee074v939d61435ee93cad@mail.gmail.com> (Fabian Tillier's message of "Mon, 12 Jun 2006 14:00:46 -0700") References: <20060612204759.GB17643@mellanox.co.il> <79ae2f320606121400j320ee074v939d61435ee93cad@mail.gmail.com> Message-ID: Fabian> Why not just have multiple implemenations of the function, Fabian> and setup the function pointer in the verbs according to Fabian> what firmware and device is in use? That way devices not Fabian> affected could continue to use the optimized version... Yup, that's the obvious solution. But I'm not sure it's worth that much bloat in this case. From rdreier at cisco.com Mon Jun 12 14:19:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Jun 2006 14:19:32 -0700 Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with ib_uverbs_idr_mutex In-Reply-To: <448DD6B7.7010305@ichips.intel.com> (Sean Hefty's message of "Mon, 12 Jun 2006 14:03:51 -0700") References: <448DD6B7.7010305@ichips.intel.com> Message-ID: Sean> It would be nice to eliminate the live flag, but I can't Sean> think of a way to do so. Agreed -- unfortunately I'm not smart enough either (and believe me I came up with some complicated attempts) From rjwalsh at pathscale.com Mon Jun 12 14:29:16 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Mon, 12 Jun 2006 14:29:16 -0700 Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with ib_uverbs_idr_mutex In-Reply-To: References: Message-ID: <1150147756.23063.16.camel@hematite.internal.keyresearch.com> On Mon, 2006-06-12 at 12:36 -0700, Roland Dreier wrote: > IB/uverbs: Don't serialize with ib_uverbs_idr_mutex This looks good - I had started something similar but your solution solves some problems mine had (by using the live flag, even if it is kind of bleh.) Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From rdreier at cisco.com Mon Jun 12 14:49:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Jun 2006 14:49:15 -0700 Subject: [openib-general] [PATCH] mthca: memfree completion with error workaround In-Reply-To: <20060612121635.GX7359@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 12 Jun 2006 15:16:35 +0300") References: <20060612121635.GX7359@mellanox.co.il> Message-ID: > + /* WQE index == -1 might be reported by > + Sinai FW 1.0.800, Arbel FW 5.1.400 and should be fixed > + in later revisions. */ In the future please use /* * comment */ style comments. > + if (unlikely(wqe_index >= (*cur_qp)->rq.max)) { > + if (unlikely(is_error) && > + unlikely(wqe_index == 0xffffffff >> wq->wqe_shift) && seems like the inside unlikely()s are wrong here -- the reason we expect to be here is exactly the reason being marked unlikely, which is backwards. > + mthca_is_memfree(dev)) > + wqe_index = wq->max - 1; > + else { > + mthca_err(dev, "Corrupted RQ CQE. " > + "CQ 0x%x QP 0x%x idx 0x%x > 0x%x\n", > + cq->cqn, entry->qp_num, wqe_index, > + wq->max); > + return -EINVAL; This should probably be "err = -EINVAL; goto out;" right? > + } > + } > entry->wr_id = (*cur_qp)->wrid[wqe_index]; > } > From bos at pathscale.com Mon Jun 12 15:09:45 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 12 Jun 2006 15:09:45 -0700 Subject: [openib-general] OFED 1.0-rc6 tarball available with working ipath driver In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007F0B6BD@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0007F0B6BD@orsmsx408> Message-ID: <1150150185.3217.0.camel@chalcedony.pathscale.com> On Mon, 2006-06-12 at 10:49 -0700, Woodruff, Robert J wrote: > Still does not seem to compile. Please try this one instead: http://openib.red-bean.com/OFED-1.0-rc6+ipath-2.tar.bz2 From mshefty at ichips.intel.com Mon Jun 12 15:22:19 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 12 Jun 2006 15:22:19 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <448DC8DD.20800@ichips.intel.com> References: <448DC8DD.20800@ichips.intel.com> Message-ID: <448DE91B.3080607@ichips.intel.com> Sean Hefty wrote: > 4. Modify umad to learn which requests generate responses, by examining response > MADs. When a response is sent, umad would mark which method the response is for > by flipping the R-bit. Based on the algorithm, this could result in losing > responses the first time that a request is seen. Some additional hard-coding > would be needed for a Set, since a Set request generates GetResp MADs. This brings up a concern. There doesn't seem to be a limit to the number of received MADs that can be queued for a user. Should we have such a limit? - Sean From robert.j.woodruff at intel.com Mon Jun 12 15:41:36 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 12 Jun 2006 15:41:36 -0700 Subject: [openib-general] OFED 1.0-rc6 tarball available with working ipath driver Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007F0BDBB@orsmsx408> Brian wrote, >Please try this one instead: http://openib.red-bean.com/OFED-1.0-rc6+ipath-2.tar.bz2 Got farther, but now fails trying to build DAPL/rdmacm. This did not fail with the original RC6. woody gcc: /var/tmp/OFED/tmp/openib/openib/src/userspace/librdmacm/src/.libs/.libs/ librdmacm.so: No such file or directory make[2]: *** [dapl/udapl/libdaplcma.la] Error 1 make[2]: Leaving directory `/var/tmp/OFED/tmp/openib/openib/src/userspace/dapl' make[1]: *** [all] Error 2 make[1]: Leaving directory `/var/tmp/OFED/tmp/openib/openib/src/userspace/dapl' make: *** [dapl] Error 2 ERROR: Failed to execute: env make user ~ From krause at cup.hp.com Mon Jun 12 14:18:27 2006 From: krause at cup.hp.com (Michael Krause) Date: Mon, 12 Jun 2006 14:18:27 -0700 Subject: [openib-general] IB MTU tunable for uDAPL and/or Intel MPI? In-Reply-To: References: Message-ID: <6.2.0.14.2.20060612135825.02de0fe8@esmail.cup.hp.com> At 10:44 AM 6/9/2006, Scott Weitzenkamp (sweitzen) wrote: >Content-class: urn:content-classes:message >Content-Type: multipart/alternative; > boundary="----_=_NextPart_001_01C68BEC.6C768F57" >Content-Transfer-Encoding: 7bit > >While we're talking about MTUs, is the IB MTU tunable in uDAPL and/or >Intel MPI via env var or config file? > >Looks like Intel MPI 2.0.1 uses 2K for IB MTU like MVAPICH does in OFED >1.0 rc4 and rc6, I'd like to try 1K with Intel MPI. IB MTU should be set on a per path basis by the SM. An application should examine the PMTU for a given path and take appropriate action - really only applies to UD as connected mode should automatically SAR requests. Communicating PMTU to an application should not occur unless it is datagram based. The same is true for iWARP where TCP / IP takes care of the PMTU on behalf of the ULP / application. If you want to control PMTU, then do so via the SM directly which was the intention of the architecture and specification. Mike > >Scott > >---------- >From: openib-general-bounces at openib.org >[mailto:openib-general-bounces at openib.org] On Behalf Of Scott Weitzenkamp >(sweitzen) >Sent: Thursday, June 08, 2006 4:38 PM >To: Tziporet Koren; openfabrics-ewg at openib.org >Cc: openib-general >Subject: RE: [openib-general] OFED-1.0-rc6 is available >The MTU change undos the changes for bug 81, so I have reopened bug 81 >(http://openib.org/bugzilla/show_bug.cgi?id=81). > > >With rc6, PCI-X osu_bw and osu_bibw performance is bad, and PCI-E osu_bibw >performance is bad. I've enclosed some performance data, look at rc4 vs >rc5 vs rc6 for Cougar/Cheetah/LionMini. > >Are there other benchmarks driving the changes in rc6 (and rc4)? > >Scott Weitzenkamp >SQA and Release Manager >Server Virtualization Business Unit >Cisco Systems > > > > >OSU MPI: >· Added mpi_alltoall fine tuning parameters >· Added default configuration/documentation file >$MPIHOME/etc/mvapich.conf >· Added shell configuration files $MPIHOME/etc/mvapich.csh , >$MPIHOME/etc/mvapich.csh >· Default MTU was changed back to 2K for InfiniHost III Ex and >InfiniHost III Lx HCAs. For InfiniHost card recommended value is: >VIADEV_DEFAULT_MTU=MTU1024 >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From bos at pathscale.com Mon Jun 12 15:50:53 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 12 Jun 2006 15:50:53 -0700 Subject: [openib-general] OFED 1.0-rc6 tarball available with working ipath driver In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007F0BDBB@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0007F0BDBB@orsmsx408> Message-ID: <1150152653.3217.8.camel@chalcedony.pathscale.com> On Mon, 2006-06-12 at 15:41 -0700, Woodruff, Robert J wrote: > Brian wrote, > > >Please try this one instead: > > http://openib.red-bean.com/OFED-1.0-rc6+ipath-2.tar.bz2 > > Got farther, but now fails trying to build > DAPL/rdmacm. This did not fail with the original RC6. Yeah, I see that too. As long as the ipath driver built, I'm happy enough for now :-) References: <448DC8DD.20800@ichips.intel.com> <448DE91B.3080607@ichips.intel.com> Message-ID: <1150152620.570.113031.camel@hal.voltaire.com> On Mon, 2006-06-12 at 18:22, Sean Hefty wrote: > Sean Hefty wrote: > > 4. Modify umad to learn which requests generate responses, by examining response > > MADs. When a response is sent, umad would mark which method the response is for > > by flipping the R-bit. Based on the algorithm, this could result in losing > > responses the first time that a request is seen. Some additional hard-coding > > would be needed for a Set, since a Set request generates GetResp MADs. > > This brings up a concern. There doesn't seem to be a limit to the number of > received MADs that can be queued for a user. Should we have such a limit? How are MADs counted ? Is a multisegment MAD 1 MAD or multiple MADs ? If the latter, it seems problematic to limit this as the response to a get response might be very large. -- Hal > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From betsy at pathscale.com Mon Jun 12 16:13:40 2006 From: betsy at pathscale.com (Betsy Zeller) Date: Mon, 12 Jun 2006 16:13:40 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 tarball available with working ipath driver In-Reply-To: <448C2946.5010707@mellanox.co.il> References: <1149895236.27921.2.camel@pelerin.serpentine.com> <448C2946.5010707@mellanox.co.il> Message-ID: <1150154020.3034.107.camel@sarium.pathscale.com> Tziporet - Bryan has confirmed that with the patches you've copied, things should work correctly. We've been testing with our version, but I really want to test on the OFED-1.0 version that you've built. Can you send us a pointer to it? Thanks, Betsy On Sun, 2006-06-11 at 17:31 +0300, Tziporet Koren wrote: > Bryan O'Sullivan wrote: > > Due to unfortunate timing, the ipath driver in OFED 1.0-rc6 does not > > work correctly. You can download an updated tarball from here, for > > which the ipath driver works fine: > > > > http://openib.red-bean.com/OFED-1.0-rc6+ipath.tar.bz2 > > > > Alternatively, pull the necessary patches from SVN. > > > > > > > > > __ > Hi Bryan > > You have forgot some of the patches in your tarball file, thus several > OSes does not pass (e.g RH EL4 up3). > > /openib-1.0/patches/ > ls */ipath* > 2.6.11_FC4/ipath_backport.patch 2.6.13/ipath_backport.patch > 2.6.15/ipath_backport.patch > 2.6.11/ipath_backport.patch 2.6.13_suse10_0_u/ipath_backport.patch > 2.6.9/ipath_backport.patch > 2.6.12/ipath_backport.patch 2.6.14/ipath_backport.patch > fixes/ipath_rollup.patch > /openib-1.0/patches/ > ls > 2.6.11/ 2.6.12/ 2.6.13_suse10_0_u/ 2.6.15/ 2.6.16_sles10/ > 2.6.9/ dapl/ memtrack/ > 2.6.11_FC4/ 2.6.13/ 2.6.14/ 2.6.16/ 2.6.17/ > 2.6.9_U3/ fixes/ > > I took the freedom to copy these patches to svn since I noticed that > ipath backport patches are same for all OSes & kernels. > Please take a look and make sure ipath driver compiles & loads on all > supported systems (including ia64, PPC64 etc.) > > Tziporet > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > From sean.hefty at intel.com Mon Jun 12 16:32:36 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 12 Jun 2006 16:32:36 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 tarball available with working ipath driver In-Reply-To: <1150154020.3034.107.camel@sarium.pathscale.com> Message-ID: <000b01c68e78$7d214570$ff0da8c0@amr.corp.intel.com> >Tziporet - Bryan has confirmed that with the patches you've copied, >things should work correctly. We've been testing with our version, but >I really want to test on the OFED-1.0 version that you've built. Can you >send us a pointer to it? How can you go from an RC6 that doesn't build to a 1.0 release? Shouldn't you at least get a release candidate that builds first? - Sean From mshefty at ichips.intel.com Mon Jun 12 16:43:26 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 12 Jun 2006 16:43:26 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <1150152620.570.113031.camel@hal.voltaire.com> References: <448DC8DD.20800@ichips.intel.com> <448DE91B.3080607@ichips.intel.com> <1150152620.570.113031.camel@hal.voltaire.com> Message-ID: <448DFC1E.5080602@ichips.intel.com> Hal Rosenstock wrote: >>This brings up a concern. There doesn't seem to be a limit to the number of >>received MADs that can be queued for a user. Should we have such a limit? > > > How are MADs counted ? Is a multisegment MAD 1 MAD or multiple MADs ? If > the latter, it seems problematic to limit this as the response to a get > response might be very large. I could go either way, or use a hybrid of some sort Counting a multisegment MAD as 1 MAD might be a little easier. We could also allow something like 100 segments, or at least 1 reassembled MAD. So, the user could have 100 single segment MADs, 50 2-segment MADs, etc. Without some sort of restriction, a userspace app that's slow to pull receive MADs from the kernel would result in consuming a large amount of kernel memory. - Sean From robert.j.woodruff at intel.com Mon Jun 12 16:47:23 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 12 Jun 2006 16:47:23 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 tarball available with working ipath driver Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007F40A7C@orsmsx408> Sean wrote, >>Tziporet - Bryan has confirmed that with the patches you've copied, >>things should work correctly. We've been testing with our version, but >>I really want to test on the OFED-1.0 version that you've built. Can you >>send us a pointer to it? >How can you go from an RC6 that doesn't build to a 1.0 release? Shouldn't you >at least get a release candidate that builds first? >- Sean I agree, don't see how we can go from something that has never been tested by the wider community to released. Has anyone run uDAPL tests or Intel MPI with Pathscale ? woody From sweitzen at cisco.com Mon Jun 12 17:29:38 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 12 Jun 2006 17:29:38 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 tarball available with working ipath driver Message-ID: I agree, having an rc7 then ~three days to test it for regressions seems appropriate. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of > Woodruff, Robert J > Sent: Monday, June 12, 2006 4:47 PM > To: Hefty, Sean; Betsy Zeller; Tziporet Koren > Cc: OpenFabricsEWG; openib-general > Subject: Re: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 > tarball available with working ipath driver > > Sean wrote, > >>Tziporet - Bryan has confirmed that with the patches you've copied, > >>things should work correctly. We've been testing with our > version, but > >>I really want to test on the OFED-1.0 version that you've built. Can > you > >>send us a pointer to it? > > >How can you go from an RC6 that doesn't build to a 1.0 release? > Shouldn't you > >at least get a release candidate that builds first? > > >- Sean > > I agree, don't see how we can go from something that has never been > tested > by the wider community to released. Has anyone run uDAPL > tests or Intel > MPI > with Pathscale ? > > woody > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From boris at mellanox.com Mon Jun 12 17:53:22 2006 From: boris at mellanox.com (Boris Shpolyansky) Date: Mon, 12 Jun 2006 17:53:22 -0700 Subject: [openib-general] MVAPICH failure on IBM PPC-64 Linux machine Message-ID: <1E3DCD1C63492545881FACB6063A57C1324241@mtiexch01.mti.com> Hi, I've run into following failure running OSU MPI out of OFED-rc5 on IBM PPC-64 platform: [1] Abort: Error creating QP at line 820 in file viainit.c mpirun: executable version 1 does not match our version 3, This seems to be memory allocation issue which could be easily explained (and overcome) if the job is launched with regular user permissions, but in my case it's root who launches it. Have anybody tested OFED's OSU MPI on PPC-64 platform recently and can comment on this ? Thanks, Boris Shpolyansky Application Engineer Mellanox Technologies Inc. 2900 Stender Way Santa Clara, CA 95054 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From bos at pathscale.com Mon Jun 12 19:03:43 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 12 Jun 2006 19:03:43 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 tarball available with working ipath driver In-Reply-To: <448C2946.5010707@mellanox.co.il> References: <1149895236.27921.2.camel@pelerin.serpentine.com> <448C2946.5010707@mellanox.co.il> Message-ID: <1150164223.741.3.camel@pelerin.serpentine.com> On Sun, 2006-06-11 at 17:31 +0300, Tziporet Koren wrote: > Please take a look and make sure ipath driver compiles & loads on all > supported systems (including ia64, PPC64 etc.) I can't build successfully on RHEL4 U3, because SDP isn't compiling there; it complains about the parameter count on line 1168 of sdp_main.c. Looks like the ipath stuff is fine on the systems we tried (SLES10 RC2 and RHEL4 U3). References: <448DC8DD.20800@ichips.intel.com> <448DE91B.3080607@ichips.intel.com> <1150152620.570.113031.camel@hal.voltaire.com> <448DFC1E.5080602@ichips.intel.com> Message-ID: <1150167512.570.122321.camel@hal.voltaire.com> On Mon, 2006-06-12 at 19:43, Sean Hefty wrote: > Hal Rosenstock wrote: > >>This brings up a concern. There doesn't seem to be a limit to the number of > >>received MADs that can be queued for a user. Should we have such a limit? > > > > > > How are MADs counted ? Is a multisegment MAD 1 MAD or multiple MADs ? If > > the latter, it seems problematic to limit this as the response to a get > > response might be very large. > > I could go either way, or use a hybrid of some sort Counting a multisegment MAD > as 1 MAD might be a little easier. We could also allow something like 100 > segments, or at least 1 reassembled MAD. So, the user could have 100 single > segment MADs, 50 2-segment MADs, etc. This seems prone to introducing a different problem and that this number would need to scale with the size of the subnet. > Without some sort of restriction, a userspace app that's slow to pull receive > MADs from the kernel would result in consuming a large amount of kernel memory. Understood but dropping a MAD after acknowledging also seems like a bad thing to me. Couldn't this be controlled on the request side (assuming the request has a response as opposed to unsolicited sends/receives) ? -- Hal > - Sean From viswa.krish at gmail.com Mon Jun 12 20:16:00 2006 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Mon, 12 Jun 2006 20:16:00 -0700 Subject: [openib-general] opensm and NPTL Message-ID: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com> There were some issues with opensm running with NPTL (thread library). Has the issues been resolved ? Regards, Viswa -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhushisongzhu at yahoo.com Mon Jun 12 20:28:31 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Mon, 12 Jun 2006 20:28:31 -0700 (PDT) Subject: [openib-general] it's hard to download through svn In-Reply-To: Message-ID: <20060613032831.5296.qmail@web36913.mail.mud.yahoo.com> It's hard to download OFED-rc6 release through svn. Do you suggest some more direct way to get them? tks zhu __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From mst at mellanox.co.il Mon Jun 12 21:26:47 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 13 Jun 2006 07:26:47 +0300 Subject: [openib-general] [PATCH] mthca: memfree completion with error workaround In-Reply-To: References: Message-ID: <20060613042647.GA4621@mellanox.co.il> Quoting r. Roland Dreier : > > + if (unlikely(wqe_index >= (*cur_qp)->rq.max)) { > > + if (unlikely(is_error) && > > + unlikely(wqe_index == 0xffffffff >> wq->wqe_shift) && > > seems like the inside unlikely()s are wrong here -- the reason we > expect to be here is exactly the reason being marked unlikely, which > is backwards. Hmm, right. > > + mthca_is_memfree(dev)) > > + wqe_index = wq->max - 1; > > + else { > > + mthca_err(dev, "Corrupted RQ CQE. " > > + "CQ 0x%x QP 0x%x idx 0x%x > 0x%x\n", > > + cq->cqn, entry->qp_num, wqe_index, > > + wq->max); > > + return -EINVAL; > > This should probably be "err = -EINVAL; goto out;" right? But note this branch is there "just in case". -- MST From mst at mellanox.co.il Mon Jun 12 21:40:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 13 Jun 2006 07:40:21 +0300 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <1150167512.570.122321.camel@hal.voltaire.com> References: <1150167512.570.122321.camel@hal.voltaire.com> Message-ID: <20060613044021.GC4621@mellanox.co.il> Quoting r. Hal Rosenstock : > > Without some sort of restriction, a userspace app that's slow to pull receive > > MADs from the kernel would result in consuming a large amount of kernel memory. > > Understood but dropping a MAD after acknowledging also seems like a bad > thing to me. True. Maybe we can find a way to avoid acknowledging the MAD? > Couldn't this be controlled on the request side (assuming > the request has a response as opposed to unsolicited sends/receives) ? Sounds like the wrong thing to do. -- MST From mst at mellanox.co.il Mon Jun 12 21:47:40 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 13 Jun 2006 07:47:40 +0300 Subject: [openib-general] [PATCH] mthca: memfree completion with error workaround In-Reply-To: References: Message-ID: <20060613044740.GD4621@mellanox.co.il> Quoting r. Roland Dreier : > But it bloats the function and adds to i-cache footprint. I'm sure it > benchmarks fine but it adds to general cache usage that pushes useful > code out of cache. Hmm. Would you be more comfortable with just + if (unlikely(wqe_index >= wq->max)) + wqe_index = wq->max - 1; The other case is there just in case, to catch firmware errors. -- MST From mst at mellanox.co.il Mon Jun 12 22:11:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 13 Jun 2006 08:11:49 +0300 Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with ib_uverbs_idr_mutex In-Reply-To: References: Message-ID: <20060613051149.GE4621@mellanox.co.il> Quoting r. Roland Dreier : > @@ -1089,10 +1271,8 @@ ssize_t ib_uverbs_modify_qp(struct ib_uv > if (!attr) > return -ENOMEM; > > - mutex_lock(&ib_uverbs_idr_mutex); > - > - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); > - if (!qp || qp->uobject->context != file->ucontext) { > + qp = idr_read_qp(cmd.qp_handle, file->ucontext); > + if (!qp) { > ret = -EINVAL; > goto out; > } > @@ -1144,13 +1324,15 @@ ssize_t ib_uverbs_modify_qp(struct ib_uv > attr->alt_ah_attr.port_num = cmd.alt_dest.port_num; > > ret = ib_modify_qp(qp, attr, cmd.attr_mask); > + > + put_qp_read(qp); > + > if (ret) > goto out; > > ret = in_len; > > out: > - mutex_unlock(&ib_uverbs_idr_mutex); > kfree(attr); > > return ret; Won't this let the user issue multiple modify QP commands in parallel on the same QP? mthca at least does not protect against such attempts, and doing this will confuse the hardware. -- MST From halr at voltaire.com Tue Jun 13 03:10:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 06:10:33 -0400 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <20060613044021.GC4621@mellanox.co.il> References: <1150167512.570.122321.camel@hal.voltaire.com> <20060613044021.GC4621@mellanox.co.il> Message-ID: <1150193430.570.138279.camel@hal.voltaire.com> On Tue, 2006-06-13 at 00:40, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock : > > > Without some sort of restriction, a userspace app that's slow to pull receive > > > MADs from the kernel would result in consuming a large amount of kernel memory. > > > > Understood but dropping a MAD after acknowledging also seems like a bad > > thing to me. > > True. Maybe we can find a way to avoid acknowledging the MAD? There are architected ways to do that. There's busy for MADs which could be used for some MADs. For RMPP, would the transfer be ABORTed ? I don't think you can switch to BUSY in the middle (but I'm not 100% sure). I don't know how this limit is being used exactly, but it might be best if the RMPP receive were treated as 1 MAD regardless of of how many segments it was. -- Hal > > Couldn't this be controlled on the request side (assuming > > the request has a response as opposed to unsolicited sends/receives) ? > > Sounds like the wrong thing to do. From halr at voltaire.com Tue Jun 13 03:15:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 06:15:34 -0400 Subject: [openib-general] opensm and NPTL In-Reply-To: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com> References: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com> Message-ID: <1150193732.570.138496.camel@hal.voltaire.com> Hi Viswa, On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy wrote: > There were some issues with opensm running with NPTL (thread > library). Has the issues been resolved ? There were some fixes to the signal handling which went in back in the Feb/early March time frame. OpenSM should be better with NPTL now. Is it working for you or are you asking before stepping into these waters again ? -- Hal > Regards, > Viswa > > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Tue Jun 13 03:44:57 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 13 Jun 2006 13:44:57 +0300 Subject: [openib-general] [PATCH] osm: Provide SUBNET UP message every heavy sweep Message-ID: <86r71tgwra.fsf@mtl066.yok.mtl.com> Hi Hal This trivial patch provides a "SUBNET UP" message (with level INFO) every time the SM completes a full heavy sweep. It is most useful for cases where you want to make sure teh SM responded to some change in the fabric. Also used to sync the various test flows to the end of sweeps. Eitan Signed-off-by: Eitan Zahavi Index: opensm/osm_state_mgr.c =================================================================== --- opensm/osm_state_mgr.c (revision 7904) +++ opensm/osm_state_mgr.c (working copy) @@ -199,6 +199,8 @@ __osm_state_mgr_up_msg( osm_log( p_mgr->p_log, OSM_LOG_SYS, "SUBNET UP\n" ); /* Format Waived */ /* clear the signal */ p_mgr->p_subn->moved_to_master_state = FALSE; + } else { + osm_log( p_mgr->p_log, OSM_LOG_INFO, "SUBNET UP\n" ); /* Format Waived */ } if( p_mgr->p_subn->opt.sweep_interval ) From moshek at voltaire.com Tue Jun 13 03:51:11 2006 From: moshek at voltaire.com (Moshe Kazir) Date: Tue, 13 Jun 2006 13:51:11 +0300 Subject: [openib-general] OFED-RC4 backport to sles9 sp3 kernel 2.6.5-7.244 Message-ID: The enclosed diff file include sles9 sp3 backporrt changes. I performed the work on RC4. changes description: - openib-1.0/configure - changed to enable make in kernel 2.6.5 - build_env.sh - changed to support gcc 3.3.3 (the current compiler version) - I created a new directory -> patches/2.6.5-7.244/ - all the patches from patches/2.6.9 to the new created directory. - all the patches created by me are *_6922_to_2_6_5-7.244.patch . limitation / known bugs : - I checked the work with build.sh / install.sh of the basic package. - ib_ipath was change to enable compilation, it does not pass insmod as some entry points are not resolved. - As result of the ipath problem after /etc/init.d/openibd start you'll get -> [fail] , but all the modules are in place and working. - ipoib and the ibv_* programs are working o.k. I performed very short testing on x86_64 and em64t. Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OFED-1.0-rc4.backport_to_sles9_sp3.patch Type: application/octet-stream Size: 162278 bytes Desc: OFED-1.0-rc4.backport_to_sles9_sp3.patch URL: From tziporet at mellanox.co.il Tue Jun 13 04:21:47 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 13 Jun 2006 14:21:47 +0300 Subject: [openib-general] [Bug 126] RDMA_CM and UCM not loaded on boot Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA71DB@mtlexch01.mtl.com> It's not in the default, since CM and CMA are not defined as basic HPC components (basic components are only mthca, ipath, core and ipoib). Thus any one wants these modules should change the file /etc/infiniband/openib.conf Tziporet -----Original Message----- From: Arlin Davis [mailto:ardavis at ichips.intel.com] Sent: Monday, June 12, 2006 8:30 PM To: Tziporet Koren Cc: openib; Woodruff, Robert J Subject: Re: [openib-general] [Bug 126] RDMA_CM and UCM not loaded on boot bugzilla-daemon at openib.org wrote: Did the default openib.conf script get updated with: RDMA_CM_LOAD=yes RDMA_UCM_LOAD=yes -arlin -arlin From halr at voltaire.com Tue Jun 13 04:12:20 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 07:12:20 -0400 Subject: [openib-general] [PATCH] osm: Provide SUBNET UP message every heavy sweep In-Reply-To: <86r71tgwra.fsf@mtl066.yok.mtl.com> References: <86r71tgwra.fsf@mtl066.yok.mtl.com> Message-ID: <1150197138.570.140615.camel@hal.voltaire.com> Hi Eitan, On Tue, 2006-06-13 at 06:44, Eitan Zahavi wrote: > Hi Hal > > This trivial patch provides a "SUBNET UP" message (with level INFO) > every time the SM completes a full heavy sweep. It is most useful for > cases where you want to make sure teh SM responded to some change in > the fabric. Also used to sync the various test flows to the end of sweeps. > > Eitan > > Signed-off-by: Eitan Zahavi Thanks. Applied to trunk only. -- Hal From tziporet at mellanox.co.il Tue Jun 13 04:36:29 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 13 Jun 2006 14:36:29 +0300 Subject: [openib-general] OFED-RC4 backport to sles9 sp3 kernel 2.6.5-7.244 In-Reply-To: References: Message-ID: <448EA33D.10800@mellanox.co.il> Moshe Kazir wrote: > The enclosed diff file include sles9 sp3 backporrt changes. > > Great you did it You understand we cannot include it in OFED 1.0, since it should be out this week, but we can add it to OFED 1.1, that will be on July. Tziporet From eitan at mellanox.co.il Tue Jun 13 04:30:37 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 13 Jun 2006 14:30:37 +0300 Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT tables from dump file. In-Reply-To: <20060611003243.22430.56582.stgit@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060611003243.22430.56582.stgit@sashak.voltaire.com> Message-ID: <448EA1DD.7090204@mellanox.co.il> Hi Sasha, Please see my comments inside Sasha Khapyorsky wrote: > This patch implements trivial routing module which able to load LFT > tables from dump file. Main features: > - support for unicast LFTs only, support for multicast can be added later > - this will run after min hop matrix calculation > - this will load switch LFTs according to the path entries introduced in > the dump file > - no additional checks will be performed (like is port connected, etc) > - in case when fabric LIDs were changed this will try to reconstruct LFTs > correctly if endport GUIDs are represented in the dump file (in order > to disable this GUIDs may be removed from the dump file or zeroed) I think you cold use the concept of directed routes for storing the LIDs too. So in case of new LID assignments you can extract the old -> new mapping by scanning the LIDs of end ports by their DR path. Anyway, I think it is required that you also perform topology matching such that if someone changed the topology you are able to figure it out and stop. THIS IS A SERIOUS LIMITATION OF YOUR PROPOSAL. > > The dump file format is compatible with output of 'ibroute' util and for > whole fabric may be generated with script like this: > > for sw_lid in `ibswitches | awk '{print $NF}'` ; do > ibroute $sw_lid > done > /path/to/dump_file > > , or using DR paths: > > > for sw_dr in `ibnetdiscover -v \ > | sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \ > | sed -e 's/\]\[/,/g' \ > | sort -u` ; do > ibroute -D ${sw_dr} > done > /path/to/dump_file WE SHOULD ALSO PROVIDE A DUMP FILE VIA: 1. OpenSM should dump its routes using this format (like it does today using osm.fdbs) 2. ibdiagnet > > > > diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h > index a637367..ec1d056 100644 > --- a/osm/include/opensm/osm_subnet.h > +++ b/osm/include/opensm/osm_subnet.h > @@ -423,6 +424,10 @@ typedef struct _osm_subn_opt > * routing_engine_name > * Name of used routing engine (other than default Min Hop Algorithm) > * > +* ucast_dump_file > +* Name of the unicast routing dump file from where switch > +* forwearding tables will be loaded ^^^^^^^^^^^ forwarding > +* > * updn_guid_file > * Pointer to name of the UPDN guid file given by User > * > > diff --git a/osm/opensm/osm_ucast_file.c b/osm/opensm/osm_ucast_file.c > new file mode 100644 > index 0000000..a68d9ec > --- /dev/null > +++ b/osm/opensm/osm_ucast_file.c > @@ -0,0 +1,258 @@ > +/* > + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + * $Id$ > + */ > + > +/* > + * Abstract: > + * Implementation of OpenSM unicast routing module which loads > + * routes from the dump file > + * > + * Environment: > + * Linux User Mode > + * > + */ > + > +#if HAVE_CONFIG_H > +# include > +#endif /* HAVE_CONFIG_H */ > + > +#include > +#include > +#include > + > +#include > +#include > +#include > +#include > +#include > + > +#define PARSEERR(log, file_name, lineno, fmt, arg...) \ > + osm_log(log, OSM_LOG_ERROR, "PARSE ERROR: %s:%u: " fmt , \ > + file_name, lineno, ##arg ) > + > +#define PARSEWARN(log, file_name, lineno, fmt, arg...) \ > + osm_log(log, OSM_LOG_VERBOSE, "PARSE WARN: %s:%u: " fmt , \ > + file_name, lineno, ##arg ) > + > +static uint16_t remap_lid(osm_opensm_t *p_osm, uint16_t lid, ib_net64_t guid) > +{ > + osm_port_t *p_port; > + uint16_t min_lid, max_lid; > + uint8_t lmc; > + > + p_port = (osm_port_t *)cl_qmap_get(&p_osm->subn.port_guid_tbl, guid); > + if (!p_port || > + p_port == (osm_port_t *)cl_qmap_end(&p_osm->subn.port_guid_tbl)) { > + osm_log(&p_osm->log, OSM_LOG_VERBOSE, > + "remap_lid: cannot find port guid 0x%016" PRIx64 > + " , will use the same lid.\n", cl_ntoh64(guid)); > + return lid; > + } > + > + osm_port_get_lid_range_ho(p_port, &min_lid, &max_lid); > + if (min_lid <= lid && lid <= max_lid) > + return lid; > + > + lmc = osm_port_get_lmc(p_port); > + return min_lid + (lid & ((1 << lmc) - 1)); > +} > + > +static void add_path(osm_opensm_t * p_osm, > + osm_switch_t * p_sw, uint16_t lid, uint8_t port_num, > + ib_net64_t port_guid) > +{ > + uint16_t new_lid; > + uint8_t old_port; > + > + new_lid = port_guid ? remap_lid(p_osm, lid, port_guid) : lid; > + old_port = osm_fwd_tbl_get(osm_switch_get_fwd_tbl_ptr(p_sw), new_lid); > + if (old_port != OSM_NO_PATH && old_port != port_num) { > + osm_log(&p_osm->log, OSM_LOG_VERBOSE, > + "add_path: LID collision is detected on switch " > + "0x016%" PRIx64 ", will overwrite LID 0x%x entry.\n", > + cl_ntoh64(osm_node_get_node_guid > + (osm_switch_get_node_ptr(p_sw))), new_lid); > + } > + > + osm_switch_set_path(p_sw, new_lid, port_num, TRUE); > + > + osm_log(&p_osm->log, OSM_LOG_DEBUG, > + "add_path: route 0x%04x(was 0x%04x) %u 0x%016" PRIx64 > + " is added to switch 0x%016" PRIx64 "\n", > + new_lid, lid, port_num, cl_ntoh64(port_guid), > + cl_ntoh64(osm_node_get_node_guid > + (osm_switch_get_node_ptr(p_sw)))); > +} > + > +static void clean_sw_fwd_table(void *arg, void *context) > +{ > + osm_switch_t *p_sw = arg; > + uint16_t lid, max_lid; > + > + max_lid = osm_switch_get_max_lid_ho(p_sw); > + for (lid = 1 ; lid <= max_lid ; lid++) > + osm_switch_set_path(p_sw, lid, OSM_NO_PATH, TRUE); > +} > + > +static int do_ucast_file_load(void *context) > +{ > + char line[1024]; > + char *file_name; > + FILE *file; > + ib_net64_t sw_guid, port_guid; > + osm_opensm_t *p_osm = context; > + osm_switch_t *p_sw; > + uint16_t lid; > + uint8_t port_num; > + unsigned lineno; > + > + file_name = p_osm->subn.opt.ucast_dump_file; > + > + if (!file_name) { > + osm_log(&p_osm->log, OSM_LOG_ERROR, > + "do_ucast_file_load: " > + "ucast dump file name is not defined.\n"); > + return -1; > + } > + > + file = fopen(file_name, "r"); > + if (!file) { > + osm_log(&p_osm->log, OSM_LOG_ERROR, > + "do_ucast_file_load: " > + "cannot open ucast dump file \'%s\'\n", file_name); > + return -1; > + } > + > + cl_qmap_apply_func(&p_osm->subn.sw_guid_tbl, clean_sw_fwd_table, NULL); > + > + lineno = 0; > + p_sw = NULL; > + > + while (fgets(line, sizeof(line) - 1, file) != NULL) { > + char *p, *q; > + lineno++; > + > + p = line; > + while (isspace(*p)) > + p++; > + > + if (*p == '#') > + continue; > + > + if (!strncmp(p, "Multicast mlids", 15)) { > + osm_log(&p_osm->log, OSM_LOG_ERROR, > + "do_ucast_file_load: " > + "Multicast dump file is detected. " > + "Skip parsing.\n"); > + } > + else if (!strncmp(p, "Unicast lids", 12)) { > + q = strstr(p, " guid 0x"); > + if (!q) { > + PARSEERR(&p_osm->log, file_name, lineno, > + "cannot parse switch definition\n"); > + return -1; > + } > + p = q + 6; > + sw_guid = strtoll(p, &q, 16); > + if (q && !isspace(*q)) { > + PARSEERR(&p_osm->log, file_name, lineno, > + "cannot parse switch guid: \'%s\'\n", > + p); > + return -1; > + } > + sw_guid = cl_hton64(sw_guid); > + > + p_sw = (osm_switch_t *)cl_qmap_get(&p_osm->subn.sw_guid_tbl, > + sw_guid); > + if (!p_sw || > + p_sw == (osm_switch_t *)cl_qmap_end(&p_osm->subn.sw_guid_tbl)) { > + p_sw = NULL; > + osm_log(&p_osm->log, OSM_LOG_VERBOSE, > + "do_ucast_file_load: " > + "cannot find switch %016" PRIx64 ".\n", > + cl_ntoh64(sw_guid)); > + continue; > + } > + } > + else if (p_sw && !strncmp(p, "0x", 2)) { > + lid = strtoul(p, &q, 16); > + if (q && !isspace(*q)) { > + PARSEERR(&p_osm->log, file_name, lineno, > + "cannot parse lid: \'%s\'\n", p); > + return -1; > + } > + p = q; > + while (isspace(*p)) > + p++; > + port_num = strtoul(p, &q, 10); > + if (q && !isspace(*q)) { > + PARSEERR(&p_osm->log, file_name, lineno, > + "cannot parse port: \'%s\'\n", p); > + return -1; > + } > + p = q; > + /* additionally try to exract guid */ > + q = strstr(p, " portguid 0x"); > + if (!q) { > + PARSEWARN(&p_osm->log, file_name, lineno, > + "cannot find port guid " > + "(maybe broken dump): \'%s\'\n", p); > + port_guid = 0; > + } > + else { > + p = q + 10; > + port_guid = strtoll(p, &q, 16); > + if (!q && !isspace(*q) && *q != ':') { > + PARSEWARN(&p_osm->log, file_name, > + lineno, > + "cannot parse port guid " > + "(maybe broken dump): " > + "\'%s\'\n", p); > + port_guid = 0; > + } > + } > + port_guid = cl_hton64(port_guid); > + add_path(p_osm, p_sw, lid, port_num, port_guid); > + } > + } > + > + fclose(file); > + return 0; > +} In OpenSM we write with style: if () { } else if () { } else { } Not any other combination > + > +int osm_ucast_file_setup(osm_opensm_t * p_osm) > +{ > + p_osm->routing_engine.context = (void *)p_osm; > + p_osm->routing_engine.ucast_build_fwd_tables = do_ucast_file_load; > + return 0; > +} From eitan at mellanox.co.il Tue Jun 13 04:55:13 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 13 Jun 2006 14:55:13 +0300 Subject: [openib-general] [PATCH 2/4] Modular routing engine (unicast only yet). In-Reply-To: <20060611003240.22430.88414.stgit@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060611003240.22430.88414.stgit@sashak.voltaire.com> Message-ID: <448EA7A1.8060206@mellanox.co.il> Hi Sasha, As provided in my previous patch 1/4 comments I think the callbacks should also have an entry for the MinHop stage (maybe this is the ucast_build_fwd_tables?) I have some algorithms in mind that will skip that stage all-together. Also it might make sense for each routing engine to provide its own "dump" routine such that each could support difference file format if needed. Rest of the comments are inline EZ Sasha Khapyorsky wrote: > > diff --git a/osm/include/opensm/osm_opensm.h b/osm/include/opensm/osm_opensm.h > index 3235ad4..3e6e120 100644 > --- a/osm/include/opensm/osm_opensm.h > +++ b/osm/include/opensm/osm_opensm.h > @@ -92,6 +92,18 @@ BEGIN_C_DECLS > * > *********/ > > +/* > + * routing engine structure - yet limited by ucast_fdb_assign and > + * ucast_build_fwd_tables (multicast callbacks may be added later) > + */ > +struct osm_routing_engine { > + const char *name; > + void *context; > + int (*ucast_build_fwd_tables)(void *context); > + int (*ucast_fdb_assign)(void *context); > + void (*delete)(void *context); > +}; It would be nice if you added a standard header to this struct. It is not clear to me what ucast_build_fwd_tables and ucast_fdb_assign are mapping to. Please see the next section as an example for a struct header. > + > /****s* OpenSM: OpenSM/osm_opensm_t > * NAME > * osm_opensm_t > @@ -116,7 +128,7 @@ typedef struct _osm_opensm_t > osm_log_t log; > cl_dispatcher_t disp; > cl_plock_t lock; > - updn_t *p_updn_ucast_routing; > + struct osm_routing_engine routing_engine; > osm_stats_t stats; > } osm_opensm_t; > /* > @@ -153,6 +165,9 @@ typedef struct _osm_opensm_t > * lock > * Shared lock guarding most OpenSM structures. > * > +* routing_engine > +* Routing engine, will be initialized then used > +* > * stats > * Open SM statistics block > * > diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c > index cac7f9b..0c0d635 100644 > --- a/osm/opensm/osm_ucast_mgr.c > +++ b/osm/opensm/osm_ucast_mgr.c > @@ -62,6 +62,7 @@ #include > #include > #include > #include > +#include > > #define LINE_LENGTH 256 > > @@ -269,7 +270,7 @@ osm_ucast_mgr_dump_ucast_routes( > strcat( p_mgr->p_report_buf, "yes" ); > else > { > - if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign) { > + if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) { > ui_ucast_fdb_assign_func_defined = TRUE; > } else { > ui_ucast_fdb_assign_func_defined = FALSE; > @@ -708,7 +709,7 @@ __osm_ucast_mgr_process_port( > node_guid = osm_node_get_node_guid(osm_switch_get_node_ptr( p_sw ) ); > > /* Flag to mark whether or not a ui ucast fdb assign function was given */ > - if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign) > + if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) > ui_ucast_fdb_assign_func_defined = TRUE; > else > ui_ucast_fdb_assign_func_defined = FALSE; > @@ -753,7 +754,7 @@ __osm_ucast_mgr_process_port( > > /* Up/Down routing can cause unreachable routes between some > switches so we do not report that as an error in that case */ > - if (!p_mgr->p_subn->opt.updn_activate) > + if (!p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) > { > osm_log( p_mgr->p_log, OSM_LOG_ERROR, > "__osm_ucast_mgr_process_port: ERR 3A08: " > @@ -973,6 +974,18 @@ __osm_ucast_mgr_process_tbl( > /********************************************************************** > **********************************************************************/ > static void > +__osm_ucast_mgr_set_table_cb( > + IN cl_map_item_t* const p_map_item, > + IN void* context ) > +{ > + osm_switch_t* const p_sw = (osm_switch_t*)p_map_item; > + osm_ucast_mgr_t* const p_mgr = (osm_ucast_mgr_t*)context; > + __osm_ucast_mgr_set_table( p_mgr, p_sw ); > +} > + > +/********************************************************************** > + **********************************************************************/ > +static void > __osm_ucast_mgr_process_neighbors( > IN cl_map_item_t* const p_map_item, > IN void* context ) > @@ -1058,12 +1071,14 @@ osm_ucast_mgr_process( > { > uint32_t i; > uint32_t iteration_max; > + struct osm_routing_engine *p_routing_eng; > osm_signal_t signal; > cl_qmap_t *p_sw_guid_tbl; > > OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_process ); > > p_sw_guid_tbl = &p_mgr->p_subn->sw_guid_tbl; > + p_routing_eng = &p_mgr->p_subn->p_osm->routing_engine; > > CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); > > @@ -1129,6 +1144,14 @@ osm_ucast_mgr_process( > i > ); > > + if (p_routing_eng->ucast_build_fwd_tables && > + p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0) > + { > + cl_qmap_apply_func( p_sw_guid_tbl, > + __osm_ucast_mgr_set_table_cb, p_mgr ); > + } /* fallback on the regular path in case of failures */ > + else > + { Please explain why this step is needed and why if the routing engine function is returning 0 you still invoke the standard __osm_ucast_mgr_set_table_cb. > /* > This is the place where we can load pre-defined routes > into the switches fwd_tbl structures. From eitan at mellanox.co.il Tue Jun 13 05:03:31 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 13 Jun 2006 15:03:31 +0300 Subject: [openib-general] [PATCH 1/4] Simplification of the ucast fdb dumps. In-Reply-To: <20060611003238.22430.62423.stgit@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060611003238.22430.62423.stgit@sashak.voltaire.com> Message-ID: <448EA993.6010000@mellanox.co.il> Hi Sasha, I still need to see if there are no real problematic changes in the osm.fdbs file syntax (need to update ibdm to support those) but I like the patch and the clean way you resolved the multiple opens of the dump file. EZ From eitan at mellanox.co.il Tue Jun 13 05:39:15 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 13 Jun 2006 15:39:15 +0300 Subject: [openib-general] [PATCH] osm: Provide SUBNET UP message every heavy sweep - resend Message-ID: <86pshdgrgs.fsf@mtl066.yok.mtl.com> Hi Hal Sorry bout the previous patch - I got the } else { in it. This trivial patch provides a "SUBNET UP" message (with level INFO) every time the SM completes a full heavy sweep. It is most useful for cases where you want to make sure teh SM responded to some change in the fabric. Also used to sync the various test flows to the end of sweeps. Eitan Signed-off-by: Eitan Zahavi Index: opensm/osm_state_mgr.c =================================================================== --- opensm/osm_state_mgr.c (revision 7904) +++ opensm/osm_state_mgr.c (working copy) @@ -200,6 +200,10 @@ __osm_state_mgr_up_msg( /* clear the signal */ p_mgr->p_subn->moved_to_master_state = FALSE; } + else + { + osm_log( p_mgr->p_log, OSM_LOG_INFO, "SUBNET UP\n" ); /* Format Waived */ + } if( p_mgr->p_subn->opt.sweep_interval ) { From eitan at mellanox.co.il Tue Jun 13 05:54:15 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 13 Jun 2006 15:54:15 +0300 Subject: [openib-general] [PATCH] osm: partition manager force policy Message-ID: <86odwxgqrs.fsf@mtl066.yok.mtl.com> --text follows this line-- Hi Hal This is a second take after debug and cleanup of the partition manager patch I have previously provided. The functionality is the same but this one is after 2 days of testing on the simulator. I also did some code restructuring for clarity. Tests passed were both dedicated pkey enforcements (pkey.*) and stress test (osmStress.*) As I started to test the partition manager code (using ibmgtsim pkey test), I realized the implementation does not really enforces the partition policy on the given fabric. This patch fixes that. It was verified using the simulation test. Several other corner cases were fixed too. Eitan Signed-off-by: Eitan Zahavi Index: include/opensm/osm_port.h =================================================================== --- include/opensm/osm_port.h (revision 7867) +++ include/opensm/osm_port.h (working copy) @@ -586,6 +586,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy * Port, Physical Port *********/ +/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl +* NAME +* osm_physp_get_mod_pkey_tbl +* +* DESCRIPTION +* Returns a NON CONST pointer to the P_Key table object of the Physical Port object. +* +* SYNOPSIS +*/ +static inline osm_pkey_tbl_t * +osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp ) +{ + CL_ASSERT( osm_physp_is_valid( p_physp ) ); + /* + (14.2.5.7) - the block number valid values are 0-2047, and are further + limited by the size of the P_Key table specified by the PartitionCap on the node. + */ + return( &p_physp->pkeys ); +}; +/* +* PARAMETERS +* p_physp +* [in] Pointer to an osm_physp_t object. +* +* RETURN VALUES +* The pointer to the P_Key table object. +* +* NOTES +* +* SEE ALSO +* Port, Physical Port +*********/ + /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl * NAME * osm_physp_set_slvl_tbl Index: include/opensm/osm_pkey.h =================================================================== --- include/opensm/osm_pkey.h (revision 7867) +++ include/opensm/osm_pkey.h (working copy) @@ -92,6 +92,9 @@ typedef struct _osm_pkey_tbl cl_ptr_vector_t blocks; cl_ptr_vector_t new_blocks; cl_map_t keys; + cl_qlist_t pending; + uint16_t used_blocks; + uint16_t max_blocks; } osm_pkey_tbl_t; /* * FIELDS @@ -104,6 +107,18 @@ typedef struct _osm_pkey_tbl * keys * A set holding all keys * +* pending +* A list osm_pending_pkey structs that is temporarily set by the +* pkey mgr and used during pkey mgr algorithm only +* +* used_blocks +* Tracks the number of blocks having non-zero pkeys +* +* max_blocks +* The maximal number of blocks this partition table might hold +* this value is based on node_info (for port 0 or CA) or switch_info +* updated on receiving the node_info or switch_info GetResp +* * NOTES * 'blocks' vector should be used to store pkey values obtained from * the port and SM pkey manager should not change it directly, for this @@ -114,6 +129,39 @@ typedef struct _osm_pkey_tbl * *********/ +/****s* OpenSM: osm_pending_pkey_t +* NAME +* osm_pending_pkey_t +* +* DESCRIPTION +* This objects stores temporary information on pkeys their target block and index +* during the pkey manager operation +* +* SYNOPSIS +*/ +typedef struct _osm_pending_pkey { + cl_list_item_t list_item; + uint16_t pkey; + uint32_t block; + uint8_t index; + boolean_t is_new; +} osm_pending_pkey_t; +/* +* FIELDS +* pkey +* The actual P_Key +* +* block +* The block index based on the previous table extracted from the device +* +* index +* The index of the pky within the block +* +* is_new +* TRUE for new P_Keys such that the block and index are invalid in that case +* +*********/ + /****f* OpenSM: osm_pkey_tbl_construct * NAME * osm_pkey_tbl_construct @@ -209,8 +257,8 @@ osm_pkey_tbl_get_num_blocks( static inline ib_pkey_table_t *osm_pkey_tbl_block_get( const osm_pkey_tbl_t *p_pkey_tbl, uint16_t block) { - CL_ASSERT(block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)); - return(cl_ptr_vector_get(&p_pkey_tbl->blocks, block)); + return( (block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)) ? + cl_ptr_vector_get(&p_pkey_tbl->blocks, block) : NULL); }; /* * p_pkey_tbl @@ -244,6 +292,106 @@ static inline ib_pkey_table_t *osm_pkey_ /* *********/ + +/****f* OpenSM: osm_pkey_tbl_make_block_pair +* NAME +* osm_pkey_tbl_make_block_pair +* +* DESCRIPTION +* Find or create a pair of "old" and "new" blocks for the +* given block index +* +* SYNOPSIS +*/ +int osm_pkey_tbl_make_block_pair( + osm_pkey_tbl_t *p_pkey_tbl, + uint16_t block_idx, + ib_pkey_table_t **pp_old_block, + ib_pkey_table_t **pp_new_block); +/* +* p_pkey_tbl +* [in] Pointer to the PKey table +* +* block_idx +* [in] The block index to use +* +* pp_old_block +* [out] Pointer to the old block pointer arg +* +* pp_new_block +* [out] Pointer to the new block pointer arg +* +* RETURN VALUES +* 0 if OK 1 if failed +* +*********/ + +/****f* OpenSM: osm_pkey_tbl_set_new_entry +* NAME +* osm_pkey_tbl_set_new_entry +* +* DESCRIPTION +* stores the given pkey in the "new" blocks array and update +* the "map" to show that on the "old" blocks +* +* SYNOPSIS +*/ +int +osm_pkey_tbl_set_new_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t block_idx, + IN uint8_t pkey_idx, + IN uint16_t pkey); +/* +* p_pkey_tbl +* [in] Pointer to the PKey table +* +* block_idx +* [in] The block index to use +* +* pkey_idx +* [in] The index within the block +* +* pkey +* [in] PKey to store +* +* RETURN VALUES +* 0 if OK 1 if failed +* +*********/ + +/****f* OpenSM: osm_pkey_find_next_free_entry +* NAME +* osm_pkey_find_next_free_entry +* +* DESCRIPTION +* Find the next free entry in the PKey table. Starting at the given +* index and block number. The user should increment pkey_idx before +* next call +* Inspect the "new" blocks array for empty space. +* +* SYNOPSIS +*/ +boolean_t +osm_pkey_find_next_free_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + OUT uint16_t *p_block_idx, + OUT uint8_t *p_pkey_idx); +/* +* p_pkey_tbl +* [in] Pointer to the PKey table +* +* p_block_idx +* [out] The block index to use +* +* p_pkey_idx +* [out] The index within the block to use +* +* RETURN VALUES +* TRUE if found FALSE if did not find +* +*********/ + /****f* OpenSM: osm_pkey_tbl_sync_new_blocks * NAME * osm_pkey_tbl_sync_new_blocks @@ -263,9 +411,44 @@ void osm_pkey_tbl_sync_new_blocks( * *********/ +/****f* OpenSM: osm_pkey_tbl_get_block_and_idx +* NAME +* osm_pkey_tbl_get_block_and_idx +* +* DESCRIPTION +* set the block index and pkey index the given +* pkey is found in. return 1 if cound not find +* it, 0 if OK +* +* SYNOPSIS +*/ +int +osm_pkey_tbl_get_block_and_idx( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t *p_pkey, + OUT uint32_t *block_idx, + OUT uint8_t *pkey_index); +/* +* p_pkey_tbl +* [in] Pointer to osm_pkey_tbl_t object. +* +* p_pkey +* [in] Pointer to the P_Key entry searched +* +* p_block_idx +* [out] Pointer to the block index to be updated +* +* p_pkey_idx +* [out] Pointer to the pkey index (in the block) to be updated +* +* +* NOTES +* +*********/ + /****f* OpenSM: osm_pkey_tbl_set * NAME * osm_pkey_tbl_set Index: opensm/osm_pkey.c =================================================================== --- opensm/osm_pkey.c (revision 7904) +++ opensm/osm_pkey.c (working copy) @@ -100,6 +100,9 @@ int osm_pkey_tbl_init( cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1); cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1); cl_map_init( &p_pkey_tbl->keys, 1 ); + cl_qlist_init( &p_pkey_tbl->pending ); + p_pkey_tbl->used_blocks = 0; + p_pkey_tbl->max_blocks = 0; return(IB_SUCCESS); } @@ -118,14 +121,29 @@ void osm_pkey_tbl_sync_new_blocks( p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b); if ( b < new_blocks ) p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b); - else { + else + { p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block)); if (!p_new_block) break; + cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, + b, p_new_block); + } + memset(p_new_block, 0, sizeof(*p_new_block)); - cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block); } - memcpy(p_new_block, p_block, sizeof(*p_new_block)); +} + +/********************************************************************** + **********************************************************************/ +void osm_pkey_tbl_cleanup_pending( + IN osm_pkey_tbl_t *p_pkey_tbl) +{ + cl_list_item_t *p_item; + p_item = cl_qlist_remove_head( &p_pkey_tbl->pending ); + while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) ) + { + free( (osm_pending_pkey_t *)p_item ); } } @@ -202,6 +220,138 @@ int osm_pkey_tbl_set( /********************************************************************** **********************************************************************/ +int osm_pkey_tbl_make_block_pair( + osm_pkey_tbl_t *p_pkey_tbl, + uint16_t block_idx, + ib_pkey_table_t **pp_old_block, + ib_pkey_table_t **pp_new_block) +{ + if (block_idx >= p_pkey_tbl->max_blocks) return 1; + + if (pp_old_block) + { + *pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx ); + if (! *pp_old_block) + { + *pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); + if (!*pp_old_block) return 1; + memset(*pp_old_block, 0, sizeof(ib_pkey_table_t)); + cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block); + } + } + + if (pp_new_block) + { + *pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx ); + if (! *pp_new_block) + { + *pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); + if (!*pp_new_block) return 1; + memset(*pp_new_block, 0, sizeof(ib_pkey_table_t)); + cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block); + } + } + return 0; +} + +/********************************************************************** + **********************************************************************/ +/* + store the given pkey in the "new" blocks array and update the "map" + to show that on the "old" blocks +*/ +int +osm_pkey_tbl_set_new_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t block_idx, + IN uint8_t pkey_idx, + IN uint16_t pkey) +{ + ib_pkey_table_t *p_old_block; + ib_pkey_table_t *p_new_block; + + if (osm_pkey_tbl_make_block_pair( + p_pkey_tbl, block_idx, &p_old_block, &p_new_block)) + return 1; + + cl_map_insert( &p_pkey_tbl->keys, + ib_pkey_get_base(pkey), + &(p_old_block->pkey_entry[pkey_idx])); + p_new_block->pkey_entry[pkey_idx] = pkey; + if (p_pkey_tbl->used_blocks < block_idx) + p_pkey_tbl->used_blocks = block_idx; + + return 0; +} + +/********************************************************************** + **********************************************************************/ +boolean_t +osm_pkey_find_next_free_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + OUT uint16_t *p_block_idx, + OUT uint8_t *p_pkey_idx) +{ + ib_pkey_table_t *p_new_block; + + CL_ASSERT(p_block_idx); + CL_ASSERT(p_pkey_idx); + + while ( *p_block_idx < p_pkey_tbl->max_blocks) + { + if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1) + { + *p_pkey_idx = 0; + (*p_block_idx)++; + if (*p_block_idx >= p_pkey_tbl->max_blocks) + return FALSE; + } + + p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx); + + if ( !p_new_block || + ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx])) + return TRUE; + else + (*p_pkey_idx)++; + } + return FALSE; +} + +/********************************************************************** + **********************************************************************/ +int +osm_pkey_tbl_get_block_and_idx( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t *p_pkey, + OUT uint32_t *p_block_idx, + OUT uint8_t *p_pkey_index) +{ + uint32_t num_of_blocks; + uint32_t block_index; + ib_pkey_table_t *block; + + CL_ASSERT( p_pkey_tbl ); + CL_ASSERT( p_block_idx != NULL ); + CL_ASSERT( p_pkey_idx != NULL ); + + num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks); + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + { + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + if ( ( block->pkey_entry <= p_pkey ) && + ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK)) + { + *p_block_idx = block_index; + *p_pkey_index = p_pkey - block->pkey_entry; + return 0; + } + } + return 1; +} + +/********************************************************************** + **********************************************************************/ static boolean_t __osm_match_pkey ( IN const ib_net16_t *pkey1, IN const ib_net16_t *pkey2 ) { @@ -305,7 +455,8 @@ osm_physp_share_pkey( if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys)) return TRUE; - return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); + return + !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); } /********************************************************************** @@ -321,7 +472,8 @@ osm_port_share_pkey( OSM_LOG_ENTER( p_log, osm_port_share_pkey ); - if (!p_port_1 || !p_port_2) { + if (!p_port_1 || !p_port_2) + { ret = FALSE; goto Exit; } @@ -329,7 +481,8 @@ osm_port_share_pkey( p_physp1 = osm_port_get_default_phys_ptr(p_port_1); p_physp2 = osm_port_get_default_phys_ptr(p_port_2); - if (!p_physp1 || !p_physp2) { + if (!p_physp1 || !p_physp2) + { ret = FALSE; goto Exit; } Index: opensm/osm_pkey_mgr.c =================================================================== --- opensm/osm_pkey_mgr.c (revision 7904) +++ opensm/osm_pkey_mgr.c (working copy) @@ -62,6 +62,139 @@ /********************************************************************** **********************************************************************/ +/* + the max number of pkey blocks for a physical port is located in + different place for switch external ports (SwitchInfo) and the + rest of the ports (NodeInfo) +*/ +static int pkey_mgr_get_physp_max_blocks( + IN const osm_subn_t *p_subn, + IN const osm_physp_t *p_physp) +{ + osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); + osm_switch_t *p_sw; + uint16_t num_pkeys = 0; + + if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) || + (osm_physp_get_port_num( p_physp ) == 0)) + num_pkeys = cl_ntoh16( p_node->node_info.partition_cap ); + else + { + p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid); + if (p_sw) + num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap ); + } + return( (num_pkeys + 31) / 32 ); +} + +/********************************************************************** + **********************************************************************/ +/* + * Insert the new pending pkey entry to the specific port pkey table + * pending pkeys. new entries are inserted at the back. + */ +static void pkey_mgr_process_physical_port( + IN osm_log_t *p_log, + IN const osm_req_t *p_req, + IN const ib_net16_t pkey, + IN osm_physp_t *p_physp ) +{ + osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); + osm_pkey_tbl_t *p_pkey_tbl; + ib_net16_t *p_orig_pkey; + char *stat = NULL; + osm_pending_pkey_t *p_pending; + + p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); + if (! p_pkey_tbl) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: ERR 0501: " + "No pkey table found for node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + return; + } + + p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t)); + if (! p_pending) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: ERR 0502: " + "Fail to allocate new pending pkey entry for node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + return; + } + p_pending->pkey = pkey; + p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); + if ( !p_orig_pkey || + (ib_pkey_get_base(*p_orig_pkey) != ib_pkey_get_base(pkey) )) + { + p_pending->is_new = TRUE; + cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); + stat = "inserted"; + } + else + { + p_pending->is_new = FALSE; + if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey, + &p_pending->block, &p_pending->index)) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: ERR 0503: " + "Fail to obtain P_Key 0x%04x block and index for node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + return; + } + cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); + stat = "updated"; + } + + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_process_physical_port: " + "pkey 0x%04x was %s for node 0x%016" PRIx64 + " port %u\n", + cl_ntoh16( pkey ), stat, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); +} + +/********************************************************************** + **********************************************************************/ +static void +pkey_mgr_process_partition_table( + osm_log_t *p_log, + const osm_req_t *p_req, + const osm_prtn_t *p_prtn, + const boolean_t full ) +{ + const cl_map_t *p_tbl = full ? + &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; + cl_map_iterator_t i, i_next; + ib_net16_t pkey = p_prtn->pkey; + osm_physp_t *p_physp; + + if ( full ) + pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); + + i_next = cl_map_head( p_tbl ); + while ( i_next != cl_map_end( p_tbl ) ) + { + i = i_next; + i_next = cl_map_next( i ); + p_physp = cl_map_obj( i ); + if ( p_physp && osm_physp_is_valid( p_physp ) ) + pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); + } +} + +/********************************************************************** + **********************************************************************/ static ib_api_status_t pkey_mgr_update_pkey_entry( IN const osm_req_t *p_req, @@ -114,7 +247,8 @@ pkey_mgr_enforce_partition( p_pi->state_info2 = 0; ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE ); - context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); + context.pi_context.node_guid = + osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); context.pi_context.set_method = TRUE; context.pi_context.update_master_sm_base_lid = FALSE; @@ -131,80 +265,132 @@ pkey_mgr_enforce_partition( /********************************************************************** **********************************************************************/ -/* - * Prepare a new entry for the pkey table for this port when this pkey - * does not exist. Update existed entry when membership was changed. - */ -static void pkey_mgr_process_physical_port( - IN osm_log_t *p_log, - IN const osm_req_t *p_req, - IN const ib_net16_t pkey, - IN osm_physp_t *p_physp ) +static boolean_t pkey_mgr_update_port( + osm_log_t *p_log, + osm_req_t *p_req, + const osm_port_t * const p_port ) { - osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); - ib_pkey_table_t *block; + osm_physp_t *p_physp; + osm_node_t *p_node; + ib_pkey_table_t *block, *new_block; + osm_pkey_tbl_t *p_pkey_tbl; uint16_t block_index; + uint8_t pkey_index; + uint16_t last_free_block_index = 0; + uint16_t last_free_pkey_index = 0; uint16_t num_of_blocks; - const osm_pkey_tbl_t *p_pkey_tbl; - ib_net16_t *p_orig_pkey; - char *stat = NULL; - uint32_t i; + uint16_t max_num_of_blocks; - p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + ib_api_status_t status; + boolean_t ret_val = FALSE; + osm_pending_pkey_t *p_pending; + boolean_t found; - p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); + p_physp = osm_port_get_default_phys_ptr( p_port ); + if ( !osm_physp_is_valid( p_physp ) ) + return FALSE; - if ( !p_orig_pkey ) - { - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); + num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp ); + if ( p_pkey_tbl->max_blocks > max_num_of_blocks ) { - block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) + osm_log( p_log, OSM_LOG_INFO, + "pkey_mgr_update_port: " + "Max number of blocks reduced from %u to %u " + "for node 0x%016" PRIx64 " port %u\n", + p_pkey_tbl->max_blocks, max_num_of_blocks, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + } + p_pkey_tbl->max_blocks = max_num_of_blocks; + + osm_pkey_tbl_sync_new_blocks( p_pkey_tbl ); + cl_map_remove_all( &p_pkey_tbl->keys ); + p_pkey_tbl->used_blocks = 0; + + /* + process every pending pkey in order - + first must be "updated" last are "new" + */ + p_pending = + (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); + while (p_pending != + (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) ) + { + if (p_pending->is_new == FALSE) + { + block_index = p_pending->block; + pkey_index = p_pending->index; + found = TRUE; + } + else { - if ( ib_pkey_is_invalid( block->pkey_entry[i] ) ) + found = osm_pkey_find_next_free_entry(p_pkey_tbl, + &last_free_block_index, + &last_free_pkey_index); + if ( !found ) { - block->pkey_entry[i] = pkey; - stat = "inserted"; - goto _done; + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0504: " + "failed to find empty space for new pkey 0x%04x " + "of node 0x%016" PRIx64 " port %u\n", + cl_ntoh16(p_pending->pkey), + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); } + else + { + block_index = last_free_block_index; + pkey_index = last_free_pkey_index++; } } + + if (found) + { + if (osm_pkey_tbl_set_new_entry( + p_pkey_tbl, block_index, pkey_index, p_pending->pkey) ) + { osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_process_physical_port: ERR 0501: " - "No empty pkey entry was found to insert 0x%04x for node " - "0x%016" PRIx64 " port %u\n", - cl_ntoh16( pkey ), + "pkey_mgr_update_port: ERR 0505: " + "failed to set PKey 0x%04x in block %u idx %u " + "of node 0x%016" PRIx64 " port %u\n", + p_pending->pkey, block_index, pkey_index, cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); } - else if ( *p_orig_pkey != pkey ) - { + } + + free( p_pending ); + p_pending = + (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); + } + + /* now look for changes and store */ for ( block_index = 0; block_index < num_of_blocks; block_index++ ) { - /* we need real block (not just new_block) in order - * to resolve block/pkey indices */ block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); - i = p_orig_pkey - block->pkey_entry; - if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) { - block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - block->pkey_entry[i] = pkey; - stat = "updated"; - goto _done; - } - } - } + new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - _done: - if (stat) { - osm_log( p_log, OSM_LOG_VERBOSE, - "pkey_mgr_process_physical_port: " - "pkey 0x%04x was %s for node 0x%016" PRIx64 - " port %u\n", - cl_ntoh16( pkey ), stat, + if (block && + (!new_block || !memcmp( new_block, block, sizeof( *block ) )) ) + continue; + + status = pkey_mgr_update_pkey_entry( + p_req, p_physp , new_block, block_index ); + if (status == IB_SUCCESS) + ret_val = TRUE; + else + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0506: " + "pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); } + + return ret_val; } /********************************************************************** @@ -217,21 +403,23 @@ pkey_mgr_update_peer_port( const osm_port_t * const p_port, boolean_t enforce ) { - osm_physp_t *p, *peer; + osm_physp_t *p_physp, *peer; osm_node_t *p_node; ib_pkey_table_t *block, *peer_block; - const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl; + const osm_pkey_tbl_t *p_pkey_tbl; + osm_pkey_tbl_t *p_peer_pkey_tbl; osm_switch_t *p_sw; ib_switch_info_t *p_si; uint16_t block_index; uint16_t num_of_blocks; + uint16_t peer_max_blocks; ib_api_status_t status = IB_SUCCESS; boolean_t ret_val = FALSE; - p = osm_port_get_default_phys_ptr( p_port ); - if ( !osm_physp_is_valid( p ) ) + p_physp = osm_port_get_default_phys_ptr( p_port ); + if ( !osm_physp_is_valid( p_physp ) ) return FALSE; - peer = osm_physp_get_remote( p ); + peer = osm_physp_get_remote( p_physp ); if ( !peer || !osm_physp_is_valid( peer ) ) return FALSE; p_node = osm_physp_get_node_ptr( peer ); @@ -245,7 +433,7 @@ pkey_mgr_update_peer_port( if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) { osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0502: " + "pkey_mgr_update_peer_port: ERR 0507: " "pkey_mgr_enforce_partition() failed to update " "node 0x%016" PRIx64 " port %u\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), @@ -255,24 +443,36 @@ pkey_mgr_update_peer_port( if (enforce == FALSE) return FALSE; - p_pkey_tbl = osm_physp_get_pkey_tbl( p ); - p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); + p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); + p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer ); num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); - if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) ) - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ); + peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer ); + if (peer_max_blocks < p_pkey_tbl->used_blocks) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_peer_port: ERR 0508: " + "not enough entries (%u < %u) on switch 0x%016" PRIx64 + " port %u\n", + peer_max_blocks, num_of_blocks, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( peer ) ); + return FALSE; + } - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; + for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++) { block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); if ( memcmp( peer_block, block, sizeof( *peer_block ) ) ) { + osm_pkey_tbl_set(p_peer_pkey_tbl, block_index, block); status = pkey_mgr_update_pkey_entry( p_req, peer, block, block_index ); if ( status == IB_SUCCESS ) ret_val = TRUE; else osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0503: " + "pkey_mgr_update_peer_port: ERR 0509: " "pkey_mgr_update_pkey_entry() failed to update " "pkey table block %d for node 0x%016" PRIx64 " port %u\n", @@ -282,10 +482,10 @@ pkey_mgr_update_peer_port( } } - if ( ret_val == TRUE && - osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) + if ( (ret_val == TRUE) && + osm_log_is_active( p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_log, OSM_LOG_VERBOSE, + osm_log( p_log, OSM_LOG_DEBUG, "pkey_mgr_update_peer_port: " "pkey table was updated for node 0x%016" PRIx64 " port %u\n", @@ -298,82 +498,6 @@ pkey_mgr_update_peer_port( /********************************************************************** **********************************************************************/ -static boolean_t pkey_mgr_update_port( - osm_log_t *p_log, - osm_req_t *p_req, - const osm_port_t * const p_port ) -{ - osm_physp_t *p; - osm_node_t *p_node; - ib_pkey_table_t *block, *new_block; - const osm_pkey_tbl_t *p_pkey_tbl; - uint16_t block_index; - uint16_t num_of_blocks; - ib_api_status_t status; - boolean_t ret_val = FALSE; - - p = osm_port_get_default_phys_ptr( p_port ); - if ( !osm_physp_is_valid( p ) ) - return FALSE; - - p_pkey_tbl = osm_physp_get_pkey_tbl(p); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); - - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) - { - block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); - new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - - if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) ) - continue; - - status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index ); - if (status == IB_SUCCESS) - ret_val = TRUE; - else - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_port: ERR 0504: " - "pkey_mgr_update_pkey_entry() failed to update " - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", - block_index, - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( p ) ); - } - - return ret_val; -} - -/********************************************************************** - **********************************************************************/ -static void -pkey_mgr_process_partition_table( - osm_log_t *p_log, - const osm_req_t *p_req, - const osm_prtn_t *p_prtn, - const boolean_t full ) -{ - const cl_map_t *p_tbl = full ? - &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; - cl_map_iterator_t i, i_next; - ib_net16_t pkey = p_prtn->pkey; - osm_physp_t *p_physp; - - if ( full ) - pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); - - i_next = cl_map_head( p_tbl ); - while ( i_next != cl_map_end( p_tbl ) ) - { - i = i_next; - i_next = cl_map_next( i ); - p_physp = cl_map_obj( i ); - if ( p_physp && osm_physp_is_valid( p_physp ) ) - pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); - } -} - -/********************************************************************** - **********************************************************************/ osm_signal_t osm_pkey_mgr_process( IN osm_opensm_t *p_osm ) @@ -383,8 +507,7 @@ osm_pkey_mgr_process( osm_prtn_t *p_prtn; osm_port_t *p_port; osm_signal_t signal = OSM_SIGNAL_DONE; - osm_physp_t *p_physp; - + osm_node_t *p_node; CL_ASSERT( p_osm ); OSM_LOG_ENTER( &p_osm->log, osm_pkey_mgr_process ); @@ -394,32 +517,25 @@ osm_pkey_mgr_process( if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS ) { osm_log( &p_osm->log, OSM_LOG_ERROR, - "osm_pkey_mgr_process: ERR 0505: " + "osm_pkey_mgr_process: ERR 0510: " "osm_prtn_make_partitions() failed\n" ); goto _err; } - p_tbl = &p_osm->subn.port_guid_tbl; - p_next = cl_qmap_head( p_tbl ); - while ( p_next != cl_qmap_end( p_tbl ) ) - { - p_port = ( osm_port_t * ) p_next; - p_next = cl_qmap_next( p_next ); - p_physp = osm_port_get_default_phys_ptr( p_port ); - if ( osm_physp_is_valid( p_physp ) ) - osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) ); - } - + /* populate the pending pkey entries by scanning all partitions */ p_tbl = &p_osm->subn.prtn_pkey_tbl; p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) { p_prtn = ( osm_prtn_t * ) p_next; p_next = cl_qmap_next( p_next ); - pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ); - pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); + pkey_mgr_process_partition_table( + &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ); + pkey_mgr_process_partition_table( + &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); } + /* calculate new pkey tables and set */ p_tbl = &p_osm->subn.port_guid_tbl; p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) @@ -428,8 +544,10 @@ osm_pkey_mgr_process( p_next = cl_qmap_next( p_next ); if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) ) signal = OSM_SIGNAL_DONE_PENDING; - if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH && - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, + p_node = osm_port_get_parent_node( p_port ); + if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && + pkey_mgr_update_peer_port( + &p_osm->log, &p_osm->sm.req, &p_osm->subn, p_port, !p_osm->subn.opt.no_partition_enforcement ) ) signal = OSM_SIGNAL_DONE_PENDING; From tziporet at mellanox.co.il Tue Jun 13 06:07:33 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 13 Jun 2006 16:07:33 +0300 Subject: [openib-general] OFED 1.0 release schedule Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA71E5@mtlexch01.mtl.com> Hi All, After reading the mail thread regarding OFED release I have decided this: We upload OFED-1.0-pre1.tgz to https://openib.org/svn/gen2/branches/1.0/ofed/releases/ We checked that all modules compile and loaded on this build (including ipath and uDAPL) The only missing parts of this release from the final release are the documents, and the scripts rpm that Scott requested. I think testing this version 3 days (Tuesday, Wednesday and Thursday) should be enough as Scott wrote. So - we can do the official OFED 1.0 release on Friday 16-June. Matt - please check with Novel if this date is acceptable by them. If not then the earliest we can do the release if Thursday 15-June. Tziporet Koren Software Director Mellanox Technologies mailto: tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Jun 13 06:09:11 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 09:09:11 -0400 Subject: [openib-general] [PATCH] osm: Provide SUBNET UP message every heavy sweep - resend In-Reply-To: <86pshdgrgs.fsf@mtl066.yok.mtl.com> References: <86pshdgrgs.fsf@mtl066.yok.mtl.com> Message-ID: <1150203379.570.144617.camel@hal.voltaire.com> Hi Eitan, On Tue, 2006-06-13 at 08:39, Eitan Zahavi wrote: > Hi Hal > > Sorry bout the previous patch - I got the } else { in it. > > This trivial patch provides a "SUBNET UP" message (with level INFO) > every time the SM completes a full heavy sweep. It is most useful for > cases where you want to make sure teh SM responded to some change in > the fabric. Also used to sync the various test flows to the end of sweeps. I already had fixed this prior to committing it. I thought that was easier than "going 'round the block" on it. > Eitan > > Signed-off-by: Eitan Zahavi > > Index: opensm/osm_state_mgr.c > =================================================================== > --- opensm/osm_state_mgr.c (revision 7904) > +++ opensm/osm_state_mgr.c (working copy) > @@ -200,6 +200,10 @@ __osm_state_mgr_up_msg( > /* clear the signal */ > p_mgr->p_subn->moved_to_master_state = FALSE; > } > + else > + { > + osm_log( p_mgr->p_log, OSM_LOG_INFO, "SUBNET UP\n" ); /* Format Waived */ > + } If tab is supposed to be the convention, spaces are used in most OpenSM modules and I have been trying to keep to the convention used in the particular module. -- Hal > if( p_mgr->p_subn->opt.sweep_interval ) > { > From halr at voltaire.com Tue Jun 13 06:17:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 09:17:33 -0400 Subject: [openib-general] [PATCH] osm: partition manager force policy In-Reply-To: <86odwxgqrs.fsf@mtl066.yok.mtl.com> References: <86odwxgqrs.fsf@mtl066.yok.mtl.com> Message-ID: <1150204529.570.145313.camel@hal.voltaire.com> Hi Eitan, On Tue, 2006-06-13 at 08:54, Eitan Zahavi wrote: > --text follows this line-- > Hi Hal > > This is a second take after debug and cleanup of the partition manager > patch I have previously provided. Thanks. So this patch superceeds the previous version ? If so, in the future, just indicate [PATCHv2] for this. > The functionality is the same but > this one is after 2 days of testing on the simulator. Are you still working on this (more testing) ? > I also did some code restructuring for clarity. > Tests passed were both dedicated pkey enforcements (pkey.*) and > stress test (osmStress.*) > > As I started to test the partition manager code (using ibmgtsim pkey test), > I realized the implementation does not really enforces the partition policy > on the given fabric. This patch fixes that. It was verified using the > simulation test. Several other corner cases were fixed too. Can you elaborate on these cases ? -- Hal From bpradip at in.ibm.com Tue Jun 13 06:47:51 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Tue, 13 Jun 2006 19:17:51 +0530 Subject: [openib-general] [PATCH] libamso: fix erroneous return and memory leak in verbs.c Message-ID: <20060613134743.GA17393@harry-potter.ibm.com> Hi, This patch fixes an erroneous return in func amso_create_cq() and a memory leak in amso_create_qp() --- Index = libamso/verbs.c ============================================================================ --- verbs.org 2006-06-13 18:56:50.000000000 +0530 +++ verbs.c 2006-06-13 19:02:03.000000000 +0530 @@ -154,9 +154,8 @@ struct ibv_cq *amso_create_cq(struct ibv int ret; cq = malloc(sizeof *cq); - if (!cq) { - goto err; - } + if (!cq) + return NULL; ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, &cq->ibv_cq, &cmd.ibv_cmd, sizeof cmd, @@ -248,14 +247,15 @@ struct ibv_qp *amso_create_qp(struct ibv ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd, &resp.ibv_resp, sizeof resp); if (ret) - return NULL; + goto err; #if 0 /* A reminder for bypass functionality */ qp->physaddr = resp.physaddr; #endif return &qp->ibv_qp; - +err: + free(qp); return NULL; } From mst at mellanox.co.il Tue Jun 13 07:19:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 13 Jun 2006 17:19:55 +0300 Subject: [openib-general] [PATCH updated] libmthca: memfree completion with error Message-ID: <20060613141955.GX19518@mellanox.co.il> Same thing for userspace. --- Fix up completion with error for memfree. Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/libmthca/src/cq.c =================================================================== --- openib/src/userspace/libmthca/src/cq.c (revision 7890) +++ openib/src/userspace/libmthca/src/cq.c (working copy) @@ -347,8 +347,17 @@ wc->wr_id = srq->wrid[wqe_index]; mthca_free_srq_wqe(srq, wqe_index); } else { + int32_t wqe; wq = &(*cur_qp)->rq; - wqe_index = ntohl(cqe->wqe) >> wq->wqe_shift; + wqe = ntohl(cqe->wqe); + wqe_index = wqe >> wq->wqe_shift; + /* + * WQE addr == base - 1 might be reported in receive completion + * with error instead of (rq size - 1) by Sinai FW 1.0.800, + * Arbel FW 5.1.400 and should be fixed in later revisions. + */ + if (wqe_index < 0) + wqe_index = wq->max - 1; wc->wr_id = (*cur_qp)->wrid[wqe_index]; } -- MST From mst at mellanox.co.il Tue Jun 13 07:19:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 13 Jun 2006 17:19:42 +0300 Subject: [openib-general] [PATCH updated] mthca: memfree completion with error workaround In-Reply-To: <20060612121635.GX7359@mellanox.co.il> References: <20060612121635.GX7359@mellanox.co.il> Message-ID: <20060613141942.GW19518@mellanox.co.il> OK, here's an optimized version of the fix. With this, I see: before 5994 0 0 5994 176a drivers/infiniband/hw/mthca/mthca_cq.o after 5995 0 0 5995 176b drivers/infiniband/hw/mthca/mthca_cq.o So the cost is minimal. Please consider for 2.6.17. --- Memfree firmware is in rare cases reporting WQE index == base - 1 in receive completion with error instead of (rq size - 1); base is 0 in mthca. Here is a patch to avoid kernel crash and report a correct WR id in this case. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16/drivers/infiniband/hw/mthca/mthca_cq.c =================================================================== --- linux-2.6.16.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2006-05-16 12:33:05.000000000 +0300 +++ linux-2.6.16/drivers/infiniband/hw/mthca/mthca_cq.c 2006-06-13 12:14:13.000000000 +0300 @@ -540,8 +540,17 @@ static inline int mthca_poll_one(struct entry->wr_id = srq->wrid[wqe_index]; mthca_free_srq_wqe(srq, wqe); } else { + s32 wqe; wq = &(*cur_qp)->rq; - wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift; + wqe = be32_to_cpu(cqe->wqe); + wqe_index = wqe >> wq->wqe_shift; + /* + * WQE addr == base - 1 might be reported in receive completion + * with error instead of (rq size - 1) by Sinai FW 1.0.800, + * Arbel FW 5.1.400 and should be fixed in later revisions. + */ + if (unlikely(wqe_index < 0)) + wqe_index = wq->max - 1; entry->wr_id = (*cur_qp)->wrid[wqe_index]; } -- MST From eitan at mellanox.co.il Tue Jun 13 07:21:24 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 13 Jun 2006 17:21:24 +0300 Subject: [openib-general] [PATCH] osm: partition manager force policy In-Reply-To: <1150204529.570.145313.camel@hal.voltaire.com> References: <86odwxgqrs.fsf@mtl066.yok.mtl.com> <1150204529.570.145313.camel@hal.voltaire.com> Message-ID: <448EC9E4.3020409@mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > Hi Eitan, > > On Tue, 2006-06-13 at 08:54, Eitan Zahavi wrote: > >>--text follows this line-- >>Hi Hal >> >>This is a second take after debug and cleanup of the partition manager >>patch I have previously provided. > > > Thanks. > > So this patch superceeds the previous version ? If so, in the future, > just indicate [PATCHv2] for this. > > >> The functionality is the same but >>this one is after 2 days of testing on the simulator. > > > Are you still working on this (more testing) ? > > >>I also did some code restructuring for clarity. > > >>Tests passed were both dedicated pkey enforcements (pkey.*) and >>stress test (osmStress.*) >> >>As I started to test the partition manager code (using ibmgtsim pkey test), >>I realized the implementation does not really enforces the partition policy >>on the given fabric. This patch fixes that. It was verified using the >>simulation test. Several other corner cases were fixed too. > > > Can you elaborate on these cases ? If you ask about the corner cases: 1. A bug in avoiding switch enforcement when the HCA had more blocks then the switch. 2. Similar but when the HCA blocks are unused so actually the switch does not need so many blocks 3. Segfaults due to fabric instability. If you ask about the test code it is checked in https://openib.org/svn/gen2/utils/src/linux-user/ibmgtsim/tests the file names start with pkey.* and osmStress.*. In general the pkey test does: * Randomize 3 pkeys p1 p2 p3 (first 2 are full 1 is partial) * Assignment of ports into 3 groups G1 which uses p1, G2 which uses p2 and G3 which uses p1,p2 and p3 * For each HCA port randomize pkey tables with random number of entries (including the ones above with random location) * For some ports override the tables with an incorrect set * write a partition policy file * start the SM, wait for subnet up * randomly select HCA ports and verify (using osmtest -f c) that all-to-all path records they see are limited by the partitions they belong to * forcefully null all default pkey entries on the fabric ports * set a change bit on a switch to force a sweep * wait for subnet up and check all ports do have correct default pkey set The stress test does: * Setup LIDs * Force some random LID violations (duplicated, misaligned, zero) * Write guid2lid file with some random change * Disconnect some random nodes * Run OpenSM wait for subnet up * Repeat 10 times: Reconnect all nodes Disconnect some random nodes * Wait for subnet up * check all LID values are correct (according to guid2lid) * Start 240 iterations of selecting one of the following : connect random port disconnect random port register random service query random paths from random nodes join random port to 0xC000 leave random port from 0xC000 * Eventually: connect all nodes join 0xC000 from all HCA ports wait for subnet up check connectivity and FDB validity etc using ibdiagnet From ishai at mellanox.co.il Tue Jun 13 07:57:47 2006 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Tue, 13 Jun 2006 17:57:47 +0300 Subject: [openib-general] [PATCH] SRP: Avoid a potential race on target->req_queue Message-ID: <20060613145747.GA18628@mellanox.co.il> Hi Roland, There is a potential race between srp_reconnect_target and srp_reset_device when they access the target->req_queue. These functions can execute in the same time because srp_reconnect_target is called form srp_reconnect_work that is scheduled by srp_completion, while srp_reset_device is called from the scsi layer. The race is caused because srp_reconnect_target is not holding host_lock while accessing target->req_queue. It assumes that since the state is CONNECTING no other function will access target->req_queue (and this is the case with srp_reset_host for example). There are two possible solutions: 1) Change srp_reset_device: after locking host_lock, it will check the state. Only if the state is LIVE it will execute the loop that access target->req_queue. 2) Change srp_reconnect_target. Before executing the loop that access target->req_queue it will lock host_lock and will release it after the loop. I'm sending a patch for the second solution. If you prefer the first, I have another patch for it (It is a bit longer). Which solution do you like better? Signed-off-by: Ishai Rabinovitz Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-06-13 02:24:22.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c 2006-06-13 02:26:07.000000000 +0300 @@ -641,8 +641,10 @@ static int srp_reconnect_target(struct s while (ib_poll_cq(target->cq, 1, &wc) > 0) ; /* nothing */ + spin_lock_irq(target->scsi_host->host_lock); list_for_each_entry_safe(req, tmp, &target->req_queue, list) srp_reset_req(target, req); + spin_unlock_irq(target->scsi_host->host_lock); target->rx_head = 0; target->tx_head = 0; -- Ishai Rabinovitz From viswa.krish at gmail.com Tue Jun 13 09:21:03 2006 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Tue, 13 Jun 2006 09:21:03 -0700 Subject: [openib-general] opensm and NPTL In-Reply-To: <1150193732.570.138496.camel@hal.voltaire.com> References: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com> <1150193732.570.138496.camel@hal.voltaire.com> Message-ID: <4df28be40606130921t1d5eb51dof06280721d1bf1e9@mail.gmail.com> Yes.. I want to test waters again and see if the issues went away. -Viswa On 13 Jun 2006 06:15:34 -0400, Hal Rosenstock wrote: > > Hi Viswa, > > On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy wrote: > > There were some issues with opensm running with NPTL (thread > > library). Has the issues been resolved ? > > There were some fixes to the signal handling which went in back in the > Feb/early March time frame. OpenSM should be better with NPTL now. Is it > working for you or are you asking before stepping into these waters > again ? > > -- Hal > > > Regards, > > Viswa > > > > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From faulkner at opengridcomputing.com Tue Jun 13 09:24:07 2006 From: faulkner at opengridcomputing.com (Boyd R. Faulkner) Date: Tue, 13 Jun 2006 11:24:07 -0500 Subject: [openib-general] [PATCH] librdmacm/examples/rping.c Message-ID: <200606131124.08110.faulkner@opengridcomputing.com> This patch resolves a race condition between the receipt of a connection established event and a receive completion from the client. The server no longer goes to connected state but merely waits for the READ_ADV state to begin its looping. This keeps the server from going back to CONNECTED from the later states if the connection established event comes in after the receive completion (i.e. the loop starts). Signed-off-by: Boyd Faulkner Index: rping.c =================================================================== --- rping.c (revision 7960) +++ rping.c (working copy) @@ -182,7 +182,13 @@ case RDMA_CM_EVENT_ESTABLISHED: DEBUG_LOG("ESTABLISHED\n"); - cb->state = CONNECTED; + + /* + * Server will wake up when first RECV completes. + */ + if (!cb->server) { + cb->state = CONNECTED; + } sem_post(&cb->sem); break; @@ -197,7 +203,7 @@ break; case RDMA_CM_EVENT_DISCONNECTED: - fprintf(stderr, "DISCONNECT EVENT...\n"); + fprintf(stderr, "%s DISCONNECT EVENT...\n", cb->server ? "server" : "client"); sem_post(&cb->sem); break; @@ -225,7 +231,7 @@ DEBUG_LOG("Received rkey %x addr %" PRIx64 "len %d from peer\n", cb->remote_rkey, cb->remote_addr, cb->remote_len); - if (cb->state == CONNECTED || cb->state == RDMA_WRITE_COMPLETE) + if (cb->state <= CONNECTED || cb->state == RDMA_WRITE_COMPLETE) cb->state = RDMA_READ_ADV; else cb->state = RDMA_WRITE_ADV; -- Boyd R. Faulkner Open Grid Computing, Inc. Phone: 512-343-9196 x109 Fax: 512-343-5450 From halr at voltaire.com Tue Jun 13 09:35:17 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 12:35:17 -0400 Subject: [openib-general] opensm and NPTL In-Reply-To: <4df28be40606130921t1d5eb51dof06280721d1bf1e9@mail.gmail.com> References: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com> <1150193732.570.138496.camel@hal.voltaire.com> <4df28be40606130921t1d5eb51dof06280721d1bf1e9@mail.gmail.com> Message-ID: <1150216346.570.152323.camel@hal.voltaire.com> On Tue, 2006-06-13 at 12:21, Viswanath Krishnamurthy wrote: > Yes.. I want to test waters again and see if the issues went away. Are you using the trunk or 1.0 ? -- Hal > -Viswa > > > On 13 Jun 2006 06:15:34 -0400, Hal Rosenstock > wrote: > Hi Viswa, > > On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy wrote: > > There were some issues with opensm running with > NPTL (thread > > library). Has the issues been resolved ? > > There were some fixes to the signal handling which went in > back in the > Feb/early March time frame. OpenSM should be better with NPTL > now. Is it > working for you or are you asking before stepping into these > waters > again ? > > -- Hal > > > Regards, > > Viswa > > > > > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From halr at voltaire.com Tue Jun 13 09:38:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 12:38:33 -0400 Subject: [openib-general] [PATCH] osmtest: Add test for non base LID SA PortInfoRecord request when LMC > 0 Message-ID: <1150216507.570.152411.camel@hal.voltaire.com> osmtest: Add test for non base LID SA PortInfoRecord request when LMC > 0 Signed-off-by: Hal Rosenstock Index: osmtest/osmtest.c =================================================================== --- osmtest/osmtest.c (revision 7961) +++ osmtest/osmtest.c (working copy) @@ -1613,6 +1613,7 @@ osmtest_stress_port_recs_small( IN osmte **********************************************************************/ ib_api_status_t osmtest_get_local_port_lmc( IN osmtest_t * const p_osmt, + IN ib_net16_t lid, OUT uint8_t * const p_lmc ) { osmtest_req_context_t context; @@ -1629,7 +1630,7 @@ osmtest_get_local_port_lmc( IN osmtest_t * Do a blocking query for our own PortRecord in the subnet. */ status = osmtest_get_port_rec( p_osmt, - cl_ntoh16(p_osmt->local_port.lid), + cl_ntoh16( lid ), &context ); if( status != IB_SUCCESS ) @@ -3181,7 +3182,7 @@ osmtest_validate_path_data( IN osmtest_t cl_ntoh16( p_rec->slid ), cl_ntoh16( p_rec->dlid ) ); } - status = osmtest_get_local_port_lmc( p_osmt, &lmc ); + status = osmtest_get_local_port_lmc( p_osmt, p_osmt->local_port.lid, &lmc ); /* HACK: Assume uniform LMC across endports in the subnet */ /* In absence of this assumption, validation of this is much more complicated */ @@ -4885,10 +4886,13 @@ static ib_api_status_t osmtest_validate_against_db( IN osmtest_t * const p_osmt ) { ib_api_status_t status = IB_SUCCESS; -#if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP) +#ifdef VENDOR_RMPP_SUPPORT + uint8_t lmc; +#ifdef DUAL_SIDED_RMPP osmtest_req_context_t context; osmv_multipath_req_t request; #endif +#endif OSM_LOG_ENTER( &p_osmt->log, osmtest_validate_against_db ); @@ -4999,6 +5003,18 @@ osmtest_validate_against_db( IN osmtest_ if( status != IB_SUCCESS ) goto Exit; + /* If LMC > 0, test non base LID SA PortInfoRecord request */ + status = osmtest_get_local_port_lmc( p_osmt, p_osmt->local_port.lid, &lmc ); + if ( status != IB_SUCCESS ) + goto Exit; + + if (lmc != 0) + { + status = osmtest_get_local_port_lmc( p_osmt, p_osmt->local_port.lid + 1, NULL); + if ( status != IB_SUCCESS ) + goto Exit; + } + if (! p_osmt->opt.ignore_path_records) { status = osmtest_validate_all_path_recs( p_osmt ); From halr at voltaire.com Tue Jun 13 09:42:19 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 12:42:19 -0400 Subject: [openib-general] [PATCH] OpenSM/SA: Properly handle non base LID requests to some SA records Message-ID: <1150216933.570.152671.camel@hal.voltaire.com> OpenSM/SA: Properly handle non base LID requests to some SA records In osm_sa_node_record.c and osm_sa_portinfo_record.c, properly handle non base LID requests per C15-0.1.11: Query responses shall contain a port's base LID in any LID component of a RID. So when LMC is non 0, the only records that appear are those with the base LID and not with any masked LIDs. Furthermore, if a query comes in on a non base LID, the LID in the RID returned is only with the base LID. Also, fixed some endian issues in osm_log messages. Note: Similar patch for other affected SA records will follow. Signed-off-by: Hal Rosenstock Index: opensm/osm_sa_node_record.c =================================================================== --- opensm/osm_sa_node_record.c (revision 7961) +++ opensm/osm_sa_node_record.c (working copy) @@ -200,12 +200,11 @@ __osm_nr_rcv_create_nr( uint8_t port_num; uint8_t num_ports; uint16_t match_lid_ho; - uint16_t lid_ho; + ib_net16_t base_lid; ib_net16_t base_lid_ho; ib_net16_t max_lid_ho; uint8_t lmc; ib_net64_t port_guid; - ib_api_status_t status; OSM_LOG_ENTER( p_rcv->p_log, __osm_nr_rcv_create_nr ); @@ -245,7 +244,8 @@ __osm_nr_rcv_create_nr( if( match_port_guid && ( port_guid != match_port_guid ) ) continue; - base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_physp ) ); + base_lid = osm_physp_get_base_lid( p_physp ); + base_lid_ho = cl_ntoh16( base_lid ); lmc = osm_physp_get_lmc( p_physp ); max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 ); match_lid_ho = cl_ntoh16( match_lid ); @@ -260,29 +260,18 @@ __osm_nr_rcv_create_nr( osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_nr_rcv_create_nr: " "Comparing LID: 0x%X <= 0x%X <= 0x%X\n", - cl_ntoh16( base_lid_ho ), - cl_ntoh16( match_lid_ho ), - cl_ntoh16( max_lid_ho ) + base_lid_ho, match_lid_ho, max_lid_ho ); } if( (match_lid_ho <= max_lid_ho) && (match_lid_ho >= base_lid_ho) ) { - __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, match_lid ); + __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, base_lid ); } } else { - /* - For every lid value create a Node Record. - */ - for( lid_ho = base_lid_ho; lid_ho <= max_lid_ho; lid_ho++ ) - { - status = __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, - port_guid, cl_hton16( lid_ho ) ); - if( status != IB_SUCCESS ) - break; - } + __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, base_lid ); } } Index: opensm/osm_sa_portinfo_record.c =================================================================== --- opensm/osm_sa_portinfo_record.c (revision 7961) +++ opensm/osm_sa_portinfo_record.c (working copy) @@ -194,9 +194,9 @@ __osm_sa_pir_create( IN osm_pir_search_ctxt_t* const p_ctxt ) { uint8_t lmc; - uint16_t lid_ho; uint16_t max_lid_ho; uint16_t base_lid_ho; + uint16_t match_lid_ho; OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_pir_create ); @@ -218,17 +218,28 @@ __osm_sa_pir_create( if( p_ctxt->comp_mask & IB_PIR_COMPMASK_LID ) { - __osm_pir_rcv_new_pir( p_rcv, p_physp, p_ctxt->p_list, - p_ctxt->p_rcvd_rec->lid ); - } - else - { - for( lid_ho = base_lid_ho; lid_ho <= max_lid_ho; lid_ho++ ) + match_lid_ho = cl_ntoh16( p_ctxt->p_rcvd_rec->lid ); + + /* + We validate that the lid belongs to this node. + */ + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { - __osm_pir_rcv_new_pir( p_rcv, p_physp, p_ctxt->p_list, - cl_hton16( lid_ho ) ); + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_sa_pir_create: " + "Comparing LID: 0x%X <= 0x%X <= 0x%X\n", + base_lid_ho, match_lid_ho, max_lid_ho + ); } + + if ( match_lid_ho < base_lid_ho || match_lid_ho > max_lid_ho ) + goto Exit; } + + __osm_pir_rcv_new_pir( p_rcv, p_physp, p_ctxt->p_list, + cl_hton16( base_lid_ho ) ); + + Exit: OSM_LOG_EXIT( p_rcv->p_log ); } From viswa.krish at gmail.com Tue Jun 13 09:56:08 2006 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Tue, 13 Jun 2006 09:56:08 -0700 Subject: [openib-general] opensm and NPTL In-Reply-To: <1150216346.570.152323.camel@hal.voltaire.com> References: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com> <1150193732.570.138496.camel@hal.voltaire.com> <4df28be40606130921t1d5eb51dof06280721d1bf1e9@mail.gmail.com> <1150216346.570.152323.camel@hal.voltaire.com> Message-ID: <4df28be40606130956v4f945921ncbd13f2b6d0ff517@mail.gmail.com> I am using the trunk. Should I be using 1.0 ? -Viswa On 13 Jun 2006 12:35:17 -0400, Hal Rosenstock wrote: > > On Tue, 2006-06-13 at 12:21, Viswanath Krishnamurthy wrote: > > Yes.. I want to test waters again and see if the issues went away. > > Are you using the trunk or 1.0 ? > > -- Hal > > > -Viswa > > > > > > On 13 Jun 2006 06:15:34 -0400, Hal Rosenstock > > wrote: > > Hi Viswa, > > > > On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy wrote: > > > There were some issues with opensm running with > > NPTL (thread > > > library). Has the issues been resolved ? > > > > There were some fixes to the signal handling which went in > > back in the > > Feb/early March time frame. OpenSM should be better with NPTL > > now. Is it > > working for you or are you asking before stepping into these > > waters > > again ? > > > > -- Hal > > > > > Regards, > > > Viswa > > > > > > > > > > > > > > > ______________________________________________________________________ > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gjohnson at lanl.gov Tue Jun 13 10:02:46 2006 From: gjohnson at lanl.gov (Greg Johnson) Date: Tue, 13 Jun 2006 11:02:46 -0600 Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes from the file In-Reply-To: <20060611002758.22430.63061.stgit@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> Message-ID: <20060613170246.GH23320@durango.c3.lanl.gov> On Sun, Jun 11, 2006 at 03:27:58AM +0300, Sasha Khapyorsky wrote: > Hi, > > There are couple of unicast routing related patches for OpenSM. > > Basically it implements routing module which provides possibility to load > switch forwarding tables from pre-created dump file. Currently unicast > tables loading is only supported, multicast may be added in a future. > > Short patch descriptions (more details may be found in emails with > patches): > > 1. Ucast dump file simplification. > 2. Modular routing - preliminary implements generic model to plug new > routing engine to OpenSM. > 3. New simple unicast routing engine which allows to load LFTs from > pre-created dump file. > 4. Example of ucast dump generation script. > > Please comment and test. Thanks. We tried this on our 256-node cluster with a single chassis Voltaire 288-port switch. It seems to load the routes generated by the dump script, but afterward it is not possible to dump the routes again. I would like to re-dump the routes after loading to ensure that they were loaded correctly. After loading routes with "opensm -R file -U dump_file", dump_lfts.sh gives: nodeinfo 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ibroute: iberror: dump tables failed: node info failed: valid addr? for each switch. Also, I had to delete a space in the sed script on line 17 of dump_lfts.sh: sed -ne 's/^.* lid \([1-9a-f]*\) .*$/\1/p' became sed -ne 's/^.* lid \([1-9a-f]*\).*$/\1/p' Thanks for the work! Greg From halr at voltaire.com Tue Jun 13 10:06:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 13:06:43 -0400 Subject: [openib-general] opensm and NPTL In-Reply-To: <4df28be40606130956v4f945921ncbd13f2b6d0ff517@mail.gmail.com> References: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com> <1150193732.570.138496.camel@hal.voltaire.com> <4df28be40606130921t1d5eb51dof06280721d1bf1e9@mail.gmail.com> <1150216346.570.152323.camel@hal.voltaire.com> <4df28be40606130956v4f945921ncbd13f2b6d0ff517@mail.gmail.com> Message-ID: <1150218085.570.153354.camel@hal.voltaire.com> On Tue, 2006-06-13 at 12:56, Viswanath Krishnamurthy wrote: > I am using the trunk. Should I be using 1.0 ? No; I didn't check but if my memory serves me correctly, the trunk may have some fixes 1.0 doesn't towards this but I'm not 100% sure right now and since you are using the trunk, I'm not going to do my homework on whether that is really the case or my memory is just fuzzy on this. -- Hal > > -Viswa > > > On 13 Jun 2006 12:35:17 -0400, Hal Rosenstock > wrote: > On Tue, 2006-06-13 at 12:21, Viswanath Krishnamurthy wrote: > > Yes.. I want to test waters again and see if the issues went > away. > > Are you using the trunk or 1.0 ? > > -- Hal > > > -Viswa > > > > > > On 13 Jun 2006 06:15:34 -0400, Hal Rosenstock > > > wrote: > > Hi Viswa, > > > > On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy > wrote: > > > There were some issues with opensm running with > > NPTL (thread > > > library). Has the issues been resolved ? > > > > There were some fixes to the signal handling which > went in > > back in the > > Feb/early March time frame. OpenSM should be better > with NPTL > > now. Is it > > working for you or are you asking before stepping > into these > > waters > > again ? > > > > -- Hal > > > > > Regards, > > > Viswa > > > > > > > > > > > > > > > ______________________________________________________________________ > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > From swise at opengridcomputing.com Tue Jun 13 10:25:52 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 13 Jun 2006 12:25:52 -0500 Subject: [openib-general] [PATCH] librdmacm/examples/rping.c In-Reply-To: <200606131124.08110.faulkner@opengridcomputing.com> References: <200606131124.08110.faulkner@opengridcomputing.com> Message-ID: <1150219552.17394.23.camel@stevo-desktop> Thanks, applied. iwarp branch: r7964 trunk: r7966 On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote: > This patch resolves a race condition between the receipt of > a connection established event and a receive completion from > the client. The server no longer goes to connected state but > merely waits for the READ_ADV state to begin its looping. This > keeps the server from going back to CONNECTED from the later > states if the connection established event comes in after the > receive completion (i.e. the loop starts). > > Signed-off-by: Boyd Faulkner From swise at opengridcomputing.com Tue Jun 13 10:31:10 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 13 Jun 2006 12:31:10 -0500 Subject: [openib-general] [PATCH] rping: Erroneous check for minumum ping buffer size In-Reply-To: <20060610173417.GA14280@harry-potter.ibm.com> References: <20060610173417.GA14280@harry-potter.ibm.com> Message-ID: <1150219870.17394.26.camel@stevo-desktop> Thanks. Committed under revisions: trunk: r7968 iwarp branch: r7969 Steve. On Sat, 2006-06-10 at 23:04 +0530, Pradipta Kumar Banerjee wrote: > This includes the changes suggested by Tom. > > Signed-off-by: Pradipta Kumar Banerjee > --- > > Index: rping.c > ================================================================= > --- rping.org 2006-06-09 10:57:43.000000000 +0530 > +++ rping.c.new 2006-06-10 22:48:53.000000000 +0530 > @@ -96,6 +96,15 @@ struct rping_rdma_info { > #define RPING_BUFSIZE 64*1024 > #define RPING_SQ_DEPTH 16 > > +/* Default string for print data and > + * minimum buffer size > + */ > +#define _stringify( _x ) # _x > +#define stringify( _x ) _stringify( _x ) > + > +#define RPING_MSG_FMT "rdma-ping-%d: " > +#define RPING_MIN_BUFSIZE sizeof(stringify(INT_MAX)) + sizeof(RPING_MSG_FMT) > + > /* > * Control block struct. > */ > @@ -774,7 +783,7 @@ static void rping_test_client(struct rpi > cb->state = RDMA_READ_ADV; > > /* Put some ascii text in the buffer. */ > - cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping); > + cc = sprintf(cb->start_buf, RPING_MSG_FMT, ping); > for (i = cc, c = start; i < cb->size; i++) { > cb->start_buf[i] = c; > c++; > @@ -977,11 +986,11 @@ int main(int argc, char *argv[]) > break; > case 'S': > cb->size = atoi(optarg); > - if ((cb->size < 1) || > + if ((cb->size < RPING_MIN_BUFSIZE) || > (cb->size > (RPING_BUFSIZE - 1))) { > fprintf(stderr, "Invalid size %d " > - "(valid range is 1 to %d)\n", > - cb->size, RPING_BUFSIZE); > + "(valid range is %d to %d)\n", > + cb->size, RPING_MIN_BUFSIZE, RPING_BUFSIZE); > ret = EINVAL; > } else > DEBUG_LOG("size %d\n", (int) atoi(optarg)); > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Tue Jun 13 10:26:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 13:26:49 -0400 Subject: [openib-general] [PATCH 4/4] diags: ucast routing dump file generator example - dump_lfts.sh In-Reply-To: <20060611003245.22430.93904.stgit@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060611003245.22430.93904.stgit@sashak.voltaire.com> Message-ID: <1150219599.570.154302.camel@hal.voltaire.com> On Sat, 2006-06-10 at 20:32, Sasha Khapyorsky wrote: > New simple script - dump_lfts.sh, may be used for ucast dump file > generation. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From rdreier at cisco.com Tue Jun 13 10:55:57 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 Jun 2006 10:55:57 -0700 Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with ib_uverbs_idr_mutex In-Reply-To: <20060613051149.GE4621@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 13 Jun 2006 08:11:49 +0300") References: <20060613051149.GE4621@mellanox.co.il> Message-ID: Michael> Won't this let the user issue multiple modify QP commands Michael> in parallel on the same QP? mthca at least does not Michael> protect against such attempts, and doing this will Michael> confuse the hardware. Hmm, that's a good point. But I did write the following in Documentation/infiniband/core_locking.txt: All of the methods in struct ib_device exported by a low-level driver must be fully reentrant. The low-level driver is required to perform all synchronization necessary to maintain consistency, even if multiple function calls using the same object are run simultaneously. The IB midlayer does not perform any serialization of function calls. So I guess this is a bug in mthca. I think modify_srq at least has the same problem. I'll audit this and fix it up in mthca. - R. From bpradip at in.ibm.com Tue Jun 13 10:55:07 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Tue, 13 Jun 2006 23:25:07 +0530 Subject: [openib-general] [PATCH resend] libamso: fix erroneous return and memory leak in verbs.c Message-ID: <20060613175457.GA8976@harry-potter.ibm.com> Forgot to add the 'Signed-off-by' This patch fixes an erroneous return in func amso_create_cq() and a memory leak in amso_create_qp(). Signed-off-by: Pradipta Kumar Banerjee --- Index = libamso/verbs.c ============================================================================ --- verbs.org 2006-06-13 18:56:50.000000000 +0530 +++ verbs.c 2006-06-13 19:02:03.000000000 +0530 @@ -154,9 +154,8 @@ struct ibv_cq *amso_create_cq(struct ibv int ret; cq = malloc(sizeof *cq); - if (!cq) { - goto err; - } + if (!cq) + return NULL; ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, &cq->ibv_cq, &cmd.ibv_cmd, sizeof cmd, @@ -248,14 +247,15 @@ struct ibv_qp *amso_create_qp(struct ibv ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd, &resp.ibv_resp, sizeof resp); if (ret) - return NULL; + goto err; #if 0 /* A reminder for bypass functionality */ qp->physaddr = resp.physaddr; #endif return &qp->ibv_qp; - +err: + free(qp); return NULL; } From swise at opengridcomputing.com Tue Jun 13 11:03:57 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 13 Jun 2006 13:03:57 -0500 Subject: [openib-general] [PATCH resend] libamso: fix erroneous return and memory leak in verbs.c In-Reply-To: <20060613175457.GA8976@harry-potter.ibm.com> References: <20060613175457.GA8976@harry-potter.ibm.com> Message-ID: <1150221837.17394.46.camel@stevo-desktop> On Tue, 2006-06-13 at 23:25 +0530, Pradipta Kumar Banerjee wrote: > Forgot to add the 'Signed-off-by' > > This patch fixes an erroneous return in func amso_create_cq() and a memory > leak in amso_create_qp(). > > Signed-off-by: Pradipta Kumar Banerjee > Committed revision 7971. Thanks, Steve. From sean.hefty at intel.com Tue Jun 13 11:05:23 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 13 Jun 2006 11:05:23 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <1150193430.570.138279.camel@hal.voltaire.com> Message-ID: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com> >There are architected ways to do that. There's busy for MADs which could >be used for some MADs. For RMPP, would the transfer be ABORTed ? I don't >think you can switch to BUSY in the middle (but I'm not 100% sure). I >don't know how this limit is being used exactly, but it might be best if >the RMPP receive were treated as 1 MAD regardless of of how many >segments it was. Maybe I should back-up some here. There are a couple problems that I'm trying to solve, but the main goal is to prevent sending duplicate responses. I'd like to do this by detecting and dropping duplicate requests. To detect a duplicate request, my proposal is to move completed MADs to a "done_list". Newly received MADs would also check the done_list to determine if the MAD is a duplicate. When a user sends a response MAD, a check would be made against the done_list for a matching request that has not generated a response yet. If one is not found, then the send would be failed. Received MADs would be removed from the done_list when they are freed. My guess is that for kernel clients, the changes would probably be minimal. For usermode clients, the problem is more difficult, since we cannot trust usermode clients to generate responses correctly, and there's no free_mad call that maps to the kernel. One of the ideas then, is for the kernel umad module to learn which MADs generate responses. It would do this by updating an entry to a table whenever a response MAD is generated. A received MAD would check against the table to see if a response is supposed to be generated. If not, then the MAD would be freed after userspace claims it. If a response is expected, then the MAD would not be freed until the response was generated. Assuming minimal hard-coding of which methods are requests, a client would drop only about 1 MAD per method during start-up. Considering most requests are not sent reliably, this shouldn't be a big issue. (In fact, outside of a MultiPathRecord query, I don't believe any requests are sent reliably.) And I would argue that even if a request has been acknowledged, the sender of the request would still need to deal with the case that no response is ever generated. If this approach were taken, then, it brings up the issue that MADs are being stored in the kernel waiting for a response. But what if a response is never generated? This problem is somewhat related to MADs being queued in the kernel, but the userspace app doesn't call down to receive them. Ideally, we could come up with a single solution to both problems, but that may not be possible. My current thoughts on how to handle requests are to time when each request MAD is received, and queue it. Once the queue is full, if another request is received, it would check the MAD at the head of the queue. If the MAD at the head was older than some selected value (say 20 seconds), it would be bumped from the queue, and the new request would be added to the tail. - Sean From rdreier at cisco.com Tue Jun 13 11:05:55 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 Jun 2006 11:05:55 -0700 Subject: [openib-general] [PATCH] mthca: restore missing registers In-Reply-To: <20060612135751.GB19518@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 12 Jun 2006 16:57:51 +0300") References: <20060612135751.GB19518@mellanox.co.il> Message-ID: Thanks, applied for 2.6.17 From rdreier at cisco.com Tue Jun 13 11:08:22 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 Jun 2006 11:08:22 -0700 Subject: [openib-general] [PATCH updated] mthca: memfree completion with error workaround In-Reply-To: <20060613141942.GW19518@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 13 Jun 2006 17:19:42 +0300") References: <20060612121635.GX7359@mellanox.co.il> <20060613141942.GW19518@mellanox.co.il> Message-ID: Yeah, I like this much more. It doesn't seem that likely that there will be another firmware bug with the same symptoms, and we have to trust some of what the hardware tells us... - R. From eitan at mellanox.co.il Tue Jun 13 11:21:11 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 13 Jun 2006 21:21:11 +0300 Subject: [openib-general] [PATCH] OpenSM/SA: Properly handle non base LID requests to someSA records Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302368841@mtlexch01.mtl.com> Sure. Looks good to me Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, June 13, 2006 7:42 PM > To: openib-general at openib.org > Cc: Eitan Zahavi > Subject: [PATCH] OpenSM/SA: Properly handle non base LID requests to someSA > records > > OpenSM/SA: Properly handle non base LID requests to some SA records > > In osm_sa_node_record.c and osm_sa_portinfo_record.c, properly handle > non base LID requests per C15-0.1.11: Query responses shall contain a > port's base LID in any LID component of a RID. So when LMC is non 0, > the only records that appear are those with the base LID and not with > any masked LIDs. Furthermore, if a query comes in on a non base LID, the > LID in the RID returned is only with the base LID. > > Also, fixed some endian issues in osm_log messages. > > Note: Similar patch for other affected SA records will follow. > > Signed-off-by: Hal Rosenstock > > Index: opensm/osm_sa_node_record.c > =================================================================== > --- opensm/osm_sa_node_record.c (revision 7961) > +++ opensm/osm_sa_node_record.c (working copy) > @@ -200,12 +200,11 @@ __osm_nr_rcv_create_nr( > uint8_t port_num; > uint8_t num_ports; > uint16_t match_lid_ho; > - uint16_t lid_ho; > + ib_net16_t base_lid; > ib_net16_t base_lid_ho; > ib_net16_t max_lid_ho; > uint8_t lmc; > ib_net64_t port_guid; > - ib_api_status_t status; > > OSM_LOG_ENTER( p_rcv->p_log, __osm_nr_rcv_create_nr ); > > @@ -245,7 +244,8 @@ __osm_nr_rcv_create_nr( > if( match_port_guid && ( port_guid != match_port_guid ) ) > continue; > > - base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_physp ) ); > + base_lid = osm_physp_get_base_lid( p_physp ); > + base_lid_ho = cl_ntoh16( base_lid ); > lmc = osm_physp_get_lmc( p_physp ); > max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 ); > match_lid_ho = cl_ntoh16( match_lid ); > @@ -260,29 +260,18 @@ __osm_nr_rcv_create_nr( > osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_nr_rcv_create_nr: " > "Comparing LID: 0x%X <= 0x%X <= 0x%X\n", > - cl_ntoh16( base_lid_ho ), > - cl_ntoh16( match_lid_ho ), > - cl_ntoh16( max_lid_ho ) > + base_lid_ho, match_lid_ho, max_lid_ho > ); > } > > if( (match_lid_ho <= max_lid_ho) && (match_lid_ho >= base_lid_ho) ) > { > - __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, match_lid ); > + __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, base_lid ); > } > } > else > { > - /* > - For every lid value create a Node Record. > - */ > - for( lid_ho = base_lid_ho; lid_ho <= max_lid_ho; lid_ho++ ) > - { > - status = __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, > - port_guid, cl_hton16( lid_ho ) ); > - if( status != IB_SUCCESS ) > - break; > - } > + __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, base_lid ); > } > } > > Index: opensm/osm_sa_portinfo_record.c > =================================================================== > --- opensm/osm_sa_portinfo_record.c (revision 7961) > +++ opensm/osm_sa_portinfo_record.c (working copy) > @@ -194,9 +194,9 @@ __osm_sa_pir_create( > IN osm_pir_search_ctxt_t* const p_ctxt ) > { > uint8_t lmc; > - uint16_t lid_ho; > uint16_t max_lid_ho; > uint16_t base_lid_ho; > + uint16_t match_lid_ho; > > OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_pir_create ); > > @@ -218,17 +218,28 @@ __osm_sa_pir_create( > > if( p_ctxt->comp_mask & IB_PIR_COMPMASK_LID ) > { > - __osm_pir_rcv_new_pir( p_rcv, p_physp, p_ctxt->p_list, > - p_ctxt->p_rcvd_rec->lid ); > - } > - else > - { > - for( lid_ho = base_lid_ho; lid_ho <= max_lid_ho; lid_ho++ ) > + match_lid_ho = cl_ntoh16( p_ctxt->p_rcvd_rec->lid ); > + > + /* > + We validate that the lid belongs to this node. > + */ > + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > { > - __osm_pir_rcv_new_pir( p_rcv, p_physp, p_ctxt->p_list, > - cl_hton16( lid_ho ) ); > + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_sa_pir_create: " > + "Comparing LID: 0x%X <= 0x%X <= 0x%X\n", > + base_lid_ho, match_lid_ho, max_lid_ho > + ); > } > + > + if ( match_lid_ho < base_lid_ho || match_lid_ho > max_lid_ho ) > + goto Exit; > } > + > + __osm_pir_rcv_new_pir( p_rcv, p_physp, p_ctxt->p_list, > + cl_hton16( base_lid_ho ) ); > + > + Exit: > OSM_LOG_EXIT( p_rcv->p_log ); > } > > From rdreier at cisco.com Tue Jun 13 11:19:13 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 Jun 2006 11:19:13 -0700 Subject: [openib-general] [git pull] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This has a couple of mthca driver bug fixes: Michael S. Tsirkin: IB/mthca: restore missing PCI registers after reset IB/mthca: memfree completion with error FW bug workaround drivers/infiniband/hw/mthca/mthca_cq.c | 11 +++++ drivers/infiniband/hw/mthca/mthca_reset.c | 59 +++++++++++++++++++++++++++++ 2 files changed, 69 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 205854e..87a8f11 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -540,8 +540,17 @@ static inline int mthca_poll_one(struct entry->wr_id = srq->wrid[wqe_index]; mthca_free_srq_wqe(srq, wqe); } else { + s32 wqe; wq = &(*cur_qp)->rq; - wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift; + wqe = be32_to_cpu(cqe->wqe); + wqe_index = wqe >> wq->wqe_shift; + /* + * WQE addr == base - 1 might be reported in receive completion + * with error instead of (rq size - 1) by Sinai FW 1.0.800 and + * Arbel FW 5.1.400. This bug should be fixed in later FW revs. + */ + if (unlikely(wqe_index < 0)) + wqe_index = wq->max - 1; entry->wr_id = (*cur_qp)->wrid[wqe_index]; } diff --git a/drivers/infiniband/hw/mthca/mthca_reset.c b/drivers/infiniband/hw/mthca/mthca_reset.c index df5e494..f4fddd5 100644 --- a/drivers/infiniband/hw/mthca/mthca_reset.c +++ b/drivers/infiniband/hw/mthca/mthca_reset.c @@ -49,6 +49,12 @@ int mthca_reset(struct mthca_dev *mdev) u32 *hca_header = NULL; u32 *bridge_header = NULL; struct pci_dev *bridge = NULL; + int bridge_pcix_cap = 0; + int hca_pcie_cap = 0; + int hca_pcix_cap = 0; + + u16 devctl; + u16 linkctl; #define MTHCA_RESET_OFFSET 0xf0010 #define MTHCA_RESET_VALUE swab32(1) @@ -110,6 +116,9 @@ #define MTHCA_RESET_VALUE swab32(1) } } + hca_pcix_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); + hca_pcie_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP); + if (bridge) { bridge_header = kmalloc(256, GFP_KERNEL); if (!bridge_header) { @@ -129,6 +138,13 @@ #define MTHCA_RESET_VALUE swab32(1) goto out; } } + bridge_pcix_cap = pci_find_capability(bridge, PCI_CAP_ID_PCIX); + if (!bridge_pcix_cap) { + err = -ENODEV; + mthca_err(mdev, "Couldn't locate HCA bridge " + "PCI-X capability, aborting.\n"); + goto out; + } } /* actually hit reset */ @@ -178,6 +194,20 @@ #define MTHCA_RESET_VALUE swab32(1) good: /* Now restore the PCI headers */ if (bridge) { + if (pci_write_config_dword(bridge, bridge_pcix_cap + 0x8, + bridge_header[(bridge_pcix_cap + 0x8) / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge Upstream " + "split transaction control, aborting.\n"); + goto out; + } + if (pci_write_config_dword(bridge, bridge_pcix_cap + 0xc, + bridge_header[(bridge_pcix_cap + 0xc) / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge Downstream " + "split transaction control, aborting.\n"); + goto out; + } /* * Bridge control register is at 0x3e, so we'll * naturally restore it last in this loop. @@ -203,6 +233,35 @@ good: } } + if (hca_pcix_cap) { + if (pci_write_config_dword(mdev->pdev, hca_pcix_cap, + hca_header[hca_pcix_cap / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA PCI-X " + "command register, aborting.\n"); + goto out; + } + } + + if (hca_pcie_cap) { + devctl = hca_header[(hca_pcie_cap + PCI_EXP_DEVCTL) / 4]; + if (pci_write_config_word(mdev->pdev, hca_pcie_cap + PCI_EXP_DEVCTL, + devctl)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA PCI Express " + "Device Control register, aborting.\n"); + goto out; + } + linkctl = hca_header[(hca_pcie_cap + PCI_EXP_LNKCTL) / 4]; + if (pci_write_config_word(mdev->pdev, hca_pcie_cap + PCI_EXP_LNKCTL, + linkctl)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA PCI Express " + "Link control register, aborting.\n"); + goto out; + } + } + for (i = 0; i < 16; ++i) { if (i * 4 == PCI_COMMAND) continue; From rjwalsh at pathscale.com Tue Jun 13 11:25:39 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Tue, 13 Jun 2006 11:25:39 -0700 Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with ib_uverbs_idr_mutex In-Reply-To: References: <20060613051149.GE4621@mellanox.co.il> Message-ID: <1150223140.11881.2.camel@hematite.internal.keyresearch.com> On Tue, 2006-06-13 at 10:55 -0700, Roland Dreier wrote: > Michael> Won't this let the user issue multiple modify QP commands > Michael> in parallel on the same QP? mthca at least does not > Michael> protect against such attempts, and doing this will > Michael> confuse the hardware. > > Hmm, that's a good point. But I did write the following in > Documentation/infiniband/core_locking.txt: > > All of the methods in struct ib_device exported by a low-level > driver must be fully reentrant. The low-level driver is required to > perform all synchronization necessary to maintain consistency, even > if multiple function calls using the same object are run > simultaneously. > > The IB midlayer does not perform any serialization of function calls. > > So I guess this is a bug in mthca. We have a similar problem in resource checking - we were relying on the idr lock to keep us safe. I'll fix that up, too. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Tue Jun 13 11:32:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 14:32:54 -0400 Subject: [openib-general] [PATCH 1/4] Simplification of the ucast fdb dumps. In-Reply-To: <20060611003238.22430.62423.stgit@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060611003238.22430.62423.stgit@sashak.voltaire.com> Message-ID: <1150223563.570.156637.camel@hal.voltaire.com> On Sat, 2006-06-10 at 20:32, Sasha Khapyorsky wrote: > This separates the dump procedure from rest of the flow and prevents > multiple fopen()/fclose() (one pair per switch) - one fopen() and one > fclose() instead. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (with some cosmetic changes). -- Hal From sashak at voltaire.com Tue Jun 13 13:00:35 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 13 Jun 2006 23:00:35 +0300 Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes from the file In-Reply-To: <20060613170246.GH23320@durango.c3.lanl.gov> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060613170246.GH23320@durango.c3.lanl.gov> Message-ID: <20060613200035.GG10482@sashak.voltaire.com> Hi Greg, On 11:02 Tue 13 Jun , Greg Johnson wrote: > On Sun, Jun 11, 2006 at 03:27:58AM +0300, Sasha Khapyorsky wrote: > > Hi, > > > > There are couple of unicast routing related patches for OpenSM. > > > > Basically it implements routing module which provides possibility to load > > switch forwarding tables from pre-created dump file. Currently unicast > > tables loading is only supported, multicast may be added in a future. > > > > Short patch descriptions (more details may be found in emails with > > patches): > > > > 1. Ucast dump file simplification. > > 2. Modular routing - preliminary implements generic model to plug new > > routing engine to OpenSM. > > 3. New simple unicast routing engine which allows to load LFTs from > > pre-created dump file. > > 4. Example of ucast dump generation script. > > > > Please comment and test. Thanks. > > We tried this on our 256-node cluster with a single chassis Voltaire > 288-port switch. Thanks. > It seems to load the routes generated by the dump > script, but afterward it is not possible to dump the routes again. This means you have broken LFTs now. Probably I know what is going on here - new LFTs don't have " 0" entries, and switches are not accessible by LIDs anymore. Please update 'ibroute' utility (diags/) from the trunk and recreate the dump file - this should fix the problem. (Sorry, I forgot to mention 'ibroute' upgrade issue in patch announcement). > I > would like to re-dump the routes after loading to ensure that they were > loaded correctly. > > After loading routes with "opensm -R file -U dump_file", dump_lfts.sh > gives: > > nodeinfo > 0000 0000 0000 0000 0000 0000 0000 0000 > 0000 0000 0000 0000 0000 0000 0000 0000 > 0000 0000 0000 0000 0000 0000 0000 0000 > 0000 0000 0000 0000 0000 0000 0000 0000 > ibroute: iberror: dump tables failed: node info failed: valid addr? > > for each switch. > > Also, I had to delete a space in the sed script on line 17 of > dump_lfts.sh: > > sed -ne 's/^.* lid \([1-9a-f]*\) .*$/\1/p' > > became > > sed -ne 's/^.* lid \([1-9a-f]*\).*$/\1/p' I see. I've used ibswitches/ibnetdiscover from the trunk, there is some minor difference in the output (' lmc N' was added). I think with your change the script will work with both old and new outputs. Thanks for the fix. > Thanks for the work! Thanks for trying this. Sasha From swise at opengridcomputing.com Tue Jun 13 13:34:31 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 13 Jun 2006 15:34:31 -0500 Subject: [openib-general] [PATCH v2 1/2] iWARP Connection Manager. In-Reply-To: <1150127698.22704.9.camel@trinity.ogc.int> References: <20060607200600.9003.56328.stgit@stevo-desktop> <20060607200605.9003.25830.stgit@stevo-desktop> <20060608005452.087b34db.akpm@osdl.org> <1150127698.22704.9.camel@trinity.ogc.int> Message-ID: <1150230871.17394.68.camel@stevo-desktop> > > > +static void cm_event_handler(struct iw_cm_id *cm_id, > > > + struct iw_cm_event *iw_event) > > > +{ > > > + struct iwcm_work *work; > > > + struct iwcm_id_private *cm_id_priv; > > > + unsigned long flags; > > > + > > > + work = kmalloc(sizeof(*work), GFP_ATOMIC); > > > + if (!work) > > > + return; > > > > This allocation _will_ fail sometimes. The driver must recover from it. > > Will it do so? > > Er...no. It will lose this event. Depending on the event...the carnage > varies. We'll take a look at this. > This behavior is consistent with the Infiniband CM (see drivers/infiniband/core/cm.c function cm_recv_handler()). But I think we should at least log an error because a lost event will usually stall the rdma connection. > > > > > +EXPORT_SYMBOL(iw_cm_init_qp_attr); > > > > This file exports a ton of symbols. It's usual to provide some justifying > > commentary in the changelog when this happens. > > This module is a logical instance of the xx_cm where xx is the transport > type. I think there is some discussion warranted on whether or not these > should all be built into and exported by rdma_cm. One rationale would be > that the rdma_cm is the only client for many of these functions (this > being a particularly good example) and doing so would reduce the export > count. Others would be reasonably needed for any application (connect, > etc...) > Transport-dependent ULPs, in theory, are able to use the transport-specific CM directly if they don't wish to use the RDMA CM. I think that's the rationale for have the xx_cm modules seperate from the rdma_cm module and exporting the various functions. > All that said, we'll be sure to document the exported symbols in a > follow-up patch. > I'll add commentary explaining this. Steve. From sashak at voltaire.com Tue Jun 13 13:39:58 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 13 Jun 2006 23:39:58 +0300 Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes from the file In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881F@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881F@mtlexch01.mtl.com> Message-ID: <20060613203958.GI10482@sashak.voltaire.com> Hi Eitan, On 09:36 Sun 11 Jun , Eitan Zahavi wrote: > Hi Sasha, > > General comments: > 1. I hope the change in osm.fdbs is not going to break the parser in > ibdm:Fabric.cpp - The file format was not changed, I don't expect brokenness. > was it really necessary change? Yes, in order to create unified osm.fdbs with any routing engine. > or just nice to have ? This is the nice side effect. > 2. The modular routing is a great idea. From my first glance it seems > that it assumes calculation of min-hop-tables is common to all routing > engines. Yes and no. Currently the min-hop-tables are used with multicast, so it is common code. But I expect this will be different in the future (for instance extend this loader to handle multicast tables too). > I think it should be a callback provided by the engine too. Yes, when it will be useful. > Please note that the Min-Hop engine takes most of the routing time so in > the future if we could avoid that stage it would be even better. Agree. Thanks for the comments. Sasha > [EZ] We should start thinking about testing of this new feature too. > > Further comment on the patches themselves. > > > There are couple of unicast routing related patches for OpenSM. > > > > Basically it implements routing module which provides possibility to > load > > switch forwarding tables from pre-created dump file. Currently unicast > > tables loading is only supported, multicast may be added in a future. > > > > Short patch descriptions (more details may be found in emails with > > patches): > > > > 1. Ucast dump file simplification. > > 2. Modular routing - preliminary implements generic model to plug new > > routing engine to OpenSM. > > 3. New simple unicast routing engine which allows to load LFTs from > > pre-created dump file. > > 4. Example of ucast dump generation script. > > > > Please comment and test. Thanks. > > > > Sasha From betsy at pathscale.com Tue Jun 13 13:44:16 2006 From: betsy at pathscale.com (Betsy Zeller) Date: Tue, 13 Jun 2006 13:44:16 -0700 Subject: [openib-general] OFED 1.0 release schedule In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA71E5@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA71E5@mtlexch01.mtl.com> Message-ID: <1150231456.3034.219.camel@sarium.pathscale.com> Tziporet - this plan makes sense. We'll let you know how the testing goes. BTW, for some reason, if you click on the URL you sent out, it just hangs but if you type it in, it works. Not sure why. Thanks, Betsy On Tue, 2006-06-13 at 16:07 +0300, Tziporet Koren wrote: > Hi All, > > > > After reading the mail thread regarding OFED release I have decided > this: > > > > We upload OFED-1.0-pre1.tgz to > https://openib.org/svn/gen2/branches/1.0/ofed/releases/ > > > > We checked that all modules compile and loaded on this build > (including ipath and uDAPL) > > The only missing parts of this release from the final release are the > documents, and the scripts rpm that Scott requested. > > > > I think testing this version 3 days (Tuesday, Wednesday and Thursday) > should be enough as Scott wrote. > > So – we can do the official OFED 1.0 release on Friday 16-June. > > > > Matt – please check with Novel if this date is acceptable by them. > > > > If not then the earliest we can do the release if Thursday 15-June. > > > > > > Tziporet Koren > > Software Director > > Mellanox Technologies > > mailto: tziporet at mellanox.co.il > Tel +972-4-9097200, ext 380 > > > > From halr at voltaire.com Tue Jun 13 13:39:01 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 16:39:01 -0400 Subject: [openib-general] [PATCH] OpenSM/modular-routing.txt: Add description of modular routing Message-ID: <1150231129.570.161009.camel@hal.voltaire.com> OpenSM/doc/modular_routing.txt: Add description of modular routing Signed-off-by: Hal Rosenstock Index: osm/doc/modular-routing.txt =================================================================== --- osm/doc/modular-routing.txt (revision 0) +++ osm/doc/modular-routing.txt (revision 0) @@ -0,0 +1,53 @@ +Modular routing engine structure has been added to allow +for ease of "plugging" new routing module. + +Currently, only unicast callbacks are supported. Multicast +can be added later. + +An existing routing module is up-down "updn", which may be +activate with '-R updb' option (instead of old '-u'). + +General usage is: +$ opensm -R 'module-name' + +There is also a trivial routing module which is able +to load LFT tables from a dump file. + +Main features: + +- support for unicast LFTs only, support for multicast can be added later +- this will run after min hop matrix calculation +- this will load switch LFTs according to the path entries introduced in + the dump file +- no additional checks will be performed (like is port connected, etc) +- in case when fabric LIDs were changed this will try to reconstruct LFTs + correctly if endport GUIDs are represented in the dump file (in order + to disable this GUIDs may be removed from the dump file or zeroed) + +The dump file format is compatible with output of 'ibroute' util and for +whole fabric may be generated with script like this: + + for sw_lid in `ibswitches | awk '{print $NF}'` ; do + ibroute $sw_lid + done > /path/to/dump_file + +, or using DR paths: + + + for sw_dr in `ibnetdiscover -v \ + | sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \ + | sed -e 's/\]\[/,/g' \ + | sort -u` ; do + ibroute -D ${sw_dr} + done > /path/to/dump_file + + +In order to activate new module use: + + opensm -R file -U /path/to/dump_file + +NOTE: ibroute has been updated to support this (for switch management ports). +Also, lmc was added to switch management ports. ibroute needs to be 7855 or +later from the trunk. + + From halr at voltaire.com Tue Jun 13 14:15:44 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 17:15:44 -0400 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com> References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com> Message-ID: <1150233334.570.162326.camel@hal.voltaire.com> On Tue, 2006-06-13 at 14:05, Sean Hefty wrote: > >There are architected ways to do that. There's busy for MADs which could > >be used for some MADs. For RMPP, would the transfer be ABORTed ? I don't > >think you can switch to BUSY in the middle (but I'm not 100% sure). I > >don't know how this limit is being used exactly, but it might be best if > >the RMPP receive were treated as 1 MAD regardless of of how many > >segments it was. > > Maybe I should back-up some here. There are a couple problems that I'm trying > to solve, but the main goal is to prevent sending duplicate responses. I'd like > to do this by detecting and dropping duplicate requests. > > To detect a duplicate request, my proposal is to move completed MADs to a > "done_list". Newly received MADs would also check the done_list to determine if > the MAD is a duplicate. When a user sends a response MAD, a check would be made > against the done_list for a matching request that has not generated a response > yet. If one is not found, then the send would be failed. > > Received MADs would be removed from the done_list when they are freed. My guess > is that for kernel clients, the changes would probably be minimal. For usermode > clients, the problem is more difficult, since we cannot trust usermode clients > to generate responses correctly, and there's no free_mad call that maps to the > kernel. > > One of the ideas then, is for the kernel umad module to learn which MADs > generate responses. It would do this by updating an entry to a table whenever a > response MAD is generated. A received MAD would check against the table to see > if a response is supposed to be generated. If not, then the MAD would be freed > after userspace claims it. If a response is expected, then the MAD would not be > freed until the response was generated. > > Assuming minimal hard-coding of which methods are requests, a client would drop > only about 1 MAD per method during start-up. Is this only the new methods which are not hard coded ? Would this invoke a timeout (and hopefully retry) ? > Considering most requests are not > sent reliably, this shouldn't be a big issue. (In fact, outside of a > MultiPathRecord query, I don't believe any requests are sent reliably.) If you mean sent via RMPP, then yes, only GetMulti is sent this way. > And I > would argue that even if a request has been acknowledged, the sender of the > request would still need to deal with the case that no response is ever > generated. Are you referring to a request being acknowledged but the response is not sent (yet) ? > If this approach were taken, then, it brings up the issue that MADs are being > stored in the kernel waiting for a response. But what if a response is never > generated? This problem is somewhat related to MADs being queued in the kernel, > but the userspace app doesn't call down to receive them. Ideally, we could come > up with a single solution to both problems, but that may not be possible. > > My current thoughts on how to handle requests are to time when each request MAD > is received, and queue it. Once the queue is full, if another request is > received, it would check the MAD at the head of the queue. If the MAD at the > head was older than some selected value (say 20 seconds), it would be bumped > from the queue, and the new request would be added to the tail. For RMPP, this time should start when the last segment is received. Is that how you would envision it working ? I'm also not sure what the right timeout value would be for this. Where did 20 seconds come from ? -- Hal > - Sean From sashak at voltaire.com Tue Jun 13 14:36:06 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 Jun 2006 00:36:06 +0300 Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT tables from dump file. In-Reply-To: <448EA1DD.7090204@mellanox.co.il> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060611003243.22430.56582.stgit@sashak.voltaire.com> <448EA1DD.7090204@mellanox.co.il> Message-ID: <20060613213606.GJ10482@sashak.voltaire.com> Hi Eitan, On 14:30 Tue 13 Jun , Eitan Zahavi wrote: > Hi Sasha, > > Please see my comments inside > > Sasha Khapyorsky wrote: > >This patch implements trivial routing module which able to load LFT > >tables from dump file. Main features: > >- support for unicast LFTs only, support for multicast can be added later > >- this will run after min hop matrix calculation > >- this will load switch LFTs according to the path entries introduced in > > the dump file > >- no additional checks will be performed (like is port connected, etc) > >- in case when fabric LIDs were changed this will try to reconstruct LFTs > > correctly if endport GUIDs are represented in the dump file (in order > > to disable this GUIDs may be removed from the dump file or zeroed) > I think you cold use the concept of directed routes for storing the LIDs > too. Maybe. But there is one disadvantage - such dump file will be node dependent, we will not be able to generate it on one node and load on another. Anyway the goal of LID/GUID checking is to provide minimal fixing for trivial case and not to limit the subnet administrator in what He/She wants to do. > So in case of new LID assignments you can extract the old -> new mapping by > scanning the LIDs of end ports by their DR path. I do it with GUID. > Anyway, I think it is required that you also perform topology matching such > that > if someone changed the topology you are able to figure it out and stop. > THIS IS A SERIOUS LIMITATION OF YOUR PROPOSAL. I think this is limitation of the subnet administrator's choice - one may want to create LFT with entries for yet not connected nodes. If you are about more "safe" dump loader, this may be done (and the code may be reused), but I think this should be different routing method. > >The dump file format is compatible with output of 'ibroute' util and for > >whole fabric may be generated with script like this: > > > > for sw_lid in `ibswitches | awk '{print $NF}'` ; do > > ibroute $sw_lid > > done > /path/to/dump_file > > > >, or using DR paths: > > > > > > for sw_dr in `ibnetdiscover -v \ > > | sed -ne '/^DR path .* switch /s/^DR path > > \[\(.*\)\].*$/\1/p' \ > > | sed -e 's/\]\[/,/g' \ > > | sort -u` ; do > > ibroute -D ${sw_dr} > > done > /path/to/dump_file > WE SHOULD ALSO PROVIDE A DUMP FILE VIA: > 1. OpenSM should dump its routes using this format (like it does today > using osm.fdbs) In this way you may generate dump with LFTs created only by OpenSM (and not by other SMs). This is unnecessary limitation for primary method. However I agree that as additional method this may be good and useful. Please feel free to provide the path for this. > 2. ibdiagnet Ditto > > > > > > > >diff --git a/osm/include/opensm/osm_subnet.h > >b/osm/include/opensm/osm_subnet.h > >index a637367..ec1d056 100644 > >--- a/osm/include/opensm/osm_subnet.h > >+++ b/osm/include/opensm/osm_subnet.h > >@@ -423,6 +424,10 @@ typedef struct _osm_subn_opt > > * routing_engine_name > > * Name of used routing engine (other than default Min Hop Algorithm) > > * > >+* ucast_dump_file > >+* Name of the unicast routing dump file from where switch > >+* forwearding tables will be loaded > ^^^^^^^^^^^ > forwarding Thanks. Will fix. > >+ "cannot parse port guid " > >+ "(maybe broken dump): " > >+ "\'%s\'\n", p); > >+ port_guid = 0; > >+ } > >+ } > >+ port_guid = cl_hton64(port_guid); > >+ add_path(p_osm, p_sw, lid, port_num, port_guid); > >+ } > >+ } > >+ > >+ fclose(file); > >+ return 0; > >+} > In OpenSM we write with style: > if () { > } > else if () > { > } > else > { > } > > Not any other combination Really? Don't want to bother with examples, but I may see almost any "combination" in OpenSM and it is not clear for me which one is common (the coding style and identation are different even from file to file). Thanks for comments. Sasha From sean.hefty at intel.com Tue Jun 13 14:36:46 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 13 Jun 2006 14:36:46 -0700 Subject: [openib-general] [PATCH v2 1/2] iWARP Connection Manager. In-Reply-To: <1150230871.17394.68.camel@stevo-desktop> Message-ID: <000001c68f31$78910fe0$24268686@amr.corp.intel.com> >> Er...no. It will lose this event. Depending on the event...the carnage >> varies. We'll take a look at this. >> > >This behavior is consistent with the Infiniband CM (see >drivers/infiniband/core/cm.c function cm_recv_handler()). But I think >we should at least log an error because a lost event will usually stall >the rdma connection. I believe that there's a difference here. For the Infiniband CM, an allocation error behaves the same as if the received MAD were lost or dropped. Since MADs are unreliable anyway, it's not so much that an IB CM event gets lost, as it doesn't ever occur. A remote CM should retry the send, which hopefully allows the connection to make forward progress. - Sean From swise at opengridcomputing.com Tue Jun 13 14:46:36 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 13 Jun 2006 16:46:36 -0500 Subject: [openib-general] [PATCH v2 1/2] iWARP Connection Manager. In-Reply-To: <000001c68f31$78910fe0$24268686@amr.corp.intel.com> References: <000001c68f31$78910fe0$24268686@amr.corp.intel.com> Message-ID: <1150235196.17394.91.camel@stevo-desktop> On Tue, 2006-06-13 at 14:36 -0700, Sean Hefty wrote: > >> Er...no. It will lose this event. Depending on the event...the carnage > >> varies. We'll take a look at this. > >> > > > >This behavior is consistent with the Infiniband CM (see > >drivers/infiniband/core/cm.c function cm_recv_handler()). But I think > >we should at least log an error because a lost event will usually stall > >the rdma connection. > > I believe that there's a difference here. For the Infiniband CM, an allocation > error behaves the same as if the received MAD were lost or dropped. Since MADs > are unreliable anyway, it's not so much that an IB CM event gets lost, as it > doesn't ever occur. A remote CM should retry the send, which hopefully allows > the connection to make forward progress. > hmm. Ok. I see. I misunderstood the code in cm_recv_handler(). Tom and I have been talking about what we can do to not drop the event. Stay tuned. Steve. From sean.hefty at intel.com Tue Jun 13 14:58:33 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 13 Jun 2006 14:58:33 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <1150233334.570.162326.camel@hal.voltaire.com> Message-ID: <000201c68f34$83c2c950$24268686@amr.corp.intel.com> >> Assuming minimal hard-coding of which methods are requests, a client would >drop >> only about 1 MAD per method during start-up. > >Is this only the new methods which are not hard coded ? Would this >invoke a timeout (and hopefully retry) ? We can hard-code existing methods to avoid this problem. So only unknown methods would be affected, which would affect user-defined classes more than the existing classes. In most cases, I would expect the sender to timeout and retry the request, which hopefully comes after the request table has been updated. >> And I >> would argue that even if a request has been acknowledged, the sender of the >> request would still need to deal with the case that no response is ever >> generated. > >Are you referring to a request being acknowledged but the response is >not sent (yet) ? Yes. >> My current thoughts on how to handle requests are to time when each request >MAD >> is received, and queue it. Once the queue is full, if another request is >> received, it would check the MAD at the head of the queue. If the MAD at the >> head was older than some selected value (say 20 seconds), it would be bumped >> from the queue, and the new request would be added to the tail. > >For RMPP, this time should start when the last segment is received. Is >that how you would envision it working ? Correct. Part of the motivation here is if a client cannot or will not generate a response for some reason, we don't want to keep the MAD hanging around forever. >I'm also not sure what the right timeout value would be for this. Where >did 20 seconds come from ? I just made that up. Something like this would probably have to be adaptable, and would likely depend on the size of the fabric. In most cases, I would guess that a timeout indicates some sort of error in the client, so I would tend towards a larger timeout. - Sean From halr at voltaire.com Tue Jun 13 15:26:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jun 2006 18:26:34 -0400 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <000201c68f34$83c2c950$24268686@amr.corp.intel.com> References: <000201c68f34$83c2c950$24268686@amr.corp.intel.com> Message-ID: <1150237590.570.164947.camel@hal.voltaire.com> On Tue, 2006-06-13 at 17:58, Sean Hefty wrote: > >> Assuming minimal hard-coding of which methods are requests, a client would > >drop > >> only about 1 MAD per method during start-up. > > > >Is this only the new methods which are not hard coded ? Would this > >invoke a timeout (and hopefully retry) ? > > We can hard-code existing methods to avoid this problem. So only unknown > methods would be affected, which would affect user-defined classes more than the > existing classes. I would expect vendor classes to follow the standard methods unless they need something different. > In most cases, I would expect the sender to timeout and retry the request, which > hopefully comes after the request table has been updated. > > >> And I > >> would argue that even if a request has been acknowledged, the sender of the > >> request would still need to deal with the case that no response is ever > >> generated. > > > >Are you referring to a request being acknowledged but the response is > >not sent (yet) ? > > Yes. > > >> My current thoughts on how to handle requests are to time when each request > >MAD > >> is received, and queue it. Once the queue is full, if another request is > >> received, it would check the MAD at the head of the queue. If the MAD at the > >> head was older than some selected value (say 20 seconds), it would be bumped > >> from the queue, and the new request would be added to the tail. > > > >For RMPP, this time should start when the last segment is received. Is > >that how you would envision it working ? > > Correct. Part of the motivation here is if a client cannot or will not generate > a response for some reason, we don't want to keep the MAD hanging around > forever. > > >I'm also not sure what the right timeout value would be for this. Where > >did 20 seconds come from ? > > I just made that up. Something like this would probably have to be adaptable, > and would likely depend on the size of the fabric. In most cases, I would guess > that a timeout indicates some sort of error in the client, so I would tend > towards a larger timeout. Is the only downside of a larger timeout that potentially more memory accumulates (until the timeout occurs) before it is freed ? -- Hal > - Sean From robert.j.woodruff at intel.com Tue Jun 13 16:02:58 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 13 Jun 2006 16:02:58 -0700 Subject: [openib-general] OFED 1.0 release schedule Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007F7377F@orsmsx408> Tziporet wrote, >We upload OFED-1.0-pre1.tgz to > https://openib.org/svn/gen2/branches/1.0/ofed/releases/ > I tried the new tar ball and the pathscale driver now compiles (on Redhat EL4 - U3) and IPoIB and OpenSM appear to work OK, but Intel MPI/uDAPL and NetPipe/uDAPL are broken. It apprears to be a problem with rdma operations. I also tried SDP/pathscale and it does not work either. Finally, the rdma_cm is missing the changes that match the uDAPL fix that was put in for the new setops for the CM timeouts. Arlin will provide specifics. We'd really like the rdma_cm fix in the release. woody -----Original Message----- From: Betsy Zeller [mailto:betsy at pathscale.com] Sent: Tuesday, June 13, 2006 1:44 PM To: Tziporet Koren Cc: Matt L. Leininger; Scott Weitzenkamp (sweitzen); Matters, Todd; Moni Levy; Woodruff, Robert J; openib; OpenFabricsEWG Subject: Re: OFED 1.0 release schedule Tziporet - this plan makes sense. We'll let you know how the testing goes. BTW, for some reason, if you click on the URL you sent out, it just hangs but if you type it in, it works. Not sure why. Thanks, Betsy On Tue, 2006-06-13 at 16:07 +0300, Tziporet Koren wrote: > Hi All, > > > > After reading the mail thread regarding OFED release I have decided > this: > > > > We upload OFED-1.0-pre1.tgz to > https://openib.org/svn/gen2/branches/1.0/ofed/releases/ > > > > We checked that all modules compile and loaded on this build > (including ipath and uDAPL) > > The only missing parts of this release from the final release are the > documents, and the scripts rpm that Scott requested. > > > > I think testing this version 3 days (Tuesday, Wednesday and Thursday) > should be enough as Scott wrote. > > So - we can do the official OFED 1.0 release on Friday 16-June. > > > > Matt - please check with Novel if this date is acceptable by them. > > > > If not then the earliest we can do the release if Thursday 15-June. > > > > > > Tziporet Koren > > Software Director > > Mellanox Technologies > > mailto: tziporet at mellanox.co.il > Tel +972-4-9097200, ext 380 > > > > From ardavis at ichips.intel.com Tue Jun 13 16:07:26 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 13 Jun 2006 16:07:26 -0700 Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA719C@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA719C@mtlexch01.mtl.com> Message-ID: <448F452E.3090606@ichips.intel.com> Tziporet Koren wrote: >Jack put the bug fix to OFED 1.0. > >Tziporet > > Great. Did the CMA module (SVN 7742) changes also get in? If not, uDAPL is out of sync with CMA and will not work. -arlin From sashak at voltaire.com Tue Jun 13 16:20:47 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 Jun 2006 02:20:47 +0300 Subject: [openib-general] [PATCH 2/4] Modular routing engine (unicast only yet). In-Reply-To: <448EA7A1.8060206@mellanox.co.il> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060611003240.22430.88414.stgit@sashak.voltaire.com> <448EA7A1.8060206@mellanox.co.il> Message-ID: <20060613232047.GO10482@sashak.voltaire.com> Hi Eitan, On 14:55 Tue 13 Jun , Eitan Zahavi wrote: > > As provided in my previous patch 1/4 comments > I think the callbacks should also have an entry for the MinHop stage (maybe > this is the ucast_build_fwd_tables?) I have some algorithms in mind that > will > skip that stage all-together. We may add new callback when it will be useful. > Also it might make sense for each routing engine to provide its own "dump" > routine such that each could support difference file format if needed. Why we may want dump format per routing engine? Even if we are, you may put it into routing engine specific code. > > Rest of the comments are inline > > EZ > > Sasha Khapyorsky wrote: > > > >diff --git a/osm/include/opensm/osm_opensm.h > >b/osm/include/opensm/osm_opensm.h > >index 3235ad4..3e6e120 100644 > >--- a/osm/include/opensm/osm_opensm.h > >+++ b/osm/include/opensm/osm_opensm.h > >@@ -92,6 +92,18 @@ BEGIN_C_DECLS > > * > > *********/ > > > >+/* > >+ * routing engine structure - yet limited by ucast_fdb_assign and > >+ * ucast_build_fwd_tables (multicast callbacks may be added later) > >+ */ > >+struct osm_routing_engine { > >+ const char *name; > >+ void *context; > >+ int (*ucast_build_fwd_tables)(void *context); > >+ int (*ucast_fdb_assign)(void *context); > >+ void (*delete)(void *context); > >+}; > It would be nice if you added a standard header to this struct. > It is not clear to me what ucast_build_fwd_tables and > ucast_fdb_assign are mapping to. Ok, will add. BTW, seems OpenSM declarations were used for generation manuals or other docs. Do you know are those /****h* /****s* /****f* in use anymore? And with what is the tool? > Please see the next section as an example for a struct header. > >+ > > /****s* OpenSM: OpenSM/osm_opensm_t > > * NAME > > * osm_opensm_t > >@@ -1129,6 +1144,14 @@ osm_ucast_mgr_process( > > i > > ); > > > >+ if (p_routing_eng->ucast_build_fwd_tables && > >+ p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == > >0) > >+ { > >+ cl_qmap_apply_func( p_sw_guid_tbl, > >+ __osm_ucast_mgr_set_table_cb, p_mgr ); > >+ } /* fallback on the regular path in case of failures */ > >+ else > >+ { > Please explain why this step is needed and why if the routing engine > function is > returning 0 you still invoke the standard __osm_ucast_mgr_set_table_cb. ->ucast_build_fwd_tables() creates fwd tables and __osm_ucast_mgr_set_table_cb() upload them on the switches. In case of ->ucast_build_fwd_tables() fatal failure (when return status is != 0), tables uploading will be skipped and flow will continue with default routing code. Thanks for the comments. Sasha From ardavis at ichips.intel.com Tue Jun 13 16:20:53 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 13 Jun 2006 16:20:53 -0700 Subject: [openib-general] OFED 1.0 release schedule In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007F7377F@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0007F7377F@orsmsx408> Message-ID: <448F4855.7060204@ichips.intel.com> Woodruff, Robert J wrote: >Tziporet wrote, > > >>We upload OFED-1.0-pre1.tgz to >>https://openib.org/svn/gen2/branches/1.0/ofed/releases/ >> >> >> > >I tried the new tar ball and the pathscale driver now >compiles (on Redhat EL4 - U3) and IPoIB and OpenSM appear to work OK, >but Intel MPI/uDAPL and NetPipe/uDAPL are broken. It apprears to >be a problem with rdma operations. I also tried SDP/pathscale and >it does not work either. >Finally, the rdma_cm is missing the changes that match the uDAPL fix >that >was put in for the new setops for the CM timeouts. >Arlin will provide specifics. We'd really like the rdma_cm fix in the >release. > > > Here is a pointer to Sean's email/patches with the details: http://openib.org/pipermail/openib-general/2006-June/022654.html http://openib.org/pipermail/openib-general/2006-June/022655.html -arlin >woody > > >-----Original Message----- >From: Betsy Zeller [mailto:betsy at pathscale.com] >Sent: Tuesday, June 13, 2006 1:44 PM >To: Tziporet Koren >Cc: Matt L. Leininger; Scott Weitzenkamp (sweitzen); Matters, Todd; Moni >Levy; Woodruff, Robert J; openib; OpenFabricsEWG >Subject: Re: OFED 1.0 release schedule > >Tziporet - this plan makes sense. We'll let you know how the testing >goes. BTW, for some reason, if you click on the URL you sent out, it >just hangs but if you type it in, it works. Not sure why. > >Thanks, Betsy > >On Tue, 2006-06-13 at 16:07 +0300, Tziporet Koren wrote: > > >>Hi All, >> >> >> >>After reading the mail thread regarding OFED release I have decided >>this: >> >> >> >>We upload OFED-1.0-pre1.tgz to >>https://openib.org/svn/gen2/branches/1.0/ofed/releases/ >> >> >> >>We checked that all modules compile and loaded on this build >>(including ipath and uDAPL) >> >>The only missing parts of this release from the final release are the >>documents, and the scripts rpm that Scott requested. >> >> >> >>I think testing this version 3 days (Tuesday, Wednesday and Thursday) >>should be enough as Scott wrote. >> >>So - we can do the official OFED 1.0 release on Friday 16-June. >> >> >> >>Matt - please check with Novel if this date is acceptable by them. >> >> >> >>If not then the earliest we can do the release if Thursday 15-June. >> >> >> >> >> >>Tziporet Koren >> >>Software Director >> >>Mellanox Technologies >> >>mailto: tziporet at mellanox.co.il >>Tel +972-4-9097200, ext 380 >> >> >> >> >> >> > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From sashak at voltaire.com Tue Jun 13 16:31:36 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 Jun 2006 02:31:36 +0300 Subject: [openib-general] [PATCH 2/4 v2] Modular routing engine (unicast only yet). In-Reply-To: <20060611003240.22430.88414.stgit@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060611003240.22430.88414.stgit@sashak.voltaire.com> Message-ID: <20060613233136.GA12137@sashak.voltaire.com> Hi, The same patch, but with comment addition about osm_routing_engine structure. Sasha. This patch introduces routing_engine structure which may be used for "plugging" new routing module. Currently only unicast callbacks are supported (multicast can be added later). And existing routing module is up-down 'updn', may be activated with '-R updn' option (instead of old '-u'). General usage is: $ opensm -R 'module-name' Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_opensm.h | 45 ++++++++++++++++++++++- osm/include/opensm/osm_subnet.h | 16 ++------ osm/include/opensm/osm_ucast_updn.h | 26 ------------- osm/opensm/main.c | 26 +++++-------- osm/opensm/osm_opensm.c | 41 ++++++++++++++++++--- osm/opensm/osm_subnet.c | 23 ++++++------ osm/opensm/osm_ucast_mgr.c | 69 ++++++++++++++++++++++++----------- osm/opensm/osm_ucast_updn.c | 69 ++++++++++++++++++----------------- 8 files changed, 184 insertions(+), 131 deletions(-) diff --git a/osm/include/opensm/osm_opensm.h b/osm/include/opensm/osm_opensm.h index 3235ad4..77d2a86 100644 --- a/osm/include/opensm/osm_opensm.h +++ b/osm/include/opensm/osm_opensm.h @@ -92,6 +92,46 @@ BEGIN_C_DECLS * *********/ +/****s* OpenSM: OpenSM/osm_routing_engine +* NAME +* struct osm_routing_engine +* +* DESCRIPTION +* OpenSM routing engine module definition. +* NOTES +* routing engine structure - yet limited by ucast_fdb_assign and +* ucast_build_fwd_tables (multicast callbacks may be added later) +*/ +struct osm_routing_engine { + const char *name; + void *context; + int (*ucast_build_fwd_tables)(void *context); + int (*ucast_fdb_assign)(void *context); + void (*delete)(void *context); +}; +/* +* FIELDS +* name +* The routing engine name (will be used in logs). +* +* context +* The routing engine context. Will be passed as parameter +* to the callback functions. +* +* ucast_build_fwd_tables +* The callback for unicast forwarding table generation. +* +* ucast_fdb_assign +* The same as above, but pretty integrated with default +* routing flow. Look at osm_ucast_mgr_process() and +* osm_ucast_updn.c for details. In future may be merged +* with ucast_build_fwd_tables() callback. +* +* delete +* The delete method, may be used for routing engine +* internals cleanup. +*/ + /****s* OpenSM: OpenSM/osm_opensm_t * NAME * osm_opensm_t @@ -116,7 +156,7 @@ typedef struct _osm_opensm_t osm_log_t log; cl_dispatcher_t disp; cl_plock_t lock; - updn_t *p_updn_ucast_routing; + struct osm_routing_engine routing_engine; osm_stats_t stats; } osm_opensm_t; /* @@ -153,6 +193,9 @@ typedef struct _osm_opensm_t * lock * Shared lock guarding most OpenSM structures. * +* routing_engine +* Routing engine, will be initialized then used +* * stats * Open SM statistics block * diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h index 4db449d..a637367 100644 --- a/osm/include/opensm/osm_subnet.h +++ b/osm/include/opensm/osm_subnet.h @@ -272,13 +272,11 @@ typedef struct _osm_subn_opt uint32_t max_port_profile; osm_pfn_ui_extension_t pfn_ui_pre_lid_assign; void * ui_pre_lid_assign_ctx; - osm_pfn_ui_extension_t pfn_ui_ucast_fdb_assign; - void * ui_ucast_fdb_assign_ctx; osm_pfn_ui_mcast_extension_t pfn_ui_mcast_fdb_assign; void * ui_mcast_fdb_assign_ctx; boolean_t sweep_on_trap; osm_testability_modes_t testability_mode; - boolean_t updn_activate; + char * routing_engine_name; char * updn_guid_file; boolean_t exit_on_fatal; boolean_t honor_guid2lid_file; @@ -407,13 +405,6 @@ typedef struct _osm_subn_opt * ui_pre_lid_assign_ctx * A UI context (void *) to be provided to the pfn_ui_pre_lid_assign * -* pfn_ui_ucast_fdb_assign -* A UI function to be called instead of the ucast manager FDB -* configuration. -* -* ui_ucast_fdb_assign_ctx -* A UI context (void *) to be provided to the pfn_ui_ucast_fdb_assign -* * pfn_ui_mcast_fdb_assign * A UI function to be called inside the mcast manager instead of the * call for the build spanning tree. This will be called on every @@ -429,9 +420,8 @@ typedef struct _osm_subn_opt * testability_mode * Object that indicates if we are running in a special testability mode. * -* updn_activate -* Object that indicates if we are running the UPDN algorithm (TRUE) or -* Min Hop Algorithm (FALSE) +* routing_engine_name +* Name of used routing engine (other than default Min Hop Algorithm) * * updn_guid_file * Pointer to name of the UPDN guid file given by User diff --git a/osm/include/opensm/osm_ucast_updn.h b/osm/include/opensm/osm_ucast_updn.h index 027056c..fbf8782 100644 --- a/osm/include/opensm/osm_ucast_updn.h +++ b/osm/include/opensm/osm_ucast_updn.h @@ -421,32 +421,6 @@ osm_subn_calc_up_down_min_hop_table( * This function returns 0 when rankning has succeded , otherwise 1. ******/ -/****f* OpenSM: OpenSM/osm_updn_reg_calc_min_hop_table -* NAME -* osm_updn_reg_calc_min_hop_table -* -* DESCRIPTION -* Registration function to ucast routing manager (instead of -* Min Hop Algorithm) -* -* SYNOPSIS -*/ -int -osm_updn_reg_calc_min_hop_table( - IN updn_t * p_updn, - IN osm_subn_opt_t* p_opt ); -/* -* PARAMETERS -* -* RETURN VALUES -* 0 - on success , 1 - on failure -* -* NOTES -* -* SEE ALSO -* osm_subn_calc_up_down_min_hop_table -*********/ - /****** Osmsh: UpDown/osm_updn_find_root_nodes_by_min_hop * NAME * osm_updn_find_root_nodes_by_min_hop diff --git a/osm/opensm/main.c b/osm/opensm/main.c index 22591eb..c888ed4 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -60,7 +60,6 @@ #include #include #include #include -#include #include /******************************************************************** @@ -174,10 +173,10 @@ show_usage(void) " may disrupt subnet traffic.\n" " Without -r, OpenSM attempts to preserve existing\n" " LID assignments resolving multiple use of same LID.\n\n"); - printf( "-u\n" - "--updn\n" - " This option activate UPDN algorithm instead of Min Hop\n" - " algorithm (default).\n"); + printf( "-R\n" + "--routing_engine \n" + " This option choose routing engine instead of Min Hop\n" + " algorithm (default). Supported engines: updn\n"); printf ("-a\n" "--add_guid_file \n" " Set the root nodes for the Up/Down routing algorithm\n" @@ -524,7 +523,7 @@ #endif boolean_t cache_options = FALSE; char *ignore_guids_file_name = NULL; uint32_t val; - const char * const short_option = "i:f:ed:g:l:s:t:a:P:NQuvVhorcyx"; + const char * const short_option = "i:f:ed:g:l:s:t:a:R:P:NQvVhorcyx"; /* In the array below, the 2nd parameter specified the number @@ -556,7 +555,7 @@ #endif { "reassign_lids", 0, NULL, 'r'}, { "priority", 1, NULL, 'p'}, { "smkey", 1, NULL, 'k'}, - { "updn", 0, NULL, 'u'}, + { "routing_engine",1, NULL, 'R'}, { "add_guid_file", 1, NULL, 'a'}, { "cache-options", 0, NULL, 'c'}, { "stay_on_fatal", 0, NULL, 'y'}, @@ -776,9 +775,9 @@ #endif opt.sm_key = sm_key; break; - case 'u': - opt.updn_activate = TRUE; - printf(" Activate UPDN algorithm\n"); + case 'R': + opt.routing_engine_name = optarg; + printf(" Activate \'%s\' routing engine\n", optarg); break; case 'a': @@ -885,13 +884,6 @@ #endif setup_signals(); osm_opensm_sweep( &osm ); - /* since osm_opensm_init get opt as RO we'll set the opt value with UI pfn here */ - /* Now do the registration */ - if (opt.updn_activate) - if (osm_updn_reg_calc_min_hop_table(osm.p_updn_ucast_routing, &(osm.subn.opt))) { - status = IB_ERROR; - goto Exit; - } if( run_once_flag == TRUE ) { diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c index 8c422b5..52f06da 100644 --- a/osm/opensm/osm_opensm.c +++ b/osm/opensm/osm_opensm.c @@ -68,6 +68,37 @@ #include #include #include +struct routing_engine_module { + const char *name; + int (*setup)(osm_opensm_t *p_osm); +}; + +extern int osm_ucast_updn_setup(osm_opensm_t *p_osm); + +const static struct routing_engine_module routing_modules[] = { + {"null", NULL}, + {"updn", osm_ucast_updn_setup }, + {} +}; + +static int setup_routing_engine(osm_opensm_t *p_osm, const char *name) +{ + const struct routing_engine_module *r; + for (r = routing_modules ; r->name && *r->name ; r++) { + if(!strcmp(r->name, name)) { + p_osm->routing_engine.name = r->name; + if (r->setup(p_osm)) + break; + osm_log (&p_osm->log, OSM_LOG_DEBUG, + "opensm: setup_routing_engine: " + "\'%s\' routing engine set up.\n", + p_osm->routing_engine.name); + return 0; + } + } + return -1; +} + /********************************************************************** **********************************************************************/ void @@ -118,7 +149,8 @@ osm_opensm_destroy( cl_disp_shutdown( &p_osm->disp ); /* do the destruction in reverse order as init */ - updn_destroy( p_osm->p_updn_ucast_routing ); + if (p_osm->routing_engine.delete) + p_osm->routing_engine.delete(p_osm->routing_engine.context); osm_sa_destroy( &p_osm->sa ); osm_sm_destroy( &p_osm->sm ); osm_db_destroy( &p_osm->db ); @@ -252,11 +284,8 @@ #endif if( status != IB_SUCCESS ) goto Exit; - /* HACK - the UpDown manager should have been a part of the osm_sm_t */ - /* Init updn struct */ - p_osm->p_updn_ucast_routing = updn_construct( ); - status = updn_init( p_osm->p_updn_ucast_routing ); - if( status != IB_SUCCESS ) + if( p_opt->routing_engine_name && + setup_routing_engine(p_osm, p_opt->routing_engine_name)) goto Exit; Exit: diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index 7c08556..27f97ab 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -484,13 +484,11 @@ osm_subn_set_default_opt( p_opt->max_port_profile = 0xffffffff; p_opt->pfn_ui_pre_lid_assign = NULL; p_opt->ui_pre_lid_assign_ctx = NULL; - p_opt->pfn_ui_ucast_fdb_assign = NULL; - p_opt->ui_ucast_fdb_assign_ctx = NULL; p_opt->pfn_ui_mcast_fdb_assign = NULL; p_opt->ui_mcast_fdb_assign_ctx = NULL; p_opt->sweep_on_trap = TRUE; p_opt->testability_mode = OSM_TEST_MODE_NONE; - p_opt->updn_activate = FALSE; + p_opt->routing_engine_name = NULL; p_opt->updn_guid_file = NULL; p_opt->exit_on_fatal = TRUE; subn_set_default_qos_options(&p_opt->qos_options); @@ -911,9 +909,9 @@ osm_subn_parse_conf_file( "sweep_on_trap", p_key, p_val, &p_opts->sweep_on_trap); - __osm_subn_opts_unpack_boolean( - "updn_activate", - p_key, p_val, &p_opts->updn_activate); + __osm_subn_opts_unpack_charp( + "routing_engine", + p_key, p_val, &p_opts->routing_engine_name); __osm_subn_opts_unpack_charp( "log_file", p_key, p_val, &p_opts->log_file); @@ -1089,12 +1087,13 @@ osm_subn_write_conf_file( opts_file, "#\n# ROUTING OPTIONS\n#\n" "# If true do not count switches as link subscriptions\n" - "port_profile_switch_nodes %s\n\n" - "# Activate the Up/Down routing algorithm\n" - "updn_activate %s\n\n", - p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE", - p_opts->updn_activate ? "TRUE" : "FALSE" - ); + "port_profile_switch_nodes %s\n\n", + p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE"); + if (p_opts->routing_engine_name) + fprintf( opts_file, + "# Routing engine\n" + "routing_engine %s\n\n", + p_opts->routing_engine_name); if (p_opts->updn_guid_file) fprintf( opts_file, "# The file holding the Up/Down root node guids\n" diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index 301aea5..787ae02 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -62,6 +62,7 @@ #include #include #include #include +#include #define LINE_LENGTH 256 @@ -269,7 +270,7 @@ __osm_ucast_mgr_dump_ucast_routes( strcat( p_mgr->p_report_buf, "yes" ); else { - if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign) { + if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) { ui_ucast_fdb_assign_func_defined = TRUE; } else { ui_ucast_fdb_assign_func_defined = FALSE; @@ -708,7 +709,7 @@ __osm_ucast_mgr_process_port( node_guid = osm_node_get_node_guid(osm_switch_get_node_ptr( p_sw ) ); /* Flag to mark whether or not a ui ucast fdb assign function was given */ - if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign) + if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) ui_ucast_fdb_assign_func_defined = TRUE; else ui_ucast_fdb_assign_func_defined = FALSE; @@ -753,7 +754,7 @@ __osm_ucast_mgr_process_port( /* Up/Down routing can cause unreachable routes between some switches so we do not report that as an error in that case */ - if (!p_mgr->p_subn->opt.updn_activate) + if (!p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) { osm_log( p_mgr->p_log, OSM_LOG_ERROR, "__osm_ucast_mgr_process_port: ERR 3A08: " @@ -973,6 +974,18 @@ __osm_ucast_mgr_process_tbl( /********************************************************************** **********************************************************************/ static void +__osm_ucast_mgr_set_table_cb( + IN cl_map_item_t* const p_map_item, + IN void* context ) +{ + osm_switch_t* const p_sw = (osm_switch_t*)p_map_item; + osm_ucast_mgr_t* const p_mgr = (osm_ucast_mgr_t*)context; + __osm_ucast_mgr_set_table( p_mgr, p_sw ); +} + +/********************************************************************** + **********************************************************************/ +static void __osm_ucast_mgr_process_neighbors( IN cl_map_item_t* const p_map_item, IN void* context ) @@ -1058,12 +1071,14 @@ osm_ucast_mgr_process( { uint32_t i; uint32_t iteration_max; + struct osm_routing_engine *p_routing_eng; osm_signal_t signal; cl_qmap_t *p_sw_guid_tbl; OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_process ); p_sw_guid_tbl = &p_mgr->p_subn->sw_guid_tbl; + p_routing_eng = &p_mgr->p_subn->p_osm->routing_engine; CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); @@ -1129,6 +1144,14 @@ osm_ucast_mgr_process( i ); + if (p_routing_eng->ucast_build_fwd_tables && + p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0) + { + cl_qmap_apply_func( p_sw_guid_tbl, + __osm_ucast_mgr_set_table_cb, p_mgr ); + } /* fallback on the regular path in case of failures */ + else + { /* This is the place where we can load pre-defined routes into the switches fwd_tbl structures. @@ -1136,32 +1159,34 @@ osm_ucast_mgr_process( Later code will use these values if not configured for re-assignment. */ - if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign) - { - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) + if (p_routing_eng->ucast_fdb_assign) { - osm_log( p_mgr->p_log, OSM_LOG_DEBUG, - "osm_ucast_mgr_process: " - "Invoking UI function pfn_ui_ucast_fdb_assign\n"); - } - p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign(p_mgr->p_subn->opt.ui_ucast_fdb_assign_ctx); - } else { + if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_mgr->p_log, OSM_LOG_DEBUG, + "osm_ucast_mgr_process: " + "Invoking \'%s\' function ucast_fdb_assign\n", + p_routing_eng->name); + } + p_routing_eng->ucast_fdb_assign(p_routing_eng->context); + } else { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "osm_ucast_mgr_process: " "UI pfn was not invoked\n"); - } + } - osm_log(p_mgr->p_log, OSM_LOG_INFO, - "osm_ucast_mgr_process: " - "Min Hop Tables configured on all switches\n"); + osm_log(p_mgr->p_log, OSM_LOG_INFO, + "osm_ucast_mgr_process: " + "Min Hop Tables configured on all switches\n"); - /* - Now that the lid matrixes have been built, we can - build and download the switch forwarding tables. - */ + /* + Now that the lid matrixes have been built, we can + build and download the switch forwarding tables. + */ - cl_qmap_apply_func( p_sw_guid_tbl, - __osm_ucast_mgr_process_tbl, p_mgr ); + cl_qmap_apply_func( p_sw_guid_tbl, + __osm_ucast_mgr_process_tbl, p_mgr ); + } /* dump fdb into file: */ if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) diff --git a/osm/opensm/osm_ucast_updn.c b/osm/opensm/osm_ucast_updn.c index d80f7eb..8e36854 100644 --- a/osm/opensm/osm_ucast_updn.c +++ b/osm/opensm/osm_ucast_updn.c @@ -76,8 +76,9 @@ __updn_get_dir(IN uint8_t cur_rank, IN uint64_t cur_guid, IN uint64_t rem_guid) { - uint32_t i = 0, max_num_guids = osm.p_updn_ucast_routing->updn_ucast_reg_inputs.num_guids; - uint64_t *p_guid = osm.p_updn_ucast_routing->updn_ucast_reg_inputs.guid_list; + updn_t *p_updn = osm.routing_engine.context; + uint32_t i = 0, max_num_guids = p_updn->updn_ucast_reg_inputs.num_guids; + uint64_t *p_guid = p_updn->updn_ucast_reg_inputs.guid_list; boolean_t cur_is_root = FALSE , rem_is_root = FALSE; /* HACK: comes to solve root nodes connection, in a classic subnet root nodes does not connect @@ -540,7 +541,7 @@ updn_init( p_updn->updn_ucast_reg_inputs.guid_list = NULL; p_updn->auto_detect_root_nodes = FALSE; /* Check if updn is activated , then fetch root nodes */ - if (osm.subn.opt.updn_activate) + if (osm.routing_engine.context) { /* Check the source for root node list, if file parse it, otherwise @@ -569,7 +570,7 @@ updn_init( { p_tmp = malloc(sizeof(uint64_t)); *p_tmp = strtoull(line, NULL, 16); - cl_list_insert_tail(osm.p_updn_ucast_routing->p_root_nodes, p_tmp); + cl_list_insert_tail(p_updn->p_root_nodes, p_tmp); } } else @@ -588,8 +589,8 @@ updn_init( "osm_opensm_init: " "UPDN - Root nodes fetching by file %s\n", osm.subn.opt.updn_guid_file); - guid_iterator = cl_list_head(osm.p_updn_ucast_routing->p_root_nodes); - while( guid_iterator != cl_list_end(osm.p_updn_ucast_routing->p_root_nodes) ) + guid_iterator = cl_list_head(p_updn->p_root_nodes); + while( guid_iterator != cl_list_end(p_updn->p_root_nodes) ) { osm_log( &osm.log, OSM_LOG_DEBUG, "osm_opensm_init: " @@ -600,7 +601,7 @@ updn_init( } else { - osm.p_updn_ucast_routing->auto_detect_root_nodes = TRUE; + p_updn->auto_detect_root_nodes = TRUE; } /* If auto mode detection reuired - will be executed in main b4 the assignment of UI Ucast */ } @@ -985,33 +986,6 @@ void __osm_updn_convert_list2array(IN up /********************************************************************** **********************************************************************/ -/* Registration function to ucast routing manager (instead of - Min Hop Algorithm) */ -int -osm_updn_reg_calc_min_hop_table( - IN updn_t * p_updn, - IN osm_subn_opt_t* p_opt ) -{ - OSM_LOG_ENTER(&(osm.log), osm_updn_reg_calc_min_hop_table); - /* - If root nodes were supplied by the user - we need to convert into array - otherwise, will be created & converted in callback function activation - */ - if (!p_updn->auto_detect_root_nodes) - { - __osm_updn_convert_list2array(p_updn); - } - osm_log (&(osm.log), OSM_LOG_DEBUG, - "osm_updn_reg_calc_min_hop_table: " - "assigning ucast fdb UI function with updn callback\n"); - p_opt->pfn_ui_ucast_fdb_assign = __osm_updn_call; - p_opt->ui_ucast_fdb_assign_ctx = (void *)p_updn; - OSM_LOG_EXIT(&(osm.log)); - return 0; -} - -/********************************************************************** - **********************************************************************/ /* Find Root nodes automatically by Min Hop Table info */ int osm_updn_find_root_nodes_by_min_hop( OUT updn_t * p_updn ) @@ -1210,3 +1184,30 @@ osm_updn_find_root_nodes_by_min_hop( OUT OSM_LOG_EXIT(&(osm.log)); return 0; } + +/********************************************************************** + **********************************************************************/ + +static void __osm_updn_delete(void *context) +{ + updn_t *p_updn = context; + updn_destroy(p_updn); +} + +int osm_ucast_updn_setup(osm_opensm_t *p_osm) +{ + updn_t *p_updn; + p_updn = updn_construct(); + if (!p_updn) + return -1; + p_osm->routing_engine.context = p_updn; + p_osm->routing_engine.delete = __osm_updn_delete; + p_osm->routing_engine.ucast_fdb_assign = __osm_updn_call; + + if (updn_init(p_updn) != IB_SUCCESS) + return -1; + if (!p_updn->auto_detect_root_nodes) + __osm_updn_convert_list2array(p_updn); + + return 0; +} From sashak at voltaire.com Tue Jun 13 16:40:57 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 Jun 2006 02:40:57 +0300 Subject: [openib-general] [PATCH 1/4] Simplification of the ucast fdb dumps. In-Reply-To: <448EA993.6010000@mellanox.co.il> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060611003238.22430.62423.stgit@sashak.voltaire.com> <448EA993.6010000@mellanox.co.il> Message-ID: <20060613234057.GR10482@sashak.voltaire.com> Hi Eitan, On 15:03 Tue 13 Jun , Eitan Zahavi wrote: > Hi Sasha, > > I still need to see if there are no real problematic changes in the osm.fdbs > file syntax (need to update ibdm to support those) but I like the patch and > the clean way you resolved the multiple opens of the dump file. Thanks. Sasha From arlin.r.davis at intel.com Tue Jun 13 17:03:40 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Tue, 13 Jun 2006 17:03:40 -0700 Subject: [openib-general] [PATCH] uDAPL cma provider - add missing ia_attributes for the ia_query Message-ID: James, Here are some changes to include some missing IA attributes during a query. -arlin Signed-off by: Arlin Davis ardavis at ichips.intel.com Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 7935) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -444,7 +444,10 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HC ia_attr->hardware_version_major = dev_attr.hw_ver; ia_attr->max_eps = dev_attr.max_qp; ia_attr->max_dto_per_ep = dev_attr.max_qp_wr; - ia_attr->max_rdma_read_per_ep = dev_attr.max_qp_rd_atom; + ia_attr->max_rdma_read_per_ep_in = dev_attr.max_qp_rd_atom; + ia_attr->max_rdma_read_per_ep_out = dev_attr.max_qp_rd_atom; + ia_attr->max_rdma_read_per_ep_in_guaranteed = DAT_TRUE; + ia_attr->max_rdma_read_per_ep_out_guaranteed = DAT_TRUE; ia_attr->max_evds = dev_attr.max_cq; ia_attr->max_evd_qlen = dev_attr.max_cqe; ia_attr->max_iov_segments_per_dto = dev_attr.max_sge; @@ -468,10 +471,11 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HC ia_attr->max_eps, ia_attr->max_dto_per_ep, ia_attr->max_evds, ia_attr->max_evd_qlen ); dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " query_hca: msg %llu rdma %llu iov %d lmr %d rmr %d\n", + " query_hca: msg %llu rdma %llu iov %d lmr %d rmr %d" + " rd_io %d\n", ia_attr->max_mtu_size, ia_attr->max_rdma_size, ia_attr->max_iov_segments_per_dto, ia_attr->max_lmrs, - ia_attr->max_rmrs ); + ia_attr->max_rmrs, ia_attr->max_rdma_read_per_ep_in ); } if (ep_attr != NULL) { From gjohnson at lanl.gov Tue Jun 13 17:06:10 2006 From: gjohnson at lanl.gov (Greg Johnson) Date: Tue, 13 Jun 2006 18:06:10 -0600 Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes from the file In-Reply-To: <20060613200035.GG10482@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060613170246.GH23320@durango.c3.lanl.gov> <20060613200035.GG10482@sashak.voltaire.com> Message-ID: <20060614000610.GJ23320@durango.c3.lanl.gov> On Tue, Jun 13, 2006 at 11:00:35PM +0300, Sasha Khapyorsky wrote: > Hi Greg, > > On 11:02 Tue 13 Jun , Greg Johnson wrote: > > It seems to load the routes generated by the dump > > script, but afterward it is not possible to dump the routes again. > > This means you have broken LFTs now. Probably I know what is going on > here - new LFTs don't have " 0" entries, and switches are > not accessible by LIDs anymore. > > Please update 'ibroute' utility (diags/) from the trunk and recreate the > dump file - this should fix the problem. > > (Sorry, I forgot to mention 'ibroute' upgrade issue in patch announcement). Ok, that fixed it. It works fine now. Any chance of making our own lid -> guid assignments while we are at it? Thanks, Greg From robert.j.woodruff at intel.com Tue Jun 13 17:09:02 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 13 Jun 2006 17:09:02 -0700 Subject: [openib-general] OFED 1.0 release schedule Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007F738B1@orsmsx408> Tziporet wrote, >We upload OFED-1.0-pre1.tgz to > https://openib.org/svn/gen2/branches/1.0/ofed/releases/ > One other thing I noticed is that you do not enable MSI interrupt mode by default. You will get lower performance if you do not enable MSI. I think you can set it when you load the driver with a modprobe parameter. woody From weiny2 at llnl.gov Tue Jun 13 17:11:47 2006 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 13 Jun 2006 17:11:47 -0700 Subject: [openib-general] MPI error when using a "system" call in mpi job. Message-ID: <20060613171147.35787125.weiny2@llnl.gov> A co-worker here was seeing the following MPI error from his job: [1] Abort: [ldev2:1] Got completion with error, code=1 at line 2148 in file viacheck.c After some tracking down he found that apparently if he used a "system" call [int system(const char *string)] the next MPI command will fail. I have been able to reproduce this with the attached simple "hello" program. Perhaps someone has seen this type of error? Here is the output from 2 runs: weiny2 at ldev0:~/ior-test 17:04:04 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello x ldev1 [0] Abort: [ldev1:0] Got completion with error, code=1 at line 2148 in file viacheck.c ldev2 mpirun_rsh: Abort signaled from [0] done. weiny2 at ldev0:~/ior-test 17:05:23 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello now = 0.000000 now = 0.000052 now = 0.000094 now = 0.000121 now = 0.000151 now = 0.001072 now = 0.001102 now = 0.001118 now = 0.001141 now = 0.001160 done. We are running mvapich 0.9.7 and the openib trunk rev 6829. Thanks, Ira -------------- next part -------------- A non-text attachment was scrubbed... Name: hello.c Type: application/octet-stream Size: 2784 bytes Desc: not available URL: From sean.hefty at intel.com Tue Jun 13 18:46:09 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 13 Jun 2006 18:46:09 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <1150237590.570.164947.camel@hal.voltaire.com> Message-ID: <000001c68f54$4f7350a0$5fc8180a@amr.corp.intel.com> >Is the only downside of a larger timeout that potentially more memory >accumulates (until the timeout occurs) before it is freed ? This is the only one that I can think of. Can anyone think of others? - Sean From betsy at pathscale.com Tue Jun 13 20:09:45 2006 From: betsy at pathscale.com (Betsy Zeller) Date: Tue, 13 Jun 2006 20:09:45 -0700 Subject: [openib-general] OFED 1.0 release schedule In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007F738B1@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0007F738B1@orsmsx408> Message-ID: <1150254585.3425.6.camel@sarium.pathscale.com> Woody - you are absolutely correct for ipath - you definitely want MSI interrupts enabled. We (QLogic) need to submit this information for inclusion in the OFED 1.0 release notes. Thanks, Betsy On Tue, 2006-06-13 at 17:09 -0700, Woodruff, Robert J wrote: > Tziporet wrote, > >We upload OFED-1.0-pre1.tgz to > > https://openib.org/svn/gen2/branches/1.0/ofed/releases/ > > > > One other thing I noticed is that you do not enable MSI interrupt > mode by default. You will get lower performance if you do not > enable MSI. I think you can set it when you load the driver with a > modprobe parameter. > > woody From tziporet at mellanox.co.il Tue Jun 13 22:48:17 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 14 Jun 2006 08:48:17 +0300 Subject: [openib-general] MSI enabled (was OFED 1.0 release schedule) Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA71FC@mtlexch01.mtl.com> Since this is the case in the git tree too we have not changed it. Most our QA so far run in this way so I don't want to change the default now. I will add this option in mthca release notes. Tziporet -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Wednesday, June 14, 2006 3:09 AM To: Woodruff, Robert J; Betsy Zeller; Tziporet Koren; Davis, Arlin R Cc: Matt L. Leininger; OpenFabricsEWG; openib; Matters, Todd Subject: RE: [openib-general] OFED 1.0 release schedule Tziporet wrote, >We upload OFED-1.0-pre1.tgz to > https://openib.org/svn/gen2/branches/1.0/ofed/releases/ > One other thing I noticed is that you do not enable MSI interrupt mode by default. You will get lower performance if you do not enable MSI. I think you can set it when you load the driver with a modprobe parameter. woody From eitan at mellanox.co.il Tue Jun 13 23:32:30 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 14 Jun 2006 09:32:30 +0300 Subject: [openib-general] [PATCH 2/4] Modular routing engine (unicast only yet). Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236884C@mtlexch01.mtl.com> Hi Sasha, OpenSM header files were used for generating documents using RoboDoc which was slightly modified by Intel. I found it very useful when I was learning the code. I attach the robodoc sources and my scripts for generating the doc for all headers in a dir. EZ Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Wednesday, June 14, 2006 2:21 AM > To: Eitan Zahavi > Cc: Hal Rosenstock; openib-general at openib.org; Greg Johnson; michael k lang; Yael > Kalka; Ofer Gigi > Subject: Re: [PATCH 2/4] Modular routing engine (unicast only yet). > > Hi Eitan, > > On 14:55 Tue 13 Jun , Eitan Zahavi wrote: > > > > As provided in my previous patch 1/4 comments > > I think the callbacks should also have an entry for the MinHop stage (maybe > > this is the ucast_build_fwd_tables?) I have some algorithms in mind that > > will > > skip that stage all-together. > > We may add new callback when it will be useful. > > > Also it might make sense for each routing engine to provide its own "dump" > > routine such that each could support difference file format if needed. > > Why we may want dump format per routing engine? Even if we are, you may > put it into routing engine specific code. > > > > > Rest of the comments are inline > > > > EZ > > > > Sasha Khapyorsky wrote: > > > > > >diff --git a/osm/include/opensm/osm_opensm.h > > >b/osm/include/opensm/osm_opensm.h > > >index 3235ad4..3e6e120 100644 > > >--- a/osm/include/opensm/osm_opensm.h > > >+++ b/osm/include/opensm/osm_opensm.h > > >@@ -92,6 +92,18 @@ BEGIN_C_DECLS > > > * > > > *********/ > > > > > >+/* > > >+ * routing engine structure - yet limited by ucast_fdb_assign and > > >+ * ucast_build_fwd_tables (multicast callbacks may be added later) > > >+ */ > > >+struct osm_routing_engine { > > >+ const char *name; > > >+ void *context; > > >+ int (*ucast_build_fwd_tables)(void *context); > > >+ int (*ucast_fdb_assign)(void *context); > > >+ void (*delete)(void *context); > > >+}; > > It would be nice if you added a standard header to this struct. > > It is not clear to me what ucast_build_fwd_tables and > > ucast_fdb_assign are mapping to. > > Ok, will add. > > BTW, seems OpenSM declarations were used for generation manuals or other > docs. Do you know are those > > /****h* > /****s* > /****f* > > in use anymore? And with what is the tool? > > > Please see the next section as an example for a struct header. > > >+ > > > /****s* OpenSM: OpenSM/osm_opensm_t > > > * NAME > > > * osm_opensm_t > > > >@@ -1129,6 +1144,14 @@ osm_ucast_mgr_process( > > > i > > > ); > > > > > >+ if (p_routing_eng->ucast_build_fwd_tables && > > >+ p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == > > >0) > > >+ { > > >+ cl_qmap_apply_func( p_sw_guid_tbl, > > >+ __osm_ucast_mgr_set_table_cb, p_mgr ); > > >+ } /* fallback on the regular path in case of failures */ > > >+ else > > >+ { > > Please explain why this step is needed and why if the routing engine > > function is > > returning 0 you still invoke the standard __osm_ucast_mgr_set_table_cb. > > ->ucast_build_fwd_tables() creates fwd tables and > __osm_ucast_mgr_set_table_cb() upload them on the switches. In case of > ->ucast_build_fwd_tables() fatal failure (when return status is != 0), > tables uploading will be skipped and flow will continue with default > routing code. > > Thanks for the comments. > Sasha -------------- next part -------------- A non-text attachment was scrubbed... Name: robodoc-3.2.3.tar.gz Type: application/x-gzip Size: 112042 bytes Desc: robodoc-3.2.3.tar.gz URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: roboDocDir Type: application/octet-stream Size: 1322 bytes Desc: roboDocDir URL: From eitan at mellanox.co.il Tue Jun 13 23:48:15 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 14 Jun 2006 09:48:15 +0300 Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT tables from dump file. Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236884D@mtlexch01.mtl.com> Hi Hal, Sasha, Regarding OpenSM coding style: Sasha wrote: > > Really? Don't want to bother with examples, but I may see almost any > "combination" in OpenSM and it is not clear for me which one is common > (the coding style and identation are different even from file to file). [EZ] This bothers me as I think we should use a consistent coding style. You might also remember we had put in place a both a script to do automatic indentation and coding style rule fixes (osm_indent and osm_check_n_fix) I did check for all "else" statements: osm/opensm>grep else *.c | wc -l 397 osm/opensm>grep else *.c | grep -v "{" | grep -v "}" | wc -l 361 So you can see only <10% (36 out of 397) "else" statement are not coding style consistent. Checking what is the code that is "non standard": osm/opensm>grep else *.c | grep "{" | awk '{print $1}' | sort | uniq -c | sort -rn 7 osm_console.c: 6 osm_prtn_config.c: 3 st.c: 3 osm_sa_multipath_record.c: 2 osm_ucast_mgr.c: 2 osm_sa_path_record.c: 1 osm_sa_mcmember_record.c: 1 osm_sa_informinfo.c: 1 osm_sa_class_port_info.c: 1 osm_multicast.c: You can see the majority of these mismatches are in code introduced by Hal and yourself. I think OpenSM should sue a single code style. My proposal is that we update our osm_indent script with a set of rules we agree on and apply to the entire tree. From jackm at mellanox.co.il Wed Jun 14 00:55:42 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 14 Jun 2006 10:55:42 +0300 Subject: [openib-general] OFED 1.0 release schedule In-Reply-To: <448F4855.7060204@ichips.intel.com> References: <1AC79F16F5C5284499BB9591B33D6F0007F7377F@orsmsx408> <448F4855.7060204@ichips.intel.com> Message-ID: <200606141055.42449.jackm@mellanox.co.il> On Wednesday 14 June 2006 02:20, Arlin Davis wrote: > Woodruff, Robert J wrote: > >>We upload OFED-1.0-pre1.tgz to > >>https://openib.org/svn/gen2/branches/1.0/ofed/releases/ > > > >I tried the new tar ball and the pathscale driver now > >compiles (on Redhat EL4 - U3) and IPoIB and OpenSM appear to work OK, > >but Intel MPI/uDAPL and NetPipe/uDAPL are broken. It apprears to > >be a problem with rdma operations. I also tried SDP/pathscale and > >it does not work either. > >Finally, the rdma_cm is missing the changes that match the uDAPL fix > >that > >was put in for the new setops for the CM timeouts. > >Arlin will provide specifics. We'd really like the rdma_cm fix in the > >release. > > Here is a pointer to Sean's email/patches with the details: > > http://openib.org/pipermail/openib-general/2006-June/022654.html > http://openib.org/pipermail/openib-general/2006-June/022655.html > > -arlin > As I posted to ipoib-general on June 7 ( http://openib.org/pipermail/openib-general/2006-June/022725.html ) the new setops for CM timeouts will not be available in OFED 1.0 , so please don't try to use them as yet. We tested out IntelMPI over uDapl (from OFED 1.0-pre1) using the PALLAS test suite, and it worked fine -- no problems. Evidently, you are trying to use these new (and absent/unsupported) features. We do appreciate that these features are very important for scalability, and we plan to include them in the 1.1 release which will follow shortly. From greg.lindahl at qlogic.com Wed Jun 14 01:36:42 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Wed, 14 Jun 2006 01:36:42 -0700 Subject: [openib-general] MPI error when using a "system" call in mpi job. In-Reply-To: <20060613171147.35787125.weiny2@llnl.gov> References: <20060613171147.35787125.weiny2@llnl.gov> Message-ID: <20060614083642.GG2741@greglaptop.hsd1.ca.comcast.net> On Tue, Jun 13, 2006 at 05:11:47PM -0700, Ira Weiny wrote: > After some tracking down he found that apparently if he used a "system" call > [int system(const char *string)] the next MPI command will fail. Are you sure MVAPICH supports fork()? It is not unusual for MPI implementations to not support fork(). system() uses fork(). -- greg From mst at mellanox.co.il Wed Jun 14 01:40:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Jun 2006 11:40:41 +0300 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com> References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com> Message-ID: <20060614084041.GA19518@mellanox.co.il> Quoting r. Sean Hefty : > One of the ideas then, is for the kernel umad module to learn which MADs > generate responses. I thought about this a bit, this seems to add even more state for MAD processing engine, which sounds like a wrong approach: Would keeping around MADs in the done list consume significant extra memory resources? What limits this memory? Would a small client that would normally just send RMPP, get a response and exit will be slowed down significantly while the agent learns? Would a buggy application confuse the umad module, corrupting MAD processing for all other applications? The original approach by Jack of detecting, and dropping, duplicate responses instead of duplicate requests seemed much easier to me. The only disadvantage it has that I'm aware of is a slight performance hit for duplicate processing of each request. But all the done_list scans proposed seem even more CPU intensive. Can we discuss that approach once again please? The patch is here: https://openib.org/svn/trunk/contrib/mellanox/patches/mad_rmpp_requester_retry.patch -- MST From mst at mellanox.co.il Wed Jun 14 01:49:20 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Jun 2006 11:49:20 +0300 Subject: [openib-general] oops on trunk Message-ID: <20060614084920.GC19518@mellanox.co.il> Here's another oops while unloading modules: Unable to handle kernel paging request at ffffffff8803d8ad RIP: [] PGD 103027 PUD 105027 PMD 11f99f067 PTE 0 Oops: 0010 [1] SMP CPU 1 Modules linked in: ib_mthca ib_umad ib_sa ib_mad ib_core Pid: 12364, comm: modprobe Not tainted 2.6.16 #1 RIP: 0010:[] [] RSP: 0000:ffff810118835d80 EFLAGS: 00010246 RAX: 0000000000000005 RBX: ffff810118835e10 RCX: ffffffff8801996e RDX: ffff81011fe11c00 RSI: 0000000000000000 RDI: 00000000fffffffc RBP: ffff81010fee5c90 R08: ffff8101199d9d00 R09: 0000000000000000 R10: ffff81011fc227c8 R11: ffff81011c70e8c0 R12: 00000000fffffffc R13: 0000000000000000 R14: 0000000000000080 R15: 0000000000000000 FS: 00002b68af872b00(0000) GS:ffff81011fc74bc0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffffffff8803d8ad CR3: 00000001094a4000 CR4: 00000000000006e0 Process modprobe (pid: 12364, threadinfo ffff810118834000, task ffff81011fe56f00) Stack: ffffffff880199ae ffff810109461040 ffff81011fe57120 0000000000000008 ffff810118835e78 ffffffff8049f880 ffff810118835e78 0000000000000080 0000000000000000 ffff810118835e10 Call Trace: {:ib_sa:ib_sa_mcmember_rec_callback+64} {:ib_sa:send_handler+74} {:ib_mad:ib_unregister_mad_agent+366} {cond_resched+76} {:ib_sa:ib_sa_remove_one+71} {:ib_core:ib_unregister_client+64} {:ib_sa:ib_sa_cleanup+13} {sys_delete_module+481} {__up_write+20} {sys_munmap+80} {system_call+126} Code: Bad RIP value. RIP [] RSP CR2: ffffffff8803d8ad -- MST From glebn at voltaire.com Wed Jun 14 02:29:28 2006 From: glebn at voltaire.com (glebn at voltaire.com) Date: Wed, 14 Jun 2006 12:29:28 +0300 Subject: [openib-general] MPI error when using a "system" call in mpi job. In-Reply-To: <20060614083642.GG2741@greglaptop.hsd1.ca.comcast.net> References: <20060613171147.35787125.weiny2@llnl.gov> <20060614083642.GG2741@greglaptop.hsd1.ca.comcast.net> Message-ID: <20060614092928.GB17758@minantech.com> On Wed, Jun 14, 2006 at 01:36:42AM -0700, Greg Lindahl wrote: > On Tue, Jun 13, 2006 at 05:11:47PM -0700, Ira Weiny wrote: > > > After some tracking down he found that apparently if he used a "system" call > > [int system(const char *string)] the next MPI command will fail. > > Are you sure MVAPICH supports fork()? It is not unusual for MPI > implementations to not support fork(). system() uses fork(). > On kernel 2.6.12 or newer system() should works OK (in non threaded application). -- Gleb. From mst at mellanox.co.il Wed Jun 14 02:47:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Jun 2006 12:47:16 +0300 Subject: [openib-general] [PATCH 2/4] Modular routing engine (unicast only yet). In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30236884C@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30236884C@mtlexch01.mtl.com> Message-ID: <20060614094716.GE19518@mellanox.co.il> Quoting r. Eitan Zahavi : > Subject: Re: [PATCH 2/4] Modular routing engine (unicast only yet). > > Hi Sasha, > > OpenSM header files were used for generating documents using RoboDoc > which was slightly modified by Intel. I found it very useful when I was > learning the code. > > I attach the robodoc sources and my scripts for generating the doc for > all headers in a dir. > > EZ Put it all in svn somewhere? https://openib.org/svn/gen2/trunk/build/ -- MST From halr at voltaire.com Wed Jun 14 04:03:06 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Jun 2006 07:03:06 -0400 Subject: [openib-general] [PATCH 2/4 v2] Modular routing engine (unicast only yet). In-Reply-To: <20060613233136.GA12137@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060611003240.22430.88414.stgit@sashak.voltaire.com> <20060613233136.GA12137@sashak.voltaire.com> Message-ID: <1150282969.570.191546.camel@hal.voltaire.com> On Tue, 2006-06-13 at 19:31, Sasha Khapyorsky wrote: > Hi, > > The same patch, but with comment addition about osm_routing_engine > structure. > > Sasha. > > > This patch introduces routing_engine structure which may be used for > "plugging" new routing module. Currently only unicast callbacks are > supported (multicast can be added later). And existing routing module > is up-down 'updn', may be activated with '-R updn' option (instead of > old '-u'). General usage is: > > $ opensm -R 'module-name' > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (with some cosmetic changes). -- Hal From ogerlitz at voltaire.com Wed Jun 14 06:04:16 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 14 Jun 2006 16:04:16 +0300 (IDT) Subject: [openib-general] ib_mthca not loaded by pci hotplug on SLES10 RC2 system Message-ID: I have a SLES10 RC2 system whose infiniband drivers are the ones provided by the distro (ie not replaced by OFED). I have noticed that ib_mthca is not loaded when the system comes up, however it is loaded fine if i do it manually, and ping -f over ipoib works fine so the system is very much operative. Below are the output of modinfo and lspci and attached is /proc/config.gz Doing a diff on drivers/infiniband and include/rdma with 2.6.16 they are exactly the same as those of 2.6.16.16-1.6-smp (the sles10 kernel), the HCA FW is 4.6.2 Anyone has an idea what might be the issue? Or. rosemary:/usr/src # uname -a Linux rosemary 2.6.16.16-1.6-smp #1 SMP Mon May 22 14:37:02 UTC 2006 x86_64 x86_64 x86_64 GNU/Linux rosemary:/usr/src # modinfo ib_mthca filename: /lib/modules/2.6.16.16-1.6-smp/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko author: Roland Dreier description: Mellanox InfiniBand HCA low-level driver license: Dual BSD/GPL version: 0.07 vermagic: 2.6.16.16-1.6-smp SMP gcc-4.1 depends: ib_mad,ib_core alias: pci:v000015B3d00005A44sv*sd*bc*sc*i* alias: pci:v00001867d00005A44sv*sd*bc*sc*i* alias: pci:v000015B3d00006278sv*sd*bc*sc*i* alias: pci:v00001867d00006278sv*sd*bc*sc*i* alias: pci:v000015B3d00006282sv*sd*bc*sc*i* alias: pci:v00001867d00006282sv*sd*bc*sc*i* alias: pci:v000015B3d00006274sv*sd*bc*sc*i* alias: pci:v00001867d00006274sv*sd*bc*sc*i* alias: pci:v000015B3d00005E8Csv*sd*bc*sc*i* alias: pci:v00001867d00005E8Csv*sd*bc*sc*i* srcversion: 8494F031EF8F0C77769CB89 parm: msi:attempt to use MSI if nonzero (int) parm: msi_x:attempt to use MSI-X if nonzero (int) rosemary:/usr/src # lspci 00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 0c) 00:00.1 Class ff00: Intel Corporation E7525/E7520 Error Reporting Registers (rev 0c) 00:01.0 System peripheral: Intel Corporation E7520 DMA Controller (rev 0c) 00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 0c) 00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 0c) 00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02) 00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02) 00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev 02) 00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2) 00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02) 00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02) 00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02) 01:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09) 01:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09) 03:04.0 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03) 03:04.1 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03) 04:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev a0) 05:0c.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) -------------- next part -------------- A non-text attachment was scrubbed... Name: config.gz Type: application/x-gzip Size: 15590 bytes Desc: URL: From surs at cse.ohio-state.edu Wed Jun 14 06:04:48 2006 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 14 Jun 2006 09:04:48 -0400 Subject: [openib-general] MPI error when using a "system" call in mpi job. In-Reply-To: <20060613171147.35787125.weiny2@llnl.gov> References: <20060613171147.35787125.weiny2@llnl.gov> Message-ID: <44900970.9050006@cse.ohio-state.edu> Hello Ira, I am running the program on 2.6.15 (EM64T machine) and 2.6.16 (IA32 machine). The program seems to be running fine. Can you tell us which kernel you are using? We are using drivers pulled out of the trunk about 3-4 weeks back. Thanks, Sayantan. Ira Weiny wrote: >A co-worker here was seeing the following MPI error from his job: > >[1] Abort: [ldev2:1] Got completion with error, code=1 > at line 2148 in file viacheck.c > >After some tracking down he found that apparently if he used a "system" call >[int system(const char *string)] the next MPI command will fail. > >I have been able to reproduce this with the attached simple "hello" program. > >Perhaps someone has seen this type of error? Here is the output from 2 runs: > >weiny2 at ldev0:~/ior-test >17:04:04 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello x >ldev1 >[0] Abort: [ldev1:0] Got completion with error, code=1 > at line 2148 in file viacheck.c >ldev2 >mpirun_rsh: Abort signaled from [0] >done. >weiny2 at ldev0:~/ior-test >17:05:23 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello >now = 0.000000 >now = 0.000052 >now = 0.000094 >now = 0.000121 >now = 0.000151 >now = 0.001072 >now = 0.001102 >now = 0.001118 >now = 0.001141 >now = 0.001160 >done. > >We are running mvapich 0.9.7 and the openib trunk rev 6829. > >Thanks, >Ira > > > >------------------------------------------------------------------------ > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- http://www.cse.ohio-state.edu/~surs From mst at mellanox.co.il Wed Jun 14 06:17:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Jun 2006 16:17:55 +0300 Subject: [openib-general] ib_mthca not loaded by pci hotplug on SLES10 RC2 system In-Reply-To: References: Message-ID: <20060614131755.GA25417@mellanox.co.il> Quoting r. Or Gerlitz : > Subject: ib_mthca not loaded by pci hotplug on SLES10 RC2 system > > I have a SLES10 RC2 system whose infiniband drivers are the > ones provided by the distro (ie not replaced by OFED). > > I have noticed that ib_mthca is not loaded when the system comes up, > however it is loaded fine if i do it manually, and ping -f over ipoib > works fine so the system is very much operative. Generally you need to look at scripts under /etc/hotplug to figure out. -- MST From trimmer at silverstorm.com Wed Jun 14 06:24:10 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Wed, 14 Jun 2006 09:24:10 -0400 Subject: [openib-general] MPI error when using a "system" call in mpi job. Message-ID: > -----Original Message----- > From: Ira Weiny > Sent: Tuesday, June 13, 2006 8:12 PM > A co-worker here was seeing the following MPI error from his job: > > [1] Abort: [ldev2:1] Got completion with error, code=1 > at line 2148 in file viacheck.c > > After some tracking down he found that apparently if he used a "system" > call > [int system(const char *string)] the next MPI command will fail. > > I have been able to reproduce this with the attached simple "hello" > program. I have seen this type of problem a couple years ago with our proprietary stack and it took a bit of work to correct it. Here is what it could be: This sounds like a conflict between with fork() and the Vma handling in Open IB for registered memory. system() is a fork(), exec(), wait() sequence. fork generally shares the VMAs and marks the pages as copy on write. In your case it sounds like one of the pages written by the child process includes memory previously registered by the main process, and the child ended up with the original page. The result is that the virtual address in the main process is now pointing to the wrong physical page. It sounds like you happened on a "magic sequence" which demonstrates the problem. Do you have information on the OS version, CPU type, and server config? Todd Rimmer From halr at voltaire.com Wed Jun 14 06:24:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Jun 2006 09:24:23 -0400 Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT tables from dump file. In-Reply-To: <20060611003243.22430.56582.stgit@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060611003243.22430.56582.stgit@sashak.voltaire.com> Message-ID: <1150291429.570.196696.camel@hal.voltaire.com> On Sat, 2006-06-10 at 20:32, Sasha Khapyorsky wrote: > This patch implements trivial routing module which able to load LFT > tables from dump file. Main features: > - support for unicast LFTs only, support for multicast can be added later > - this will run after min hop matrix calculation > - this will load switch LFTs according to the path entries introduced in > the dump file > - no additional checks will be performed (like is port connected, etc) > - in case when fabric LIDs were changed this will try to reconstruct LFTs > correctly if endport GUIDs are represented in the dump file (in order > to disable this GUIDs may be removed from the dump file or zeroed) > > The dump file format is compatible with output of 'ibroute' util and for > whole fabric may be generated with script like this: > > for sw_lid in `ibswitches | awk '{print $NF}'` ; do > ibroute $sw_lid > done > /path/to/dump_file > > , or using DR paths: > > > for sw_dr in `ibnetdiscover -v \ > | sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \ > | sed -e 's/\]\[/,/g' \ > | sort -u` ; do > ibroute -D ${sw_dr} > done > /path/to/dump_file > > > In order to activate new module use: > > opensm -R file -U /path/to/dump_file > > Signed-off-by: Sasha Khapyorsky Thanks! Applied with some cosmetic changes. -- Hal From halr at voltaire.com Wed Jun 14 06:39:04 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Jun 2006 09:39:04 -0400 Subject: [openib-general] [PATCH] OpenSM/osm_ucast_file: Eliminate compiler warning Message-ID: <1150292338.570.197234.camel@hal.voltaire.com> OpenSM/osm_ucast_file: Eliminate compiler warning osm_ucast_file.c: In function `do_ucast_file_load': osm_ucast_file.c:156: warning: passing arg 2 of `cl_qmap_apply_func' from incompatible pointer type Signed-off-by: Hal Rosenstock Index: opensm/osm_ucast_file.c =================================================================== --- opensm/osm_ucast_file.c (revision 8000) +++ opensm/osm_ucast_file.c (working copy) @@ -114,9 +114,9 @@ static void add_path(osm_opensm_t * p_os (osm_switch_get_node_ptr(p_sw)))); } -static void clean_sw_fwd_table(void *arg, void *context) +static void clean_sw_fwd_table(cl_map_item_t* const p_map_item, void *context) { - osm_switch_t *p_sw = arg; + osm_switch_t * const p_sw = (osm_switch_t *)p_map_item; uint16_t lid, max_lid; max_lid = osm_switch_get_max_lid_ho(p_sw); From ogerlitz at voltaire.com Wed Jun 14 06:48:09 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 14 Jun 2006 16:48:09 +0300 Subject: [openib-general] ib_mthca not loaded by pci hotplug on SLES10 RC2 system In-Reply-To: <20060614131755.GA25417@mellanox.co.il> References: <20060614131755.GA25417@mellanox.co.il> Message-ID: <44901399.7040408@voltaire.com> Michael S. Tsirkin wrote: > Quoting r. Or Gerlitz : >> Subject: ib_mthca not loaded by pci hotplug on SLES10 RC2 system >> >> I have a SLES10 RC2 system whose infiniband drivers are the >> ones provided by the distro (ie not replaced by OFED). >> >> I have noticed that ib_mthca is not loaded when the system comes up, >> however it is loaded fine if i do it manually, and ping -f over ipoib >> works fine so the system is very much operative. > > Generally you need to look at scripts under /etc/hotplug to figure out. OK, thanks... it turns out no hotplug package was installed nor nothing related to hotplug is found by the yast2 lookup, so i have installed hotplug-0.44-32.46 which is working for me on a sles9 system running kernel.org 2.6.16, yet the driver is not loaded on boot but does load manually (and works fine). Do i need to setup something other then installing the package & reboot? Or. From trimmer at silverstorm.com Wed Jun 14 06:54:38 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Wed, 14 Jun 2006 09:54:38 -0400 Subject: [openib-general] Maintainers List Message-ID: Is there a convenient list of the maintainers for all the various OFED components? Thanks, Todd Rimmer From mst at mellanox.co.il Wed Jun 14 07:38:59 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Jun 2006 17:38:59 +0300 Subject: [openib-general] ib_mthca not loaded by pci hotplug on SLES10 RC2 system In-Reply-To: <44901E8E.20503@voltaire.com> References: <44901E8E.20503@voltaire.com> Message-ID: <20060614143859.GE25417@mellanox.co.il> Quoting r. Or Gerlitz : > OK, thanks, well the content related to mthca of modules.pcimap on the > sles10 system is the same as in the sles9 system, see below, and still > the sles10 does not load the module. Fine now all you need is have a script run on hotplug, read this and load modules. -- MST From robert.j.woodruff at intel.com Wed Jun 14 08:37:57 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 14 Jun 2006 08:37:57 -0700 Subject: [openib-general] OFED 1.0 release schedule In-Reply-To: <200606141055.42449.jackm@mellanox.co.il> Message-ID: <000001c68fc8$849646b0$50a9070a@amr.corp.intel.com> Jack Morgenstein wrote, >We tested out IntelMPI over uDapl (from OFED 1.0-pre1) using the PALLAS test >suite, and it worked fine -- no problems. Evidently, you are trying to use >these new (and absent/unsupported) features. >We do appreciate that these features are very important for scalability, and >we plan to include them in the 1.1 release which will follow shortly. These new options are needed to allow Intel MPI to scale up to larger clusters, 128+. If you did not run on a large cluster you would not have seen the problems. >As I posted to ipoib-general on June 7 >( http://openib.org/pipermail/openib-general/2006-June/022725.html ) Unfortunately, due to a problem with our email server, I was not receiving openib-general emails for the last week and missed this thread or would have spoken up then. Anyway, What is the criteria/decision process for deciding that something will or will not be included. I think that we have an equal say as to what should go in and what should not. Seems like there are double standards here. You are including last minute fixes for things like the Pathscale driver after RC6, but will not allow a fix that is needed by our product. woody From jackm at mellanox.co.il Wed Jun 14 09:04:22 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 14 Jun 2006 19:04:22 +0300 Subject: [openib-general] OFED 1.0 release schedule In-Reply-To: <000001c68fc8$849646b0$50a9070a@amr.corp.intel.com> References: <000001c68fc8$849646b0$50a9070a@amr.corp.intel.com> Message-ID: <200606141904.22321.jackm@mellanox.co.il> On Wednesday 14 June 2006 18:37, Bob Woodruff wrote: > Unfortunately, due to a problem with our email server, I was not receiving > openib-general emails for the last week and missed this thread or would > have spoken up then. > > Anyway, What is the criteria/decision process for deciding that something > will > or will not be included. I think that we have an equal say as to what > should go in and what should not. Seems like there are double standards > here. You are including last minute fixes for things like the Pathscale > driver after RC6, but will not allow a fix that is needed by our product. > The Pathscale fixes affect ONLY Pathscale users. Unfortunately, the changes you are requesting affect ALL the ulp's -- IPoIB, SDP, iSer,... , and NOT just your product. These changes would mean activating the ib_local_sa module (which has NOT been QA'd under OFED). There was a long thread quite a while ago on this topic, starting May 4: see ( http://openib.org/pipermail/openib-general/2006-May/020977.html ) We decided then not to include the local_sa module, and heard no objections. Since the change you request (which is a kernel-level change) affects many products, not just IntelMPI, it is not possible to just include it at the last minute and hope for the best (due to lack of QA). - Jack From swise at opengridcomputing.com Wed Jun 14 09:11:08 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 14 Jun 2006 11:11:08 -0500 Subject: [openib-general] [PATCH v2 1/2] iWARP Connection Manager. In-Reply-To: <1150235196.17394.91.camel@stevo-desktop> References: <000001c68f31$78910fe0$24268686@amr.corp.intel.com> <1150235196.17394.91.camel@stevo-desktop> Message-ID: <1150301468.28999.22.camel@stevo-desktop> On Tue, 2006-06-13 at 16:46 -0500, Steve Wise wrote: > On Tue, 2006-06-13 at 14:36 -0700, Sean Hefty wrote: > > >> Er...no. It will lose this event. Depending on the event...the carnage > > >> varies. We'll take a look at this. > > >> > > > > > >This behavior is consistent with the Infiniband CM (see > > >drivers/infiniband/core/cm.c function cm_recv_handler()). But I think > > >we should at least log an error because a lost event will usually stall > > >the rdma connection. > > > > I believe that there's a difference here. For the Infiniband CM, an allocation > > error behaves the same as if the received MAD were lost or dropped. Since MADs > > are unreliable anyway, it's not so much that an IB CM event gets lost, as it > > doesn't ever occur. A remote CM should retry the send, which hopefully allows > > the connection to make forward progress. > > > > hmm. Ok. I see. I misunderstood the code in cm_recv_handler(). > > Tom and I have been talking about what we can do to not drop the event. > Stay tuned. Here's a simple solution that solves the problem: For any given cm_id, there are a finite (and small) number of outstanding CM events that can be posted. So we just pre-allocate them when the cm_id is created and keep them on a free list hanging off of the cm_id struct. Then the event handler function will pull from this free list. The only case where there is any non-finite issue is on the passive listening cm_id. Each incoming connection request will consume a work struct. So based on client connects, we could run out of work structs. However, the CMA has the concept of a backlog, which is defined as the max number of pending unaccepted connection requests. So we allocate these work structs based on that number (or a computation based on that number), and if we run out, we simply drop the incoming connection request due to backlog overflow (I suggest we log the drop event too). When a MPA connection request is dropped, the (IETF conforming) MPA client will eventually time out the connection and the consumer can retry. Comments? From robert.j.woodruff at intel.com Wed Jun 14 09:12:40 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 14 Jun 2006 09:12:40 -0700 Subject: [openib-general] OFED 1.0 release schedule In-Reply-To: <200606141904.22321.jackm@mellanox.co.il> Message-ID: <000101c68fcd$5c2c69c0$50a9070a@amr.corp.intel.com> Jack wrote, >Since the change you request (which is a kernel-level change) affects many >products, not just IntelMPI, it is not possible to just include it at the >last minute and hope for the best (due to lack of QA). >- Jack At this point if the 1.0-pre1 tar ball goes gold this Friday as scheduled, then I guess we will have to live with what we have and people running larger clusters will need to use the trunk until OFED 1.1. If however, it is decided that there needs to be another RC to fix the other problems with things like pathscale not working with SDP or uDAPL, then I think we should allow the setops fixes in also, but I understand that would mean going through another QA cycle. my 2 cents. woody From jlentini at netapp.com Wed Jun 14 09:17:15 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 14 Jun 2006 12:17:15 -0400 (EDT) Subject: [openib-general] [PATCH] uDAPL cma provider - add missing ia_attributes for the ia_query In-Reply-To: References: Message-ID: On Tue, 13 Jun 2006, Arlin Davis wrote: > > James, > > Here are some changes to include some missing IA attributes during a > query. Looks good. Committed in revision 8008. From betsy at pathscale.com Wed Jun 14 09:20:54 2006 From: betsy at pathscale.com (Betsy Zeller) Date: Wed, 14 Jun 2006 09:20:54 -0700 Subject: [openib-general] OFED 1.0 release schedule In-Reply-To: <000101c68fcd$5c2c69c0$50a9070a@amr.corp.intel.com> References: <000101c68fcd$5c2c69c0$50a9070a@amr.corp.intel.com> Message-ID: <1150302054.3425.66.camel@sarium.pathscale.com> On Wed, 2006-06-14 at 09:12 -0700, Bob Woodruff wrote: > with things like pathscale not working with SDP or uDAPL, then I think Woody - For us, SDP ran just fine on InfiniPath on the RHEL4 tests we ran yesterday the the OFED pre-release candidate. Can you send me the output you got when you tried it? Thanks, Betsy -- Betsy Zeller Director of Software Engineering QLogic Corporation System Interconnect Group (formerly PathScale, Inc) 2071 Stierlin Court, Suite 200 Mountain View, CA, 94043 1-650-934-8088 From jlentini at netapp.com Wed Jun 14 09:34:24 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 14 Jun 2006 12:34:24 -0400 (EDT) Subject: [openib-general] communication established affiliated asynchronous event Message-ID: The IBTA spec (volume 1, version 1.2) describes a communication established affiliated asynchronous event. Is this event supposed to be delivered to the verbs consumer or the IB CM? We've seen this event delivered to our NFS-RDMA server and aren't sure what to do with it. james From jlentini at netapp.com Wed Jun 14 09:39:06 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 14 Jun 2006 12:39:06 -0400 (EDT) Subject: [openib-general] communication established affiliated asynchronous event In-Reply-To: References: Message-ID: On Wed, 14 Jun 2006, James Lentini wrote: > > The IBTA spec (volume 1, version 1.2) describes a communication > established affiliated asynchronous event. The discription is on page 637. > Is this event supposed to be delivered to the verbs consumer or the IB > CM? > > We've seen this event delivered to our NFS-RDMA server and aren't sure > what to do with it. > > james From mshefty at ichips.intel.com Wed Jun 14 09:39:34 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 14 Jun 2006 09:39:34 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <20060614084041.GA19518@mellanox.co.il> References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com> <20060614084041.GA19518@mellanox.co.il> Message-ID: <44903BC6.8020302@ichips.intel.com> Michael S. Tsirkin wrote: > Would keeping around MADs in the done list consume significant extra memory > resources? For kernel clients, it shouldn't consume any additional memory. For userspace clients, it would continue to consume memory until a response were generated. Currently, that memory is freed once the userspace application retrieves the MAD from the kernel > What limits this memory? That's part of the discussion. Today, there is NO limit on how much memory a userspace application can consume. It will continue to consume memory as long as it doesn't call to receive a MAD. > Would a small client that would normally > just send RMPP, get a response and exit will be slowed down significantly while > the agent learns? Clients that send requests are unaffected. Clients that use one of the pre-defined classes or known methods would also be unaffected. The learning only affects new methods, and would typically be limited to the receiving one MAD for each method. > Would a buggy application confuse the umad module, corrupting > the agent learns? Would a buggy application confuse the umad module, corrupting > MAD processing for all other applications? A buggy application would only affect itself, plus whoever it was trying to communicate with. We can't really fix the latter though. > The original approach by Jack of detecting, and dropping, duplicate responses > instead of duplicate requests seemed much easier to me. The only disadvantage > it has that I'm aware of is a slight performance hit for duplicate processing of > each request. But all the done_list scans proposed seem even more CPU intensive. Jack's approach results in scanning a list, plus has the overhead of of duplicating the processing. The other problem is that DS RMPP requires maintaining state between receiving a request and the generation of a response. This approach provides a mechanism that can be used to maintain that state (i.e. the received request). By applying Jack's patch, I'll end up having to invent another way to store and retrieve the state. - Sean From mshefty at ichips.intel.com Wed Jun 14 09:46:21 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 14 Jun 2006 09:46:21 -0700 Subject: [openib-general] communication established affiliated asynchronous event In-Reply-To: References: Message-ID: <44903D5D.10102@ichips.intel.com> James Lentini wrote: > The IBTA spec (volume 1, version 1.2) describes a communication > established affiliated asynchronous event. > > Is this event supposed to be delivered to the verbs consumer or the IB > CM? > > We've seen this event delivered to our NFS-RDMA server and aren't sure > what to do with it. This event is delivered to the verbs consumer, since it occurs on the QP. It's expected that the consumer will call ib_cm_establish. Although, I would guess that you can probably ignore the event, under the assumption that the RTU will eventually be received by the local CM. - Sean From mst at mellanox.co.il Wed Jun 14 09:53:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Jun 2006 19:53:27 +0300 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <44903BC6.8020302@ichips.intel.com> References: <44903BC6.8020302@ichips.intel.com> Message-ID: <20060614165327.GI25417@mellanox.co.il> Quoting r. Sean Hefty : > > Would a small client that would normally just send RMPP, get a response and > > exit will be slowed down significantly while the agent learns? > > Clients that send requests are unaffected. Clients that use one of the > pre-defined classes or known methods would also be unaffected. The learning > only affects new methods, and would typically be limited to the receiving one > MAD for each method. Is that per-agent, or global? If per-agent, can this hurt user that writes scripts using management utilities? These will typically send or receive something and exit. No? > > Would a buggy application confuse the umad module, corrupting the agent > > learns? Would a buggy application confuse the umad module, corrupting MAD > > processing for all other applications? > > A buggy application would only affect itself, plus whoever it was trying to > communicate with. We can't really fix the latter though. Is the table of methods maintained per agent then? > The other problem is that DS RMPP requires maintaining state between receiving > a request and the generation of a response. It does? Why does it? -- MST From robert.j.woodruff at intel.com Wed Jun 14 09:56:04 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 14 Jun 2006 09:56:04 -0700 Subject: [openib-general] OFED 1.0 release schedule Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007F73FBD@orsmsx408> >Woody - For us, SDP ran just fine on InfiniPath on the RHEL4 tests we >ran yesterday the the OFED pre-release candidate. Can you send me the >output you got when you tried it? I ran a modified netpipe over SDP and it hung somewhere around size > 4k. It was on a production Lindenhurst Xeon system. This works fine with the Mellanox cards. I also had problems with uDAPL over pathscale (and thus Intel MPI) and suspect problems with RDMA operations. I did not have time to debug it any further. Were you able to get perftest running, as Arlin suggested to your developers a couple of weeks back ? Right now, I had to pull the pathscale cards to complete regression testing of 1.0-pre1 with Intel MPI and since pathscale does not work with Intel MPI, I put the Mellanox cards back in. From weiny2 at llnl.gov Wed Jun 14 09:59:58 2006 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 14 Jun 2006 09:59:58 -0700 Subject: [openib-general] MPI error when using a "system" call in mpi job. In-Reply-To: <44900970.9050006@cse.ohio-state.edu> References: <20060613171147.35787125.weiny2@llnl.gov> <44900970.9050006@cse.ohio-state.edu> Message-ID: <20060614095958.59c7dcc7.weiny2@llnl.gov> We are on a modified RedHat RHEL4 kernel. Roughly 2.6.9. :-( I am going to try a 2.6.16 kernel I have built to see if it changes. Ira On Wed, 14 Jun 2006 09:04:48 -0400 Sayantan Sur wrote: > Hello Ira, > > I am running the program on 2.6.15 (EM64T machine) and 2.6.16 (IA32 > machine). The program seems to be running fine. Can you tell us which > kernel you are using? We are using drivers pulled out of the trunk > about 3-4 weeks back. > > Thanks, > Sayantan. > > Ira Weiny wrote: > > >A co-worker here was seeing the following MPI error from his job: > > > >[1] Abort: [ldev2:1] Got completion with error, code=1 > > at line 2148 in file viacheck.c > > > >After some tracking down he found that apparently if he used a > >"system" call [int system(const char *string)] the next MPI command > >will fail. > > > >I have been able to reproduce this with the attached simple "hello" > >program. > > > >Perhaps someone has seen this type of error? Here is the output > >from 2 runs: > > > >weiny2 at ldev0:~/ior-test > >17:04:04 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello x > >ldev1 > >[0] Abort: [ldev1:0] Got completion with error, code=1 > > at line 2148 in file viacheck.c > >ldev2 > >mpirun_rsh: Abort signaled from [0] > >done. > >weiny2 at ldev0:~/ior-test > >17:05:23 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello > >now = 0.000000 > >now = 0.000052 > >now = 0.000094 > >now = 0.000121 > >now = 0.000151 > >now = 0.001072 > >now = 0.001102 > >now = 0.001118 > >now = 0.001141 > >now = 0.001160 > >done. > > > >We are running mvapich 0.9.7 and the openib trunk rev 6829. > > > >Thanks, > >Ira > > > > > > > >------------------------------------------------------------------------ > > > >_______________________________________________ > >openib-general mailing list > >openib-general at openib.org > >http://openib.org/mailman/listinfo/openib-general > > > >To unsubscribe, please visit > >http://openib.org/mailman/listinfo/openib-general > > > > -- > http://www.cse.ohio-state.edu/~surs > From mshefty at ichips.intel.com Wed Jun 14 10:11:40 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 14 Jun 2006 10:11:40 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <20060614165327.GI25417@mellanox.co.il> References: <44903BC6.8020302@ichips.intel.com> <20060614165327.GI25417@mellanox.co.il> Message-ID: <4490434C.6020003@ichips.intel.com> Michael S. Tsirkin wrote: > Is that per-agent, or global? If per-agent, can this hurt user that writes > scripts using management utilities? These will typically send or receive > something and exit. No? This is per agent. The proposal would only affect applications that generate the responses. (Think of it as enforcing that all response MADs match with a received request, so a user can't generate a response for a request that they never received.) An agent that sends a request, and receives the response is unaffected. > Is the table of methods maintained per agent then? That would be my plan; although, we could probably make it global. >>The other problem is that DS RMPP requires maintaining state between receiving >>a request and the generation of a response. > > It does? Why does it? It needs to track receiving an ACK of the final ACK to the request, which carries the initial window size for the response. Conceptually, what happens is: -- request --> <-- ACK request -- -- ACK (response window) --> <-- response -- -- ACK response -> - Sean From robert.j.woodruff at intel.com Wed Jun 14 10:13:24 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 14 Jun 2006 10:13:24 -0700 Subject: [openib-general] MPI error when using a "system" call in mpi job. In-Reply-To: <20060614095958.59c7dcc7.weiny2@llnl.gov> Message-ID: <000201c68fd5$d958f690$50a9070a@amr.corp.intel.com> >Subject: Re: [openib-general] MPI error when using a "system" call in mpi job. >We are on a modified RedHat RHEL4 kernel. Roughly 2.6.9. :-( >I am going to try a 2.6.16 kernel I have built to see if it changes. >Ira We have also seen problems with the 2.6.9 kernel and system call with Intel MPI. The problem seems to be fixed in the VM system somewhere around 2.6.15. I tried to look at what was changed to see if there was an easy patch that one could make to the 2.6.9 kernel to fix the problem, but it was not intuitively obvious what exactly they changed that fixed the problem. woody From mshefty at ichips.intel.com Wed Jun 14 10:25:59 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 14 Jun 2006 10:25:59 -0700 Subject: [openib-general] oops on trunk In-Reply-To: <20060614084920.GC19518@mellanox.co.il> References: <20060614084920.GC19518@mellanox.co.il> Message-ID: <449046A7.8090809@ichips.intel.com> How many nodes were running on the fabric when this happened? This was just caused by executing modprobe -r ib_ipoib, right? I'm still completely stumped on how this is occurring, and haven't been able to reproduce it. - Sean From halr at voltaire.com Wed Jun 14 10:20:53 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Jun 2006 13:20:53 -0400 Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT tables from dump file. In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30236884D@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30236884D@mtlexch01.mtl.com> Message-ID: <1150305614.570.205033.camel@hal.voltaire.com> Hi Eitan, On Wed, 2006-06-14 at 02:48, Eitan Zahavi wrote: > Hi Hal, Sasha, > > Regarding OpenSM coding style: > > Sasha wrote: > > > > Really? Don't want to bother with examples, but I may see almost any > > "combination" in OpenSM and it is not clear for me which one is common > > (the coding style and identation are different even from file to > file). > [EZ] This bothers me as I think we should use a consistent coding style. > You might also remember we had put in place a both a script to do > automatic indentation and coding style rule fixes (osm_indent and > osm_check_n_fix) > > I did check for all "else" statements: > osm/opensm>grep else *.c | wc -l > 397 > osm/opensm>grep else *.c | grep -v "{" | grep -v "}" | wc -l > 361 > > So you can see only <10% (36 out of 397) "else" statement are not > coding style consistent. > Checking what is the code that is "non standard": > osm/opensm>grep else *.c | grep "{" | awk '{print $1}' | sort | uniq -c > | sort -rn > 7 osm_console.c: > 6 osm_prtn_config.c: > 3 st.c: > 3 osm_sa_multipath_record.c: > 2 osm_ucast_mgr.c: > 2 osm_sa_path_record.c: > 1 osm_sa_mcmember_record.c: > 1 osm_sa_informinfo.c: > 1 osm_sa_class_port_info.c: > 1 osm_multicast.c: > > You can see the majority of these mismatches are in code introduced by > Hal and yourself. While some of those are ours (clearly osm_console.c, osm_prtn_config.c, and osm_sa_multipath_record.c), not all of them are. I'm sure some came from you, Yael, and Ofer so let's not be pointing fingers. I don't bother to kick back each patch on these details. If I did, we would get no where. I fixed a number of the ones you pointed to above just now. But let's back up a bit... > I think OpenSM should sue a single code style. This is the key but now is (still) not the time. How about we take this up in about a month maybe sooner if things settle down a little quicker ? I'll bring this up on the list when I think the time is right. I do think it will take time to agree on this and a lot of the rules will be arbitrary. > My proposal is that we > update our osm_indent script with a set of rules we agree on and apply > to the entire tree. I'm unconvinced that osm_indent is sufficient. I think a lot of human attention is needed afterwards. I've seen that happen before. How much time do you have to invest in doing this ? -- Hal From mshefty at ichips.intel.com Wed Jun 14 10:40:47 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 14 Jun 2006 10:40:47 -0700 Subject: [openib-general] [PATCH 0/5] multicast abstraction In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com> References: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com> Message-ID: <44904A1F.9010701@ichips.intel.com> Sean Hefty wrote: > This patch series enhances support for joining and leaving multicast groups, > providing the following functionality: I'd like to commit both the multicast and UD QP support change sets. Are there any disagreements with committing these to the trunk? This would provide a single interface for setting up RC, UD, and multicast communication. The only mentioned drawback is that iWarp does not define support for UD or multicast communication. - Sean From mst at mellanox.co.il Wed Jun 14 10:59:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Jun 2006 20:59:34 +0300 Subject: [openib-general] oops on trunk In-Reply-To: <449046A7.8090809@ichips.intel.com> References: <449046A7.8090809@ichips.intel.com> Message-ID: <20060614175934.GA27134@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] oops on trunk > > How many nodes were running on the fabric when this happened? back to back > This was just > caused by executing modprobe -r ib_ipoib, right? yes -- MST From mst at mellanox.co.il Wed Jun 14 11:05:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Jun 2006 21:05:02 +0300 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com> References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com> Message-ID: <20060614180502.GB27134@mellanox.co.il> Quoting r. Sean Hefty : > One of the ideas then, is for the kernel umad module to learn which MADs > generate responses. It would do this by updating an entry to a table whenever > a response MAD is generated. A received MAD would check against the table to > see if a response is supposed to be generated. If not, then the MAD would be > freed after userspace claims it. If a response is expected, then the MAD > would not be freed until the response was generated. Another concern with this approach: consider an application that accepts incoming MAD requests and drops some of them. With current code it can do this safely and remote side will retry. With the duplicate tracking in umad module that you propose, MAD will stay in the list forever, and application will never again get called. This kind of subtle behaviour change seems to me worse than outright ABI breakage. -- MST From caitlinb at broadcom.com Wed Jun 14 11:06:44 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 14 Jun 2006 11:06:44 -0700 Subject: [openib-general] [PATCH v2 1/2] iWARP Connection Manager. Message-ID: <54AD0F12E08D1541B826BE97C98F99F1576635@NT-SJCA-0751.brcm.ad.broadcom.com> netdev-owner at vger.kernel.org wrote: > On Tue, 2006-06-13 at 16:46 -0500, Steve Wise wrote: >> On Tue, 2006-06-13 at 14:36 -0700, Sean Hefty wrote: >>>>> Er...no. It will lose this event. Depending on the event...the >>>>> carnage varies. We'll take a look at this. >>>>> >>>> >>>> This behavior is consistent with the Infiniband CM (see >>>> drivers/infiniband/core/cm.c function cm_recv_handler()). But I >>>> think we should at least log an error because a lost event will >>>> usually stall the rdma connection. >>> >>> I believe that there's a difference here. For the Infiniband CM, an >>> allocation error behaves the same as if the received MAD were lost >>> or dropped. Since MADs are unreliable anyway, it's not so much that >>> an IB CM event gets lost, as it doesn't ever occur. A remote CM >>> should retry the send, which hopefully allows the > connection to make forward progress. >>> >> >> hmm. Ok. I see. I misunderstood the code in cm_recv_handler(). >> >> Tom and I have been talking about what we can do to not drop the >> event. Stay tuned. > > Here's a simple solution that solves the problem: > > For any given cm_id, there are a finite (and small) number of > outstanding CM events that can be posted. So we just > pre-allocate them when the cm_id is created and keep them on > a free list hanging off of the cm_id struct. Then the event > handler function will pull from this free list. > > The only case where there is any non-finite issue is on the > passive listening cm_id. Each incoming connection request > will consume a work struct. So based on client connects, we > could run out of work structs. > However, the CMA has the concept of a backlog, which is > defined as the max number of pending unaccepted connection > requests. So we allocate these work structs based on that > number (or a computation based on that number), and if we run > out, we simply drop the incoming connection request due to > backlog overflow (I suggest we log the drop event too). > When a MPA connection request is dropped, the (IETF > conforming) MPA client will eventually time out the > connection and the consumer can retry. > > Comments? > If the IWCM cannot accept a Connection Request event from the driver then *someone* should generate a non-peer reject MPA Response frame. Since the IWCM does not have the resources to relay the event, it probably does not have the resources to generate the MPA Response frame either. So simply returning an "I'm Busy" error and expecting the driver to handle it makes sense to me. From mshefty at ichips.intel.com Wed Jun 14 11:27:06 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 14 Jun 2006 11:27:06 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <20060614180502.GB27134@mellanox.co.il> References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com> <20060614180502.GB27134@mellanox.co.il> Message-ID: <449054FA.7060304@ichips.intel.com> Michael S. Tsirkin wrote: > Another concern with this approach: consider an application that accepts > incoming MAD requests and drops some of them. With current code it can do this > safely and remote side will retry. With the duplicate tracking in umad module > that you propose, MAD will stay in the list forever, and application will never > again get called. This is why I proposed a timeout for responses. > This kind of subtle behaviour change seems to me worse than outright ABI > breakage. If everyone is okay with breaking the ABI, then I would add send completion notification to umad, and put the responsibility on callers not to generate duplicate responses. - Sean From sashak at voltaire.com Wed Jun 14 11:39:32 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 Jun 2006 21:39:32 +0300 Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes from the file In-Reply-To: <20060614000610.GJ23320@durango.c3.lanl.gov> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060613170246.GH23320@durango.c3.lanl.gov> <20060613200035.GG10482@sashak.voltaire.com> <20060614000610.GJ23320@durango.c3.lanl.gov> Message-ID: <20060614183932.GB10544@sashak.voltaire.com> Hi Greg, On 18:06 Tue 13 Jun , Greg Johnson wrote: > On Tue, Jun 13, 2006 at 11:00:35PM +0300, Sasha Khapyorsky wrote: > > Hi Greg, > > > > On 11:02 Tue 13 Jun , Greg Johnson wrote: > > > It seems to load the routes generated by the dump > > > script, but afterward it is not possible to dump the routes again. > > > > This means you have broken LFTs now. Probably I know what is going on > > here - new LFTs don't have " 0" entries, and switches are > > not accessible by LIDs anymore. > > > > Please update 'ibroute' utility (diags/) from the trunk and recreate the > > dump file - this should fix the problem. > > > > (Sorry, I forgot to mention 'ibroute' upgrade issue in patch announcement). > > Ok, that fixed it. It works fine now. Good. Thanks for trying this. > Any chance of making our own lid -> guid assignments while we are at it? Does guid2lid file not help? As I understand you want to load predefined LIDs, right? Sasha. From mlang at lanl.gov Wed Jun 14 11:48:06 2006 From: mlang at lanl.gov (michael k lang) Date: Wed, 14 Jun 2006 12:48:06 -0600 Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes from the file In-Reply-To: <20060614183932.GB10544@sashak.voltaire.com> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060613170246.GH23320@durango.c3.lanl.gov> <20060613200035.GG10482@sashak.voltaire.com> <20060614000610.GJ23320@durango.c3.lanl.gov> <20060614183932.GB10544@sashak.voltaire.com> Message-ID: <1150310886.16684.12.camel@jumper.c3.lanl.gov.c3.lanl.gov> On Wed, 2006-06-14 at 21:39 +0300, Sasha Khapyorsky wrote: > Hi Greg, > > On 18:06 Tue 13 Jun , Greg Johnson wrote: > > On Tue, Jun 13, 2006 at 11:00:35PM +0300, Sasha Khapyorsky wrote: > > > Hi Greg, > > > > > > On 11:02 Tue 13 Jun , Greg Johnson wrote: > > > > It seems to load the routes generated by the dump > > > > script, but afterward it is not possible to dump the routes again. > > > > > > This means you have broken LFTs now. Probably I know what is going on > > > here - new LFTs don't have " 0" entries, and switches are > > > not accessible by LIDs anymore. > > > > > > Please update 'ibroute' utility (diags/) from the trunk and recreate the > > > dump file - this should fix the problem. > > > > > > (Sorry, I forgot to mention 'ibroute' upgrade issue in patch announcement). > > > > Ok, that fixed it. It works fine now. > > Good. Thanks for trying this. > > > Any chance of making our own lid -> guid assignments while we are at it? > > Does guid2lid file not help? Ya, guid2lid has all the info we need, we were just trying to take off one more level of indirection its not necessary. > > As I understand you want to load predefined LIDs, right? > > Sasha. --Mike From halr at voltaire.com Wed Jun 14 11:58:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Jun 2006 14:58:34 -0400 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <449054FA.7060304@ichips.intel.com> References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com> <20060614180502.GB27134@mellanox.co.il> <449054FA.7060304@ichips.intel.com> Message-ID: <1150311514.4506.1171.camel@hal.voltaire.com> On Wed, 2006-06-14 at 14:27, Sean Hefty wrote: > Michael S. Tsirkin wrote: > > Another concern with this approach: consider an application that accepts > > incoming MAD requests and drops some of them. With current code it can do this > > safely and remote side will retry. With the duplicate tracking in umad module > > that you propose, MAD will stay in the list forever, and application will never > > again get called. > > This is why I proposed a timeout for responses. > > > This kind of subtle behaviour change seems to me worse than outright ABI > > breakage. > > If everyone is okay with breaking the ABI, then I would add send completion > notification to umad, and put the responsibility on callers not to generate > duplicate responses. Is this a better architectural solution ? I'm not sure I totally understand what the new ABI would be and its impact on existing applications. Is there an example of what this might look like ? -- Hal > - Sean > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Wed Jun 14 12:13:02 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 Jun 2006 22:13:02 +0300 Subject: [openib-general] [PATCH] OpenSM/osm_ucast_file: Eliminate compiler warning In-Reply-To: <1150292338.570.197234.camel@hal.voltaire.com> References: <1150292338.570.197234.camel@hal.voltaire.com> Message-ID: <20060614191302.GH10544@sashak.voltaire.com> On 09:39 Wed 14 Jun , Hal Rosenstock wrote: > OpenSM/osm_ucast_file: Eliminate compiler warning > > osm_ucast_file.c: In function `do_ucast_file_load': > osm_ucast_file.c:156: warning: passing arg 2 of `cl_qmap_apply_func' > from incompatible pointer type > > Signed-off-by: Hal Rosenstock Missed that. Thanks for fixing. Sasha From eitan at mellanox.co.il Wed Jun 14 12:17:06 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 14 Jun 2006 22:17:06 +0300 Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT tables fromdump file. Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236885E@mtlexch01.mtl.com> Hi Hal, My point is clear: the fact there are some files that are inconsistent should not open the door for a total mess. I think the simple statistics do answer the question raised: " ... and it is not clear for me which one is common". If we do delay the major cleanup we should at least try to stick to the existing standard. >From the mail thread it seems the intention is different and this makes me really uncomfortable. Eitan > > Hi Eitan, > > On Wed, 2006-06-14 at 02:48, Eitan Zahavi wrote: > > Hi Hal, Sasha, > > > > Regarding OpenSM coding style: > > > > Sasha wrote: > > > > > > Really? Don't want to bother with examples, but I may see almost any > > > "combination" in OpenSM and it is not clear for me which one is common > > > (the coding style and identation are different even from file to > > file). > > [EZ] This bothers me as I think we should use a consistent coding style. > > You might also remember we had put in place a both a script to do > > automatic indentation and coding style rule fixes (osm_indent and > > osm_check_n_fix) > > > > I did check for all "else" statements: > > osm/opensm>grep else *.c | wc -l > > 397 > > osm/opensm>grep else *.c | grep -v "{" | grep -v "}" | wc -l > > 361 > > > > So you can see only <10% (36 out of 397) "else" statement are not > > coding style consistent. > > Checking what is the code that is "non standard": > > osm/opensm>grep else *.c | grep "{" | awk '{print $1}' | sort | uniq -c > > | sort -rn > > 7 osm_console.c: > > 6 osm_prtn_config.c: > > 3 st.c: > > 3 osm_sa_multipath_record.c: > > 2 osm_ucast_mgr.c: > > 2 osm_sa_path_record.c: > > 1 osm_sa_mcmember_record.c: > > 1 osm_sa_informinfo.c: > > 1 osm_sa_class_port_info.c: > > 1 osm_multicast.c: > > > > You can see the majority of these mismatches are in code introduced by > > Hal and yourself. > > While some of those are ours (clearly osm_console.c, osm_prtn_config.c, > and osm_sa_multipath_record.c), not all of them are. I'm sure some came > from you, Yael, and Ofer so let's not be pointing fingers. I don't > bother to kick back each patch on these details. If I did, we would get > no where. I fixed a number of the ones you pointed to above just now. > > But let's back up a bit... > > > I think OpenSM should sue a single code style. > > This is the key but now is (still) not the time. How about we take this > up in about a month maybe sooner if things settle down a little quicker > ? I'll bring this up on the list when I think the time is right. I do > think it will take time to agree on this and a lot of the rules will be > arbitrary. > > > My proposal is that we > > update our osm_indent script with a set of rules we agree on and apply > > to the entire tree. > > I'm unconvinced that osm_indent is sufficient. I think a lot of human > attention is needed afterwards. I've seen that happen before. How much > time do you have to invest in doing this ? > > -- Hal From mshefty at ichips.intel.com Wed Jun 14 12:23:27 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 14 Jun 2006 12:23:27 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <1150311514.4506.1171.camel@hal.voltaire.com> References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com> <20060614180502.GB27134@mellanox.co.il> <449054FA.7060304@ichips.intel.com> <1150311514.4506.1171.camel@hal.voltaire.com> Message-ID: <4490622F.30802@ichips.intel.com> Hal Rosenstock wrote: >>If everyone is okay with breaking the ABI, then I would add send completion >>notification to umad, and put the responsibility on callers not to generate >>duplicate responses. > > Is this a better architectural solution ? Not sure. It doesn't solve supporting DS RMPP, which requires maintaining state between receiving a request and the generation of a response. > I'm not sure I totally understand what the new ABI would be and its > impact on existing applications. Is there an example of what this might > look like ? Currently, the only send MADs that are reported to the user are requests that time out waiting for a response. We could probably change that to report all send completions. Failed sends are reported using a status of timeout, with the MAD header copied to userspace. So the length of the MAD indicates if it was a send or receive. From an implementation stand point, this approach likely requires only minor changes to the kernel code. But any userspace applications that send MADs would need to change to handle this. The list of application that do send MADs is likely fairly small however. If we wanted to be more restrictive on which applications would be affected, we could only generate send completions for response MADs. - Sean From halr at voltaire.com Wed Jun 14 12:25:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Jun 2006 15:25:03 -0400 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <4490622F.30802@ichips.intel.com> References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com> <20060614180502.GB27134@mellanox.co.il> <449054FA.7060304@ichips.intel.com> <1150311514.4506.1171.camel@hal.voltaire.com> <4490622F.30802@ichips.intel.com> Message-ID: <1150313103.4506.2186.camel@hal.voltaire.com> On Wed, 2006-06-14 at 15:23, Sean Hefty wrote: > Hal Rosenstock wrote: > >>If everyone is okay with breaking the ABI, then I would add send completion > >>notification to umad, and put the responsibility on callers not to generate > >>duplicate responses. > > > > Is this a better architectural solution ? > > Not sure. Then it's likely not worth breaking the ABI which will cause more pain than it's worth. > It doesn't solve supporting DS RMPP, which requires maintaining state > between receiving a request and the generation of a response. > > > I'm not sure I totally understand what the new ABI would be and its > > impact on existing applications. Is there an example of what this might > > look like ? > > Currently, the only send MADs that are reported to the user are requests that > time out waiting for a response. We could probably change that to report all > send completions. Failed sends are reported using a status of timeout, with the > MAD header copied to userspace. So the length of the MAD indicates if it was a > send or receive. > > From an implementation stand point, this approach likely requires only minor > changes to the kernel code. But any userspace applications that send MADs would > need to change to handle this. The list of application that do send MADs is > likely fairly small however. It's not so small. > If we wanted to be more restrictive on which applications would be affected, we > could only generate send completions for response MADs. I think that would only pare it down a little. -- Hal > - Sean From sashak at voltaire.com Wed Jun 14 12:42:06 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 Jun 2006 22:42:06 +0300 Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes from the file In-Reply-To: <1150310886.16684.12.camel@jumper.c3.lanl.gov.c3.lanl.gov> References: <20060611002758.22430.63061.stgit@sashak.voltaire.com> <20060613170246.GH23320@durango.c3.lanl.gov> <20060613200035.GG10482@sashak.voltaire.com> <20060614000610.GJ23320@durango.c3.lanl.gov> <20060614183932.GB10544@sashak.voltaire.com> <1150310886.16684.12.camel@jumper.c3.lanl.gov.c3.lanl.gov> Message-ID: <20060614194206.GJ10544@sashak.voltaire.com> On 12:48 Wed 14 Jun , michael k lang wrote: > On Wed, 2006-06-14 at 21:39 +0300, Sasha Khapyorsky wrote: > > Hi Greg, > > > > On 18:06 Tue 13 Jun , Greg Johnson wrote: > > > On Tue, Jun 13, 2006 at 11:00:35PM +0300, Sasha Khapyorsky wrote: > > > > Hi Greg, > > > > > > > > On 11:02 Tue 13 Jun , Greg Johnson wrote: > > > > > It seems to load the routes generated by the dump > > > > > script, but afterward it is not possible to dump the routes again. > > > > > > > > This means you have broken LFTs now. Probably I know what is going on > > > > here - new LFTs don't have " 0" entries, and switches are > > > > not accessible by LIDs anymore. > > > > > > > > Please update 'ibroute' utility (diags/) from the trunk and recreate the > > > > dump file - this should fix the problem. > > > > > > > > (Sorry, I forgot to mention 'ibroute' upgrade issue in patch announcement). > > > > > > Ok, that fixed it. It works fine now. > > > > Good. Thanks for trying this. > > > > > Any chance of making our own lid -> guid assignments while we are at it? > > > > Does guid2lid file not help? > Ya, guid2lid has all the info we need, we were just trying to take off > one more level of indirection its not necessary. You want to have all the info in one file. Right? It could be interesting idea to extend routing engine with 'lid loader'. Will need to think about it. Sasha From mst at mellanox.co.il Wed Jun 14 13:11:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Jun 2006 23:11:05 +0300 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <4490434C.6020003@ichips.intel.com> References: <4490434C.6020003@ichips.intel.com> Message-ID: <20060614201105.GA27868@mellanox.co.il> Quoting r. Sean Hefty : > >>The other problem is that DS RMPP requires maintaining state between > >>receiving a request and the generation of a response. > > > > It does? Why does it? > > It needs to track receiving an ACK of the final ACK to the request, which > carries the initial window size for the response. Conceptually, what happens is: > > -- request --> > <-- ACK request -- > -- ACK (response window) --> > <-- response -- > -- ACK response -> OK, so apparently, what we have with dual-sided, after ack with response window arrives, is a sender that can't send data since userspace did not give us the response. I see how this approach would require significant change in core, and I'm not really happy with this. Here's an alternative idea: instead of making huge changes all over, how about we delay passing the RMPP transaction up to the user until we have the ACK with the response window, and ask the user to give us back this ACK packet (or just the window?) when he sends the response? Since we didn't support dual-sided tansfers this extends rather than breaks both the ABI and the API. The issue of duplicates can then be dealt with by Jack's patch, detecting duplicate requests which does not require additional state. Sounds good? -- MST From paul.lundin at gmail.com Wed Jun 14 13:37:44 2006 From: paul.lundin at gmail.com (Paul) Date: Wed, 14 Jun 2006 16:37:44 -0400 Subject: [openib-general] OFED 1.0-pre 1 build issues. Message-ID: Hello All, Using the default build.sh script on x86_64 rhel4u3 works flawlessly. However when doing the same thing on ppc64 the build fails (both are "everything" installs). The frustrating thing about the failure is that its failing while looking in the wrong locations for some libraries. Instead of looking in the lib64 directories its looking in lib. I have tried setting LDFLAGS, CXXFLAGS, CCFLAGS and CFLAGS to -m64 with no change, lib64 stuff is listed before lib in ld.so.conf (which I think only affects runtime ...). Here is the exact error: g++ -shared -nostdlib /usr/lib/gcc/ppc64-redhat-linux/3.4.5/../../../../lib/crti.o /usr/lib/gcc/ppc64 -redhat-linux/3.4.5/crtbeginS.o .libs/client.o .libs/simmsg.o .libs/msgmgr.o .libs/tcpcomm.o -L/usr /lib/gcc/ppc64-redhat-linux/3.4.5 -L/usr/lib/gcc/ppc64-redhat-linux/3.4.5/../../../../lib -L/usr/lib/ gcc/ppc64-redhat-linux/3.4.5/../../.. -L/lib/../lib -L/usr/lib/../lib -lstdc++ -lm -lc -lgcc_s /usr/l ib/gcc/ppc64-redhat-linux/3.4.5/crtsavres.o /usr/lib/gcc/ppc64-redhat-linux/3.4.5/crtendS.o /usr/lib/ gcc/ppc64-redhat-linux/3.4.5/../../../../lib/crtn.o -m64 -mminimal-toc -Wl,-soname -Wl,libibmscli.so .1 -o .libs/libibmscli.so.1.0.0 /usr/bin/ld: skipping incompatible /usr/lib/gcc/ppc64-redhat-linux/3.4.5/../../../../lib/libc.so when searching for -lc /usr/bin/ld: skipping incompatible /usr/lib/gcc/ppc64-redhat-linux/3.4.5/../../../../lib/libc.a when searching for -lc /usr/bin/ld: skipping incompatible /usr/lib/gcc/ppc64-redhat-linux/3.4.5/../../../libc.so when search ing for -lc /usr/bin/ld: skipping incompatible /usr/lib/gcc/ppc64-redhat-linux/3.4.5/../../../libc.a when searchi ng for -lc /usr/bin/ld: skipping incompatible /usr/lib/../lib/libc.so when searching for -lc /usr/bin/ld: skipping incompatible /usr/lib/../lib/libc.a when searching for -lc /usr/bin/ld: warning: powerpc:common architecture of input file `/usr/lib/gcc/ppc64-redhat-linux/3.4. 5/../../../../lib/crti.o' is incompatible with powerpc:common64 output /usr/bin/ld: warning: powerpc:common architecture of input file `/usr/lib/gcc/ppc64-redhat-linux/3.4. 5/crtbeginS.o' is incompatible with powerpc:common64 output /usr/bin/ld: warning: powerpc:common architecture of input file `/usr/lib/gcc/ppc64-redhat-linux/3.4. 5/crtsavres.o' is incompatible with powerpc:common64 output /usr/bin/ld: warning: powerpc:common architecture of input file `/usr/lib/gcc/ppc64-redhat-linux/3.4. 5/crtendS.o' is incompatible with powerpc:common64 output /usr/bin/ld: warning: powerpc:common architecture of input file `/usr/lib/gcc/ppc64-redhat-linux/3.4. 5/../../../../lib/crtn.o' is incompatible with powerpc:common64 output /usr/bin/ld: can not size stub section: Bad value /usr/bin/ld: .libs/libibmscli.so.1.0.0: Not enough room for program headers, try linking with -N /usr/bin/ld: final link failed: Bad value collect2: ld returned 1 exit status make[3]: *** [libibmscli.la] Error 1 make[3]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibmgtsim/src' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibmgtsim' make[1]: *** [all] Error 2 make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibmgtsim' make: *** [all-recursive] Error 1 error: Bad exit status from /var/tmp/rpm-tmp.18200 (%install) Regards. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Jun 14 13:50:27 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 14 Jun 2006 13:50:27 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <20060614201105.GA27868@mellanox.co.il> References: <4490434C.6020003@ichips.intel.com> <20060614201105.GA27868@mellanox.co.il> Message-ID: <44907693.7010001@ichips.intel.com> Michael S. Tsirkin wrote: > Here's an alternative idea: instead of making huge changes all over, how about > we delay passing the RMPP transaction up to the user until we have the ACK with > the response window, and ask the user to give us back this ACK packet (or just > the window?) when he sends the response? Since we didn't support dual-sided > tansfers this extends rather than breaks both the ABI and the API. I thought about this as well, and I think there was a discussion about doing this. The window size could be exchanged in the RMPP header if needed. We're kind of left with the same issue of trying to determine if a received MAD will generate a response. > The issue of duplicates can then be dealt with by Jack's patch, > detecting duplicate requests which does not require additional state. > > Sounds good? > If an alternative for handling DS RMPP can be found, I'm fine with this. - Sean From mst at mellanox.co.il Wed Jun 14 13:56:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Jun 2006 23:56:18 +0300 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <44907693.7010001@ichips.intel.com> References: <44907693.7010001@ichips.intel.com> Message-ID: <20060614205618.GB28111@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: RFC: detecting duplicate MAD requests > > Michael S. Tsirkin wrote: > > Here's an alternative idea: instead of making huge changes all over, how > > about we delay passing the RMPP transaction up to the user until we have the > > ACK with the response window, and ask the user to give us back this ACK > > packet (or just the window?) when he sends the response? Since we didn't > > support dual-sided tansfers this extends rather than breaks both the ABI and > > the API. > > I thought about this as well, and I think there was a discussion about doing > this. The window size could be exchanged in the RMPP header if needed. Sounds good. > We're kind of left with the same issue of trying to determine if a received > MAD will generate a response. How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided transfer always has a response, doesn't it? -- MST From mshefty at ichips.intel.com Wed Jun 14 14:03:22 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 14 Jun 2006 14:03:22 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <20060614205618.GB28111@mellanox.co.il> References: <44907693.7010001@ichips.intel.com> <20060614205618.GB28111@mellanox.co.il> Message-ID: <4490799A.9040802@ichips.intel.com> Michael S. Tsirkin wrote: >>We're kind of left with the same issue of trying to determine if a received >>MAD will generate a response. > > > How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided > transfer always has a response, doesn't it? Unless I completely missed something, there is no IsDS flag. - Sean From mst at mellanox.co.il Wed Jun 14 14:22:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Jun 2006 00:22:50 +0300 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <4490799A.9040802@ichips.intel.com> References: <4490799A.9040802@ichips.intel.com> Message-ID: <20060614212250.GC28111@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: RFC: detecting duplicate MAD requests > > Michael S. Tsirkin wrote: > >>We're kind of left with the same issue of trying to determine if a received > >>MAD will generate a response. > > > > > > How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided > > transfer always has a response, doesn't it? I mean, the flag in the application that says that the transfer is dual-sided. The spec seems to imply that user can figure *from the method* that IsDS=1, so I assume users will have this logic: "2) Begin the initial transfer by starting the send operation at the point labelled Send. The method or other indication should be interpreted on the other side as initiating a double-sided transfer, causing the receive context to set IsDS=1." So why does the MAD layer care whether a received MAD will generate a resonse? A request arrives - we pass it up. Now the ACK for the direction switch arrives - we pass it up too, application should be waiting for it, it should take the window and pass the response back to us. -- MST From halr at voltaire.com Wed Jun 14 14:20:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Jun 2006 17:20:08 -0400 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <4490799A.9040802@ichips.intel.com> References: <44907693.7010001@ichips.intel.com> <20060614205618.GB28111@mellanox.co.il> <4490799A.9040802@ichips.intel.com> Message-ID: <1150320008.4506.6189.camel@hal.voltaire.com> On Wed, 2006-06-14 at 17:03, Sean Hefty wrote: > Michael S. Tsirkin wrote: > >>We're kind of left with the same issue of trying to determine if a received > >>MAD will generate a response. > > > > > > How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided > > transfer always has a response, doesn't it? > > Unless I completely missed something, there is no IsDS flag. IsDS is an internal state variable and not an on wire part of the protocol. -- Hal > - Sean From mst at mellanox.co.il Wed Jun 14 14:28:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Jun 2006 00:28:48 +0300 Subject: [openib-general] OFED 1.0-pre 1 build issues. In-Reply-To: References: Message-ID: <20060614212848.GD28111@mellanox.co.il> Quoting r. Paul : > Subject: OFED 1.0-pre 1 build issues. > > Hello All, > Using the default build.sh script on x86_64 rhel4u3 works flawlessly. However when doing the same thing on ppc64 the build fails (both are "everything" installs). The frustrating thing about the failure is that its failing while looking in the wrong > locations for some libraries. Instead of looking in the lib64 directories its looking in lib. I have tried setting LDFLAGS, CXXFLAGS, CCFLAGS and CFLAGS to -m64 with no change, lib64 stuff is listed before lib in ld.so.conf (which I think only affects > runtime ...). Here is the exact error: > Maybe, write a small script along the lines of #!/bin/perl my $name = $0; $name =~ s#.*/##; exec("/usr/bin/$name", "-m64", @ARGV); and have it linked as ld, gcc and g++ on path before /usr/bin? -- MST From halr at voltaire.com Wed Jun 14 14:23:25 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Jun 2006 17:23:25 -0400 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <20060614212250.GC28111@mellanox.co.il> References: <4490799A.9040802@ichips.intel.com> <20060614212250.GC28111@mellanox.co.il> Message-ID: <1150320204.4506.6311.camel@hal.voltaire.com> On Wed, 2006-06-14 at 17:22, Michael S. Tsirkin wrote: > Quoting r. Sean Hefty : > > Subject: Re: RFC: detecting duplicate MAD requests > > > > Michael S. Tsirkin wrote: > > >>We're kind of left with the same issue of trying to determine if a received > > >>MAD will generate a response. > > > > > > > > > How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided > > > transfer always has a response, doesn't it? > > I mean, the flag in the application that says that the transfer is dual-sided. > The spec seems to imply that user can figure *from the method* that IsDS=1, so I > assume users will have this logic: > > "2) > Begin the initial transfer by starting the send operation at the point labelled > Send. The method or other indication should be interpreted on > the other side as initiating a double-sided transfer, causing the receive > context to set IsDS=1." > > > So why does the MAD layer care whether a received MAD will generate a resonse? > A request arrives - we pass it up. Now the ACK for the direction switch arrives > - we pass it up too, application should be waiting for it, it should take the > window and pass the response back to us. The ACKs are transparent to the application/user. -- Hal From mst at mellanox.co.il Wed Jun 14 14:30:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Jun 2006 00:30:55 +0300 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <1150320008.4506.6189.camel@hal.voltaire.com> References: <1150320008.4506.6189.camel@hal.voltaire.com> Message-ID: <20060614213055.GE28111@mellanox.co.il> Quoting r. Hal Rosenstock : > > >>We're kind of left with the same issue of trying to determine if a received > > >>MAD will generate a response. > > > > > > > > > How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided > > > transfer always has a response, doesn't it? > > > > Unless I completely missed something, there is no IsDS flag. > > IsDS is an internal state variable and not an on wire part of the > protocol. Yes, I know, but user knows IsDS is 1 so why does MAD layer care whether there will be a response? It's up to the application to switch to the sender flow. -- MST From mst at mellanox.co.il Wed Jun 14 14:33:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Jun 2006 00:33:42 +0300 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <1150320204.4506.6311.camel@hal.voltaire.com> References: <1150320204.4506.6311.camel@hal.voltaire.com> Message-ID: <20060614213342.GF28111@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: RFC: detecting duplicate MAD requests > > On Wed, 2006-06-14 at 17:22, Michael S. Tsirkin wrote: > > Quoting r. Sean Hefty : > > > Subject: Re: RFC: detecting duplicate MAD requests > > > > > > Michael S. Tsirkin wrote: > > > >>We're kind of left with the same issue of trying to determine if a received > > > >>MAD will generate a response. > > > > > > > > > > > > How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided > > > > transfer always has a response, doesn't it? > > > > I mean, the flag in the application that says that the transfer is dual-sided. > > The spec seems to imply that user can figure *from the method* that IsDS=1, so I > > assume users will have this logic: > > > > "2) > > Begin the initial transfer by starting the send operation at the point labelled > > Send. The method or other indication should be interpreted on > > the other side as initiating a double-sided transfer, causing the receive > > context to set IsDS=1." > > > > > > So why does the MAD layer care whether a received MAD will generate a > > resonse? A request arrives - we pass it up. Now the ACK for the direction > > switch arrives - we pass it up too, application should be waiting for it, it > > should take the window and pass the response back to us. > > The ACKs are transparent to the application/user. Well the ACK for the direction switch is special, isn't it? All I'm saying, let's pass it up to the application. -- MST From mst at mellanox.co.il Wed Jun 14 14:37:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Jun 2006 00:37:50 +0300 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <20060614213342.GF28111@mellanox.co.il> References: <1150320204.4506.6311.camel@hal.voltaire.com> <20060614213342.GF28111@mellanox.co.il> Message-ID: <20060614213750.GG28111@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: RFC: detecting duplicate MAD requests > > Quoting r. Hal Rosenstock : > > Subject: Re: RFC: detecting duplicate MAD requests > > > > On Wed, 2006-06-14 at 17:22, Michael S. Tsirkin wrote: > > > Quoting r. Sean Hefty : > > > > Subject: Re: RFC: detecting duplicate MAD requests > > > > > > > > Michael S. Tsirkin wrote: > > > > >>We're kind of left with the same issue of trying to determine if a received > > > > >>MAD will generate a response. > > > > > > > > > > > > > > > How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided > > > > > transfer always has a response, doesn't it? > > > > > > I mean, the flag in the application that says that the transfer is dual-sided. > > > The spec seems to imply that user can figure *from the method* that IsDS=1, so I > > > assume users will have this logic: > > > > > > "2) > > > Begin the initial transfer by starting the send operation at the point labelled > > > Send. The method or other indication should be interpreted on > > > the other side as initiating a double-sided transfer, causing the receive > > > context to set IsDS=1." > > > > > > > > > So why does the MAD layer care whether a received MAD will generate a > > > resonse? A request arrives - we pass it up. Now the ACK for the direction > > > switch arrives - we pass it up too, application should be waiting for it, it > > > should take the window and pass the response back to us. > > > > The ACKs are transparent to the application/user. > > Well the ACK for the direction switch is special, isn't it? > All I'm saying, let's pass it up to the application. I suggest a rule along the lines of "if an ACK arrives with segment number of 0 this means sender is requesting dual sided RMPP, pass it up to the application". What's the problem with this approach? I think this does not break existing apps since these don't do DS RMPP and so never get such an ACK. Right? -- MST From bos at pathscale.com Wed Jun 14 15:30:03 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 14 Jun 2006 15:30:03 -0700 Subject: [openib-general] OFED 1.0-pre 1 build issues. In-Reply-To: References: Message-ID: <1150324203.10676.17.camel@chalcedony.pathscale.com> On Wed, 2006-06-14 at 16:37 -0400, Paul wrote: > Using the default build.sh script on x86_64 rhel4u3 works > flawlessly. However when doing the same thing on ppc64 the build fails > (both are "everything" installs). Looks like you don't have the gcc-devel.ppc64 RPM installed. Isn't building in a multiarch environment fun? It appears that processes are not exiting cleanly on SVN7946 trunk backported to 2.6.9-34 EL. They seem to be stuck in a state of "DL" and I cannot even attach to them wil gdb or kill them with a kill -9. [root at iclust-1 core]# ps -uax | grep IMB woody 4087 0.0 0.0 58500 3172 pts/3 T 14:45 0:00 gdb ./IMB-MPI1 -p 4067 woody 4067 2.3 0.0 33108 2708 ? DL 14:44 0:12 ./IMB-MPI1 woody 4109 3.1 0.0 40148 2572 ? DL 14:47 0:12 ./IMB-MPI1 root 4156 0.0 0.0 51080 732 pts/3 S+ 14:53 0:00 grep IMB The last code I pulled SVN7843 did not have this problem. Any ideas on what might be causing this ? woody -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.lundin at gmail.com Wed Jun 14 16:53:24 2006 From: paul.lundin at gmail.com (Paul) Date: Wed, 14 Jun 2006 19:53:24 -0400 Subject: [openib-general] OFED 1.0-pre 1 build issues. In-Reply-To: <1150324203.10676.17.camel@chalcedony.pathscale.com> References: <1150324203.10676.17.camel@chalcedony.pathscale.com> Message-ID: Bryan, There is no such rpm for rhel 4 u3, perhaps you meant glibc-devel (installed) ? Also worth noting is that this is on a different machine than the x86_64 build (which was an opteron). This is a standalone power5 system, not cross-compiling. Michael, I performed the same work-around in bash (not so good with perl these days) it gets past the prior point. Thanks. Should something that takes care of this be included in the build.sh or build_env.sh scripts ? We would certainly need it covered in the docs at least. Now the build is dying on some undefined references. (log attached) Regards. On 6/14/06, Bryan O'Sullivan wrote: > > On Wed, 2006-06-14 at 16:37 -0400, Paul wrote: > > > Using the default build.sh script on x86_64 rhel4u3 works > > flawlessly. However when doing the same thing on ppc64 the build fails > > (both are "everything" installs). > > Looks like you don't have the gcc-devel.ppc64 RPM installed. Isn't > building in a multiarch environment fun? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: make_user.log.gz Type: application/x-gzip Size: 9915 bytes Desc: not available URL: From sean.hefty at intel.com Wed Jun 14 17:01:16 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 14 Jun 2006 17:01:16 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <20060614213342.GF28111@mellanox.co.il> Message-ID: <000001c6900e$d34b09d0$1d268686@amr.corp.intel.com> >Well the ACK for the direction switch is special, isn't it? >All I'm saying, let's pass it up to the application. I really don't think that this is the direction that we want to take the interface. A multithreaded application could see the ACK before the request. Multiple ACKs could be received for the same request, or no ACK could be received at all. This pushes timeout handling and duplicate detection up to the any application using DS RMPP. We should work for a simpler interface, especially one exposed to userspace. Let's start with an interface that's efficient and works well in the kernel, and then determine how to expose that interface up to userspace. Let's try to keep the complexity in one location. Btw, it looks like Jack's patch has the MAD layer read MAD data while it is in transfer. I don't think that we can do this while the data is mapped. - Sean From bugzilla-daemon at openib.org Wed Jun 14 18:09:42 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 14 Jun 2006 18:09:42 -0700 (PDT) Subject: [openib-general] [Bug 3] openIB can run on SGI ia64 paltform! Message-ID: <20060615010942.7D040228738@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=3 sweitzen at cisco.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED ------- Comment #2 from sweitzen at cisco.com 2006-06-14 18:09 ------- Close out old bug. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Wed Jun 14 18:10:07 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 14 Jun 2006 18:10:07 -0700 (PDT) Subject: [openib-general] [Bug 4] Re: ipoib_ib_post_receive failed for buf 111 Message-ID: <20060615011007.D3370228735@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=4 sweitzen at cisco.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED ------- Comment #2 from sweitzen at cisco.com 2006-06-14 18:10 ------- close out old bug. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bsharp at NetEffect.com Wed Jun 14 18:35:50 2006 From: bsharp at NetEffect.com (Bob Sharp) Date: Wed, 14 Jun 2006 20:35:50 -0500 Subject: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver. Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC05D4E2D8@venom2> > +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) > +{ > + struct c2_mq *mq = c2dev->qptr_array[mq_index]; > + union c2wr *wr; > + void *resource_user_context; > + struct iw_cm_event cm_event; > + struct ib_event ib_event; > + enum c2_resource_indicator resource_indicator; > + enum c2_event_id event_id; > + unsigned long flags; > + u8 *pdata = NULL; > + int status; > + > + /* > + * retreive the message > + */ > + wr = c2_mq_consume(mq); > + if (!wr) > + return; > + > + memset(&ib_event, 0, sizeof(ib_event)); > + memset(&cm_event, 0, sizeof(cm_event)); > + > + event_id = c2_wr_get_id(wr); > + resource_indicator = be32_to_cpu(wr->ae.ae_generic.resource_type); > + resource_user_context > + (void *) (unsigned long) wr->ae.ae_generic.user_context; > + > + status = cm_event.status = > c2_convert_cm_status(c2_wr_get_result(wr)); > + > + pr_debug("event received c2_dev=%p, event_id=%d, " > + "resource_indicator=%d, user_context=%p, status = %d\n", > + c2dev, event_id, resource_indicator, resource_user_context, > + status); > + > + switch (resource_indicator) { > + case C2_RES_IND_QP:{ > + > + struct c2_qp *qp = (struct c2_qp *)resource_user_context; > + struct iw_cm_id *cm_id = qp->cm_id; > + struct c2wr_ae_active_connect_results *res; > + > + if (!cm_id) { > + pr_debug("event received, but cm_id is , qp=%p!\n", > + qp); > + goto ignore_it; > + } > + pr_debug("%s: event = %s, user_context=%llx, " > + "resource_type=%x, " > + "resource=%x, qp_state=%s\n", > + __FUNCTION__, > + to_event_str(event_id), > + be64_to_cpu(wr->ae.ae_generic.user_context), > + be32_to_cpu(wr->ae.ae_generic.resource_type), > + be32_to_cpu(wr->ae.ae_generic.resource), > + to_qp_state_str(be32_to_cpu(wr- > >ae.ae_generic.qp_state))); > + > + c2_set_qp_state(qp, be32_to_cpu(wr->ae.ae_generic.qp_state)); > + > + switch (event_id) { > + case CCAE_ACTIVE_CONNECT_RESULTS: > + res = &wr->ae.ae_active_connect_results; > + cm_event.event = IW_CM_EVENT_CONNECT_REPLY; > + cm_event.local_addr.sin_addr.s_addr = res->laddr; > + cm_event.remote_addr.sin_addr.s_addr = res->raddr; > + cm_event.local_addr.sin_port = res->lport; > + cm_event.remote_addr.sin_port = res->rport; > + if (status == 0) { > + cm_event.private_data_len = > + be32_to_cpu(res->private_data_length); > + } else { > + spin_lock_irqsave(&qp->lock, flags); > + if (qp->cm_id) { > + qp->cm_id->rem_ref(qp->cm_id); > + qp->cm_id = NULL; > + } > + spin_unlock_irqrestore(&qp->lock, flags); > + cm_event.private_data_len = 0; > + cm_event.private_data = NULL; > + } > + if (cm_event.private_data_len) { > + /* copy private data */ > + pdata > + kmalloc(cm_event.private_data_len, > + GFP_ATOMIC); > + if (!pdata) { > + /* Ignore the request, maybe the > + * remote peer will retry */ > + pr_debug ("Ignored connect request -- " > + "no memory for pdata" > + "private_data_len=%d\n", > + cm_event.private_data_len); > + goto ignore_it; > + } > + > + memcpy(pdata, res->private_data, > + cm_event.private_data_len); > + > + cm_event.private_data = pdata; > + } > + if (cm_id->event_handler) > + cm_id->event_handler(cm_id, &cm_event); > + break; > + case CCAE_TERMINATE_MESSAGE_RECEIVED: > + case CCAE_CQ_SQ_COMPLETION_OVERFLOW: > + ib_event.device = &c2dev->ibdev; > + ib_event.element.qp = &qp->ibqp; > + ib_event.event = IB_EVENT_QP_REQ_ERR; > + > + if (qp->ibqp.event_handler) > + qp->ibqp.event_handler(&ib_event, > + qp->ibqp. > + qp_context); > + break; > + case CCAE_BAD_CLOSE: > + case CCAE_LLP_CLOSE_COMPLETE: > + case CCAE_LLP_CONNECTION_RESET: > + case CCAE_LLP_CONNECTION_LOST: > + BUG_ON(cm_id->event_handler==(void*)0x6b6b6b6b); > + > + spin_lock_irqsave(&qp->lock, flags); > + if (qp->cm_id) { > + qp->cm_id->rem_ref(qp->cm_id); > + qp->cm_id = NULL; > + } > + spin_unlock_irqrestore(&qp->lock, flags); > + cm_event.event = IW_CM_EVENT_CLOSE; > + cm_event.status = 0; > + if (cm_id->event_handler) > + cm_id->event_handler(cm_id, &cm_event); > + break; > + default: > + BUG_ON(1); > + pr_debug("%s:%d Unexpected event_id=%d on QP=%p, " > + "CM_ID=%p\n", > + __FUNCTION__, __LINE__, > + event_id, qp, cm_id); > + break; > + } > + break; > + } > + > + case C2_RES_IND_EP:{ > + > + struct c2wr_ae_connection_request *req = > + &wr->ae.ae_connection_request; > + struct iw_cm_id *cm_id = > + (struct iw_cm_id *)resource_user_context; > + > + pr_debug("C2_RES_IND_EP event_id=%d\n", event_id); > + if (event_id != CCAE_CONNECTION_REQUEST) { > + pr_debug("%s: Invalid event_id: %d\n", > + __FUNCTION__, event_id); > + break; > + } > + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; > + cm_event.provider_data = (void*)(unsigned long)req->cr_handle; > + cm_event.local_addr.sin_addr.s_addr = req->laddr; > + cm_event.remote_addr.sin_addr.s_addr = req->raddr; > + cm_event.local_addr.sin_port = req->lport; > + cm_event.remote_addr.sin_port = req->rport; > + cm_event.private_data_len = > + be32_to_cpu(req->private_data_length); > + > + if (cm_event.private_data_len) { It looks to me as if pdata is leaking here since it is not tracked and the upper layers do not free it. Also, if pdata is freed after the call to cm_id->event_handler returns, it exposes an issue in user space where the private data is garbage. I suspect the iwarp cm should be copying this data before it returns. > + pdata > + kmalloc(cm_event.private_data_len, > + GFP_ATOMIC); > + if (!pdata) { > + /* Ignore the request, maybe the remote peer > + * will retry */ > + pr_debug ("Ignored connect request -- " > + "no memory for pdata" > + "private_data_len=%d\n", > + cm_event.private_data_len); > + goto ignore_it; > + } > + memcpy(pdata, > + req->private_data, > + cm_event.private_data_len); > + > + cm_event.private_data = pdata; > + } > + if (cm_id->event_handler) > + cm_id->event_handler(cm_id, &cm_event); > + break; > + } > + Bob From sweitzen at cisco.com Wed Jun 14 20:38:34 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 14 Jun 2006 20:38:34 -0700 Subject: [openib-general] please add RHEL4 to OS list on OpenIB bugzilla Message-ID: Bryan, Would you please add RHEL4 as an OS for OpenIB bugs? Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Wed Jun 14 20:45:40 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 14 Jun 2006 20:45:40 -0700 Subject: [openib-general] MVAPICH failure on IBM PPC-64 Linux machine Message-ID: I agree it's not working, and I have opened bug 135 (OFED 1.0: MVAPICH doesn't work on RHEL4 U3 ppc64). Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Boris Shpolyansky Sent: Monday, June 12, 2006 5:53 PM To: openib-general at openib.org Subject: [openib-general] MVAPICH failure on IBM PPC-64 Linux machine Hi, I've run into following failure running OSU MPI out of OFED-rc5 on IBM PPC-64 platform: [1] Abort: Error creating QP at line 820 in file viainit.c mpirun: executable version 1 does not match our version 3, This seems to be memory allocation issue which could be easily explained (and overcome) if the job is launched with regular user permissions, but in my case it's root who launches it. Have anybody tested OFED's OSU MPI on PPC-64 platform recently and can comment on this ? Thanks, Boris Shpolyansky Application Engineer Mellanox Technologies Inc. 2900 Stender Way Santa Clara, CA 95054 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Jun 14 21:10:42 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Jun 2006 00:10:42 -0400 Subject: [openib-general] [PATCH] [MINOR} OpenSM/osm_port_info_rcv.c: Move assert to before where PortInfo is assumed Message-ID: <1150344641.4506.20803.camel@hal.voltaire.com> OpenSM/osm_port_info_rcv.c: Move assert to before where PortInfo is assumed as shouldn't be processing as PortInfo unless it really is Signed-off-by: Hal Rosenstock Index: opensm/osm_port_info_rcv.c =================================================================== --- opensm/osm_port_info_rcv.c (revision 7961) +++ opensm/osm_port_info_rcv.c (working copy) @@ -683,6 +683,8 @@ osm_pi_rcv_process( p_context = osm_madw_get_pi_context_ptr( p_madw ); p_pi = (ib_port_info_t*)ib_smp_get_payload_ptr( p_smp ); + CL_ASSERT( p_smp->attr_id == IB_MAD_ATTR_PORT_INFO ); + /* On receipt of client reregister, clear the reregister bit so reregistering won't be sent again and again */ if ( ib_port_info_get_client_rereg( p_pi ) ) @@ -698,8 +700,6 @@ osm_pi_rcv_process( port_guid = p_context->port_guid; node_guid = p_context->node_guid; - CL_ASSERT( p_smp->attr_id == IB_MAD_ATTR_PORT_INFO ); - osm_dump_port_info( p_rcv->p_log, node_guid, port_guid, port_num, p_pi, OSM_LOG_DEBUG); From mst at mellanox.co.il Wed Jun 14 22:12:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Jun 2006 08:12:55 +0300 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <000001c6900e$d34b09d0$1d268686@amr.corp.intel.com> References: <000001c6900e$d34b09d0$1d268686@amr.corp.intel.com> Message-ID: <20060615051255.GA11911@mellanox.co.il> Quoting r. Sean Hefty : > A multithreaded application could see the ACK before the request. Yes, this is a problem. -- MST From betsy at pathscale.com Wed Jun 14 22:13:42 2006 From: betsy at pathscale.com (Betsy Zeller) Date: Wed, 14 Jun 2006 22:13:42 -0700 Subject: [openib-general] OFED 1.0 release schedule In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007F73FBD@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0007F73FBD@orsmsx408> Message-ID: <1150348422.3425.145.camel@sarium.pathscale.com> On Wed, 2006-06-14 at 09:56 -0700, Woodruff, Robert J wrote: > > I ran a modified netpipe over SDP and it hung somewhere around size > > 4k. > It was on a production Lindenhurst Xeon system. This works fine with the > > Mellanox cards. Hmm - netpipe on SDP ran fine for us, as well. We tried it on RHEL4, FC4, and SLES10 RC2. > > I also had problems with uDAPL over pathscale (and thus Intel MPI) and > suspect problems with > RDMA operations. I did not have time to debug it any further. > Were you able to get perftest running, We also ran perftest. > as Arlin suggested to your developers a couple of weeks back ? > > Right now, I had to pull the pathscale cards to complete regression > testing of 1.0-pre1 with Intel MPI and since pathscale does not > work with Intel MPI, I put the Mellanox cards back in. We hope to get PathScale working Intel MPI in the near future. If you could send any error messages, or a description of behavior, that would be very useful. - Betsy -- Betsy Zeller Director of Software Engineering QLogic Corporation System Interconnect Group (formerly PathScale, Inc) 2071 Stierlin Court, Suite 200 Mountain View, CA, 94043 1-650-934-8088 From glebn at voltaire.com Wed Jun 14 22:19:12 2006 From: glebn at voltaire.com (glebn at voltaire.com) Date: Thu, 15 Jun 2006 08:19:12 +0300 Subject: [openib-general] MPI error when using a "system" call in mpi job. In-Reply-To: <20060614095958.59c7dcc7.weiny2@llnl.gov> References: <20060613171147.35787125.weiny2@llnl.gov> <44900970.9050006@cse.ohio-state.edu> <20060614095958.59c7dcc7.weiny2@llnl.gov> Message-ID: <20060615051912.GI17758@minantech.com> On Wed, Jun 14, 2006 at 09:59:58AM -0700, Ira Weiny wrote: > We are on a modified RedHat RHEL4 kernel. Roughly 2.6.9. :-( > This is known problen with kernel 2.6.9. > I am going to try a 2.6.16 kernel I have built to see if it changes. > This will work with system(), but not with fork(). > Ira > > > On Wed, 14 Jun 2006 09:04:48 -0400 > Sayantan Sur wrote: > > > Hello Ira, > > > > I am running the program on 2.6.15 (EM64T machine) and 2.6.16 (IA32 > > machine). The program seems to be running fine. Can you tell us which > > kernel you are using? We are using drivers pulled out of the trunk > > about 3-4 weeks back. > > > > Thanks, > > Sayantan. > > > > Ira Weiny wrote: > > > > >A co-worker here was seeing the following MPI error from his job: > > > > > >[1] Abort: [ldev2:1] Got completion with error, code=1 > > > at line 2148 in file viacheck.c > > > > > >After some tracking down he found that apparently if he used a > > >"system" call [int system(const char *string)] the next MPI command > > >will fail. > > > > > >I have been able to reproduce this with the attached simple "hello" > > >program. > > > > > >Perhaps someone has seen this type of error? Here is the output > > >from 2 runs: > > > > > >weiny2 at ldev0:~/ior-test > > >17:04:04 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello x > > >ldev1 > > >[0] Abort: [ldev1:0] Got completion with error, code=1 > > > at line 2148 in file viacheck.c > > >ldev2 > > >mpirun_rsh: Abort signaled from [0] > > >done. > > >weiny2 at ldev0:~/ior-test > > >17:05:23 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello > > >now = 0.000000 > > >now = 0.000052 > > >now = 0.000094 > > >now = 0.000121 > > >now = 0.000151 > > >now = 0.001072 > > >now = 0.001102 > > >now = 0.001118 > > >now = 0.001141 > > >now = 0.001160 > > >done. > > > > > >We are running mvapich 0.9.7 and the openib trunk rev 6829. > > > > > >Thanks, > > >Ira > > > > > > > > > > > >------------------------------------------------------------------------ > > > > > >_______________________________________________ > > >openib-general mailing list > > >openib-general at openib.org > > >http://openib.org/mailman/listinfo/openib-general > > > > > >To unsubscribe, please visit > > >http://openib.org/mailman/listinfo/openib-general > > > > > > > -- > > http://www.cse.ohio-state.edu/~surs > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Gleb. From eitan at mellanox.co.il Wed Jun 14 23:02:07 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 15 Jun 2006 09:02:07 +0300 Subject: [openib-general] [PATCH] [MINOR} OpenSM/osm_port_info_rcv.c: Move assert to beforewhere PortInfo is assumed Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236885F@mtlexch01.mtl.com> Makes perfect sense. > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, June 15, 2006 7:11 AM > To: openib-general at openib.org > Cc: Eitan Zahavi > Subject: [PATCH] [MINOR} OpenSM/osm_port_info_rcv.c: Move assert to > beforewhere PortInfo is assumed > > OpenSM/osm_port_info_rcv.c: Move assert to before where PortInfo is > assumed as shouldn't be processing as PortInfo unless it really is > > Signed-off-by: Hal Rosenstock > > Index: opensm/osm_port_info_rcv.c > =================================================================== > --- opensm/osm_port_info_rcv.c (revision 7961) > +++ opensm/osm_port_info_rcv.c (working copy) > @@ -683,6 +683,8 @@ osm_pi_rcv_process( > p_context = osm_madw_get_pi_context_ptr( p_madw ); > p_pi = (ib_port_info_t*)ib_smp_get_payload_ptr( p_smp ); > > + CL_ASSERT( p_smp->attr_id == IB_MAD_ATTR_PORT_INFO ); > + > /* On receipt of client reregister, clear the reregister bit so > reregistering won't be sent again and again */ > if ( ib_port_info_get_client_rereg( p_pi ) ) > @@ -698,8 +700,6 @@ osm_pi_rcv_process( > port_guid = p_context->port_guid; > node_guid = p_context->node_guid; > > - CL_ASSERT( p_smp->attr_id == IB_MAD_ATTR_PORT_INFO ); > - > osm_dump_port_info( > p_rcv->p_log, node_guid, port_guid, port_num, p_pi, OSM_LOG_DEBUG); > > From ogerlitz at voltaire.com Thu Jun 15 01:26:22 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 15 Jun 2006 11:26:22 +0300 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <44903D5D.10102@ichips.intel.com> References: <44903D5D.10102@ichips.intel.com> Message-ID: <449119AE.2010703@voltaire.com> Sean Hefty wrote: > James Lentini wrote: >> The IBTA spec (volume 1, version 1.2) describes a communication >> established affiliated asynchronous event. >> We've seen this event delivered to our NFS-RDMA server and aren't sure >> what to do with it. > This event is delivered to the verbs consumer, since it occurs on the QP. It's > expected that the consumer will call ib_cm_establish. Although, I would guess > that you can probably ignore the event, under the assumption that the RTU will > eventually be received by the local CM. Sean, The cma/verbs consumer can't just ignore the event since its qp state is still RTR which means an attempt to tx replying the rx would fail. On the other hand it can't call ib_cm_establish since the CMA does not expose an API for that, nor the CM can register a cb to get this event and emulate an RTU reception since the CMA is the one to create the QP and the CMA consumer providing the qp_init_attr along with event handler... I suggest the following design: the CMA would replace the event handler provided with the qp_init_attr struct with a callback of its own and keep the original handler/context on a private structure. On the delivery of IB_EVENT_COMM_EST event, the CMA would call down the CM to emulate RTU reception (ib_cm_establish) and then call up the consumer original handler, typical CMA consumers would just ignore this event, i think. The CM should be able to allow ib_cm_established to be called in the context over which the event handler is called (or jump the treatment to higher context). The CM must also ignore the actual RTU if it arrives later/in parallel to when ib_cm_establish was called. By this design the verbs consumer is guaranteed to always get RDMA_CM_EVENT_ESTABLISHED no matter if the RTU is just late or never arrives but it still can get a CQ RX completion(s) before getting the CMA established event; in that case it can queue these completion elements for the short time window before the established event arrives and then process them. A design similar to that was implemented at the Voltaire gen1 stack and it works in production with iSER target and VIBNAL (CFS Lustre NAL for voltaire gen1 ib) server side. Does anyone know on what context (hard_irq, soft_irq, thread) are the event handlers being called? Or. From ogerlitz at voltaire.com Thu Jun 15 01:31:46 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 15 Jun 2006 11:31:46 +0300 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <449119AE.2010703@voltaire.com> References: <44903D5D.10102@ichips.intel.com> <449119AE.2010703@voltaire.com> Message-ID: <44911AF2.4060800@voltaire.com> Or Gerlitz wrote: > I suggest the following design: the CMA would replace the event handler > provided with the qp_init_attr struct with a callback of its own and > keep the original handler/context on a private structure. > > On the delivery of IB_EVENT_COMM_EST event, the CMA would call down the > CM to emulate RTU reception (ib_cm_establish) and then call up the > consumer original handler, typical CMA consumers would just ignore this > event, i think. and on other qp affiliated events the CMA would just call up the consumer callback. This proxy-ing of qp events can help us down the road to add support for path migration in the CMA. Or. From tziporet at mellanox.co.il Thu Jun 15 03:51:17 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 15 Jun 2006 13:51:17 +0300 Subject: [openib-general] Maintainers List Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA723B@mtlexch01.mtl.com> Usually you can see the owners in bugzilla Tziporet -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Rimmer, Todd Sent: Wednesday, June 14, 2006 4:55 PM To: openib-general at openib.org Subject: [openib-general] Maintainers List Is there a convenient list of the maintainers for all the various OFED components? Thanks, Todd Rimmer _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Thu Jun 15 04:06:17 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 15 Jun 2006 14:06:17 +0300 Subject: [openib-general] [PATCH] osm: partition manager force policy In-Reply-To: <86odwxgqrs.fsf@mtl066.yok.mtl.com> References: <86odwxgqrs.fsf@mtl066.yok.mtl.com> Message-ID: <20060615110617.GA21560@sashak.voltaire.com> Hi Eitan, Some comments about the patch. Personally I'm glad to see that you are using tab instead of spaces as identaion character. But it would be nice if next time you will not mix the functional changes and identaion fixes in the same patch, but instead will provide two different patches. Also it would be nice if your identation fixes will cover whole file(s) and not just selected lines. The same is about massive code moving, the patch separation may simplify review. The rest is below. On 15:54 Tue 13 Jun , Eitan Zahavi wrote: > --text follows this line-- > Hi Hal > > This is a second take after debug and cleanup of the partition manager > patch I have previously provided. The functionality is the same but > this one is after 2 days of testing on the simulator. > I also did some code restructuring for clarity. > > Tests passed were both dedicated pkey enforcements (pkey.*) and > stress test (osmStress.*) > > As I started to test the partition manager code (using ibmgtsim pkey test), > I realized the implementation does not really enforces the partition policy > on the given fabric. This patch fixes that. It was verified using the > simulation test. Several other corner cases were fixed too. > > Eitan > > Signed-off-by: Eitan Zahavi > > Index: include/opensm/osm_port.h > =================================================================== > --- include/opensm/osm_port.h (revision 7867) > +++ include/opensm/osm_port.h (working copy) > @@ -586,6 +586,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy > * Port, Physical Port > *********/ > > +/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl > +* NAME > +* osm_physp_get_mod_pkey_tbl > +* > +* DESCRIPTION > +* Returns a NON CONST pointer to the P_Key table object of the Physical Port object. > +* > +* SYNOPSIS > +*/ > +static inline osm_pkey_tbl_t * > +osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp ) > +{ > + CL_ASSERT( osm_physp_is_valid( p_physp ) ); > + /* > + (14.2.5.7) - the block number valid values are 0-2047, and are further > + limited by the size of the P_Key table specified by the PartitionCap on the node. > + */ > + return( &p_physp->pkeys ); > +}; > +/* > +* PARAMETERS > +* p_physp > +* [in] Pointer to an osm_physp_t object. > +* > +* RETURN VALUES > +* The pointer to the P_Key table object. > +* > +* NOTES > +* > +* SEE ALSO > +* Port, Physical Port > +*********/ > + Is not this simpler to remove 'const' from existing osm_physp_get_pkey_tbl() function instead of using new one? > /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl > * NAME > * osm_physp_set_slvl_tbl > Index: include/opensm/osm_pkey.h > =================================================================== > --- include/opensm/osm_pkey.h (revision 7867) > +++ include/opensm/osm_pkey.h (working copy) > @@ -92,6 +92,9 @@ typedef struct _osm_pkey_tbl > cl_ptr_vector_t blocks; > cl_ptr_vector_t new_blocks; > cl_map_t keys; > + cl_qlist_t pending; > + uint16_t used_blocks; > + uint16_t max_blocks; > } osm_pkey_tbl_t; > /* > * FIELDS > @@ -104,6 +107,18 @@ typedef struct _osm_pkey_tbl > * keys > * A set holding all keys > * > +* pending > +* A list osm_pending_pkey structs that is temporarily set by the > +* pkey mgr and used during pkey mgr algorithm only > +* > +* used_blocks > +* Tracks the number of blocks having non-zero pkeys > +* > +* max_blocks > +* The maximal number of blocks this partition table might hold > +* this value is based on node_info (for port 0 or CA) or switch_info > +* updated on receiving the node_info or switch_info GetResp > +* > * NOTES > * 'blocks' vector should be used to store pkey values obtained from > * the port and SM pkey manager should not change it directly, for this > @@ -114,6 +129,39 @@ typedef struct _osm_pkey_tbl > * > *********/ > > +/****s* OpenSM: osm_pending_pkey_t > +* NAME > +* osm_pending_pkey_t > +* > +* DESCRIPTION > +* This objects stores temporary information on pkeys their target block and index > +* during the pkey manager operation > +* > +* SYNOPSIS > +*/ > +typedef struct _osm_pending_pkey { > + cl_list_item_t list_item; > + uint16_t pkey; > + uint32_t block; > + uint8_t index; > + boolean_t is_new; > +} osm_pending_pkey_t; > +/* > +* FIELDS > +* pkey > +* The actual P_Key > +* > +* block > +* The block index based on the previous table extracted from the device > +* > +* index > +* The index of the pky within the block > +* > +* is_new > +* TRUE for new P_Keys such that the block and index are invalid in that case > +* > +*********/ > + > /****f* OpenSM: osm_pkey_tbl_construct > * NAME > * osm_pkey_tbl_construct > @@ -209,8 +257,8 @@ osm_pkey_tbl_get_num_blocks( > static inline ib_pkey_table_t *osm_pkey_tbl_block_get( > const osm_pkey_tbl_t *p_pkey_tbl, uint16_t block) > { > - CL_ASSERT(block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)); > - return(cl_ptr_vector_get(&p_pkey_tbl->blocks, block)); > + return( (block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)) ? > + cl_ptr_vector_get(&p_pkey_tbl->blocks, block) : NULL); > }; > /* > * p_pkey_tbl > @@ -244,6 +292,106 @@ static inline ib_pkey_table_t *osm_pkey_ > /* > *********/ > > + > +/****f* OpenSM: osm_pkey_tbl_make_block_pair > +* NAME > +* osm_pkey_tbl_make_block_pair > +* > +* DESCRIPTION > +* Find or create a pair of "old" and "new" blocks for the > +* given block index > +* > +* SYNOPSIS > +*/ > +int osm_pkey_tbl_make_block_pair( > + osm_pkey_tbl_t *p_pkey_tbl, > + uint16_t block_idx, > + ib_pkey_table_t **pp_old_block, > + ib_pkey_table_t **pp_new_block); > +/* > +* p_pkey_tbl > +* [in] Pointer to the PKey table > +* > +* block_idx > +* [in] The block index to use > +* > +* pp_old_block > +* [out] Pointer to the old block pointer arg > +* > +* pp_new_block > +* [out] Pointer to the new block pointer arg > +* > +* RETURN VALUES > +* 0 if OK 1 if failed It is better (conventional) to use -1 as failure return status. > +* > +*********/ > + > +/****f* OpenSM: osm_pkey_tbl_set_new_entry > +* NAME > +* osm_pkey_tbl_set_new_entry > +* > +* DESCRIPTION > +* stores the given pkey in the "new" blocks array and update > +* the "map" to show that on the "old" blocks > +* > +* SYNOPSIS > +*/ > +int > +osm_pkey_tbl_set_new_entry( > + IN osm_pkey_tbl_t *p_pkey_tbl, > + IN uint16_t block_idx, > + IN uint8_t pkey_idx, > + IN uint16_t pkey); > +/* > +* p_pkey_tbl > +* [in] Pointer to the PKey table > +* > +* block_idx > +* [in] The block index to use > +* > +* pkey_idx > +* [in] The index within the block > +* > +* pkey > +* [in] PKey to store > +* > +* RETURN VALUES > +* 0 if OK 1 if failed Ditto > +* > +*********/ > + > +/****f* OpenSM: osm_pkey_find_next_free_entry > +* NAME > +* osm_pkey_find_next_free_entry > +* > +* DESCRIPTION > +* Find the next free entry in the PKey table. Starting at the given > +* index and block number. The user should increment pkey_idx before > +* next call > +* Inspect the "new" blocks array for empty space. > +* > +* SYNOPSIS > +*/ > +boolean_t > +osm_pkey_find_next_free_entry( > + IN osm_pkey_tbl_t *p_pkey_tbl, > + OUT uint16_t *p_block_idx, > + OUT uint8_t *p_pkey_idx); > +/* > +* p_pkey_tbl > +* [in] Pointer to the PKey table > +* > +* p_block_idx > +* [out] The block index to use > +* > +* p_pkey_idx > +* [out] The index within the block to use > +* > +* RETURN VALUES > +* TRUE if found FALSE if did not find > +* > +*********/ > + > /****f* OpenSM: osm_pkey_tbl_sync_new_blocks > * NAME > * osm_pkey_tbl_sync_new_blocks > @@ -263,9 +411,44 @@ void osm_pkey_tbl_sync_new_blocks( > * > *********/ > > +/****f* OpenSM: osm_pkey_tbl_get_block_and_idx > +* NAME > +* osm_pkey_tbl_get_block_and_idx > +* > +* DESCRIPTION > +* set the block index and pkey index the given > +* pkey is found in. return 1 if cound not find > +* it, 0 if OK > +* > +* SYNOPSIS > +*/ > +int > +osm_pkey_tbl_get_block_and_idx( > + IN osm_pkey_tbl_t *p_pkey_tbl, > + IN uint16_t *p_pkey, > + OUT uint32_t *block_idx, > + OUT uint8_t *pkey_index); > +/* > +* p_pkey_tbl > +* [in] Pointer to osm_pkey_tbl_t object. > +* > +* p_pkey > +* [in] Pointer to the P_Key entry searched > +* > +* p_block_idx > +* [out] Pointer to the block index to be updated > +* > +* p_pkey_idx > +* [out] Pointer to the pkey index (in the block) to be updated > +* > +* > +* NOTES > +* > +*********/ > + > /****f* OpenSM: osm_pkey_tbl_set > * NAME > * osm_pkey_tbl_set > Index: opensm/osm_pkey.c > =================================================================== > --- opensm/osm_pkey.c (revision 7904) > +++ opensm/osm_pkey.c (working copy) > @@ -100,6 +100,9 @@ int osm_pkey_tbl_init( > cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1); > cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1); > cl_map_init( &p_pkey_tbl->keys, 1 ); > + cl_qlist_init( &p_pkey_tbl->pending ); > + p_pkey_tbl->used_blocks = 0; > + p_pkey_tbl->max_blocks = 0; > return(IB_SUCCESS); > } > > @@ -118,14 +121,29 @@ void osm_pkey_tbl_sync_new_blocks( > p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b); > if ( b < new_blocks ) > p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b); > - else { > + else > + { > p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block)); > if (!p_new_block) > break; > + cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, > + b, p_new_block); > + } > + > memset(p_new_block, 0, sizeof(*p_new_block)); > - cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block); > } > - memcpy(p_new_block, p_block, sizeof(*p_new_block)); > +} You changed this function so it does not do any sync anymore. Should function name be changed too? > + > +/********************************************************************** > + **********************************************************************/ > +void osm_pkey_tbl_cleanup_pending( > + IN osm_pkey_tbl_t *p_pkey_tbl) > +{ > + cl_list_item_t *p_item; > + p_item = cl_qlist_remove_head( &p_pkey_tbl->pending ); > + while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) ) > + { > + free( (osm_pending_pkey_t *)p_item ); > } > } > > @@ -202,6 +220,138 @@ int osm_pkey_tbl_set( > > /********************************************************************** > **********************************************************************/ > +int osm_pkey_tbl_make_block_pair( > + osm_pkey_tbl_t *p_pkey_tbl, > + uint16_t block_idx, > + ib_pkey_table_t **pp_old_block, > + ib_pkey_table_t **pp_new_block) > +{ > + if (block_idx >= p_pkey_tbl->max_blocks) return 1; > + > + if (pp_old_block) > + { > + *pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx ); > + if (! *pp_old_block) > + { > + *pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); > + if (!*pp_old_block) return 1; > + memset(*pp_old_block, 0, sizeof(ib_pkey_table_t)); > + cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block); > + } > + } > + > + if (pp_new_block) > + { > + *pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx ); > + if (! *pp_new_block) > + { > + *pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); > + if (!*pp_new_block) return 1; > + memset(*pp_new_block, 0, sizeof(ib_pkey_table_t)); > + cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block); > + } > + } > + return 0; > +} > + > +/********************************************************************** > + **********************************************************************/ > +/* > + store the given pkey in the "new" blocks array and update the "map" > + to show that on the "old" blocks > +*/ > +int > +osm_pkey_tbl_set_new_entry( > + IN osm_pkey_tbl_t *p_pkey_tbl, > + IN uint16_t block_idx, > + IN uint8_t pkey_idx, > + IN uint16_t pkey) > +{ > + ib_pkey_table_t *p_old_block; > + ib_pkey_table_t *p_new_block; > + > + if (osm_pkey_tbl_make_block_pair( > + p_pkey_tbl, block_idx, &p_old_block, &p_new_block)) > + return 1; > + > + cl_map_insert( &p_pkey_tbl->keys, > + ib_pkey_get_base(pkey), > + &(p_old_block->pkey_entry[pkey_idx])); Here you map potentially empty pkey entry. Why? "old block" will be remapped anyway on pkey receiving. Actually I don't see why you want this pretty tricky and pkey_mgr specific procedure as generic function. > + p_new_block->pkey_entry[pkey_idx] = pkey; > + if (p_pkey_tbl->used_blocks < block_idx) > + p_pkey_tbl->used_blocks = block_idx; > + > + return 0; > +} > + > +/********************************************************************** > + **********************************************************************/ > +boolean_t > +osm_pkey_find_next_free_entry( > + IN osm_pkey_tbl_t *p_pkey_tbl, > + OUT uint16_t *p_block_idx, > + OUT uint8_t *p_pkey_idx) > +{ > + ib_pkey_table_t *p_new_block; > + > + CL_ASSERT(p_block_idx); > + CL_ASSERT(p_pkey_idx); > + > + while ( *p_block_idx < p_pkey_tbl->max_blocks) > + { > + if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1) > + { > + *p_pkey_idx = 0; > + (*p_block_idx)++; > + if (*p_block_idx >= p_pkey_tbl->max_blocks) > + return FALSE; > + } > + > + p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx); > + > + if ( !p_new_block || > + ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx])) > + return TRUE; > + else > + (*p_pkey_idx)++; > + } > + return FALSE; > +} > + > +/********************************************************************** > + **********************************************************************/ > +int > +osm_pkey_tbl_get_block_and_idx( > + IN osm_pkey_tbl_t *p_pkey_tbl, > + IN uint16_t *p_pkey, > + OUT uint32_t *p_block_idx, > + OUT uint8_t *p_pkey_index) > +{ > + uint32_t num_of_blocks; > + uint32_t block_index; > + ib_pkey_table_t *block; > + > + CL_ASSERT( p_pkey_tbl ); > + CL_ASSERT( p_block_idx != NULL ); > + CL_ASSERT( p_pkey_idx != NULL ); Why last two CL_ASSERTs? What should be problem with uninitialized pointers here? > + > + num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks); > + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) > + { > + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); > + if ( ( block->pkey_entry <= p_pkey ) && > + ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK)) > + { > + *p_block_idx = block_index; > + *p_pkey_index = p_pkey - block->pkey_entry; > + return 0; > + } > + } > + return 1; > +} > + > +/********************************************************************** > + **********************************************************************/ > static boolean_t __osm_match_pkey ( > IN const ib_net16_t *pkey1, > IN const ib_net16_t *pkey2 ) { > @@ -305,7 +455,8 @@ osm_physp_share_pkey( > if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys)) > return TRUE; > > - return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); > + return > + !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); > } > > /********************************************************************** > @@ -321,7 +472,8 @@ osm_port_share_pkey( > > OSM_LOG_ENTER( p_log, osm_port_share_pkey ); > > - if (!p_port_1 || !p_port_2) { > + if (!p_port_1 || !p_port_2) > + { > ret = FALSE; > goto Exit; > } > @@ -329,7 +481,8 @@ osm_port_share_pkey( > p_physp1 = osm_port_get_default_phys_ptr(p_port_1); > p_physp2 = osm_port_get_default_phys_ptr(p_port_2); > > - if (!p_physp1 || !p_physp2) { > + if (!p_physp1 || !p_physp2) > + { > ret = FALSE; > goto Exit; > } > Index: opensm/osm_pkey_mgr.c > =================================================================== > --- opensm/osm_pkey_mgr.c (revision 7904) > +++ opensm/osm_pkey_mgr.c (working copy) > @@ -62,6 +62,139 @@ > > /********************************************************************** > **********************************************************************/ > +/* > + the max number of pkey blocks for a physical port is located in > + different place for switch external ports (SwitchInfo) and the > + rest of the ports (NodeInfo) > +*/ > +static int pkey_mgr_get_physp_max_blocks( I would suggest to add _cap_ to function name. Not too much critical since it is static function. > + IN const osm_subn_t *p_subn, > + IN const osm_physp_t *p_physp) > +{ > + osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); > + osm_switch_t *p_sw; > + uint16_t num_pkeys = 0; > + > + if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) || > + (osm_physp_get_port_num( p_physp ) == 0)) > + num_pkeys = cl_ntoh16( p_node->node_info.partition_cap ); > + else > + { > + p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid); > + if (p_sw) > + num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap ); > + } > + return( (num_pkeys + 31) / 32 ); > +} > + > +/********************************************************************** > + **********************************************************************/ > +/* > + * Insert the new pending pkey entry to the specific port pkey table > + * pending pkeys. new entries are inserted at the back. > + */ > +static void pkey_mgr_process_physical_port( > + IN osm_log_t *p_log, > + IN const osm_req_t *p_req, > + IN const ib_net16_t pkey, > + IN osm_physp_t *p_physp ) > +{ > + osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); > + osm_pkey_tbl_t *p_pkey_tbl; > + ib_net16_t *p_orig_pkey; > + char *stat = NULL; > + osm_pending_pkey_t *p_pending; > + > + p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); > + if (! p_pkey_tbl) ^^^^^^^^^^^^^ Is it possible? > + { > + osm_log( p_log, OSM_LOG_ERROR, > + "pkey_mgr_process_physical_port: ERR 0501: " > + "No pkey table found for node " > + "0x%016" PRIx64 " port %u\n", > + cl_ntoh64( osm_node_get_node_guid( p_node ) ), > + osm_physp_get_port_num( p_physp ) ); > + return; > + } > + > + p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t)); > + if (! p_pending) > + { > + osm_log( p_log, OSM_LOG_ERROR, > + "pkey_mgr_process_physical_port: ERR 0502: " > + "Fail to allocate new pending pkey entry for node " > + "0x%016" PRIx64 " port %u\n", > + cl_ntoh64( osm_node_get_node_guid( p_node ) ), > + osm_physp_get_port_num( p_physp ) ); > + return; > + } > + p_pending->pkey = pkey; > + p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); > + if ( !p_orig_pkey || > + (ib_pkey_get_base(*p_orig_pkey) != ib_pkey_get_base(pkey) )) There the cases of new pkey and updated pkey membership is mixed. Why? > + { > + p_pending->is_new = TRUE; > + cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); > + stat = "inserted"; > + } > + else > + { > + p_pending->is_new = FALSE; > + if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey, > + &p_pending->block, &p_pending->index)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AFAIK in this function there were CL_ASSERTs which check for uinitialized pointers. > + { > + osm_log( p_log, OSM_LOG_ERROR, > + "pkey_mgr_process_physical_port: ERR 0503: " > + "Fail to obtain P_Key 0x%04x block and index for node " > + "0x%016" PRIx64 " port %u\n", > + cl_ntoh64( osm_node_get_node_guid( p_node ) ), > + osm_physp_get_port_num( p_physp ) ); > + return; > + } > + cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); > + stat = "updated"; Is it will be updated? It is likely "already there" case. No? Also in this case you can already put the pkey in new_block instead of holding it in pending list. Then later you will only need to add new pkeys. This may simplify the flow and even save some mem. > + } > + > + osm_log( p_log, OSM_LOG_DEBUG, > + "pkey_mgr_process_physical_port: " > + "pkey 0x%04x was %s for node 0x%016" PRIx64 > + " port %u\n", > + cl_ntoh16( pkey ), stat, > + cl_ntoh64( osm_node_get_node_guid( p_node ) ), > + osm_physp_get_port_num( p_physp ) ); > +} > + > +/********************************************************************** > + **********************************************************************/ > +static void > +pkey_mgr_process_partition_table( > + osm_log_t *p_log, > + const osm_req_t *p_req, > + const osm_prtn_t *p_prtn, > + const boolean_t full ) > +{ > + const cl_map_t *p_tbl = full ? > + &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; > + cl_map_iterator_t i, i_next; > + ib_net16_t pkey = p_prtn->pkey; > + osm_physp_t *p_physp; > + > + if ( full ) > + pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); > + > + i_next = cl_map_head( p_tbl ); > + while ( i_next != cl_map_end( p_tbl ) ) > + { > + i = i_next; > + i_next = cl_map_next( i ); > + p_physp = cl_map_obj( i ); > + if ( p_physp && osm_physp_is_valid( p_physp ) ) > + pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); > + } > +} > + > +/********************************************************************** > + **********************************************************************/ > static ib_api_status_t > pkey_mgr_update_pkey_entry( > IN const osm_req_t *p_req, > @@ -114,7 +247,8 @@ pkey_mgr_enforce_partition( > p_pi->state_info2 = 0; > ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE ); > > - context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); > + context.pi_context.node_guid = > + osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); > context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); > context.pi_context.set_method = TRUE; > context.pi_context.update_master_sm_base_lid = FALSE; > @@ -131,80 +265,132 @@ pkey_mgr_enforce_partition( > > /********************************************************************** > **********************************************************************/ > -/* > - * Prepare a new entry for the pkey table for this port when this pkey > - * does not exist. Update existed entry when membership was changed. > - */ > -static void pkey_mgr_process_physical_port( > - IN osm_log_t *p_log, > - IN const osm_req_t *p_req, > - IN const ib_net16_t pkey, > - IN osm_physp_t *p_physp ) > +static boolean_t pkey_mgr_update_port( > + osm_log_t *p_log, > + osm_req_t *p_req, > + const osm_port_t * const p_port ) > { > - osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); > - ib_pkey_table_t *block; > + osm_physp_t *p_physp; > + osm_node_t *p_node; > + ib_pkey_table_t *block, *new_block; > + osm_pkey_tbl_t *p_pkey_tbl; > uint16_t block_index; > + uint8_t pkey_index; > + uint16_t last_free_block_index = 0; > + uint16_t last_free_pkey_index = 0; > uint16_t num_of_blocks; > - const osm_pkey_tbl_t *p_pkey_tbl; > - ib_net16_t *p_orig_pkey; > - char *stat = NULL; > - uint32_t i; > + uint16_t max_num_of_blocks; > > - p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); > - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); > + ib_api_status_t status; > + boolean_t ret_val = FALSE; > + osm_pending_pkey_t *p_pending; > + boolean_t found; > > - p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); > + p_physp = osm_port_get_default_phys_ptr( p_port ); > + if ( !osm_physp_is_valid( p_physp ) ) > + return FALSE; > > - if ( !p_orig_pkey ) > - { > - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) > + p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); > + num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); > + max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp ); > + if ( p_pkey_tbl->max_blocks > max_num_of_blocks ) > { > - block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); > - for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) > + osm_log( p_log, OSM_LOG_INFO, > + "pkey_mgr_update_port: " > + "Max number of blocks reduced from %u to %u " > + "for node 0x%016" PRIx64 " port %u\n", > + p_pkey_tbl->max_blocks, max_num_of_blocks, > + cl_ntoh64( osm_node_get_node_guid( p_node ) ), > + osm_physp_get_port_num( p_physp ) ); > + } > + p_pkey_tbl->max_blocks = max_num_of_blocks; > + > + osm_pkey_tbl_sync_new_blocks( p_pkey_tbl ); > + cl_map_remove_all( &p_pkey_tbl->keys ); What is the reason to drop map here? AFAIK it will be reinitialized later anyway when pkey blocks will be received. > + p_pkey_tbl->used_blocks = 0; > + > + /* > + process every pending pkey in order - > + first must be "updated" last are "new" > + */ > + p_pending = > + (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); > + while (p_pending != > + (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) ) > + { > + if (p_pending->is_new == FALSE) > + { > + block_index = p_pending->block; > + pkey_index = p_pending->index; > + found = TRUE; > + } > + else > { > - if ( ib_pkey_is_invalid( block->pkey_entry[i] ) ) > + found = osm_pkey_find_next_free_entry(p_pkey_tbl, > + &last_free_block_index, > + &last_free_pkey_index); There should be warning: expected third arg is uint8_t* > + if ( !found ) > { > - block->pkey_entry[i] = pkey; > - stat = "inserted"; > - goto _done; > + osm_log( p_log, OSM_LOG_ERROR, > + "pkey_mgr_update_port: ERR 0504: " > + "failed to find empty space for new pkey 0x%04x " > + "of node 0x%016" PRIx64 " port %u\n", > + cl_ntoh16(p_pending->pkey), > + cl_ntoh64( osm_node_get_node_guid( p_node ) ), > + osm_physp_get_port_num( p_physp ) ); > } > + else > + { > + block_index = last_free_block_index; > + pkey_index = last_free_pkey_index++; > } > } > + > + if (found) > + { > + if (osm_pkey_tbl_set_new_entry( > + p_pkey_tbl, block_index, pkey_index, p_pending->pkey) ) > + { > osm_log( p_log, OSM_LOG_ERROR, > - "pkey_mgr_process_physical_port: ERR 0501: " > - "No empty pkey entry was found to insert 0x%04x for node " > - "0x%016" PRIx64 " port %u\n", > - cl_ntoh16( pkey ), > + "pkey_mgr_update_port: ERR 0505: " > + "failed to set PKey 0x%04x in block %u idx %u " > + "of node 0x%016" PRIx64 " port %u\n", > + p_pending->pkey, block_index, pkey_index, > cl_ntoh64( osm_node_get_node_guid( p_node ) ), > osm_physp_get_port_num( p_physp ) ); > } > - else if ( *p_orig_pkey != pkey ) > - { > + } > + > + free( p_pending ); > + p_pending = > + (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); > + } > + > + /* now look for changes and store */ > for ( block_index = 0; block_index < num_of_blocks; block_index++ ) > { > - /* we need real block (not just new_block) in order > - * to resolve block/pkey indices */ > block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); > - i = p_orig_pkey - block->pkey_entry; > - if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) { > - block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); > - block->pkey_entry[i] = pkey; > - stat = "updated"; > - goto _done; > - } > - } > - } > + new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); > > - _done: > - if (stat) { > - osm_log( p_log, OSM_LOG_VERBOSE, > - "pkey_mgr_process_physical_port: " > - "pkey 0x%04x was %s for node 0x%016" PRIx64 > - " port %u\n", > - cl_ntoh16( pkey ), stat, > + if (block && > + (!new_block || !memcmp( new_block, block, sizeof( *block ) )) ) > + continue; > + > + status = pkey_mgr_update_pkey_entry( > + p_req, p_physp , new_block, block_index ); > + if (status == IB_SUCCESS) > + ret_val = TRUE; > + else > + osm_log( p_log, OSM_LOG_ERROR, > + "pkey_mgr_update_port: ERR 0506: " > + "pkey_mgr_update_pkey_entry() failed to update " > + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", > + block_index, > cl_ntoh64( osm_node_get_node_guid( p_node ) ), > osm_physp_get_port_num( p_physp ) ); > } > + > + return ret_val; > } > > /********************************************************************** > @@ -217,21 +403,23 @@ pkey_mgr_update_peer_port( > const osm_port_t * const p_port, > boolean_t enforce ) > { > - osm_physp_t *p, *peer; > + osm_physp_t *p_physp, *peer; > osm_node_t *p_node; > ib_pkey_table_t *block, *peer_block; > - const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl; > + const osm_pkey_tbl_t *p_pkey_tbl; > + osm_pkey_tbl_t *p_peer_pkey_tbl; > osm_switch_t *p_sw; > ib_switch_info_t *p_si; > uint16_t block_index; > uint16_t num_of_blocks; > + uint16_t peer_max_blocks; > ib_api_status_t status = IB_SUCCESS; > boolean_t ret_val = FALSE; > > - p = osm_port_get_default_phys_ptr( p_port ); > - if ( !osm_physp_is_valid( p ) ) > + p_physp = osm_port_get_default_phys_ptr( p_port ); > + if ( !osm_physp_is_valid( p_physp ) ) > return FALSE; > - peer = osm_physp_get_remote( p ); > + peer = osm_physp_get_remote( p_physp ); > if ( !peer || !osm_physp_is_valid( peer ) ) > return FALSE; > p_node = osm_physp_get_node_ptr( peer ); > @@ -245,7 +433,7 @@ pkey_mgr_update_peer_port( > if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) > { > osm_log( p_log, OSM_LOG_ERROR, > - "pkey_mgr_update_peer_port: ERR 0502: " > + "pkey_mgr_update_peer_port: ERR 0507: " > "pkey_mgr_enforce_partition() failed to update " > "node 0x%016" PRIx64 " port %u\n", > cl_ntoh64( osm_node_get_node_guid( p_node ) ), > @@ -255,24 +443,36 @@ pkey_mgr_update_peer_port( > if (enforce == FALSE) > return FALSE; > > - p_pkey_tbl = osm_physp_get_pkey_tbl( p ); > - p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); > + p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); > + p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer ); > num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); > - if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) ) > - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ); > + peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer ); > + if (peer_max_blocks < p_pkey_tbl->used_blocks) > + { > + osm_log( p_log, OSM_LOG_ERROR, > + "pkey_mgr_update_peer_port: ERR 0508: " > + "not enough entries (%u < %u) on switch 0x%016" PRIx64 > + " port %u\n", > + peer_max_blocks, num_of_blocks, > + cl_ntoh64( osm_node_get_node_guid( p_node ) ), > + osm_physp_get_port_num( peer ) ); > + return FALSE; Do you think it is the best way, just to skip update - partitions are enforced already on the switch. May be better to truncate pkey tables in order to meet peer's capabilities? > + } > > - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) > + p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; > + for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++) > { > block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); > peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); > if ( memcmp( peer_block, block, sizeof( *peer_block ) ) ) > { > + osm_pkey_tbl_set(p_peer_pkey_tbl, block_index, block); Why this (osm_pkey_tbl_set())? This will be called by receiver. > status = pkey_mgr_update_pkey_entry( p_req, peer, block, block_index ); > if ( status == IB_SUCCESS ) > ret_val = TRUE; > else > osm_log( p_log, OSM_LOG_ERROR, > - "pkey_mgr_update_peer_port: ERR 0503: " > + "pkey_mgr_update_peer_port: ERR 0509: " > "pkey_mgr_update_pkey_entry() failed to update " > "pkey table block %d for node 0x%016" PRIx64 > " port %u\n", > @@ -282,10 +482,10 @@ pkey_mgr_update_peer_port( > } > } > > - if ( ret_val == TRUE && > - osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) > + if ( (ret_val == TRUE) && > + osm_log_is_active( p_log, OSM_LOG_DEBUG ) ) > { > - osm_log( p_log, OSM_LOG_VERBOSE, > + osm_log( p_log, OSM_LOG_DEBUG, > "pkey_mgr_update_peer_port: " > "pkey table was updated for node 0x%016" PRIx64 > " port %u\n", > @@ -298,82 +498,6 @@ pkey_mgr_update_peer_port( > > /********************************************************************** > **********************************************************************/ > -static boolean_t pkey_mgr_update_port( > - osm_log_t *p_log, > - osm_req_t *p_req, > - const osm_port_t * const p_port ) > -{ > - osm_physp_t *p; > - osm_node_t *p_node; > - ib_pkey_table_t *block, *new_block; > - const osm_pkey_tbl_t *p_pkey_tbl; > - uint16_t block_index; > - uint16_t num_of_blocks; > - ib_api_status_t status; > - boolean_t ret_val = FALSE; > - > - p = osm_port_get_default_phys_ptr( p_port ); > - if ( !osm_physp_is_valid( p ) ) > - return FALSE; > - > - p_pkey_tbl = osm_physp_get_pkey_tbl(p); > - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); > - > - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) > - { > - block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); > - new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); > - > - if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) ) > - continue; > - > - status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index ); > - if (status == IB_SUCCESS) > - ret_val = TRUE; > - else > - osm_log( p_log, OSM_LOG_ERROR, > - "pkey_mgr_update_port: ERR 0504: " > - "pkey_mgr_update_pkey_entry() failed to update " > - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", > - block_index, > - cl_ntoh64( osm_node_get_node_guid( p_node ) ), > - osm_physp_get_port_num( p ) ); > - } > - > - return ret_val; > -} > - > -/********************************************************************** > - **********************************************************************/ > -static void > -pkey_mgr_process_partition_table( > - osm_log_t *p_log, > - const osm_req_t *p_req, > - const osm_prtn_t *p_prtn, > - const boolean_t full ) > -{ > - const cl_map_t *p_tbl = full ? > - &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; > - cl_map_iterator_t i, i_next; > - ib_net16_t pkey = p_prtn->pkey; > - osm_physp_t *p_physp; > - > - if ( full ) > - pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); > - > - i_next = cl_map_head( p_tbl ); > - while ( i_next != cl_map_end( p_tbl ) ) > - { > - i = i_next; > - i_next = cl_map_next( i ); > - p_physp = cl_map_obj( i ); > - if ( p_physp && osm_physp_is_valid( p_physp ) ) > - pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); > - } > -} > - > -/********************************************************************** > - **********************************************************************/ > osm_signal_t > osm_pkey_mgr_process( > IN osm_opensm_t *p_osm ) > @@ -383,8 +507,7 @@ osm_pkey_mgr_process( > osm_prtn_t *p_prtn; > osm_port_t *p_port; > osm_signal_t signal = OSM_SIGNAL_DONE; > - osm_physp_t *p_physp; > - > + osm_node_t *p_node; > CL_ASSERT( p_osm ); > > OSM_LOG_ENTER( &p_osm->log, osm_pkey_mgr_process ); > @@ -394,32 +517,25 @@ osm_pkey_mgr_process( > if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS ) > { > osm_log( &p_osm->log, OSM_LOG_ERROR, > - "osm_pkey_mgr_process: ERR 0505: " > + "osm_pkey_mgr_process: ERR 0510: " > "osm_prtn_make_partitions() failed\n" ); > goto _err; > } > > - p_tbl = &p_osm->subn.port_guid_tbl; > - p_next = cl_qmap_head( p_tbl ); > - while ( p_next != cl_qmap_end( p_tbl ) ) > - { > - p_port = ( osm_port_t * ) p_next; > - p_next = cl_qmap_next( p_next ); > - p_physp = osm_port_get_default_phys_ptr( p_port ); > - if ( osm_physp_is_valid( p_physp ) ) > - osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) ); > - } > - > + /* populate the pending pkey entries by scanning all partitions */ > p_tbl = &p_osm->subn.prtn_pkey_tbl; > p_next = cl_qmap_head( p_tbl ); > while ( p_next != cl_qmap_end( p_tbl ) ) > { > p_prtn = ( osm_prtn_t * ) p_next; > p_next = cl_qmap_next( p_next ); > - pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ); > - pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); > + pkey_mgr_process_partition_table( > + &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ); > + pkey_mgr_process_partition_table( > + &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); > } > > + /* calculate new pkey tables and set */ > p_tbl = &p_osm->subn.port_guid_tbl; > p_next = cl_qmap_head( p_tbl ); > while ( p_next != cl_qmap_end( p_tbl ) ) > @@ -428,8 +544,10 @@ osm_pkey_mgr_process( > p_next = cl_qmap_next( p_next ); > if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) ) > signal = OSM_SIGNAL_DONE_PENDING; > - if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH && > - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, > + p_node = osm_port_get_parent_node( p_port ); > + if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && > + pkey_mgr_update_peer_port( > + &p_osm->log, &p_osm->sm.req, > &p_osm->subn, p_port, > !p_osm->subn.opt.no_partition_enforcement ) ) > signal = OSM_SIGNAL_DONE_PENDING; > > Thanks, Sasha From hnbhuvaneshwar at novell.com Thu Jun 15 04:59:59 2006 From: hnbhuvaneshwar at novell.com (Bhuvaneshwar HN) Date: Thu, 15 Jun 2006 05:59:59 -0600 Subject: [openib-general] Bond0 Driver support for IB In-Reply-To: References: Message-ID: <449199170200005F0000ACC8@lucius.provo.novell.com> Hi We were thinking of using Linux Bond0 driver for Load balancing and Fault tolerance for IB, any thoughts on this would be welcome Regards Bhuvi From eitan at mellanox.co.il Thu Jun 15 05:19:44 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 15 Jun 2006 15:19:44 +0300 Subject: [openib-general] [PATCH] osm: partition manager force policy In-Reply-To: <20060615110617.GA21560@sashak.voltaire.com> References: <86odwxgqrs.fsf@mtl066.yok.mtl.com> <20060615110617.GA21560@sashak.voltaire.com> Message-ID: <44915060.6090103@mellanox.co.il> Sasha Khapyorsky wrote: > Hi Eitan, > > Some comments about the patch. Thanks for the review. The major point you bring up is the fact I intentionally impose the result of the pkey settings on the SMDB and not wait for the GetResp to do that for me. The idea I had was that once the Pkey Manager calculate the new tables any SA query that is involving PKey matching would be using the results immediately. But this actually opens up another bigger bug: What if the setting failed? The SMDB will not that on next sweep and avoid sending the update. So I think the best approach is to not set anything and rely on the receive to perform the setting for me. I will perform the changes test them and send a new patch. > > Personally I'm glad to see that you are using tab instead of spaces as > identaion character. But it would be nice if next time you will not mix > the functional changes and identaion fixes in the same patch, but instead > will provide two different patches. Also it would be nice if your > identation fixes will cover whole file(s) and not just selected lines. > > The same is about massive code moving, the patch separation may simplify > review. Yes you are correct about this. I will use this method in the future: Use first patch with code changes and a second one with ordering and style changes. > > The rest is below. > > On 15:54 Tue 13 Jun , Eitan Zahavi wrote: > >>--text follows this line-- >>Hi Hal >> >>This is a second take after debug and cleanup of the partition manager >>patch I have previously provided. The functionality is the same but >>this one is after 2 days of testing on the simulator. >>I also did some code restructuring for clarity. >> >>Tests passed were both dedicated pkey enforcements (pkey.*) and >>stress test (osmStress.*) >> >>As I started to test the partition manager code (using ibmgtsim pkey test), >>I realized the implementation does not really enforces the partition policy >>on the given fabric. This patch fixes that. It was verified using the >>simulation test. Several other corner cases were fixed too. >> >>Eitan >> >>Signed-off-by: Eitan Zahavi >> >>Index: include/opensm/osm_port.h >>=================================================================== >>--- include/opensm/osm_port.h (revision 7867) >>+++ include/opensm/osm_port.h (working copy) >>@@ -586,6 +586,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy >> * Port, Physical Port >> *********/ >> >>+/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl >>+* NAME >>+* osm_physp_get_mod_pkey_tbl >>+* >>+* DESCRIPTION >>+* Returns a NON CONST pointer to the P_Key table object of the Physical Port object. >>+* >>+* SYNOPSIS >>+*/ >>+static inline osm_pkey_tbl_t * >>+osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp ) >>+{ >>+ CL_ASSERT( osm_physp_is_valid( p_physp ) ); >>+ /* >>+ (14.2.5.7) - the block number valid values are 0-2047, and are further >>+ limited by the size of the P_Key table specified by the PartitionCap on the node. >>+ */ >>+ return( &p_physp->pkeys ); >>+}; >>+/* >>+* PARAMETERS >>+* p_physp >>+* [in] Pointer to an osm_physp_t object. >>+* >>+* RETURN VALUES >>+* The pointer to the P_Key table object. >>+* >>+* NOTES >>+* >>+* SEE ALSO >>+* Port, Physical Port >>+*********/ >>+ > > > Is not this simpler to remove 'const' from existing > osm_physp_get_pkey_tbl() function instead of using new one? There are plenty of const functions using this function internally so I would have need to fix them too. > > >> /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl >> * NAME >> * osm_physp_set_slvl_tbl >>Index: include/opensm/osm_pkey.h >>=================================================================== >>--- include/opensm/osm_pkey.h (revision 7867) >>+++ include/opensm/osm_pkey.h (working copy) >>@@ -92,6 +92,9 @@ typedef struct _osm_pkey_tbl >> cl_ptr_vector_t blocks; >> cl_ptr_vector_t new_blocks; >> cl_map_t keys; >>+ cl_qlist_t pending; >>+ uint16_t used_blocks; >>+ uint16_t max_blocks; >> } osm_pkey_tbl_t; >> /* >> * FIELDS >>@@ -104,6 +107,18 @@ typedef struct _osm_pkey_tbl >> * keys >> * A set holding all keys >> * >>+* pending >>+* A list osm_pending_pkey structs that is temporarily set by the >>+* pkey mgr and used during pkey mgr algorithm only >>+* >>+* used_blocks >>+* Tracks the number of blocks having non-zero pkeys >>+* >>+* max_blocks >>+* The maximal number of blocks this partition table might hold >>+* this value is based on node_info (for port 0 or CA) or switch_info >>+* updated on receiving the node_info or switch_info GetResp >>+* >> * NOTES >> * 'blocks' vector should be used to store pkey values obtained from >> * the port and SM pkey manager should not change it directly, for this >>@@ -114,6 +129,39 @@ typedef struct _osm_pkey_tbl >> * >> *********/ >> >>+/****s* OpenSM: osm_pending_pkey_t >>+* NAME >>+* osm_pending_pkey_t >>+* >>+* DESCRIPTION >>+* This objects stores temporary information on pkeys their target block and index >>+* during the pkey manager operation >>+* >>+* SYNOPSIS >>+*/ >>+typedef struct _osm_pending_pkey { >>+ cl_list_item_t list_item; >>+ uint16_t pkey; >>+ uint32_t block; >>+ uint8_t index; >>+ boolean_t is_new; >>+} osm_pending_pkey_t; >>+/* >>+* FIELDS >>+* pkey >>+* The actual P_Key >>+* >>+* block >>+* The block index based on the previous table extracted from the device >>+* >>+* index >>+* The index of the pky within the block >>+* >>+* is_new >>+* TRUE for new P_Keys such that the block and index are invalid in that case >>+* >>+*********/ >>+ >> /****f* OpenSM: osm_pkey_tbl_construct >> * NAME >> * osm_pkey_tbl_construct >>@@ -209,8 +257,8 @@ osm_pkey_tbl_get_num_blocks( >> static inline ib_pkey_table_t *osm_pkey_tbl_block_get( >> const osm_pkey_tbl_t *p_pkey_tbl, uint16_t block) >> { >>- CL_ASSERT(block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)); >>- return(cl_ptr_vector_get(&p_pkey_tbl->blocks, block)); >>+ return( (block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)) ? >>+ cl_ptr_vector_get(&p_pkey_tbl->blocks, block) : NULL); >> }; >> /* >> * p_pkey_tbl >>@@ -244,6 +292,106 @@ static inline ib_pkey_table_t *osm_pkey_ >> /* >> *********/ >> >>+ >>+/****f* OpenSM: osm_pkey_tbl_make_block_pair >>+* NAME >>+* osm_pkey_tbl_make_block_pair >>+* >>+* DESCRIPTION >>+* Find or create a pair of "old" and "new" blocks for the >>+* given block index >>+* >>+* SYNOPSIS >>+*/ >>+int osm_pkey_tbl_make_block_pair( >>+ osm_pkey_tbl_t *p_pkey_tbl, >>+ uint16_t block_idx, >>+ ib_pkey_table_t **pp_old_block, >>+ ib_pkey_table_t **pp_new_block); >>+/* >>+* p_pkey_tbl >>+* [in] Pointer to the PKey table >>+* >>+* block_idx >>+* [in] The block index to use >>+* >>+* pp_old_block >>+* [out] Pointer to the old block pointer arg >>+* >>+* pp_new_block >>+* [out] Pointer to the new block pointer arg >>+* >>+* RETURN VALUES >>+* 0 if OK 1 if failed > > > It is better (conventional) to use -1 as failure return status. I have seen and used both - depend on the application. I think I should have used IB_SUCCESS or IB_ERROR but I do not mind changing that to -1 too. > > >>+* >>+*********/ >>+ >>+/****f* OpenSM: osm_pkey_tbl_set_new_entry >>+* NAME >>+* osm_pkey_tbl_set_new_entry >>+* >>+* DESCRIPTION >>+* stores the given pkey in the "new" blocks array and update >>+* the "map" to show that on the "old" blocks >>+* >>+* SYNOPSIS >>+*/ >>+int >>+osm_pkey_tbl_set_new_entry( >>+ IN osm_pkey_tbl_t *p_pkey_tbl, >>+ IN uint16_t block_idx, >>+ IN uint8_t pkey_idx, >>+ IN uint16_t pkey); >>+/* >>+* p_pkey_tbl >>+* [in] Pointer to the PKey table >>+* >>+* block_idx >>+* [in] The block index to use >>+* >>+* pkey_idx >>+* [in] The index within the block >>+* >>+* pkey >>+* [in] PKey to store >>+* >>+* RETURN VALUES >>+* 0 if OK 1 if failed > > > Ditto > > >>+* >>+*********/ >>+ >>+/****f* OpenSM: osm_pkey_find_next_free_entry >>+* NAME >>+* osm_pkey_find_next_free_entry >>+* >>+* DESCRIPTION >>+* Find the next free entry in the PKey table. Starting at the given >>+* index and block number. The user should increment pkey_idx before >>+* next call >>+* Inspect the "new" blocks array for empty space. >>+* >>+* SYNOPSIS >>+*/ >>+boolean_t >>+osm_pkey_find_next_free_entry( >>+ IN osm_pkey_tbl_t *p_pkey_tbl, >>+ OUT uint16_t *p_block_idx, >>+ OUT uint8_t *p_pkey_idx); >>+/* >>+* p_pkey_tbl >>+* [in] Pointer to the PKey table >>+* >>+* p_block_idx >>+* [out] The block index to use >>+* >>+* p_pkey_idx >>+* [out] The index within the block to use >>+* >>+* RETURN VALUES >>+* TRUE if found FALSE if did not find >>+* >>+*********/ >>+ >> /****f* OpenSM: osm_pkey_tbl_sync_new_blocks >> * NAME >> * osm_pkey_tbl_sync_new_blocks >>@@ -263,9 +411,44 @@ void osm_pkey_tbl_sync_new_blocks( >> * >> *********/ >> >>+/****f* OpenSM: osm_pkey_tbl_get_block_and_idx >>+* NAME >>+* osm_pkey_tbl_get_block_and_idx >>+* >>+* DESCRIPTION >>+* set the block index and pkey index the given >>+* pkey is found in. return 1 if cound not find >>+* it, 0 if OK >>+* >>+* SYNOPSIS >>+*/ >>+int >>+osm_pkey_tbl_get_block_and_idx( >>+ IN osm_pkey_tbl_t *p_pkey_tbl, >>+ IN uint16_t *p_pkey, >>+ OUT uint32_t *block_idx, >>+ OUT uint8_t *pkey_index); >>+/* >>+* p_pkey_tbl >>+* [in] Pointer to osm_pkey_tbl_t object. >>+* >>+* p_pkey >>+* [in] Pointer to the P_Key entry searched >>+* >>+* p_block_idx >>+* [out] Pointer to the block index to be updated >>+* >>+* p_pkey_idx >>+* [out] Pointer to the pkey index (in the block) to be updated >>+* >>+* >>+* NOTES >>+* >>+*********/ >>+ >> /****f* OpenSM: osm_pkey_tbl_set >> * NAME >> * osm_pkey_tbl_set >>Index: opensm/osm_pkey.c >>=================================================================== >>--- opensm/osm_pkey.c (revision 7904) >>+++ opensm/osm_pkey.c (working copy) >>@@ -100,6 +100,9 @@ int osm_pkey_tbl_init( >> cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1); >> cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1); >> cl_map_init( &p_pkey_tbl->keys, 1 ); >>+ cl_qlist_init( &p_pkey_tbl->pending ); >>+ p_pkey_tbl->used_blocks = 0; >>+ p_pkey_tbl->max_blocks = 0; >> return(IB_SUCCESS); >> } >> >>@@ -118,14 +121,29 @@ void osm_pkey_tbl_sync_new_blocks( >> p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b); >> if ( b < new_blocks ) >> p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b); >>- else { >>+ else >>+ { >> p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block)); >> if (!p_new_block) >> break; >>+ cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, >>+ b, p_new_block); >>+ } >>+ >> memset(p_new_block, 0, sizeof(*p_new_block)); >>- cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block); >> } >>- memcpy(p_new_block, p_block, sizeof(*p_new_block)); >>+} > > > You changed this function so it does not do any sync anymore. Should > function name be changed too? Yes correct I will change it. Is a better name: osm_pkey_tbl_init_new_blocks ? > > >>+ >>+/********************************************************************** >>+ **********************************************************************/ >>+void osm_pkey_tbl_cleanup_pending( >>+ IN osm_pkey_tbl_t *p_pkey_tbl) >>+{ >>+ cl_list_item_t *p_item; >>+ p_item = cl_qlist_remove_head( &p_pkey_tbl->pending ); >>+ while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) ) >>+ { >>+ free( (osm_pending_pkey_t *)p_item ); >> } >> } >> >>@@ -202,6 +220,138 @@ int osm_pkey_tbl_set( >> >> /********************************************************************** >> **********************************************************************/ >>+int osm_pkey_tbl_make_block_pair( >>+ osm_pkey_tbl_t *p_pkey_tbl, >>+ uint16_t block_idx, >>+ ib_pkey_table_t **pp_old_block, >>+ ib_pkey_table_t **pp_new_block) >>+{ >>+ if (block_idx >= p_pkey_tbl->max_blocks) return 1; >>+ >>+ if (pp_old_block) >>+ { >>+ *pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx ); >>+ if (! *pp_old_block) >>+ { >>+ *pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); >>+ if (!*pp_old_block) return 1; >>+ memset(*pp_old_block, 0, sizeof(ib_pkey_table_t)); >>+ cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block); >>+ } >>+ } >>+ >>+ if (pp_new_block) >>+ { >>+ *pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx ); >>+ if (! *pp_new_block) >>+ { >>+ *pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); >>+ if (!*pp_new_block) return 1; >>+ memset(*pp_new_block, 0, sizeof(ib_pkey_table_t)); >>+ cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block); >>+ } >>+ } >>+ return 0; >>+} >>+ >>+/********************************************************************** >>+ **********************************************************************/ >>+/* >>+ store the given pkey in the "new" blocks array and update the "map" >>+ to show that on the "old" blocks >>+*/ >>+int >>+osm_pkey_tbl_set_new_entry( >>+ IN osm_pkey_tbl_t *p_pkey_tbl, >>+ IN uint16_t block_idx, >>+ IN uint8_t pkey_idx, >>+ IN uint16_t pkey) >>+{ >>+ ib_pkey_table_t *p_old_block; >>+ ib_pkey_table_t *p_new_block; >>+ >>+ if (osm_pkey_tbl_make_block_pair( >>+ p_pkey_tbl, block_idx, &p_old_block, &p_new_block)) >>+ return 1; >>+ >>+ cl_map_insert( &p_pkey_tbl->keys, >>+ ib_pkey_get_base(pkey), >>+ &(p_old_block->pkey_entry[pkey_idx])); > > > Here you map potentially empty pkey entry. Why? "old block" will be > remapped anyway on pkey receiving. The reason I did this was that if the GetResp will fail I still want to represent the settings in the map.But actually it might be better not to do that so next time we run we will not find it without a GetResp. > > Actually I don't see why you want this pretty tricky and pkey_mgr > specific procedure as generic function. I think once the new_blocks was made available through the osm_pkey.h we actually burden the pkey table object with the full complexity of the pkey manager. So I think the right place for the functions changing the pkey table is in the osm_pkey.* > > >>+ p_new_block->pkey_entry[pkey_idx] = pkey; >>+ if (p_pkey_tbl->used_blocks < block_idx) >>+ p_pkey_tbl->used_blocks = block_idx; >>+ >>+ return 0; >>+} >>+ >>+/********************************************************************** >>+ **********************************************************************/ >>+boolean_t >>+osm_pkey_find_next_free_entry( >>+ IN osm_pkey_tbl_t *p_pkey_tbl, >>+ OUT uint16_t *p_block_idx, >>+ OUT uint8_t *p_pkey_idx) >>+{ >>+ ib_pkey_table_t *p_new_block; >>+ >>+ CL_ASSERT(p_block_idx); >>+ CL_ASSERT(p_pkey_idx); >>+ >>+ while ( *p_block_idx < p_pkey_tbl->max_blocks) >>+ { >>+ if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1) >>+ { >>+ *p_pkey_idx = 0; >>+ (*p_block_idx)++; >>+ if (*p_block_idx >= p_pkey_tbl->max_blocks) >>+ return FALSE; >>+ } >>+ >>+ p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx); >>+ >>+ if ( !p_new_block || >>+ ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx])) >>+ return TRUE; >>+ else >>+ (*p_pkey_idx)++; >>+ } >>+ return FALSE; >>+} >>+ >>+/********************************************************************** >>+ **********************************************************************/ >>+int >>+osm_pkey_tbl_get_block_and_idx( >>+ IN osm_pkey_tbl_t *p_pkey_tbl, >>+ IN uint16_t *p_pkey, >>+ OUT uint32_t *p_block_idx, >>+ OUT uint8_t *p_pkey_index) >>+{ >>+ uint32_t num_of_blocks; >>+ uint32_t block_index; >>+ ib_pkey_table_t *block; >>+ >>+ CL_ASSERT( p_pkey_tbl ); >>+ CL_ASSERT( p_block_idx != NULL ); >>+ CL_ASSERT( p_pkey_idx != NULL ); > > > Why last two CL_ASSERTs? What should be problem with uninitialized > pointers here? > These are the outputs of the function. It does not make sense to call the functions with null output pointers (calling by ref) . Anyway instead of putting the check in the free build I used an assert > >>+ >>+ num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks); >>+ for ( block_index = 0; block_index < num_of_blocks; block_index++ ) >>+ { >>+ block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); >>+ if ( ( block->pkey_entry <= p_pkey ) && >>+ ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK)) >>+ { >>+ *p_block_idx = block_index; >>+ *p_pkey_index = p_pkey - block->pkey_entry; >>+ return 0; >>+ } >>+ } >>+ return 1; >>+} >>+ >>+/********************************************************************** >>+ **********************************************************************/ >> static boolean_t __osm_match_pkey ( >> IN const ib_net16_t *pkey1, >> IN const ib_net16_t *pkey2 ) { >>@@ -305,7 +455,8 @@ osm_physp_share_pkey( >> if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys)) >> return TRUE; >> >>- return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); >>+ return >>+ !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); >> } >> >> /********************************************************************** >>@@ -321,7 +472,8 @@ osm_port_share_pkey( >> >> OSM_LOG_ENTER( p_log, osm_port_share_pkey ); >> >>- if (!p_port_1 || !p_port_2) { >>+ if (!p_port_1 || !p_port_2) >>+ { >> ret = FALSE; >> goto Exit; >> } >>@@ -329,7 +481,8 @@ osm_port_share_pkey( >> p_physp1 = osm_port_get_default_phys_ptr(p_port_1); >> p_physp2 = osm_port_get_default_phys_ptr(p_port_2); >> >>- if (!p_physp1 || !p_physp2) { >>+ if (!p_physp1 || !p_physp2) >>+ { >> ret = FALSE; >> goto Exit; >> } >>Index: opensm/osm_pkey_mgr.c >>=================================================================== >>--- opensm/osm_pkey_mgr.c (revision 7904) >>+++ opensm/osm_pkey_mgr.c (working copy) >>@@ -62,6 +62,139 @@ >> >> /********************************************************************** >> **********************************************************************/ >>+/* >>+ the max number of pkey blocks for a physical port is located in >>+ different place for switch external ports (SwitchInfo) and the >>+ rest of the ports (NodeInfo) >>+*/ >>+static int pkey_mgr_get_physp_max_blocks( > > > I would suggest to add _cap_ to function name. Not too much critical > since it is static function. > > >>+ IN const osm_subn_t *p_subn, >>+ IN const osm_physp_t *p_physp) >>+{ >>+ osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); >>+ osm_switch_t *p_sw; >>+ uint16_t num_pkeys = 0; >>+ >>+ if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) || >>+ (osm_physp_get_port_num( p_physp ) == 0)) >>+ num_pkeys = cl_ntoh16( p_node->node_info.partition_cap ); >>+ else >>+ { >>+ p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid); >>+ if (p_sw) >>+ num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap ); >>+ } >>+ return( (num_pkeys + 31) / 32 ); >>+} >>+ >>+/********************************************************************** >>+ **********************************************************************/ >>+/* >>+ * Insert the new pending pkey entry to the specific port pkey table >>+ * pending pkeys. new entries are inserted at the back. >>+ */ >>+static void pkey_mgr_process_physical_port( >>+ IN osm_log_t *p_log, >>+ IN const osm_req_t *p_req, >>+ IN const ib_net16_t pkey, >>+ IN osm_physp_t *p_physp ) >>+{ >>+ osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); >>+ osm_pkey_tbl_t *p_pkey_tbl; >>+ ib_net16_t *p_orig_pkey; >>+ char *stat = NULL; >>+ osm_pending_pkey_t *p_pending; >>+ >>+ p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); >>+ if (! p_pkey_tbl) > > ^^^^^^^^^^^^^ > Is it possible? Yes it is ! I run into it during testing. The port did not have any pkey table. > > >>+ { >>+ osm_log( p_log, OSM_LOG_ERROR, >>+ "pkey_mgr_process_physical_port: ERR 0501: " >>+ "No pkey table found for node " >>+ "0x%016" PRIx64 " port %u\n", >>+ cl_ntoh64( osm_node_get_node_guid( p_node ) ), >>+ osm_physp_get_port_num( p_physp ) ); >>+ return; >>+ } >>+ >>+ p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t)); >>+ if (! p_pending) >>+ { >>+ osm_log( p_log, OSM_LOG_ERROR, >>+ "pkey_mgr_process_physical_port: ERR 0502: " >>+ "Fail to allocate new pending pkey entry for node " >>+ "0x%016" PRIx64 " port %u\n", >>+ cl_ntoh64( osm_node_get_node_guid( p_node ) ), >>+ osm_physp_get_port_num( p_physp ) ); >>+ return; >>+ } >>+ p_pending->pkey = pkey; >>+ p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); >>+ if ( !p_orig_pkey || >>+ (ib_pkey_get_base(*p_orig_pkey) != ib_pkey_get_base(pkey) )) > > > There the cases of new pkey and updated pkey membership is mixed. Why? I am not following your question. The specific case I am trying to catch is the one that for some reason the map points to a pkey entry that was modified somehow and is different then the one you would expect by the map. > > >>+ { >>+ p_pending->is_new = TRUE; >>+ cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); >>+ stat = "inserted"; >>+ } >>+ else >>+ { >>+ p_pending->is_new = FALSE; >>+ if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey, >>+ &p_pending->block, &p_pending->index)) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > AFAIK in this function there were CL_ASSERTs which check for uinitialized > pointers. True. So the asserts are not required in this case. > > >>+ { >>+ osm_log( p_log, OSM_LOG_ERROR, >>+ "pkey_mgr_process_physical_port: ERR 0503: " >>+ "Fail to obtain P_Key 0x%04x block and index for node " >>+ "0x%016" PRIx64 " port %u\n", >>+ cl_ntoh64( osm_node_get_node_guid( p_node ) ), >>+ osm_physp_get_port_num( p_physp ) ); >>+ return; >>+ } >>+ cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); >>+ stat = "updated"; > > > Is it will be updated? It is likely "already there" case. No? > > Also in this case you can already put the pkey in new_block instead of > holding it in pending list. Then later you will only need to add new > pkeys. This may simplify the flow and even save some mem. True but in my mind it does not simplify - on the contrary it makes the partition between populating each port pending list and actually setting the pkey tables mixed. I do not think the memory impact deserves this mix of staging > > >>+ } >>+ >>+ osm_log( p_log, OSM_LOG_DEBUG, >>+ "pkey_mgr_process_physical_port: " >>+ "pkey 0x%04x was %s for node 0x%016" PRIx64 >>+ " port %u\n", >>+ cl_ntoh16( pkey ), stat, >>+ cl_ntoh64( osm_node_get_node_guid( p_node ) ), >>+ osm_physp_get_port_num( p_physp ) ); >>+} >>+ >>+/********************************************************************** >>+ **********************************************************************/ >>+static void >>+pkey_mgr_process_partition_table( >>+ osm_log_t *p_log, >>+ const osm_req_t *p_req, >>+ const osm_prtn_t *p_prtn, >>+ const boolean_t full ) >>+{ >>+ const cl_map_t *p_tbl = full ? >>+ &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; >>+ cl_map_iterator_t i, i_next; >>+ ib_net16_t pkey = p_prtn->pkey; >>+ osm_physp_t *p_physp; >>+ >>+ if ( full ) >>+ pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); >>+ >>+ i_next = cl_map_head( p_tbl ); >>+ while ( i_next != cl_map_end( p_tbl ) ) >>+ { >>+ i = i_next; >>+ i_next = cl_map_next( i ); >>+ p_physp = cl_map_obj( i ); >>+ if ( p_physp && osm_physp_is_valid( p_physp ) ) >>+ pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); >>+ } >>+} >>+ >>+/********************************************************************** >>+ **********************************************************************/ >> static ib_api_status_t >> pkey_mgr_update_pkey_entry( >> IN const osm_req_t *p_req, >>@@ -114,7 +247,8 @@ pkey_mgr_enforce_partition( >> p_pi->state_info2 = 0; >> ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE ); >> >>- context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); >>+ context.pi_context.node_guid = >>+ osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); >> context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); >> context.pi_context.set_method = TRUE; >> context.pi_context.update_master_sm_base_lid = FALSE; >>@@ -131,80 +265,132 @@ pkey_mgr_enforce_partition( >> >> /********************************************************************** >> **********************************************************************/ >>-/* >>- * Prepare a new entry for the pkey table for this port when this pkey >>- * does not exist. Update existed entry when membership was changed. >>- */ >>-static void pkey_mgr_process_physical_port( >>- IN osm_log_t *p_log, >>- IN const osm_req_t *p_req, >>- IN const ib_net16_t pkey, >>- IN osm_physp_t *p_physp ) >>+static boolean_t pkey_mgr_update_port( >>+ osm_log_t *p_log, >>+ osm_req_t *p_req, >>+ const osm_port_t * const p_port ) >> { >>- osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); >>- ib_pkey_table_t *block; >>+ osm_physp_t *p_physp; >>+ osm_node_t *p_node; >>+ ib_pkey_table_t *block, *new_block; >>+ osm_pkey_tbl_t *p_pkey_tbl; >> uint16_t block_index; >>+ uint8_t pkey_index; >>+ uint16_t last_free_block_index = 0; >>+ uint16_t last_free_pkey_index = 0; >> uint16_t num_of_blocks; >>- const osm_pkey_tbl_t *p_pkey_tbl; >>- ib_net16_t *p_orig_pkey; >>- char *stat = NULL; >>- uint32_t i; >>+ uint16_t max_num_of_blocks; >> >>- p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); >>- num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); >>+ ib_api_status_t status; >>+ boolean_t ret_val = FALSE; >>+ osm_pending_pkey_t *p_pending; >>+ boolean_t found; >> >>- p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); >>+ p_physp = osm_port_get_default_phys_ptr( p_port ); >>+ if ( !osm_physp_is_valid( p_physp ) ) >>+ return FALSE; >> >>- if ( !p_orig_pkey ) >>- { >>- for ( block_index = 0; block_index < num_of_blocks; block_index++ ) >>+ p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); >>+ num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); >>+ max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp ); >>+ if ( p_pkey_tbl->max_blocks > max_num_of_blocks ) >> { >>- block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); >>- for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) >>+ osm_log( p_log, OSM_LOG_INFO, >>+ "pkey_mgr_update_port: " >>+ "Max number of blocks reduced from %u to %u " >>+ "for node 0x%016" PRIx64 " port %u\n", >>+ p_pkey_tbl->max_blocks, max_num_of_blocks, >>+ cl_ntoh64( osm_node_get_node_guid( p_node ) ), >>+ osm_physp_get_port_num( p_physp ) ); >>+ } >>+ p_pkey_tbl->max_blocks = max_num_of_blocks; >>+ >>+ osm_pkey_tbl_sync_new_blocks( p_pkey_tbl ); >>+ cl_map_remove_all( &p_pkey_tbl->keys ); > > > What is the reason to drop map here? AFAIK it will be reinitialized later > anyway when pkey blocks will be received. What if it is not received? > > >>+ p_pkey_tbl->used_blocks = 0; >>+ >>+ /* >>+ process every pending pkey in order - >>+ first must be "updated" last are "new" >>+ */ >>+ p_pending = >>+ (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); >>+ while (p_pending != >>+ (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) ) >>+ { >>+ if (p_pending->is_new == FALSE) >>+ { >>+ block_index = p_pending->block; >>+ pkey_index = p_pending->index; >>+ found = TRUE; >>+ } >>+ else >> { >>- if ( ib_pkey_is_invalid( block->pkey_entry[i] ) ) >>+ found = osm_pkey_find_next_free_entry(p_pkey_tbl, >>+ &last_free_block_index, >>+ &last_free_pkey_index); > > > There should be warning: expected third arg is uint8_t* True. I will fix the variable declaration to uint8_t > > >>+ if ( !found ) >> { >>- block->pkey_entry[i] = pkey; >>- stat = "inserted"; >>- goto _done; >>+ osm_log( p_log, OSM_LOG_ERROR, >>+ "pkey_mgr_update_port: ERR 0504: " >>+ "failed to find empty space for new pkey 0x%04x " >>+ "of node 0x%016" PRIx64 " port %u\n", >>+ cl_ntoh16(p_pending->pkey), >>+ cl_ntoh64( osm_node_get_node_guid( p_node ) ), >>+ osm_physp_get_port_num( p_physp ) ); >> } >>+ else >>+ { >>+ block_index = last_free_block_index; >>+ pkey_index = last_free_pkey_index++; >> } >> } >>+ >>+ if (found) >>+ { >>+ if (osm_pkey_tbl_set_new_entry( >>+ p_pkey_tbl, block_index, pkey_index, p_pending->pkey) ) >>+ { >> osm_log( p_log, OSM_LOG_ERROR, >>- "pkey_mgr_process_physical_port: ERR 0501: " >>- "No empty pkey entry was found to insert 0x%04x for node " >>- "0x%016" PRIx64 " port %u\n", >>- cl_ntoh16( pkey ), >>+ "pkey_mgr_update_port: ERR 0505: " >>+ "failed to set PKey 0x%04x in block %u idx %u " >>+ "of node 0x%016" PRIx64 " port %u\n", >>+ p_pending->pkey, block_index, pkey_index, >> cl_ntoh64( osm_node_get_node_guid( p_node ) ), >> osm_physp_get_port_num( p_physp ) ); >> } >>- else if ( *p_orig_pkey != pkey ) >>- { >>+ } >>+ >>+ free( p_pending ); >>+ p_pending = >>+ (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); >>+ } >>+ >>+ /* now look for changes and store */ >> for ( block_index = 0; block_index < num_of_blocks; block_index++ ) >> { >>- /* we need real block (not just new_block) in order >>- * to resolve block/pkey indices */ >> block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); >>- i = p_orig_pkey - block->pkey_entry; >>- if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) { >>- block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); >>- block->pkey_entry[i] = pkey; >>- stat = "updated"; >>- goto _done; >>- } >>- } >>- } >>+ new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); >> >>- _done: >>- if (stat) { >>- osm_log( p_log, OSM_LOG_VERBOSE, >>- "pkey_mgr_process_physical_port: " >>- "pkey 0x%04x was %s for node 0x%016" PRIx64 >>- " port %u\n", >>- cl_ntoh16( pkey ), stat, >>+ if (block && >>+ (!new_block || !memcmp( new_block, block, sizeof( *block ) )) ) >>+ continue; >>+ >>+ status = pkey_mgr_update_pkey_entry( >>+ p_req, p_physp , new_block, block_index ); >>+ if (status == IB_SUCCESS) >>+ ret_val = TRUE; >>+ else >>+ osm_log( p_log, OSM_LOG_ERROR, >>+ "pkey_mgr_update_port: ERR 0506: " >>+ "pkey_mgr_update_pkey_entry() failed to update " >>+ "pkey table block %d for node 0x%016" PRIx64 " port %u\n", >>+ block_index, >> cl_ntoh64( osm_node_get_node_guid( p_node ) ), >> osm_physp_get_port_num( p_physp ) ); >> } >>+ >>+ return ret_val; >> } >> >> /********************************************************************** >>@@ -217,21 +403,23 @@ pkey_mgr_update_peer_port( >> const osm_port_t * const p_port, >> boolean_t enforce ) >> { >>- osm_physp_t *p, *peer; >>+ osm_physp_t *p_physp, *peer; >> osm_node_t *p_node; >> ib_pkey_table_t *block, *peer_block; >>- const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl; >>+ const osm_pkey_tbl_t *p_pkey_tbl; >>+ osm_pkey_tbl_t *p_peer_pkey_tbl; >> osm_switch_t *p_sw; >> ib_switch_info_t *p_si; >> uint16_t block_index; >> uint16_t num_of_blocks; >>+ uint16_t peer_max_blocks; >> ib_api_status_t status = IB_SUCCESS; >> boolean_t ret_val = FALSE; >> >>- p = osm_port_get_default_phys_ptr( p_port ); >>- if ( !osm_physp_is_valid( p ) ) >>+ p_physp = osm_port_get_default_phys_ptr( p_port ); >>+ if ( !osm_physp_is_valid( p_physp ) ) >> return FALSE; >>- peer = osm_physp_get_remote( p ); >>+ peer = osm_physp_get_remote( p_physp ); >> if ( !peer || !osm_physp_is_valid( peer ) ) >> return FALSE; >> p_node = osm_physp_get_node_ptr( peer ); >>@@ -245,7 +433,7 @@ pkey_mgr_update_peer_port( >> if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) >> { >> osm_log( p_log, OSM_LOG_ERROR, >>- "pkey_mgr_update_peer_port: ERR 0502: " >>+ "pkey_mgr_update_peer_port: ERR 0507: " >> "pkey_mgr_enforce_partition() failed to update " >> "node 0x%016" PRIx64 " port %u\n", >> cl_ntoh64( osm_node_get_node_guid( p_node ) ), >>@@ -255,24 +443,36 @@ pkey_mgr_update_peer_port( >> if (enforce == FALSE) >> return FALSE; >> >>- p_pkey_tbl = osm_physp_get_pkey_tbl( p ); >>- p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); >>+ p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); >>+ p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer ); >> num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); >>- if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) ) >>- num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ); >>+ peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer ); >>+ if (peer_max_blocks < p_pkey_tbl->used_blocks) >>+ { >>+ osm_log( p_log, OSM_LOG_ERROR, >>+ "pkey_mgr_update_peer_port: ERR 0508: " >>+ "not enough entries (%u < %u) on switch 0x%016" PRIx64 >>+ " port %u\n", >>+ peer_max_blocks, num_of_blocks, >>+ cl_ntoh64( osm_node_get_node_guid( p_node ) ), >>+ osm_physp_get_port_num( peer ) ); >>+ return FALSE; > > > Do you think it is the best way, just to skip update - partitions are > enforced already on the switch. May be better to truncate pkey tables > in order to meet peer's capabilities? You are right about that - Its a bug! I think the best approach here is to turn off the enforcement on the switch. If we truncate the table we actually impact connectivity of the fabric. I prefer a softer approach - an error in the log. > > >>+ } >> >>- for ( block_index = 0; block_index < num_of_blocks; block_index++ ) >>+ p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; >>+ for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++) >> { >> block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); >> peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); >> if ( memcmp( peer_block, block, sizeof( *peer_block ) ) ) >> { >>+ osm_pkey_tbl_set(p_peer_pkey_tbl, block_index, block); > > > Why this (osm_pkey_tbl_set())? This will be called by receiver. Same as the above note about updating the map I wanted to avoid to wait for the GetResp. I think it is a mistake and we can actually remove it. > > >> status = pkey_mgr_update_pkey_entry( p_req, peer, block, block_index ); >> if ( status == IB_SUCCESS ) >> ret_val = TRUE; >> else >> osm_log( p_log, OSM_LOG_ERROR, >>- "pkey_mgr_update_peer_port: ERR 0503: " >>+ "pkey_mgr_update_peer_port: ERR 0509: " >> "pkey_mgr_update_pkey_entry() failed to update " >> "pkey table block %d for node 0x%016" PRIx64 >> " port %u\n", >>@@ -282,10 +482,10 @@ pkey_mgr_update_peer_port( >> } >> } >> >>- if ( ret_val == TRUE && >>- osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) >>+ if ( (ret_val == TRUE) && >>+ osm_log_is_active( p_log, OSM_LOG_DEBUG ) ) >> { >>- osm_log( p_log, OSM_LOG_VERBOSE, >>+ osm_log( p_log, OSM_LOG_DEBUG, >> "pkey_mgr_update_peer_port: " >> "pkey table was updated for node 0x%016" PRIx64 >> " port %u\n", >>@@ -298,82 +498,6 @@ pkey_mgr_update_peer_port( >> >> /********************************************************************** >> **********************************************************************/ >>-static boolean_t pkey_mgr_update_port( >>- osm_log_t *p_log, >>- osm_req_t *p_req, >>- const osm_port_t * const p_port ) >>-{ >>- osm_physp_t *p; >>- osm_node_t *p_node; >>- ib_pkey_table_t *block, *new_block; >>- const osm_pkey_tbl_t *p_pkey_tbl; >>- uint16_t block_index; >>- uint16_t num_of_blocks; >>- ib_api_status_t status; >>- boolean_t ret_val = FALSE; >>- >>- p = osm_port_get_default_phys_ptr( p_port ); >>- if ( !osm_physp_is_valid( p ) ) >>- return FALSE; >>- >>- p_pkey_tbl = osm_physp_get_pkey_tbl(p); >>- num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); >>- >>- for ( block_index = 0; block_index < num_of_blocks; block_index++ ) >>- { >>- block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); >>- new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); >>- >>- if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) ) >>- continue; >>- >>- status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index ); >>- if (status == IB_SUCCESS) >>- ret_val = TRUE; >>- else >>- osm_log( p_log, OSM_LOG_ERROR, >>- "pkey_mgr_update_port: ERR 0504: " >>- "pkey_mgr_update_pkey_entry() failed to update " >>- "pkey table block %d for node 0x%016" PRIx64 " port %u\n", >>- block_index, >>- cl_ntoh64( osm_node_get_node_guid( p_node ) ), >>- osm_physp_get_port_num( p ) ); >>- } >>- >>- return ret_val; >>-} >>- >>-/********************************************************************** >>- **********************************************************************/ >>-static void >>-pkey_mgr_process_partition_table( >>- osm_log_t *p_log, >>- const osm_req_t *p_req, >>- const osm_prtn_t *p_prtn, >>- const boolean_t full ) >>-{ >>- const cl_map_t *p_tbl = full ? >>- &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; >>- cl_map_iterator_t i, i_next; >>- ib_net16_t pkey = p_prtn->pkey; >>- osm_physp_t *p_physp; >>- >>- if ( full ) >>- pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); >>- >>- i_next = cl_map_head( p_tbl ); >>- while ( i_next != cl_map_end( p_tbl ) ) >>- { >>- i = i_next; >>- i_next = cl_map_next( i ); >>- p_physp = cl_map_obj( i ); >>- if ( p_physp && osm_physp_is_valid( p_physp ) ) >>- pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); >>- } >>-} >>- >>-/********************************************************************** >>- **********************************************************************/ >> osm_signal_t >> osm_pkey_mgr_process( >> IN osm_opensm_t *p_osm ) >>@@ -383,8 +507,7 @@ osm_pkey_mgr_process( >> osm_prtn_t *p_prtn; >> osm_port_t *p_port; >> osm_signal_t signal = OSM_SIGNAL_DONE; >>- osm_physp_t *p_physp; >>- >>+ osm_node_t *p_node; >> CL_ASSERT( p_osm ); >> >> OSM_LOG_ENTER( &p_osm->log, osm_pkey_mgr_process ); >>@@ -394,32 +517,25 @@ osm_pkey_mgr_process( >> if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS ) >> { >> osm_log( &p_osm->log, OSM_LOG_ERROR, >>- "osm_pkey_mgr_process: ERR 0505: " >>+ "osm_pkey_mgr_process: ERR 0510: " >> "osm_prtn_make_partitions() failed\n" ); >> goto _err; >> } >> >>- p_tbl = &p_osm->subn.port_guid_tbl; >>- p_next = cl_qmap_head( p_tbl ); >>- while ( p_next != cl_qmap_end( p_tbl ) ) >>- { >>- p_port = ( osm_port_t * ) p_next; >>- p_next = cl_qmap_next( p_next ); >>- p_physp = osm_port_get_default_phys_ptr( p_port ); >>- if ( osm_physp_is_valid( p_physp ) ) >>- osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) ); >>- } >>- >>+ /* populate the pending pkey entries by scanning all partitions */ >> p_tbl = &p_osm->subn.prtn_pkey_tbl; >> p_next = cl_qmap_head( p_tbl ); >> while ( p_next != cl_qmap_end( p_tbl ) ) >> { >> p_prtn = ( osm_prtn_t * ) p_next; >> p_next = cl_qmap_next( p_next ); >>- pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ); >>- pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); >>+ pkey_mgr_process_partition_table( >>+ &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ); >>+ pkey_mgr_process_partition_table( >>+ &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); >> } >> >>+ /* calculate new pkey tables and set */ >> p_tbl = &p_osm->subn.port_guid_tbl; >> p_next = cl_qmap_head( p_tbl ); >> while ( p_next != cl_qmap_end( p_tbl ) ) >>@@ -428,8 +544,10 @@ osm_pkey_mgr_process( >> p_next = cl_qmap_next( p_next ); >> if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) ) >> signal = OSM_SIGNAL_DONE_PENDING; >>- if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH && >>- pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, >>+ p_node = osm_port_get_parent_node( p_port ); >>+ if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && >>+ pkey_mgr_update_peer_port( >>+ &p_osm->log, &p_osm->sm.req, >> &p_osm->subn, p_port, >> !p_osm->subn.opt.no_partition_enforcement ) ) >> signal = OSM_SIGNAL_DONE_PENDING; >> >> > > > Thanks, > Sasha From mst at mellanox.co.il Thu Jun 15 05:43:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Jun 2006 15:43:23 +0300 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <000001c6900e$d34b09d0$1d268686@amr.corp.intel.com> References: <000001c6900e$d34b09d0$1d268686@amr.corp.intel.com> Message-ID: <20060615124323.GB13121@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: [openib-general] RFC: detecting duplicate MAD requests > > >Well the ACK for the direction switch is special, isn't it? > >All I'm saying, let's pass it up to the application. > > I really don't think that this is the direction that we want to take the > interface. Yes, you are right. So, I thought about this some more, and I think I see how your approach can be adapted without breaking applications in subtle ways: When a transaction arrives, pass it to user and don't keep any state. When the ACK for segment 0 arrives, we know there will be response in this transaction, so we can queue it up already (but don't send yet as we don't have the data). Start sending when user responds. To solve the case where user responds before ACK for segment 0 arrives, a responder in DS RMPP will pass IsDS flag when he sends the response. mad core will then don't start sending until ACK for segment 0 arrives. -- MST From halr at voltaire.com Thu Jun 15 05:54:48 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Jun 2006 08:54:48 -0400 Subject: [openib-general] [PATCH] osm: partition manager force policy In-Reply-To: <44915060.6090103@mellanox.co.il> References: <86odwxgqrs.fsf@mtl066.yok.mtl.com> <20060615110617.GA21560@sashak.voltaire.com> <44915060.6090103@mellanox.co.il> Message-ID: <1150376088.4506.40087.camel@hal.voltaire.com> On Thu, 2006-06-15 at 08:19, Eitan Zahavi wrote: > >>+ p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); > >>+ if (! p_pkey_tbl) > > > > ^^^^^^^^^^^^^ > > Is it possible? > Yes it is ! I run into it during testing. The port did not have any pkey table. PKey tables are optional and predicated on NodeInfo:PartitionCap for endports which has a minimum of 1 and SwitchInfo:PartitionEnforcementCap for switch external (physical) ports which can be 0. Is this routine used for an endport (CA, router, switch management port), switch external port, or both ? > >>@@ -217,21 +403,23 @@ pkey_mgr_update_peer_port( > >> const osm_port_t * const p_port, > >> boolean_t enforce ) > >> { > >>- osm_physp_t *p, *peer; > >>+ osm_physp_t *p_physp, *peer; > >> osm_node_t *p_node; > >> ib_pkey_table_t *block, *peer_block; > >>- const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl; > >>+ const osm_pkey_tbl_t *p_pkey_tbl; > >>+ osm_pkey_tbl_t *p_peer_pkey_tbl; > >> osm_switch_t *p_sw; > >> ib_switch_info_t *p_si; > >> uint16_t block_index; > >> uint16_t num_of_blocks; > >>+ uint16_t peer_max_blocks; > >> ib_api_status_t status = IB_SUCCESS; > >> boolean_t ret_val = FALSE; > >> > >>- p = osm_port_get_default_phys_ptr( p_port ); > >>- if ( !osm_physp_is_valid( p ) ) > >>+ p_physp = osm_port_get_default_phys_ptr( p_port ); > >>+ if ( !osm_physp_is_valid( p_physp ) ) > >> return FALSE; > >>- peer = osm_physp_get_remote( p ); > >>+ peer = osm_physp_get_remote( p_physp ); > >> if ( !peer || !osm_physp_is_valid( peer ) ) > >> return FALSE; > >> p_node = osm_physp_get_node_ptr( peer ); > >>@@ -245,7 +433,7 @@ pkey_mgr_update_peer_port( > >> if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) > >> { > >> osm_log( p_log, OSM_LOG_ERROR, > >>- "pkey_mgr_update_peer_port: ERR 0502: " > >>+ "pkey_mgr_update_peer_port: ERR 0507: " > >> "pkey_mgr_enforce_partition() failed to update " > >> "node 0x%016" PRIx64 " port %u\n", > >> cl_ntoh64( osm_node_get_node_guid( p_node ) ), > >>@@ -255,24 +443,36 @@ pkey_mgr_update_peer_port( > >> if (enforce == FALSE) > >> return FALSE; > >> > >>- p_pkey_tbl = osm_physp_get_pkey_tbl( p ); > >>- p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); > >>+ p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); > >>+ p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer ); > >> num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); > >>- if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) ) > >>- num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ); > >>+ peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer ); > >>+ if (peer_max_blocks < p_pkey_tbl->used_blocks) > >>+ { > >>+ osm_log( p_log, OSM_LOG_ERROR, > >>+ "pkey_mgr_update_peer_port: ERR 0508: " > >>+ "not enough entries (%u < %u) on switch 0x%016" PRIx64 > >>+ " port %u\n", > >>+ peer_max_blocks, num_of_blocks, > >>+ cl_ntoh64( osm_node_get_node_guid( p_node ) ), > >>+ osm_physp_get_port_num( peer ) ); > >>+ return FALSE; > > > > > > Do you think it is the best way, just to skip update - partitions are > > enforced already on the switch. May be better to truncate pkey tables > > in order to meet peer's capabilities? > You are right about that - Its a bug! > I think the best approach here is to turn off the enforcement on the switch. > If we truncate the table we actually impact connectivity of the fabric. > I prefer a softer approach - an error in the log. Makes sense to me. It is better to give the administrator as close to what he wants and not punish him for something like this but warn him that his policy is weakened. In addition to an error in the log, one should also go to OSM_LOG_SYS as well so it might be noticed without checking the log. -- Hal From swise at opengridcomputing.com Thu Jun 15 06:41:03 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 Jun 2006 08:41:03 -0500 Subject: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver. In-Reply-To: <5E701717F2B2ED4EA60F87C8AA57B7CC05D4E2D8@venom2> References: <5E701717F2B2ED4EA60F87C8AA57B7CC05D4E2D8@venom2> Message-ID: <1150378863.22603.12.camel@stevo-desktop> On Wed, 2006-06-14 at 20:35 -0500, Bob Sharp wrote: > > +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) > > +{ > > + > > + case C2_RES_IND_EP:{ > > + > > + struct c2wr_ae_connection_request *req = > > + &wr->ae.ae_connection_request; > > + struct iw_cm_id *cm_id = > > + (struct iw_cm_id *)resource_user_context; > > + > > + pr_debug("C2_RES_IND_EP event_id=%d\n", event_id); > > + if (event_id != CCAE_CONNECTION_REQUEST) { > > + pr_debug("%s: Invalid event_id: %d\n", > > + __FUNCTION__, event_id); > > + break; > > + } > > + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; > > + cm_event.provider_data = (void*)(unsigned > long)req->cr_handle; > > + cm_event.local_addr.sin_addr.s_addr = req->laddr; > > + cm_event.remote_addr.sin_addr.s_addr = req->raddr; > > + cm_event.local_addr.sin_port = req->lport; > > + cm_event.remote_addr.sin_port = req->rport; > > + cm_event.private_data_len = > > + be32_to_cpu(req->private_data_length); > > + > > + if (cm_event.private_data_len) { > > > It looks to me as if pdata is leaking here since it is not tracked and > the upper layers do not free it. Also, if pdata is freed after the call > to cm_id->event_handler returns, it exposes an issue in user space where > the private data is garbage. I suspect the iwarp cm should be copying > this data before it returns. > Good catch. Yes, I think the IWCM should copy the private data in the upcall. If it does, then the amso driver doesn't need to kmalloc()/copy at all. It can pass a ptr to its MQ entry directly... Thanks, Steve. From mst at mellanox.co.il Thu Jun 15 06:56:09 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Jun 2006 16:56:09 +0300 Subject: [openib-general] on vacation through June 24 Message-ID: <20060615135609.GA2281@mellanox.co.il> I'll be on vacation through June 24. I won't be online most of the time. -- MST From swise at opengridcomputing.com Thu Jun 15 07:03:31 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 Jun 2006 09:03:31 -0500 Subject: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver. In-Reply-To: <1150378863.22603.12.camel@stevo-desktop> References: <5E701717F2B2ED4EA60F87C8AA57B7CC05D4E2D8@venom2> <1150378863.22603.12.camel@stevo-desktop> Message-ID: <1150380211.22603.17.camel@stevo-desktop> On Thu, 2006-06-15 at 08:41 -0500, Steve Wise wrote: > On Wed, 2006-06-14 at 20:35 -0500, Bob Sharp wrote: > > > > +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) > > > +{ > > > + > > > > > > + case C2_RES_IND_EP:{ > > > + > > > + struct c2wr_ae_connection_request *req = > > > + &wr->ae.ae_connection_request; > > > + struct iw_cm_id *cm_id = > > > + (struct iw_cm_id *)resource_user_context; > > > + > > > + pr_debug("C2_RES_IND_EP event_id=%d\n", event_id); > > > + if (event_id != CCAE_CONNECTION_REQUEST) { > > > + pr_debug("%s: Invalid event_id: %d\n", > > > + __FUNCTION__, event_id); > > > + break; > > > + } > > > + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; > > > + cm_event.provider_data = (void*)(unsigned > > long)req->cr_handle; > > > + cm_event.local_addr.sin_addr.s_addr = req->laddr; > > > + cm_event.remote_addr.sin_addr.s_addr = req->raddr; > > > + cm_event.local_addr.sin_port = req->lport; > > > + cm_event.remote_addr.sin_port = req->rport; > > > + cm_event.private_data_len = > > > + be32_to_cpu(req->private_data_length); > > > + > > > + if (cm_event.private_data_len) { > > > > > > It looks to me as if pdata is leaking here since it is not tracked and > > the upper layers do not free it. Also, if pdata is freed after the call > > to cm_id->event_handler returns, it exposes an issue in user space where > > the private data is garbage. I suspect the iwarp cm should be copying > > this data before it returns. > > > > Good catch. > > Yes, I think the IWCM should copy the private data in the upcall. If it > does, then the amso driver doesn't need to kmalloc()/copy at all. It > can pass a ptr to its MQ entry directly... > Now that I've looked more into this, I'm not sure there's a simple way for the IWCM to copy the pdata on the upcall. Currently, the IWCM's event upcall, cm_event_handler(), simply queues the work for processing on a workqueue thread. So there's no per-event logic at all there. Lemme think on this more. Stay tuned. Either way, the amso driver has a memory leak... Steve. From jlentini at netapp.com Thu Jun 15 08:05:23 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 15 Jun 2006 11:05:23 -0400 (EDT) Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <449119AE.2010703@voltaire.com> References: <44903D5D.10102@ichips.intel.com> <449119AE.2010703@voltaire.com> Message-ID: On Thu, 15 Jun 2006, Or Gerlitz wrote: > Sean Hefty wrote: > > James Lentini wrote: > >> The IBTA spec (volume 1, version 1.2) describes a communication > >> established affiliated asynchronous event. > >> We've seen this event delivered to our NFS-RDMA server and aren't sure > >> what to do with it. > > > This event is delivered to the verbs consumer, since it occurs on > > the QP. It's expected that the consumer will call > > ib_cm_establish. Although, I would guess that you can probably > > ignore the event, under the assumption that the RTU will > > eventually be received by the local CM. > > Sean, > > The cma/verbs consumer can't just ignore the event since its qp > state is still RTR which means an attempt to tx replying the rx > would fail. Good point. > On the other hand it can't call ib_cm_establish since the CMA does > not expose an API for that, This is a problem. > nor the CM can register a cb to get this event and emulate an RTU > reception since the CMA is the one to create the QP and the CMA > consumer providing the qp_init_attr along with event handler... > > I suggest the following design: the CMA would replace the event > handler provided with the qp_init_attr struct with a callback of its > own and keep the original handler/context on a private structure. > > On the delivery of IB_EVENT_COMM_EST event, the CMA would call down > the CM to emulate RTU reception (ib_cm_establish) and then call up ib_cm_establish() doesn't emulate an RTU reception. It generates an IB_CM_USER_ESTABLISHED event (not an IB_CM_RTU_RECEIVED event). The CMA's cma_ib_handler() doesn't recognize a IB_CM_USER_ESTABLISHED event. The QP's state will not be moved to RTS. > the consumer original handler, typical CMA consumers would just > ignore this event, i think. > > The CM should be able to allow ib_cm_established to be called in the > context over which the event handler is called (or jump the > treatment to higher context). The CM must also ignore the actual RTU > if it arrives later/in parallel to when ib_cm_establish was called. > > By this design the verbs consumer is guaranteed to always get > RDMA_CM_EVENT_ESTABLISHED no matter if the RTU is just late or never > arrives The CMA's cma_ib_handler() needs to be modified for this to be true. > but it still can get a CQ RX completion(s) before getting the CMA > established event; in that case it can queue these completion > elements for the short time window before the established event > arrives and then process them. Consumers don't actually have to queue the completions, they have to defer posting sends (either in response to the recvs or otherwise) until the QP moves to RTS. Could the implementations queue up the requests for the consumers? Strictly speaking, IB requires an error to be generated (C10-29 in the IBTA spec. vol 1, page 456). Still, it would be nice if consumers didn't have to be worry about this issue. > A design similar to that was implemented at the Voltaire gen1 stack > and it works in production with iSER target and VIBNAL (CFS Lustre > NAL for voltaire gen1 ib) server side. > > Does anyone know on what context (hard_irq, soft_irq, thread) are > the event handlers being called? > > Or. From mamidala at cse.ohio-state.edu Thu Jun 15 09:10:11 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Thu, 15 Jun 2006 12:10:11 -0400 (EDT) Subject: [openib-general] librdmacm error with rping In-Reply-To: Message-ID: Hi, I have installed the latest infiniband stack with 2.6.16.20 kernel. I tested the installation using ibv_rc_pingpong and it works fine. But, while trying to use rping, I get the following error: librdmacm: couldn't open rdma_cm ABI version. rdma_create_event_channel error 2 Any clues as to why this might be happening will be of great help, Thanks, Amith From swise at opengridcomputing.com Thu Jun 15 09:40:09 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 Jun 2006 11:40:09 -0500 Subject: [openib-general] librdmacm error with rping In-Reply-To: References: Message-ID: <1150389609.6371.1.camel@stevo-desktop> Sounds like maybe the librdma.so that's installed is down-level... Did you nuke the old one, rerun autgen.sh, configure, make, make install in the librdmacm directory? Stevo. On Thu, 2006-06-15 at 12:10 -0400, amith rajith mamidala wrote: > Hi, > > I have installed the latest infiniband stack with 2.6.16.20 kernel. > I tested the installation using ibv_rc_pingpong and it works fine. > But, while trying to use rping, I get the following error: > > librdmacm: couldn't open rdma_cm ABI version. > rdma_create_event_channel error 2 > > Any clues as to why this might be happening will be of great help, > > > Thanks, > Amith > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Thu Jun 15 11:15:24 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 15 Jun 2006 21:15:24 +0300 Subject: [openib-general] [PATCH] osm: partition manager force policy In-Reply-To: <44915060.6090103@mellanox.co.il> References: <86odwxgqrs.fsf@mtl066.yok.mtl.com> <20060615110617.GA21560@sashak.voltaire.com> <44915060.6090103@mellanox.co.il> Message-ID: <20060615181524.GB24808@sashak.voltaire.com> Hi Eitan, On 15:19 Thu 15 Jun , Eitan Zahavi wrote: > >>+/* > >>+* PARAMETERS > >>+* p_physp > >>+* [in] Pointer to an osm_physp_t object. > >>+* > >>+* RETURN VALUES > >>+* The pointer to the P_Key table object. > >>+* > >>+* NOTES > >>+* > >>+* SEE ALSO > >>+* Port, Physical Port > >>+*********/ > >>+ > > > > > >Is not this simpler to remove 'const' from existing > >osm_physp_get_pkey_tbl() function instead of using new one? > There are plenty of const functions using this function internally > so I would have need to fix them too. You are right. Maybe separate patch for this? > >>@@ -118,14 +121,29 @@ void osm_pkey_tbl_sync_new_blocks( > >> p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b); > >> if ( b < new_blocks ) > >> p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b); > >>- else { > >>+ else > >>+ { > >> p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block)); > >> if (!p_new_block) > >> break; > >>+ cl_ptr_vector_set(&((osm_pkey_tbl_t > >>*)p_pkey_tbl)->new_blocks, + b, > >>p_new_block); > >>+ } > >>+ > >> memset(p_new_block, 0, sizeof(*p_new_block)); > >>- cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, > >>p_new_block); > >> } > >>- memcpy(p_new_block, p_block, sizeof(*p_new_block)); > >>+} > > > > > >You changed this function so it does not do any sync anymore. Should > >function name be changed too? > Yes correct I will change it. Is a better name: > osm_pkey_tbl_init_new_blocks ? Great name. > >>+ to show that on the "old" blocks > >>+*/ > >>+int > >>+osm_pkey_tbl_set_new_entry( > >>+ IN osm_pkey_tbl_t *p_pkey_tbl, > >>+ IN uint16_t block_idx, > >>+ IN uint8_t pkey_idx, > >>+ IN uint16_t pkey) > >>+{ > >>+ ib_pkey_table_t *p_old_block; > >>+ ib_pkey_table_t *p_new_block; > >>+ > >>+ if (osm_pkey_tbl_make_block_pair( > >>+ p_pkey_tbl, block_idx, &p_old_block, &p_new_block)) > >>+ return 1; > >>+ > >>+ cl_map_insert( &p_pkey_tbl->keys, > >>+ ib_pkey_get_base(pkey), > >>+ > >>&(p_old_block->pkey_entry[pkey_idx])); > > > > > >Here you map potentially empty pkey entry. Why? "old block" will be > >remapped anyway on pkey receiving. > The reason I did this was that if the GetResp will fail I still want to > represent > the settings in the map.But actually it might be better not to do that so > next > time we run we will not find it without a GetResp. Agree. > >>+ IN uint16_t *p_pkey, > >>+ OUT uint32_t *p_block_idx, > >>+ OUT uint8_t *p_pkey_index) > >>+{ > >>+ uint32_t num_of_blocks; > >>+ uint32_t block_index; > >>+ ib_pkey_table_t *block; > >>+ > >>+ CL_ASSERT( p_pkey_tbl ); > >>+ CL_ASSERT( p_block_idx != NULL ); > >>+ CL_ASSERT( p_pkey_idx != NULL ); > > > > > >Why last two CL_ASSERTs? What should be problem with uninitialized > >pointers here? > > > These are the outputs of the function. It does not make sense to call the > functions with > null output pointers (calling by ref) . Anyway instead of putting the check > in the free build > I used an assert I see. Actually I've overlooked that addresses and not values are checked. Please ignore this comment. > >>+ > >>+ p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); > >>+ if (! p_pkey_tbl) > > > > ^^^^^^^^^^^^^ > >Is it possible? > Yes it is ! I run into it during testing. The port did not have any pkey > table. static inline osm_pkey_tbl_t * osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp ) { ... return( &p_physp->pkeys ); }; This returns the address of physp's pkeys field. Right? Then if ( &p_physp->pkeys == NULL ) p_physp pointer should be equal to unsigned equivalent of -(offset of pkey field in physp struct). > >>+ "Fail to allocate new pending pkey > >>entry for node " > >>+ "0x%016" PRIx64 " port %u\n", > >>+ cl_ntoh64( osm_node_get_node_guid( > >>p_node ) ), > >>+ osm_physp_get_port_num( p_physp ) ); > >>+ return; > >>+ } > >>+ p_pending->pkey = pkey; > >>+ p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey > >>) ); > >>+ if ( !p_orig_pkey || > >>+ (ib_pkey_get_base(*p_orig_pkey) != ib_pkey_get_base(pkey) > >>)) > > > > > >There the cases of new pkey and updated pkey membership is mixed. Why? > I am not following your question. > The specific case I am trying to catch is the one that for some reason the > map points to > a pkey entry that was modified somehow and is different then the one you > would expect by > the map. Didn't understand it at first pass, now it is clearer. If pkey entry was modified somehow (how? bugs?), the assumption is that mapping still be valid? Then it is not new entry (or we will change pkey's index in the real table). > >>+ { > >>+ p_pending->is_new = TRUE; > >>+ cl_qlist_insert_tail(&p_pkey_tbl->pending, > >>(cl_list_item_t*)p_pending); > >>+ stat = "inserted"; > >>+ } > >>+ else > >>+ { > >>+ p_pending->is_new = FALSE; > >>+ if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey, > >>+ > >>&p_pending->block, &p_pending->index)) > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > >AFAIK in this function there were CL_ASSERTs which check for uinitialized > >pointers. > True. So the asserts are not required in this case. Up to you. Actually this my comment may be ignored, as stated above I didn't read this correctly. > > > > > >>+ { > >>+ osm_log( p_log, OSM_LOG_ERROR, > >>+ "pkey_mgr_process_physical_port: > >>ERR 0503: " > >>+ "Fail to obtain P_Key 0x%04x > >>block and index for node " > >>+ "0x%016" PRIx64 " port %u\n", > >>+ cl_ntoh64( > >>osm_node_get_node_guid( p_node ) ), > >>+ osm_physp_get_port_num( > >>p_physp ) ); > >>+ return; > >>+ } > >>+ cl_qlist_insert_head(&p_pkey_tbl->pending, > >>(cl_list_item_t*)p_pending); > >>+ stat = "updated"; > > > > > >Is it will be updated? It is likely "already there" case. No? > > > >Also in this case you can already put the pkey in new_block instead of > >holding it in pending list. Then later you will only need to add new > >pkeys. This may simplify the flow and even save some mem. > True but in my mind it does not simplify - on the contrary it makes the > partition between > populating each port pending list and actually setting the pkey tables > mixed. I meant new_block filling, not actual setting. You will be able to remove whole if { } else { } flow, as well as is_new, block and index fields from 'pending' structure (actually only pkey value itself will matter) - is it not nice simplification? > I do not think the memory impact deserves this mix of staging > > > > > > >>+ max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, > >>p_physp ); > >>+ if ( p_pkey_tbl->max_blocks > max_num_of_blocks ) > >> { > >>- block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); > >>- for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) > >>+ osm_log( p_log, OSM_LOG_INFO, > >>+ "pkey_mgr_update_port: " > >>+ "Max number of blocks reduced from > >>%u to %u " + "for node 0x%016" PRIx64 " > >>port %u\n", > >>+ p_pkey_tbl->max_blocks, > >>max_num_of_blocks, > >>+ cl_ntoh64( osm_node_get_node_guid( > >>p_node ) ), > >>+ osm_physp_get_port_num( p_physp ) ); > >>+ } > >>+ p_pkey_tbl->max_blocks = max_num_of_blocks; > >>+ > >>+ osm_pkey_tbl_sync_new_blocks( p_pkey_tbl ); > >>+ cl_map_remove_all( &p_pkey_tbl->keys ); > > > > > >What is the reason to drop map here? AFAIK it will be reinitialized later > >anyway when pkey blocks will be received. > What if it is not received? Then we will have unreliable data there. Maybe I know why you wanted this - this is part of "use pkey tables before sending/receiving to/from ports" idea? > >>@@ -255,24 +443,36 @@ pkey_mgr_update_peer_port( > >> if (enforce == FALSE) > >> return FALSE; > >> > >>- p_pkey_tbl = osm_physp_get_pkey_tbl( p ); > >>- p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); > >>+ p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); > >>+ p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer ); > >> num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); > >>- if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) ) > >>- num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ); > >>+ peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer ); > >>+ if (peer_max_blocks < p_pkey_tbl->used_blocks) > >>+ { > >>+ osm_log( p_log, OSM_LOG_ERROR, > >>+ "pkey_mgr_update_peer_port: ERR > >>0508: " > >>+ "not enough entries (%u < %u) on > >>switch 0x%016" PRIx64 > >>+ " port %u\n", > >>+ peer_max_blocks, num_of_blocks, > >>+ cl_ntoh64( osm_node_get_node_guid( > >>p_node ) ), > >>+ osm_physp_get_port_num( peer ) ); > >>+ return FALSE; > > > > > >Do you think it is the best way, just to skip update - partitions are > >enforced already on the switch. May be better to truncate pkey tables > >in order to meet peer's capabilities? > You are right about that - Its a bug! > I think the best approach here is to turn off the enforcement on the switch. > If we truncate the table we actually impact connectivity of the fabric. > I prefer a softer approach - an error in the log. Yes this should be good way to handle this. > > > > > >>+ } > >> > >>- for ( block_index = 0; block_index < num_of_blocks; block_index++ ) > >>+ p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; > >>+ for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; > >>block_index++) > >> { > >> block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); > >> peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); > >> if ( memcmp( peer_block, block, sizeof( *peer_block ) ) ) > >> { > >>+ osm_pkey_tbl_set(p_peer_pkey_tbl, block_index, > >>block); > > > > > >Why this (osm_pkey_tbl_set())? This will be called by receiver. > Same as the above note about updating the map > I wanted to avoid to wait for the GetResp. > I think it is a mistake and we can actually remove it. Agree. Sasha. From krause at cup.hp.com Thu Jun 15 10:55:06 2006 From: krause at cup.hp.com (Michael Krause) Date: Thu, 15 Jun 2006 10:55:06 -0700 Subject: [openib-general] Mellanox HCAs: outstanding RDMAs In-Reply-To: <7.0.1.0.2.20060606131933.04267008@netapp.com> References: <7.0.1.0.2.20060606131933.04267008@netapp.com> Message-ID: <6.2.0.14.2.20060615104459.06451e28@esmail.cup.hp.com> As one of the authors of IB and iWARP, I can say that both Roland and Todd's responses are correct and the intent of the specifications. The number of outstanding RDMA Reads are bounded and that is communicated during session establishment. The ULP can choose to be aware of this requirement (certainly when we wrote iSER and DA we were well aware of the requirement and we documented as such in the ULP specs) and track from above so that it does not see a stall or it can stay ignorant and deal with the stall as a result. This is a ULP choice and has been intentionally done that way so that the hardware can be kept as simple as possible and as low cost as well while meeting the breadth of ULP needs that were used to develop these technologies. Tom, you raised this issue during iWARP's definition and the debate was conducted at least several times. The outcome of these debates is reflected in iWARP and remains aligned with IB. So, unless you really want to have the IETF and IBTA go and modify their specs, I believe you'll have to deal with the issue just as other ULP are doing today and be aware of the constraint and write the software accordingly. The open source community isn't really the right forum to change iWARP and IB specifications at the end of the day. Build a case in the IETF and IBTA and let those bodies determine whether it is appropriate to modify their specs or not. And yes, it is modification of the specs and therefore the hardware implementations as well address any interoperability requirements that would result (the change proposed could fragment the hardware offerings as there are many thousands of devices in the market that would not necessarily support this change). Mike At 12:07 PM 6/6/2006, Talpey, Thomas wrote: >Todd, thanks for the set-up. I'm really glad we're having this discussion! > >Let me give an NFS/RDMA example to illustrate why this upper layer, >at least, doesn't want the HCA doing its flow control, or resource >management. > >NFS/RDMA is a credit-based protocol which allows many operations in >progress at the server. Let's say the client is currently running with >an RPC slot table of 100 requests (a typical value). > >Of these requests, some workload-specific percentage will be reads, >writes, or metadata. All NFS operations consist of one send from >client to server, some number of RDMA writes (for NFS reads) or >RDMA reads (for NFS writes), then terminated with one send from >server to client. > >The number of RDMA read or write operations per NFS op depends >on the amount of data being read or written, and also the memory >registration strategy in use on the client. The highest-performing >such strategy is an all-physical one, which results in one RDMA-able >segment per physical page. NFS r/w requests are, by default, 32KB, >or 8 pages typical. So, typically 8 RDMA requests (read or write) are >the result. > >To illustrate, let's say the client is processing a multi-threaded >workload, with (say) 50% reads, 20% writes, and 30% metadata >such as lookup and getattr. A kernel build, for example. Therefore, >of our 100 active operations, 50 are reads for 32KB each, 20 are >writes of 32KB, and 30 are metadata (non-RDMA). > >To the server, this results in 100 requests, 100 replies, 400 RDMA >writes, and 160 RDMA Reads. Of course, these overlap heavily due >to the widely differing latency of each op and the highly distributed >arrival times. But, for the example this is a snapshot of current load. > >The latency of the metadata operations is quite low, because lookup >and getattr are acting on what is effectively cached data. The reads >and writes however, are much longer, because they reference the >filesystem. When disk queues are deep, they can take many ms. > >Imagine what happens if the client's IRD is 4 and the server ignores >its local ORD. As soon as a write begins execution, the server posts >8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads >are sent, the fifth stalls, and stalls the send queue! Even when three >RDMA Reads complete, the queue remains stalled, it doesn't unblock >until the fourth is done and all the RDMA Reads have been initiated. > >But, what just happened to all the other server send traffic? All those >metadata replies, and other reads which completed? They're stuck, >waiting for that one write request. In my example, these number 99 NFS >ops, i.e. 654 WRs! All for one NFS write! The client operation stream >effectively became single threaded. What good is the "rapid initiation >of RDMA Reads" you describe in the face of this? > >Yes, there are many arcane and resource-intensive ways around it. >But the simplest by far is to count the RDMA Reads outstanding, and >for the *upper layer* to honor ORD, not the HCA. Then, the send queue >never blocks, and the operation streams never loses parallelism. This >is what our NFS server does. > >As to the depth of IRD, this is a different calculation, it's a >DelayxBandwidth >of the RDMA Read stream. 4 is good for local, low latency connections. >But over a complicated switch infrastructure, or heaven forbid a dark fiber >long link, I guarantee it will cause a bottleneck. This isn't an issue except >for operations that care, but it is certainly detectable. I would like to see >if a pure RDMA Read stream can fully utilize a typical IB fabric, and how >much headroom an IRD of 4 provides. Not much, I predict. > >Closing the connection if IRD is "insufficient to meet goals" isn't a good >answer, IMO. How does that benefit interoperability? > >Thanks for the opportunity to spout off again. Comments welcome! > >Tom. > >At 12:43 PM 6/6/2006, Rimmer, Todd wrote: > > > > > >> Talpey, Thomas > >> Sent: Tuesday, June 06, 2006 10:49 AM > >> > >> At 10:40 AM 6/6/2006, Roland Dreier wrote: > >> > Thomas> This is the difference between "may" and "must". The > >value > >> > Thomas> is provided, but I don't see anything in the spec that > >> > Thomas> makes a requirement on its enforcement. Table 107 says > >the > >> > Thomas> consumer can query it, that's about as close as it > >> > Thomas> comes. There's some discussion about CM exchange too. > >> > > >> >This seems like a very strained interpretation of the spec. For > >> > >> I don't see how strained has anything to do with it. It's not saying > >> anything > >> either way. So, a legal implementation can make either choice. We're > >> talking about the spec! > >> > >> But, it really doesn't matter. The point is, an upper layer should be > >> paying > >> attention to the number of RDMA Reads it posts, or else suffer either > >the > >> queue-stalling or connection-failing consequences. Bad stuff either > >way. > >> > >> Tom. > > > >Somewhere beneath this discussion is a bug in the application or IB > >stack. I'm not sure which "may" in the spec you are referring to, but > >the "may"s I have found all are for cases where the responder might > >support only 1 outstanding request. In all cases the negotiation > >protocol must be followed and the requestor is not allowed to exceed the > >negotiated limit. > > > >The mechanism should be: > >client queries its local HCA and determines responder resources (eg. > >number of concurrent outstanding RDMA reads on the wire from the remote > >end where this end will respond with the read data) and initiator depth > >(eg. number of concurrent outstanding RDMA reads which this end can > >initiate as the requestor). > > > >client puts the above information in the CM REQ. > > > >server similarly gets its information from its local CA and negotiates > >down the values to the MIN of each side (REP.InitiatorDepth = > >MIN(REQ.ResponderResources, server's local CAs Initiator depth); > >REP.ResponderResources = MIN(REQ.InitiatorDepth, server's local CAs > >responder resources). If server does not support RDMA Reads, it can > >REJ. > > > >If client decided the negotiated values are insufficient to meet its > >goals, it can disconnect. > > > >Each side sets its QP parameters via modify QP appropriately. Note they > >too will be mirror images of eachother: > >client: > >QP.Max RDMA Reads as Initiator = REP.ResponderResources > >QP.Max RDMA reads as responder = REP.InitiatorDepth > > > >server: > >QP.Max RDMA Reads as responder = REP.ResponderResources > >QP.Max RDMA reads as initiator = REP.InitiatorDepth > > > >We have done a lot of high stress RDMA Read traffic with Mellanox HCAs > >and provided the above negotiation is followed, we have seen no issues. > >Note however that by default a Mellanox HCA typically reports a large > >InitiatorDepth (128) and a modest ResponderResources (4-8). Hence when > >I hear that Responder Resources must be grown to 128 for some > >application to reliably work, it implies the negotiation I outlined > >above is not being followed. > > > >Note that the ordering rules in table 76 of IBTA 1.2 show how reads and > >write on a send queue are ordered. There are many cases where an op can > >pass an outstanding RDMA read, hence it is not always bad to queue extra > >RDMA reads. If needed, the Fence can be sent to force order. > > > >For many apps, its going to be better to get the items onto queue and > >let the QP handle the outstanding reads cases rather than have the app > >add a level of queuing for this purpose. Letting the HCA do the queuing > >will allow for a more rapid initiation of subsequent reads. > > > >Todd Rimmer > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralphc at pathscale.com Thu Jun 15 11:31:20 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Thu, 15 Jun 2006 11:31:20 -0700 Subject: [openib-general] [PATCH] add HW specific data to libibverbs modify QP, SRQ response Message-ID: <1150396280.32252.46.camel@brick.pathscale.com> I am working on a ipathverbs.so version of ibv_poll_cq(), ibv_post_recv(), and ibv_post_srq_recv() which mmaps the queue into user space. I found that I needed to modify the core libibverbs and kernel uverbs code in order to return the information I need from ib_ipath to the ipathverbs.so library. This patch adds those generic code changes. A subsequent patch will add the InfiniPath specific changes. Note that I didn't include matching changes to ehca since I don't have HW to test with but I can try to make a patch that allows it compile if requested to. Signed-off-by: Ralph Campbell Index: src/userspace/libibverbs/src/cmd.c =================================================================== --- src/userspace/libibverbs/src/cmd.c (revision 8021) +++ src/userspace/libibverbs/src/cmd.c (working copy) @@ -384,6 +384,23 @@ return 0; } +int ibv_cmd_resize_cq_resp(struct ibv_cq *cq, int cqe, + struct ibv_resize_cq *cmd, size_t cmd_size, + struct ibv_resize_cq_resp *resp, size_t resp_size) +{ + + IBV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, resp, resp_size); + cmd->cq_handle = cq->handle; + cmd->cqe = cqe; + + if (write(cq->context->cmd_fd, cmd, cmd_size) != cmd_size) + return errno; + + cq->cqe = resp->cqe; + + return 0; +} + static int ibv_cmd_destroy_cq_v1(struct ibv_cq *cq) { struct ibv_destroy_cq_v1 cmd; Index: src/userspace/libibverbs/src/libibverbs.map =================================================================== --- src/userspace/libibverbs/src/libibverbs.map (revision 8021) +++ src/userspace/libibverbs/src/libibverbs.map (working copy) @@ -48,6 +48,7 @@ ibv_cmd_poll_cq; ibv_cmd_req_notify_cq; ibv_cmd_resize_cq; + ibv_cmd_resize_cq_resp; ibv_cmd_destroy_cq; ibv_cmd_create_srq; ibv_cmd_modify_srq; Index: src/userspace/libibverbs/include/infiniband/driver.h =================================================================== --- src/userspace/libibverbs/include/infiniband/driver.h (revision 8021) +++ src/userspace/libibverbs/include/infiniband/driver.h (working copy) @@ -96,6 +96,9 @@ int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited_only); int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, struct ibv_resize_cq *cmd, size_t cmd_size); +int ibv_cmd_resize_cq_resp(struct ibv_cq *cq, int cqe, + struct ibv_resize_cq *cmd, size_t cmd_size, + struct ibv_resize_cq_resp *resp, size_t resp_size); int ibv_cmd_destroy_cq(struct ibv_cq *cq); int ibv_cmd_create_srq(struct ibv_pd *pd, Index: src/userspace/libibverbs/include/infiniband/kern-abi.h =================================================================== --- src/userspace/libibverbs/include/infiniband/kern-abi.h (revision 8021) +++ src/userspace/libibverbs/include/infiniband/kern-abi.h (working copy) @@ -355,6 +355,8 @@ struct ibv_resize_cq_resp { __u32 cqe; + __u32 reserved; + __u64 driver_data[0]; }; struct ibv_destroy_cq { Index: src/linux-kernel/infiniband/core/uverbs_cmd.c =================================================================== --- src/linux-kernel/infiniband/core/uverbs_cmd.c (revision 8021) +++ src/linux-kernel/infiniband/core/uverbs_cmd.c (working copy) @@ -1258,6 +1258,7 @@ int out_len) { struct ib_uverbs_modify_qp cmd; + struct ib_udata udata; struct ib_qp *qp; struct ib_qp_attr *attr; int ret; @@ -1265,6 +1266,9 @@ if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; + INIT_UDATA(&udata, buf + sizeof cmd, NULL, in_len - sizeof cmd, + out_len); + attr = kmalloc(sizeof *attr, GFP_KERNEL); if (!attr) return -ENOMEM; @@ -1321,7 +1325,7 @@ attr->alt_ah_attr.ah_flags = cmd.alt_dest.is_global ? IB_AH_GRH : 0; attr->alt_ah_attr.port_num = cmd.alt_dest.port_num; - ret = ib_modify_qp(qp, attr, cmd.attr_mask); + ret = qp->device->modify_qp(qp, attr, cmd.attr_mask, &udata); put_qp_read(qp); @@ -2031,6 +2035,7 @@ int out_len) { struct ib_uverbs_modify_srq cmd; + struct ib_udata udata; struct ib_srq *srq; struct ib_srq_attr attr; int ret; @@ -2038,6 +2043,9 @@ if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; + INIT_UDATA(&udata, buf + sizeof cmd, NULL, in_len - sizeof cmd, + out_len); + srq = idr_read_srq(cmd.srq_handle, file->ucontext); if (!srq) return -EINVAL; @@ -2045,7 +2053,7 @@ attr.max_wr = cmd.max_wr; attr.srq_limit = cmd.srq_limit; - ret = ib_modify_srq(srq, &attr, cmd.attr_mask); + ret = srq->device->modify_srq(srq, &attr, cmd.attr_mask, &udata); put_srq_read(srq); Index: src/linux-kernel/infiniband/core/verbs.c =================================================================== --- src/linux-kernel/infiniband/core/verbs.c (revision 8021) +++ src/linux-kernel/infiniband/core/verbs.c (working copy) @@ -231,7 +231,7 @@ struct ib_srq_attr *srq_attr, enum ib_srq_attr_mask srq_attr_mask) { - return srq->device->modify_srq(srq, srq_attr, srq_attr_mask); + return srq->device->modify_srq(srq, srq_attr, srq_attr_mask, NULL); } EXPORT_SYMBOL(ib_modify_srq); @@ -547,7 +547,7 @@ struct ib_qp_attr *qp_attr, int qp_attr_mask) { - return qp->device->modify_qp(qp, qp_attr, qp_attr_mask); + return qp->device->modify_qp(qp, qp_attr, qp_attr_mask, NULL); } EXPORT_SYMBOL(ib_modify_qp); Index: src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h =================================================================== --- src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h (revision 8021) +++ src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h (working copy) @@ -275,6 +275,8 @@ struct ib_uverbs_resize_cq_resp { __u32 cqe; + __u32 reserved; + __u64 driver_data[0]; }; struct ib_uverbs_poll_cq { Index: src/linux-kernel/infiniband/include/rdma/ib_verbs.h =================================================================== --- src/linux-kernel/infiniband/include/rdma/ib_verbs.h (revision 8021) +++ src/linux-kernel/infiniband/include/rdma/ib_verbs.h (working copy) @@ -911,7 +911,8 @@ struct ib_udata *udata); int (*modify_srq)(struct ib_srq *srq, struct ib_srq_attr *srq_attr, - enum ib_srq_attr_mask srq_attr_mask); + enum ib_srq_attr_mask srq_attr_mask, + struct ib_udata *udata); int (*query_srq)(struct ib_srq *srq, struct ib_srq_attr *srq_attr); int (*destroy_srq)(struct ib_srq *srq); @@ -923,7 +924,8 @@ struct ib_udata *udata); int (*modify_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, - int qp_attr_mask); + int qp_attr_mask, + struct ib_udata *udata); int (*query_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask, Index: src/linux-kernel/infiniband/hw/mthca/mthca_dev.h =================================================================== --- src/linux-kernel/infiniband/hw/mthca/mthca_dev.h (revision 8021) +++ src/linux-kernel/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -506,7 +506,7 @@ struct ib_srq_attr *attr, struct mthca_srq *srq); void mthca_free_srq(struct mthca_dev *dev, struct mthca_srq *srq); int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask); + enum ib_srq_attr_mask attr_mask, struct ib_udata *udata); int mthca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr); int mthca_max_srq_sge(struct mthca_dev *dev); void mthca_srq_event(struct mthca_dev *dev, u32 srqn, @@ -521,7 +521,8 @@ enum ib_event_type event_type); int mthca_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr); -int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, + struct ib_udata *udata); int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr); int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, Index: src/linux-kernel/infiniband/hw/mthca/mthca_qp.c =================================================================== --- src/linux-kernel/infiniband/hw/mthca/mthca_qp.c (revision 8021) +++ src/linux-kernel/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -522,7 +522,8 @@ return 0; } -int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, + struct ib_udata *udata) { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); Index: src/linux-kernel/infiniband/hw/mthca/mthca_srq.c =================================================================== --- src/linux-kernel/infiniband/hw/mthca/mthca_srq.c (revision 8021) +++ src/linux-kernel/infiniband/hw/mthca/mthca_srq.c (working copy) @@ -357,7 +357,7 @@ } int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask) + enum ib_srq_attr_mask attr_mask, struct ib_udata *udata) { struct mthca_dev *dev = to_mdev(ibsrq->device); struct mthca_srq *srq = to_msrq(ibsrq); Index: src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h (working copy) @@ -577,7 +577,7 @@ int ipath_destroy_qp(struct ib_qp *ibqp); int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, - int attr_mask); + int attr_mask, struct ib_udata *udata); int ipath_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, struct ib_qp_init_attr *init_attr); @@ -636,7 +636,8 @@ struct ib_udata *udata); int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask); + enum ib_srq_attr_mask attr_mask, + struct ib_udata *udata); int ipath_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr); Index: src/linux-kernel/infiniband/hw/ipath/ipath_qp.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_qp.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_qp.c (working copy) @@ -425,11 +425,12 @@ * @ibqp: the queue pair who's attributes we're modifying * @attr: the new attributes * @attr_mask: the mask of attributes to modify + * @udata: not used by the InfiniPath verbs driver * * Returns 0 on success, otherwise returns an errno. */ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, - int attr_mask) + int attr_mask, struct ib_udata *udata) { struct ipath_ibdev *dev = to_idev(ibqp->device); struct ipath_qp *qp = to_iqp(ibqp); Index: src/linux-kernel/infiniband/hw/ipath/ipath_srq.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_srq.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_srq.c (working copy) @@ -187,9 +187,10 @@ * @ibsrq: the SRQ to modify * @attr: the new attributes of the SRQ * @attr_mask: indicates which attributes to modify + * @udata: not used by the InfiniPath verbs driver */ int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask) + enum ib_srq_attr_mask attr_mask, struct ib_udata *udata) { struct ipath_srq *srq = to_isrq(ibsrq); unsigned long flags; -- Ralph Campbell From swise at opengridcomputing.com Thu Jun 15 13:11:57 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 Jun 2006 15:11:57 -0500 Subject: [openib-general] [PATCH] backlog ignored when listening on all devs Message-ID: <1150402317.6612.8.camel@stevo-desktop> Sean, I think this is a bug, eh? If you listen on 0.0.0.0, then the backlog isn't passed down to the devices because its not stored in the id_priv struct before calling cma_listen_on_all(). See cma_list_on_dev() which uses id_priv->backlog... Signed-off-by: Steve Wise ---------- Index: cma.c =================================================================== --- cma.c (revision 7626) +++ cma.c (working copy) @@ -1086,6 +1086,7 @@ if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) return -EINVAL; + id_priv->backlog = backlog; if (id->device) { switch (rdma_node_get_transport(id->device->node_type)) { case RDMA_TRANSPORT_IB: @@ -1100,9 +1101,9 @@ } else cma_listen_on_all(id_priv); - id_priv->backlog = backlog; return 0; err: + id_priv->backlog = 0; cma_comp_exch(id_priv, CMA_LISTEN, CMA_ADDR_BOUND); return ret; } From rdreier at cisco.com Thu Jun 15 13:55:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Jun 2006 13:55:11 -0700 Subject: [openib-general] [PATCH] add HW specific data to libibverbs modify QP, SRQ response In-Reply-To: <1150396280.32252.46.camel@brick.pathscale.com> (Ralph Campbell's message of "Thu, 15 Jun 2006 11:31:20 -0700") References: <1150396280.32252.46.camel@brick.pathscale.com> Message-ID: > I am working on a ipathverbs.so version of ibv_poll_cq(), > ibv_post_recv(), and ibv_post_srq_recv() which mmaps the > queue into user space. I found that I needed to modify the > core libibverbs and kernel uverbs code in order to return > the information I need from ib_ipath to the ipathverbs.so > library. That's weird, since libmthca operates on userspace queues and doesn't need this. I'd like to see the ipath changes before we merge this, because there may be a better way to do things. > Note that I didn't include matching changes to ehca since > I don't have HW to test with but I can try to make a patch > that allows it compile if requested to. Yes, please include that. The changes should be the same obvious ones as for mthca. Also, this breaks the ABI, and I don't see any compatibility code for libibverbs. > ibv_cmd_resize_cq; > + ibv_cmd_resize_cq_resp; I guess we need to include a new function if we want to support this in libibverbs 1.0. But for libibverbs 1.1 I would prefer to just change the prototype for ibv_resize_cq. - R. From ardavis at ichips.intel.com Thu Jun 15 14:31:01 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 15 Jun 2006 14:31:01 -0700 Subject: [openib-general] Processes not exiting on SVN7946 In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408> Message-ID: <4491D195.8030106@ichips.intel.com> Woodruff, Robert J wrote: > It appears that processes are not exiting cleanly on SVN7946 trunk > backported to 2.6.9-34 EL. > > They seem to be stuck in a state of "DL" and I cannot even attach to them > wil gdb or kill them with a kill -9. > > [root at iclust-1 core]# ps -uax | grep IMB > woody 4087 0.0 0.0 58500 3172 pts/3 T 14:45 0:00 gdb > ./IMB-MPI1 -p 4067 > woody 4067 2.3 0.0 33108 2708 ? DL 14:44 0:12 ./IMB-MPI1 > woody 4109 3.1 0.0 40148 2572 ? DL 14:47 0:12 ./IMB-MPI1 > root 4156 0.0 0.0 51080 732 pts/3 S+ 14:53 0:00 grep IMB > > The last code I pulled SVN7843 did not have this problem. > > Any ideas on what might be causing this ? > I see the same thing running the uDAPL test (dapl/test/dtest). I am running a 2.6.16 kernel and svn8805 and it appears to be deadlocked (uninterruptible sleep) in the ibv_destroy_cq() call. This all worked fine on svn7843; my last update on these systems. -arlin > woody > >------------------------------------------------------------------------ > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mamidala at cse.ohio-state.edu Thu Jun 15 14:24:38 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Thu, 15 Jun 2006 17:24:38 -0400 (EDT) Subject: [openib-general] [PATCH] librdmacm/examples/rping.c In-Reply-To: <1150219552.17394.23.camel@stevo-desktop> Message-ID: Hi, With the latest rping code (Revision: 8055) I am still able to see this race condition. server side: [@k62-oib examples]$ ./rping -s -vV -C10 -S26 -a 0.0.0.0 -p 9997 server ping data: rdma-ping-0: ABCDEFGHIJKL server ping data: rdma-ping-1: BCDEFGHIJKLM server ping data: rdma-ping-2: CDEFGHIJKLMN server ping data: rdma-ping-3: DEFGHIJKLMNO server ping data: rdma-ping-4: EFGHIJKLMNOP server ping data: rdma-ping-5: FGHIJKLMNOPQ server ping data: rdma-ping-6: GHIJKLMNOPQR server ping data: rdma-ping-7: HIJKLMNOPQRS server ping data: rdma-ping-8: IJKLMNOPQRST server ping data: rdma-ping-9: JKLMNOPQRSTU server DISCONNECT EVENT... wait for RDMA_READ_ADV state 9 cq completion failed status 5 Client side: [@k63-oib examples]$ ./rping -c -vV -C10 -S26 -a 192.168.111.66 -p 9997 ping data: rdma-ping-0: ABCDEFGHIJKL ping data: rdma-ping-1: BCDEFGHIJKLM ping data: rdma-ping-2: CDEFGHIJKLMN ping data: rdma-ping-3: DEFGHIJKLMNO ping data: rdma-ping-4: EFGHIJKLMNOP ping data: rdma-ping-5: FGHIJKLMNOPQ ping data: rdma-ping-6: GHIJKLMNOPQR ping data: rdma-ping-7: HIJKLMNOPQRS ping data: rdma-ping-8: IJKLMNOPQRST ping data: rdma-ping-9: JKLMNOPQRSTU cq completion failed status 5 client DISCONNECT EVENT... Thanks, Amith On Tue, 13 Jun 2006, Steve Wise wrote: > Thanks, applied. > > iwarp branch: r7964 > trunk: r7966 > > > On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote: > > This patch resolves a race condition between the receipt of > > a connection established event and a receive completion from > > the client. The server no longer goes to connected state but > > merely waits for the READ_ADV state to begin its looping. This > > keeps the server from going back to CONNECTED from the later > > states if the connection established event comes in after the > > receive completion (i.e. the loop starts). > > > > Signed-off-by: Boyd Faulkner > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Thu Jun 15 14:33:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Jun 2006 14:33:41 -0700 Subject: [openib-general] Processes not exiting on SVN7946 In-Reply-To: <4491D195.8030106@ichips.intel.com> (Arlin Davis's message of "Thu, 15 Jun 2006 14:31:01 -0700") References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408> <4491D195.8030106@ichips.intel.com> Message-ID: Arlin> I see the same thing running the uDAPL test Arlin> (dapl/test/dtest). I am running a 2.6.16 kernel and svn8805 Arlin> and it appears to be deadlocked (uninterruptible sleep) in Arlin> the ibv_destroy_cq() call. This all worked fine on Arlin> svn7843; my last update on these systems. Hmm, any further clue where in ibv_destroy_cq() it's stuck? Is it doing down_write() or something? This is probably fallout from my kill-ib_uverbs_idr_mutex change... - R. From rdreier at cisco.com Thu Jun 15 14:35:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Jun 2006 14:35:03 -0700 Subject: [openib-general] Processes not exiting on SVN7946 In-Reply-To: (Roland Dreier's message of "Thu, 15 Jun 2006 14:33:41 -0700") References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408> <4491D195.8030106@ichips.intel.com> Message-ID: Roland> Hmm, any further clue where in ibv_destroy_cq() it's Roland> stuck? Is it doing down_write() or something? Can you send me full sysrq-t output when it gets stuck? Thanks... From ralphc at pathscale.com Thu Jun 15 14:41:44 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Thu, 15 Jun 2006 14:41:44 -0700 Subject: [openib-general] [PATCH] add HW specific data to libibverbs modify QP, SRQ response In-Reply-To: References: <1150396280.32252.46.camel@brick.pathscale.com> Message-ID: <1150407704.32252.65.camel@brick.pathscale.com> On Thu, 2006-06-15 at 13:55 -0700, Roland Dreier wrote: > > I am working on a ipathverbs.so version of ibv_poll_cq(), > > ibv_post_recv(), and ibv_post_srq_recv() which mmaps the > > queue into user space. I found that I needed to modify the > > core libibverbs and kernel uverbs code in order to return > > the information I need from ib_ipath to the ipathverbs.so > > library. > > That's weird, since libmthca operates on userspace queues and doesn't > need this. I'd like to see the ipath changes before we merge this, > because there may be a better way to do things. libmthca uses a single shared page which is created at driver open time. I'm mmaping vmalloc memory created at ibv_create_cq(), qp, srq time so I need a way to return the offset to ipathverbs.so to then pass to mmap(). > > Note that I didn't include matching changes to ehca since > > I don't have HW to test with but I can try to make a patch > > that allows it compile if requested to. > > Yes, please include that. The changes should be the same obvious ones > as for mthca. OK. > Also, this breaks the ABI, and I don't see any compatibility code for > libibverbs. The new kernel drivers work with the old libibverbs and vice versa since only the cqe entry in struct ibv_resize_cq_resp is used. The reserved entry is only needed to avoid using "packed" structs if struct ibv_resize_cq_resp is included in another struct. > > ibv_cmd_resize_cq; > > + ibv_cmd_resize_cq_resp; > > I guess we need to include a new function if we want to support this > in libibverbs 1.0. But for libibverbs 1.1 I would prefer to just > change the prototype for ibv_resize_cq. I thought about this trade off too. Either way is OK with me. I will post the current HW specific changes soon. I have code for everything except resizing the QP's receive queue. > - R. -- Ralph Campbell From caitlinb at broadcom.com Thu Jun 15 14:45:57 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 15 Jun 2006 14:45:57 -0700 Subject: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver. Message-ID: <54AD0F12E08D1541B826BE97C98F99F15767E3@NT-SJCA-0751.brcm.ad.broadcom.com> netdev-owner at vger.kernel.org wrote: > On Thu, 2006-06-15 at 08:41 -0500, Steve Wise wrote: >> On Wed, 2006-06-14 at 20:35 -0500, Bob Sharp wrote: >> >>>> +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) { >>>> + >> >> >> >>>> + case C2_RES_IND_EP:{ >>>> + >>>> + struct c2wr_ae_connection_request *req = >>>> + &wr->ae.ae_connection_request; >>>> + struct iw_cm_id *cm_id = >>>> + (struct iw_cm_id *)resource_user_context; >>>> + >>>> + pr_debug("C2_RES_IND_EP event_id=%d\n", event_id); >>>> + if (event_id != CCAE_CONNECTION_REQUEST) { >>>> + pr_debug("%s: Invalid event_id: %d\n", >>>> + __FUNCTION__, event_id); >>>> + break; >>>> + } >>>> + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; >>>> + cm_event.provider_data = (void*)(unsigned long)req->cr_handle; >>>> + cm_event.local_addr.sin_addr.s_addr = req->laddr; >>>> + cm_event.remote_addr.sin_addr.s_addr = req->raddr; >>>> + cm_event.local_addr.sin_port = req->lport; >>>> + cm_event.remote_addr.sin_port = req->rport; >>>> + cm_event.private_data_len = >>>> + be32_to_cpu(req->private_data_length); >>>> + >>>> + if (cm_event.private_data_len) { >>> >>> >>> It looks to me as if pdata is leaking here since it is not tracked >>> and the upper layers do not free it. Also, if pdata is freed after >>> the call to cm_id->event_handler returns, it exposes an issue in >>> user space where the private data is garbage. I suspect the iwarp >>> cm should be copying this data before it returns. >>> >> >> Good catch. >> >> Yes, I think the IWCM should copy the private data in the upcall. If >> it does, then the amso driver doesn't need to kmalloc()/copy at all. >> It can pass a ptr to its MQ entry directly... >> > > Now that I've looked more into this, I'm not sure there's a > simple way for the IWCM to copy the pdata on the upcall. > Currently, the IWCM's event upcall, cm_event_handler(), > simply queues the work for processing on a workqueue thread. > So there's no per-event logic at all there. > Lemme think on this more. Stay tuned. > > Either way, the amso driver has a memory leak... > Having the IWCM copy the pdata during the upcall also leaves the greatest flexibility for the driver on how/where the pdata is captured. The IWCM has to deal with user-mode, indefinite delays waiting for a response and user-mode processes that die while holding a connection request. So it makes sense for that layer to do the allocating and copying. From rdreier at cisco.com Thu Jun 15 14:56:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Jun 2006 14:56:48 -0700 Subject: [openib-general] [PATCH] add HW specific data to libibverbs modify QP, SRQ response In-Reply-To: <1150407704.32252.65.camel@brick.pathscale.com> (Ralph Campbell's message of "Thu, 15 Jun 2006 14:41:44 -0700") References: <1150396280.32252.46.camel@brick.pathscale.com> <1150407704.32252.65.camel@brick.pathscale.com> Message-ID: Ralph> libmthca uses a single shared page which is created at Ralph> driver open time. I'm mmaping vmalloc memory created at Ralph> ibv_create_cq(), qp, srq time so I need a way to return the Ralph> offset to ipathverbs.so to then pass to mmap(). Hmm... it seems simpler to have userspace allocate the memory with mmap() before the resize_cq call, and then pass that new buffer into the resize_cq call. That way you don't have a window where the kernel is putting completions into a buffer that userspace doesn't know about. Ralph> The new kernel drivers work with the old libibverbs and Ralph> vice versa since only the cqe entry in struct Ralph> ibv_resize_cq_resp is used. The reserved entry is only Ralph> needed to avoid using "packed" structs if struct Ralph> ibv_resize_cq_resp is included in another struct. OK, I guess we're OK, since the kernel isn't checking the size of the response buffer. old libipathverbs does need to bail out on a new ipath kernel driver, though, or else you'll get corruption when responses go off the end of a buffer. - R. From swise at opengridcomputing.com Thu Jun 15 14:58:29 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 Jun 2006 16:58:29 -0500 Subject: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver. In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F15767E3@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F15767E3@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1150408709.6612.16.camel@stevo-desktop> > > Now that I've looked more into this, I'm not sure there's a > > simple way for the IWCM to copy the pdata on the upcall. > > Currently, the IWCM's event upcall, cm_event_handler(), > > simply queues the work for processing on a workqueue thread. > > So there's no per-event logic at all there. > > Lemme think on this more. Stay tuned. > > > > Either way, the amso driver has a memory leak... > > > > Having the IWCM copy the pdata during the upcall also leaves > the greatest flexibility for the driver on how/where the pdata > is captured. The IWCM has to deal with user-mode, indefinite > delays waiting for a response and user-mode processes that die > while holding a connection request. So it makes sense for that > layer to do the allocating and copying. I've already coded and test this. The IWCM will copy the pdata... Steve. From ardavis at ichips.intel.com Thu Jun 15 14:57:53 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 15 Jun 2006 14:57:53 -0700 Subject: [openib-general] Processes not exiting on SVN7946 In-Reply-To: References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408> <4491D195.8030106@ichips.intel.com> Message-ID: <4491D7E1.5050504@ichips.intel.com> Roland Dreier wrote: > Roland> Hmm, any further clue where in ibv_destroy_cq() it's > Roland> stuck? Is it doing down_write() or something? > >Can you send me full sysrq-t output when it gets stuck? > >Thanks... > > > I just added ibv_destroy_cq() to ibv_rc_pingpong test. Here's the output.... open("/sys/class/infiniband_verbs/abi_version", O_RDONLY) = 3 read(3, "6\n", 8) = 2 close(3) = 0 open("/sys/class/infiniband_verbs", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 fstat(3, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 getdents64(3, /* 4 entries */, 4096) = 112 open("/sys/class/infiniband_verbs/uverbs0/abi_version", O_RDONLY) = 4 read(4, "1\n", 8) = 2 close(4) = 0 open("/sys/class/infiniband_verbs/uverbs0/ibdev", O_RDONLY) = 4 read(4, "mthca0\n", 64) = 7 close(4) = 0 open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 4 read(4, "0x15b3\n", 8) = 7 close(4) = 0 open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 4 read(4, "0x6278\n", 8) = 7 close(4) = 0 getdents64(3, /* 0 entries */, 4096) = 0 close(3) = 0 open("/dev/infiniband/uverbs0", O_RDWR) = 3 write(3, "\0\0\0\0\4\0\4\0\300\227\221\377\377\177\0\0", 16) = 16 mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, 3, 0) = 0x2b318fa6f000 write(3, "\3\0\0\0\4\0\3\0\200\227\221\377\377\177\0\0", 16) = 16 write(3, "\3\0\0\0\4\0\3\0\320\227\221\377\377\177\0\0", 16) = 16 write(3, "\t\0\0\0\f\0\3\0`\227\221\377\377\177\0\0\0pP\0\0\0\0\0"..., 48) = 48 write(3, "\t\0\0\0\f\0\3\0\240\226\221\377\377\177\0\0\0\240P\0\0"..., 48) = 48 write(3, "\22\0\0\0\22\0\4\0p\227\221\377\377\177\0\0\320nP\0\0\0"..., 72) = 72 write(3, "\t\0\0\0\f\0\3\0\240\226\221\377\377\177\0\0\0\360P\0\0"..., 48) = 48 write(3, "\30\0\0\0\30\0\10\0`\227\221\377\377\177\0\0p\221P\0\0"..., 96) = 96 write(3, "\32\0\0\0\36\0\0\0\250Y\1a9\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 120) = 120 write(3, "\2\0\0\0\6\0\n\0`\227\221\377\377\177\0\0\1lQ\0\0\0\0\0"..., 24) = 24 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 7), ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b318fa70000 write(1, " local address: LID 0x0004, QP"..., 57 local address: LID 0x0004, QPN 0x040407, PSN 0xce99bd ) = 57 socket(PF_INET6, SOCK_DGRAM, IPPROTO_IP) = 5 connect(5, {sa_family=AF_INET6, sin6_port=htons(18515), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 getsockname(5, {sa_family=AF_INET6, sin6_port=htons(32770), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [22635233564164124]) = 0 close(5) = 0 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 5 connect(5, {sa_family=AF_INET, sin_port=htons(18515), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 getsockname(5, {sa_family=AF_INET, sin_port=htons(32770), sin_addr=inet_addr("127.0.0.1")}, [22635233564164112]) = 0 close(5) = 0 socket(PF_INET6, SOCK_STREAM, IPPROTO_TCP) = 5 setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [22635233564164097], 4) = 0 bind(5, {sa_family=AF_INET6, sin6_port=htons(18515), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 listen(5, 1) = 0 accept(5, 0, NULL) = 6 close(5) = 0 read(6, "0005:040407:abb228\0", 19) = 19 write(3, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 120) = 120 write(3, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 120) = 120 write(6, "0004:040407:ce99bd\0", 19) = 19 read(6, "done\0", 19) = 5 close(6) = 0 write(1, " remote address: LID 0x0005, QP"..., 57 remote address: LID 0x0005, QPN 0x040407, PSN 0xabb228 ) = 57 write(1, " calling destroy_cq\n", 20 calling destroy_cq ) = 20 write(3, "\24\0\0\0\6\0\2\0\250\227\221\377\377\177\0\0\7\0\0\0\0"..., 24 From swise at opengridcomputing.com Thu Jun 15 15:03:47 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 Jun 2006 17:03:47 -0500 Subject: [openib-general] [PATCH] librdmacm/examples/rping.c In-Reply-To: References: Message-ID: <1150409027.6612.20.camel@stevo-desktop> This is the normal output for rping... The status error on the completion is 5 (FLUSHED), which is normal. Steve. On Thu, 2006-06-15 at 17:24 -0400, amith rajith mamidala wrote: > Hi, > > With the latest rping code (Revision: 8055) I am still able to see this > race condition. > > server side: > > [@k62-oib examples]$ ./rping -s -vV -C10 -S26 -a 0.0.0.0 -p 9997 > server ping data: rdma-ping-0: ABCDEFGHIJKL > server ping data: rdma-ping-1: BCDEFGHIJKLM > server ping data: rdma-ping-2: CDEFGHIJKLMN > server ping data: rdma-ping-3: DEFGHIJKLMNO > server ping data: rdma-ping-4: EFGHIJKLMNOP > server ping data: rdma-ping-5: FGHIJKLMNOPQ > server ping data: rdma-ping-6: GHIJKLMNOPQR > server ping data: rdma-ping-7: HIJKLMNOPQRS > server ping data: rdma-ping-8: IJKLMNOPQRST > server ping data: rdma-ping-9: JKLMNOPQRSTU > server DISCONNECT EVENT... > wait for RDMA_READ_ADV state 9 > cq completion failed status 5 > > Client side: > > [@k63-oib examples]$ ./rping -c -vV -C10 -S26 -a 192.168.111.66 -p 9997 > ping data: rdma-ping-0: ABCDEFGHIJKL > ping data: rdma-ping-1: BCDEFGHIJKLM > ping data: rdma-ping-2: CDEFGHIJKLMN > ping data: rdma-ping-3: DEFGHIJKLMNO > ping data: rdma-ping-4: EFGHIJKLMNOP > ping data: rdma-ping-5: FGHIJKLMNOPQ > ping data: rdma-ping-6: GHIJKLMNOPQR > ping data: rdma-ping-7: HIJKLMNOPQRS > ping data: rdma-ping-8: IJKLMNOPQRST > ping data: rdma-ping-9: JKLMNOPQRSTU > cq completion failed status 5 > client DISCONNECT EVENT... > > > Thanks, > Amith > > > On Tue, 13 Jun 2006, Steve Wise wrote: > > > Thanks, applied. > > > > iwarp branch: r7964 > > trunk: r7966 > > > > > > On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote: > > > This patch resolves a race condition between the receipt of > > > a connection established event and a receive completion from > > > the client. The server no longer goes to connected state but > > > merely waits for the READ_ADV state to begin its looping. This > > > keeps the server from going back to CONNECTED from the later > > > states if the connection established event comes in after the > > > receive completion (i.e. the loop starts). > > > > > > Signed-off-by: Boyd Faulkner > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From rdreier at cisco.com Thu Jun 15 15:03:14 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Jun 2006 15:03:14 -0700 Subject: [openib-general] Processes not exiting on SVN7946 In-Reply-To: <4491D7E1.5050504@ichips.intel.com> (Arlin Davis's message of "Thu, 15 Jun 2006 14:57:53 -0700") References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408> <4491D195.8030106@ichips.intel.com> <4491D7E1.5050504@ichips.intel.com> Message-ID: Thanks, reproduced it locally. From sean.hefty at intel.com Thu Jun 15 15:04:57 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 15 Jun 2006 15:04:57 -0700 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <449119AE.2010703@voltaire.com> Message-ID: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> >The cma/verbs consumer can't just ignore the event since its qp state is >still RTR which means an attempt to tx replying the rx would fail. In most cases, I would expect that the IB CM will eventually receive the RTU, which will generate an event to the RDMA CM to transition the QP into RTS. This is why I think that the event can safely be ignored. It does however mean that a user cannot send on the QP until the user sees RDMA_CM_EVENT_ESTABLISHED. >I suggest the following design: the CMA would replace the event handler >provided with the qp_init_attr struct with a callback of its own and >keep the original handler/context on a private structure. This sounds like it would work. I don't think that there are any events where the additional delay would matter. As an alternative, I don't think that there's any reason why the QP can't be transition to RTS when the CM REP is sent. A user just can't post to the send queue until either an RDMA_CM_EVENT_ESTABLISHED, IB_EVENT_COMM_EST, or a completion occurs on the QP. (This doesn't change the fact that the IB CM still needs to know that the connection has been established, or it risks putting the connection into an error state if an RTU is never received.) - Sean From rdreier at cisco.com Thu Jun 15 15:13:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Jun 2006 15:13:12 -0700 Subject: [openib-general] Processes not exiting on SVN7946 In-Reply-To: (Roland Dreier's message of "Thu, 15 Jun 2006 15:03:14 -0700") References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408> <4491D195.8030106@ichips.intel.com> <4491D7E1.5050504@ichips.intel.com> Message-ID: OK, just a dumb oversight on my part. The change below (already checked in) fixes it for me: --- infiniband/core/uverbs_cmd.c (revision 8055) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -1123,6 +1123,12 @@ ssize_t ib_uverbs_create_qp(struct ib_uv goto err_copy; } + put_pd_read(pd); + put_cq_read(scq); + put_cq_read(rcq); + if (srq) + put_srq_read(srq); + mutex_lock(&file->mutex); list_add_tail(&obj->uevent.uobject.list, &file->ucontext->qp_list); mutex_unlock(&file->mutex); From sean.hefty at intel.com Thu Jun 15 15:13:01 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 15 Jun 2006 15:13:01 -0700 Subject: [openib-general] [PATCH] backlog ignored when listening on all devs In-Reply-To: <1150402317.6612.8.camel@stevo-desktop> Message-ID: <000101c690c8$de077f10$62268686@amr.corp.intel.com> Roland, can you pick up this patch for 2.6.18? Thanks - committed in 8057. - Sean >If you listen on 0.0.0.0, then the backlog isn't passed down to the >devices because its not stored in the id_priv struct before calling >cma_listen_on_all(). See cma_list_on_dev() which uses >id_priv->backlog... > >Signed-off-by: Steve Wise > >---------- > >Index: cma.c >=================================================================== >--- cma.c (revision 7626) >+++ cma.c (working copy) >@@ -1086,6 +1086,7 @@ > if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) > return -EINVAL; > >+ id_priv->backlog = backlog; > if (id->device) { > switch (rdma_node_get_transport(id->device->node_type)) { > case RDMA_TRANSPORT_IB: >@@ -1100,9 +1101,9 @@ > } else > cma_listen_on_all(id_priv); > >- id_priv->backlog = backlog; > return 0; > err: >+ id_priv->backlog = 0; > cma_comp_exch(id_priv, CMA_LISTEN, CMA_ADDR_BOUND); > return ret; > } From robert.j.woodruff at intel.com Thu Jun 15 15:16:28 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 15 Jun 2006 15:16:28 -0700 Subject: [openib-general] Processes not exiting on SVN7946 In-Reply-To: Message-ID: <000001c690c9$5a03e9f0$010fa8c0@amr.corp.intel.com> Roland wrote, >OK, just a dumb oversight on my part. The change below (already >checked in) fixes it for me: Great thanks for the quick response, woody From rdreier at cisco.com Thu Jun 15 15:34:37 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Jun 2006 15:34:37 -0700 Subject: [openib-general] [PATCH] backlog ignored when listening on all devs In-Reply-To: <1150402317.6612.8.camel@stevo-desktop> (Steve Wise's message of "Thu, 15 Jun 2006 15:11:57 -0500") References: <1150402317.6612.8.camel@stevo-desktop> Message-ID: OK, I rolled this into the cma patch in for-2.6.18 branch. Unfortunately this means that all of the patches in that branch are rebased, so you'll have to repull my tree if you're tracking it. - R. From ralphc at pathscale.com Thu Jun 15 15:34:58 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Thu, 15 Jun 2006 15:34:58 -0700 Subject: [openib-general] [PATCH] add HW specific data to libibverbs modify QP, SRQ response In-Reply-To: References: <1150396280.32252.46.camel@brick.pathscale.com> <1150407704.32252.65.camel@brick.pathscale.com> Message-ID: <1150410898.32252.69.camel@brick.pathscale.com> On Thu, 2006-06-15 at 14:56 -0700, Roland Dreier wrote: > Ralph> libmthca uses a single shared page which is created at > Ralph> driver open time. I'm mmaping vmalloc memory created at > Ralph> ibv_create_cq(), qp, srq time so I need a way to return the > Ralph> offset to ipathverbs.so to then pass to mmap(). > > Hmm... it seems simpler to have userspace allocate the memory with > mmap() before the resize_cq call, and then pass that new buffer into > the resize_cq call. That way you don't have a window where the kernel > is putting completions into a buffer that userspace doesn't know about. Perhaps. But this way, the code is the same for kernel and user allocated queues. > Ralph> The new kernel drivers work with the old libibverbs and > Ralph> vice versa since only the cqe entry in struct > Ralph> ibv_resize_cq_resp is used. The reserved entry is only > Ralph> needed to avoid using "packed" structs if struct > Ralph> ibv_resize_cq_resp is included in another struct. > > OK, I guess we're OK, since the kernel isn't checking the size of the > response buffer. old libipathverbs does need to bail out on a new ipath > kernel driver, though, or else you'll get corruption when responses go > off the end of a buffer. Or the new kernel driver needs to handle the old way and the new way. > - R. -- Ralph Campbell From ralphc at pathscale.com Thu Jun 15 15:40:54 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Thu, 15 Jun 2006 15:40:54 -0700 Subject: [openib-general] Patch for review: ipath mmaped CQs, QPs, SRQs [1 of 2] Message-ID: <1150411254.32252.76.camel@brick.pathscale.com> Here are the diffs Roland requested for the ipath driver changes to mmap the completion and receive queues into the user library. This isn't quite the final version though since I need to implement QP receive queue resizing and some version checking/handling. Index: src/userspace/libipathverbs/src/verbs.c =================================================================== --- src/userspace/libipathverbs/src/verbs.c (revision 8021) +++ src/userspace/libipathverbs/src/verbs.c (working copy) @@ -40,11 +40,14 @@ #include #include -#include +#include #include #include +#include +#include #include "ipathverbs.h" +#include "ipath-abi.h" int ipath_query_device(struct ibv_context *context, struct ibv_device_attr *attr) @@ -83,11 +86,11 @@ struct ibv_pd *pd; pd = malloc(sizeof *pd); - if(!pd) + if (!pd) return NULL; - if(ibv_cmd_alloc_pd(context, pd, &cmd, sizeof cmd, - &resp, sizeof resp)) { + if (ibv_cmd_alloc_pd(context, pd, &cmd, sizeof cmd, + &resp, sizeof resp)) { free(pd); return NULL; } @@ -142,129 +145,396 @@ struct ibv_comp_channel *channel, int comp_vector) { - struct ibv_cq *cq; - struct ibv_create_cq cmd; - struct ibv_create_cq_resp resp; - int ret; + struct ipath_cq *cq; + struct ibv_create_cq cmd; + struct ipath_create_cq_resp resp; + int ret; + size_t size; cq = malloc(sizeof *cq); if (!cq) return NULL; - ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, cq, - &cmd, sizeof cmd, &resp, sizeof resp); + ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, + &cq->ibv_cq, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); if (ret) { free(cq); return NULL; } - return cq; + size = sizeof(struct ipath_cq_wc) + sizeof(struct ipath_wc) * cqe; + cq->queue = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, + context->cmd_fd, resp.offset); + if ((void *) cq->queue == MAP_FAILED) { + free(cq); + return NULL; + } + + pthread_spin_init(&cq->lock, PTHREAD_PROCESS_PRIVATE); + return &cq->ibv_cq; } -int ipath_destroy_cq(struct ibv_cq *cq) +int ipath_resize_cq(struct ibv_cq *ibcq, int cqe) { + struct ipath_cq *cq = to_icq(ibcq); + struct ibv_resize_cq cmd; + struct ipath_resize_cq_resp resp; + size_t size; + int ret; + + pthread_spin_lock(&cq->lock); + /* Unmap the old queue so we can resize it. */ + size = sizeof(struct ipath_cq_wc) + + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe); + (void) munmap(cq->queue, size); + ret = ibv_cmd_resize_cq_resp(ibcq, cqe, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); + if (ret) { + pthread_spin_unlock(&cq->lock); + return ret; + } + size = sizeof(struct ipath_cq_wc) + + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe); + cq->queue = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, + ibcq->context->cmd_fd, resp.offset); + ret = errno; + pthread_spin_unlock(&cq->lock); + if ((void *) cq->queue == MAP_FAILED) + return ret; + return 0; +} + +int ipath_destroy_cq(struct ibv_cq *ibcq) +{ + struct ipath_cq *cq = to_icq(ibcq); int ret; - ret = ibv_cmd_destroy_cq(cq); + ret = ibv_cmd_destroy_cq(ibcq); if (ret) return ret; + (void) munmap(cq->queue, sizeof(struct ipath_cq_wc) + + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe)); free(cq); return 0; } +int ipath_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc) +{ + struct ipath_cq *cq = to_icq(ibcq); + struct ipath_cq_wc *q; + int npolled; + uint32_t tail; + + pthread_spin_lock(&cq->lock); + q = cq->queue; + tail = q->tail; + for (npolled = 0; npolled < ne; ++npolled, ++wc) { + if (tail == q->head) + break; + memcpy(wc, &q->queue[tail], sizeof(*wc)); + if (tail == cq->ibv_cq.cqe) + tail = 0; + else + tail++; + } + q->tail = tail; + pthread_spin_unlock(&cq->lock); + + return npolled; +} + struct ibv_qp *ipath_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) { - struct ibv_create_qp cmd; - struct ibv_create_qp_resp resp; - struct ibv_qp *qp; - int ret; + struct ibv_create_qp cmd; + struct ipath_create_qp_resp resp; + struct ipath_qp *qp; + int ret; + size_t size; qp = malloc(sizeof *qp); if (!qp) return NULL; - ret = ibv_cmd_create_qp(pd, qp, attr, &cmd, sizeof cmd, &resp, sizeof resp); + ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); if (ret) { free(qp); return NULL; } - return qp; + if (attr->srq) { + qp->rq.size = 0; + qp->rq.max_sge = 0; + qp->rq.rwq = NULL; + } else { + qp->rq.size = attr->cap.max_recv_wr + 1; + qp->rq.max_sge = attr->cap.max_recv_sge; + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * qp->rq.max_sge)) * + qp->rq.size; + qp->rq.rwq = mmap(NULL, size, + PROT_READ | PROT_WRITE, MAP_SHARED, + pd->context->cmd_fd, resp.offset); + if ((void *) qp->rq.rwq == MAP_FAILED) { + free(qp); + return NULL; + } + } + + pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE); + return &qp->ibv_qp; } -int ipath_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, +int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, + enum ibv_qp_attr_mask attr_mask, + struct ibv_qp_init_attr *init_attr) +{ + struct ibv_query_qp cmd; + + return ibv_cmd_query_qp(qp, attr, attr_mask, init_attr, + &cmd, sizeof cmd); +} + +int ipath_modify_qp(struct ibv_qp *ibqp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask) { - struct ibv_modify_qp cmd; + struct ipath_qp *qp = to_iqp(ibqp); + struct ipath_modify_qp_cmd cmd; + __u64 offset; + size_t size; + int ret; - return ibv_cmd_modify_qp(qp, attr, attr_mask, &cmd, sizeof cmd); + if (attr_mask & IBV_QP_CAP) { + /* Can't resize receive queue if we haved a shared one. */ + if (ibqp->srq) + return EINVAL; + pthread_spin_lock(&qp->rq.lock); + /* Unmap the old queue so we can resize it. */ + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * qp->rq.max_sge)) * + qp->rq.size; + (void) munmap(qp->rq.rwq, size); + } + cmd.offset_addr = (__u64) &offset; + ret = ibv_cmd_modify_qp(ibqp, attr, attr_mask, + &cmd.ibv_cmd, sizeof cmd); + if (ret) { + if (attr_mask & IBV_QP_CAP) + pthread_spin_unlock(&qp->rq.lock); + return ret; + } + if (attr_mask & IBV_QP_CAP) { + qp->rq.size = attr->cap.max_recv_wr + 1; + qp->rq.max_sge = attr->cap.max_recv_sge; + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * qp->rq.max_sge)) * + qp->rq.size; + qp->rq.rwq = mmap(NULL, size, + PROT_READ | PROT_WRITE, MAP_SHARED, + ibqp->context->cmd_fd, offset); + pthread_spin_unlock(&qp->rq.lock); + /* XXX Now we have no receive queue. */ + if ((void *) qp->rq.rwq == MAP_FAILED) + return errno; + } + return 0; } -int ipath_destroy_qp(struct ibv_qp *qp) +int ipath_destroy_qp(struct ibv_qp *ibqp) { + struct ipath_qp *qp = to_iqp(ibqp); int ret; - ret = ibv_cmd_destroy_qp(qp); + ret = ibv_cmd_destroy_qp(ibqp); if (ret) return ret; + if (qp->rq.rwq) { + size_t size; + + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * qp->rq.max_sge)) * + qp->rq.size; + (void) munmap(qp->rq.rwq, size); + } free(qp); return 0; } +static int post_recv(struct ipath_rq *rq, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + struct ibv_recv_wr *i; + struct ipath_rwq *rwq; + struct ipath_rwqe *wqe; + uint32_t head; + int n, ret; + + pthread_spin_lock(&rq->lock); + rwq = rq->rwq; + head = rwq->head; + for (i = wr; i; i = i->next) { + if ((unsigned) i->num_sge > rq->max_sge) + goto bad; + wqe = get_rwqe_ptr(rq, head); + if (++head >= rq->size) + head = 0; + if (head == rwq->tail) + goto bad; + wqe->wr_id = i->wr_id; + wqe->num_sge = i->num_sge; + for (n = 0; n < wqe->num_sge; n++) + wqe->sg_list[n] = i->sg_list[n]; + rwq->head = head; + } + ret = 0; + goto done; + +bad: + ret = -ENOMEM; + if (bad_wr) + *bad_wr = i; +done: + pthread_spin_unlock(&rq->lock); + return ret; +} + +int ipath_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + struct ipath_qp *qp = to_iqp(ibqp); + + return post_recv(&qp->rq, wr, bad_wr); +} + struct ibv_srq *ipath_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *attr) { - struct ibv_srq *srq; + struct ipath_srq *srq; struct ibv_create_srq cmd; - struct ibv_create_srq_resp resp; + struct ipath_create_srq_resp resp; int ret; + size_t size; srq = malloc(sizeof *srq); - if(srq == NULL) + if (srq == NULL) return NULL; - ret = ibv_cmd_create_srq(pd, srq, attr, &cmd, sizeof cmd, - &resp, sizeof resp); + ret = ibv_cmd_create_srq(pd, &srq->ibv_srq, attr, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); if (ret) { free(srq); return NULL; } - return srq; + srq->rq.size = attr->attr.max_wr + 1; + srq->rq.max_sge = attr->attr.max_sge; + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * srq->rq.max_sge)) * srq->rq.size; + srq->rq.rwq = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, + pd->context->cmd_fd, resp.offset); + if ((void *) srq->rq.rwq == MAP_FAILED) { + free(srq); + return NULL; + } + + pthread_spin_init(&srq->rq.lock, PTHREAD_PROCESS_PRIVATE); + return &srq->ibv_srq; } -int ipath_modify_srq(struct ibv_srq *srq, +int ipath_modify_srq(struct ibv_srq *ibsrq, struct ibv_srq_attr *attr, enum ibv_srq_attr_mask attr_mask) { - struct ibv_modify_srq cmd; + struct ipath_srq *srq = to_isrq(ibsrq); + struct ipath_modify_srq_cmd cmd; + __u64 offset; + size_t size; + int ret; - return ibv_cmd_modify_srq(srq, attr, attr_mask, &cmd, sizeof cmd); + if (attr_mask & IBV_SRQ_MAX_WR) { + pthread_spin_lock(&srq->rq.lock); + /* Unmap the old queue so we can resize it. */ + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * srq->rq.max_sge)) * + srq->rq.size; + (void) munmap(srq->rq.rwq, size); + } + cmd.offset_addr = (__u64) &offset; + ret = ibv_cmd_modify_srq(ibsrq, attr, attr_mask, + &cmd.ibv_cmd, sizeof cmd); + if (ret) { + if (attr_mask & IBV_SRQ_MAX_WR) + pthread_spin_unlock(&srq->rq.lock); + return ret; + } + if (attr_mask & IBV_SRQ_MAX_WR) { + srq->rq.size = attr->max_wr + 1; + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * srq->rq.max_sge)) * + srq->rq.size; + srq->rq.rwq = mmap(NULL, size, + PROT_READ | PROT_WRITE, MAP_SHARED, + ibsrq->context->cmd_fd, offset); + pthread_spin_unlock(&srq->rq.lock); + /* XXX Now we have no receive queue. */ + if ((void *) srq->rq.rwq == MAP_FAILED) + return errno; + } + return 0; } -int ipath_destroy_srq(struct ibv_srq *srq) +int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr) { + struct ibv_query_srq cmd; + + return ibv_cmd_query_srq(srq, attr, &cmd, sizeof cmd); +} + +int ipath_destroy_srq(struct ibv_srq *ibsrq) +{ + struct ipath_srq *srq = to_isrq(ibsrq); + size_t size; int ret; - ret = ibv_cmd_destroy_srq(srq); + ret = ibv_cmd_destroy_srq(ibsrq); if (ret) return ret; + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * srq->rq.max_sge)) * srq->rq.size; + (void) munmap(srq->rq.rwq, size); free(srq); return 0; } +int ipath_post_srq_recv(struct ibv_srq *ibsrq, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + struct ipath_srq *srq = to_isrq(ibsrq); + + return post_recv(&srq->rq, wr, bad_wr); +} + struct ibv_ah *ipath_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr) { struct ibv_ah *ah; ah = malloc(sizeof *ah); - if(ah == NULL) + if (ah == NULL) return NULL; - if(ibv_cmd_create_ah(pd, ah, attr)) { + if (ibv_cmd_create_ah(pd, ah, attr)) { free(ah); return NULL; } Index: src/userspace/libipathverbs/src/ipath-abi.h =================================================================== --- src/userspace/libipathverbs/src/ipath-abi.h (revision 0) +++ src/userspace/libipathverbs/src/ipath-abi.h (revision 0) @@ -0,0 +1,72 @@ +/* + * Copyright (c) 2006. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#ifndef IPATH_ABI_H +#define IPATH_ABI_H + +#include + +struct ipath_create_cq_resp { + struct ibv_create_cq_resp ibv_resp; + __u64 offset; +}; + +struct ipath_resize_cq_resp { + struct ibv_resize_cq_resp ibv_resp; + __u64 offset; +}; + +struct ipath_create_qp_resp { + struct ibv_create_qp_resp ibv_resp; + __u64 offset; +}; + +struct ipath_modify_qp_cmd { + struct ibv_modify_qp ibv_cmd; + __u64 offset_addr; +}; + +struct ipath_create_srq_resp { + struct ibv_create_srq_resp ibv_resp; + __u64 offset; +}; + +struct ipath_modify_srq_cmd { + struct ibv_modify_srq ibv_cmd; + __u64 offset_addr; +}; + +#endif /* IPATH_ABI_H */ Index: src/userspace/libipathverbs/src/ipathverbs.c =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.c (revision 8021) +++ src/userspace/libipathverbs/src/ipathverbs.c (working copy) @@ -86,22 +86,25 @@ .dereg_mr = ipath_dereg_mr, .create_cq = ipath_create_cq, - .poll_cq = ibv_cmd_poll_cq, + .poll_cq = ipath_poll_cq, .req_notify_cq = ibv_cmd_req_notify_cq, .cq_event = NULL, + .resize_cq = ipath_resize_cq, .destroy_cq = ipath_destroy_cq, .create_srq = ipath_create_srq, .modify_srq = ipath_modify_srq, + .query_srq = ipath_query_srq, .destroy_srq = ipath_destroy_srq, - .post_srq_recv = ibv_cmd_post_srq_recv, + .post_srq_recv = ipath_post_srq_recv, .create_qp = ipath_create_qp, + .query_qp = ipath_query_qp, .modify_qp = ipath_modify_qp, .destroy_qp = ipath_destroy_qp, .post_send = ibv_cmd_post_send, - .post_recv = ibv_cmd_post_recv, + .post_recv = ipath_post_recv, .create_ah = ipath_create_ah, .destroy_ah = ipath_destroy_ah, Index: src/userspace/libipathverbs/src/ipathverbs.h =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.h (revision 8021) +++ src/userspace/libipathverbs/src/ipathverbs.h (working copy) @@ -39,6 +39,7 @@ #include #include +#include #include #include @@ -64,6 +65,81 @@ struct ibv_context ibv_ctx; }; +/* + * This structure needs to have the same size and offsets as + * the kernel's ib_wc structure since it is memory mapped. + */ +struct ipath_wc { + uint64_t wr_id; + enum ibv_wc_status status; + enum ibv_wc_opcode opcode; + uint32_t vendor_err; + uint32_t byte_len; + uint32_t imm_data; /* in network byte order */ + uint32_t qp_num; + uint32_t src_qp; + enum ibv_wc_flags wc_flags; + uint16_t pkey_index; + uint16_t slid; + uint8_t sl; + uint8_t dlid_path_bits; + uint8_t port_num; +}; + +struct ipath_cq_wc { + uint32_t head; + uint32_t tail; + struct ipath_wc queue[1]; +}; + +struct ipath_cq { + struct ibv_cq ibv_cq; + struct ipath_cq_wc *queue; + pthread_spinlock_t lock; +}; + +/* + * Receive work request queue entry. + * The size of the sg_list is determined when the QP is created and stored + * in qp->r_max_sge. + */ +struct ipath_rwqe { + uint64_t wr_id; + uint8_t num_sge; + struct ibv_sge sg_list[0]; +}; + +/* + * This struture is used to contain the head pointer, tail pointer, + * and receive work queue entries as a single memory allocation so + * it can be mmap'ed into user space. + * Note that the wq array elements are variable size so you can't + * just index into the array to get the N'th element; + * use get_rwqe_ptr() instead. + */ +struct ipath_rwq { + uint32_t head; /* new requests posted to the head */ + uint32_t tail; /* receives pull requests from here. */ + struct ipath_rwqe wq[0]; +}; + +struct ipath_rq { + struct ipath_rwq *rwq; + pthread_spinlock_t lock; + uint32_t size; + uint32_t max_sge; +}; + +struct ipath_qp { + struct ibv_qp ibv_qp; + struct ipath_rq rq; +}; + +struct ipath_srq { + struct ibv_srq ibv_srq; + struct ipath_rq rq; +}; + #define to_ixxx(xxx, type) \ ((struct ipath_##type *) \ ((void *) ib##xxx - offsetof(struct ipath_##type, ibv_##xxx))) @@ -73,6 +149,34 @@ return to_ixxx(ctx, context); } +static inline struct ipath_cq *to_icq(struct ibv_cq *ibcq) +{ + return to_ixxx(cq, cq); +} + +static inline struct ipath_qp *to_iqp(struct ibv_qp *ibqp) +{ + return to_ixxx(qp, qp); +} + +static inline struct ipath_srq *to_isrq(struct ibv_srq *ibsrq) +{ + return to_ixxx(srq, srq); +} + +/* + * Since struct ipath_rwqe is not a fixed size, we can't simply index into + * struct ipath_rq.wq. This function does the array index computation. + */ +static inline struct ipath_rwqe *get_rwqe_ptr(struct ipath_rq *rq, + unsigned n) +{ + return (struct ipath_rwqe *) + ((char *) rq->rwq->wq + + (sizeof(struct ipath_rwqe) + + rq->max_sge * sizeof(struct ibv_sge)) * n); +} + extern int ipath_query_device(struct ibv_context *context, struct ibv_device_attr *attr); @@ -92,11 +196,19 @@ struct ibv_comp_channel *channel, int comp_vector); +int ipath_resize_cq(struct ibv_cq *cq, int cqe); + int ipath_destroy_cq(struct ibv_cq *cq); +int ipath_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); + struct ibv_qp *ipath_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); +int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, + enum ibv_qp_attr_mask attr_mask, + struct ibv_qp_init_attr *init_attr); + int ipath_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask); @@ -115,8 +227,12 @@ struct ibv_srq_attr *attr, enum ibv_srq_attr_mask attr_mask); +int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr); + int ipath_destroy_srq(struct ibv_srq *srq); +int ipath_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr); struct ibv_ah *ipath_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr); -- Ralph Campbell From ralphc at pathscale.com Thu Jun 15 15:42:12 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Thu, 15 Jun 2006 15:42:12 -0700 Subject: [openib-general] Patch for review: ipath mmaped CQs, QPs, SRQs [2 of 2] Message-ID: <1150411332.32252.78.camel@brick.pathscale.com> Here are the kernel driver changes that go with the user library changes just posted. Index: src/linux-kernel/infiniband/hw/ipath/ipath_qp.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_qp.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_qp.c (working copy) @@ -425,11 +425,12 @@ * @ibqp: the queue pair who's attributes we're modifying * @attr: the new attributes * @attr_mask: the mask of attributes to modify + * @udata: not used by the InfiniPath verbs driver * * Returns 0 on success, otherwise returns an errno. */ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, - int attr_mask) + int attr_mask, struct ib_udata *udata) { struct ipath_ibdev *dev = to_idev(ibqp->device); struct ipath_qp *qp = to_iqp(ibqp); Index: src/linux-kernel/infiniband/hw/ipath/ipath_ruc.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_ruc.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_ruc.c (working copy) @@ -105,6 +105,54 @@ spin_unlock_irqrestore(&dev->pending_lock, flags); } +static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + int user = to_ipd(qp->ibqp.pd)->user; + int i, j, ret; + struct ib_wc wc; + + qp->r_len = 0; + for (i = j = 0; i < wqe->num_sge; i++) { + if (wqe->sg_list[i].length == 0) + continue; + /* Check LKEY */ + if ((user && wqe->sg_list[i].lkey == 0) || + !ipath_lkey_ok(&dev->lk_table, + &qp->r_sg_list[j], &wqe->sg_list[i], + IB_ACCESS_LOCAL_WRITE)) + goto bad_lkey; + qp->r_len += wqe->sg_list[i].length; + j++; + } + qp->r_sge.sge = qp->r_sg_list[0]; + qp->r_sge.sg_list = qp->r_sg_list + 1; + qp->r_sge.num_sge = j; + ret = 1; + goto bail; + +bad_lkey: + wc.wr_id = wqe->wr_id; + wc.status = IB_WC_LOC_PROT_ERR; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.imm_data = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + /* Signal solicited completion event. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + ret = 0; +bail: + return ret; +} + /** * ipath_get_rwqe - copy the next RWQE into the QP's RWQE * @qp: the QP @@ -118,73 +166,69 @@ { unsigned long flags; struct ipath_rq *rq; + struct ipath_rwq *wq; struct ipath_srq *srq; struct ipath_rwqe *wqe; + void (*handler)(struct ib_event *, void *); + u32 tail; int ret; - if (!qp->ibqp.srq) { + if (qp->ibqp.srq) { + srq = to_isrq(qp->ibqp.srq); + handler = srq->ibsrq.event_handler; + rq = &srq->rq; + } else { + srq = NULL; + handler = NULL; rq = &qp->r_rq; - spin_lock_irqsave(&rq->lock, flags); + } - if (unlikely(rq->tail == rq->head)) { + spin_lock_irqsave(&rq->lock, flags); + wq = rq->wq; + tail = wq->tail; + do { + if (unlikely(tail == wq->head)) { + spin_unlock_irqrestore(&rq->lock, flags); ret = 0; goto bail; } - wqe = get_rwqe_ptr(rq, rq->tail); - qp->r_wr_id = wqe->wr_id; - if (!wr_id_only) { - qp->r_sge.sge = wqe->sg_list[0]; - qp->r_sge.sg_list = wqe->sg_list + 1; - qp->r_sge.num_sge = wqe->num_sge; - qp->r_len = wqe->length; - } - if (++rq->tail >= rq->size) - rq->tail = 0; - goto done; - } + wqe = get_rwqe_ptr(rq, tail); + if (++tail >= rq->size) + tail = 0; + } while (!wr_id_only && !init_sge(qp, wqe)); + qp->r_wr_id = wqe->wr_id; + wq->tail = tail; - srq = to_isrq(qp->ibqp.srq); - rq = &srq->rq; - spin_lock_irqsave(&rq->lock, flags); - - if (unlikely(rq->tail == rq->head)) { - ret = 0; - goto bail; - } - wqe = get_rwqe_ptr(rq, rq->tail); - qp->r_wr_id = wqe->wr_id; - if (!wr_id_only) { - qp->r_sge.sge = wqe->sg_list[0]; - qp->r_sge.sg_list = wqe->sg_list + 1; - qp->r_sge.num_sge = wqe->num_sge; - qp->r_len = wqe->length; - } - if (++rq->tail >= rq->size) - rq->tail = 0; - if (srq->ibsrq.event_handler) { - struct ib_event ev; + ret = 1; + if (handler) { u32 n; - if (rq->head < rq->tail) - n = rq->size + rq->head - rq->tail; + /* + * validate head pointer value and compute + * the number of remaining WQEs. + */ + n = wq->head; + if (n >= rq->size) + n = 0; + if (n < tail) + n += rq->size - tail; else - n = rq->head - rq->tail; + n -= tail; if (n < srq->limit) { + struct ib_event ev; + srq->limit = 0; spin_unlock_irqrestore(&rq->lock, flags); ev.device = qp->ibqp.device; ev.element.srq = qp->ibqp.srq; ev.event = IB_EVENT_SRQ_LIMIT_REACHED; - srq->ibsrq.event_handler(&ev, - srq->ibsrq.srq_context); - spin_lock_irqsave(&rq->lock, flags); + handler(&ev, srq->ibsrq.srq_context); + goto bail; } } -done: - ret = 1; + spin_unlock_irqrestore(&rq->lock, flags); bail: - spin_unlock_irqrestore(&rq->lock, flags); return ret; } Index: src/linux-kernel/infiniband/hw/ipath/Makefile =================================================================== --- src/linux-kernel/infiniband/hw/ipath/Makefile (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/Makefile (working copy) @@ -25,6 +25,7 @@ ipath_cq.o \ ipath_keys.o \ ipath_mad.o \ + ipath_mmap.o \ ipath_mr.o \ ipath_qp.o \ ipath_rc.o \ Index: src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c (working copy) @@ -280,11 +280,12 @@ struct ib_recv_wr **bad_wr) { struct ipath_qp *qp = to_iqp(ibqp); + struct ipath_rwq *wq = qp->r_rq.wq; unsigned long flags; int ret; /* Check that state is OK to post receive. */ - if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_RECV_OK)) { + if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_RECV_OK) || !wq) { *bad_wr = wr; ret = -EINVAL; goto bail; @@ -293,59 +294,31 @@ for (; wr; wr = wr->next) { struct ipath_rwqe *wqe; u32 next; - int i, j; + int i; - if (wr->num_sge > qp->r_rq.max_sge) { + if ((unsigned) wr->num_sge > qp->r_rq.max_sge) { *bad_wr = wr; ret = -ENOMEM; goto bail; } spin_lock_irqsave(&qp->r_rq.lock, flags); - next = qp->r_rq.head + 1; + next = wq->head + 1; if (next >= qp->r_rq.size) next = 0; - if (next == qp->r_rq.tail) { + if (next == wq->tail) { spin_unlock_irqrestore(&qp->r_rq.lock, flags); *bad_wr = wr; ret = -ENOMEM; goto bail; } - wqe = get_rwqe_ptr(&qp->r_rq, qp->r_rq.head); + wqe = get_rwqe_ptr(&qp->r_rq, wq->head); wqe->wr_id = wr->wr_id; - wqe->sg_list[0].mr = NULL; - wqe->sg_list[0].vaddr = NULL; - wqe->sg_list[0].length = 0; - wqe->sg_list[0].sge_length = 0; - wqe->length = 0; - for (i = 0, j = 0; i < wr->num_sge; i++) { - /* Check LKEY */ - if (to_ipd(qp->ibqp.pd)->user && - wr->sg_list[i].lkey == 0) { - spin_unlock_irqrestore(&qp->r_rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - if (wr->sg_list[i].length == 0) - continue; - if (!ipath_lkey_ok( - &to_idev(qp->ibqp.device)->lk_table, - &wqe->sg_list[j], &wr->sg_list[i], - IB_ACCESS_LOCAL_WRITE)) { - spin_unlock_irqrestore(&qp->r_rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - wqe->length += wr->sg_list[i].length; - j++; - } - wqe->num_sge = j; - qp->r_rq.head = next; + wqe->num_sge = wr->num_sge; + for (i = 0; i < wr->num_sge; i++) + wqe->sg_list[i] = wr->sg_list[i]; + wq->head = next; spin_unlock_irqrestore(&qp->r_rq.lock, flags); } ret = 0; @@ -694,7 +667,7 @@ ipath_layer_get_lastibcstat(dev->dd) & 0xf]; props->port_cap_flags = dev->port_cap_flags; props->gid_tbl_len = 1; - props->max_msg_sz = 4096; + props->max_msg_sz = 0x80000000; props->pkey_tbl_len = ipath_layer_get_npkeys(dev->dd); props->bad_pkey_cntr = ipath_layer_get_cr_errpkey(dev->dd) - dev->z_pkey_violations; @@ -871,7 +844,7 @@ goto bail; } - if (ah_attr->port_num != 1 || + if (ah_attr->port_num < 1 || ah_attr->port_num > pd->device->phys_port_cnt) { ret = ERR_PTR(-EINVAL); goto bail; @@ -883,6 +856,8 @@ goto bail; } + dev->n_ahs_allocated++; + /* ib_create_ah() will initialize ah->ibah. */ ah->attr = *ah_attr; @@ -1137,6 +1112,7 @@ dev->attach_mcast = ipath_multicast_attach; dev->detach_mcast = ipath_multicast_detach; dev->process_mad = ipath_process_mad; + dev->mmap = ipath_mmap; snprintf(dev->node_desc, sizeof(dev->node_desc), IPATH_IDSTR " %s kernel_SMA", system_utsname.nodename); Index: src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h (working copy) @@ -577,7 +577,7 @@ int ipath_destroy_qp(struct ib_qp *ibqp); int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, - int attr_mask); + int attr_mask, struct ib_udata *udata); int ipath_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, struct ib_qp_init_attr *init_attr); @@ -636,7 +636,8 @@ struct ib_udata *udata); int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask); + enum ib_srq_attr_mask attr_mask, + struct ib_udata *udata); int ipath_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr); Index: src/linux-kernel/infiniband/hw/ipath/ipath_mmap.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_mmap.c (revision 0) +++ src/linux-kernel/infiniband/hw/ipath/ipath_mmap.c (revision 0) @@ -0,0 +1,147 @@ +/* + * Copyright (c) 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include +#include +#include + +#include "ipath_verbs.h" + +/** + * ipath_release_mmap_info - free mmap info structure + * @ref: a pointer to the kref within struct ipath_mmap_info + */ +void ipath_release_mmap_info(struct kref *ref) +{ + struct ipath_mmap_info *ip = + container_of(ref, struct ipath_mmap_info, ref); + + vfree(ip->obj); + kfree(ip); +} + +/* + * open and close keep track of how many times the CQ is mapped, + * to avoid releasing it. + */ +static void ipath_vma_open(struct vm_area_struct *vma) +{ + struct ipath_mmap_info *ip = vma->vm_private_data; + + kref_get(&ip->ref); + ip->mmap_cnt++; +} + +static void ipath_vma_close(struct vm_area_struct *vma) +{ + struct ipath_mmap_info *ip = vma->vm_private_data; + + ip->mmap_cnt--; + kref_put(&ip->ref, ipath_release_mmap_info); +} + +/* + * ipath_vma_nopage - handle a VMA page fault. + */ +static struct page *ipath_vma_nopage(struct vm_area_struct *vma, + unsigned long address, int *type) +{ + struct ipath_mmap_info *ip = vma->vm_private_data; + unsigned long offset = address - vma->vm_start; + struct page *page = NOPAGE_SIGBUS; + void *pageptr; + + if (offset >= ip->size) + goto out; /* out of range */ + + /* + * Convert the vmalloc address into a struct page. + */ + pageptr = (void *)(offset + (vma->vm_pgoff << PAGE_SHIFT)); + page = vmalloc_to_page(pageptr); + + /* Increment the reference count. */ + get_page(page); + if (type) + *type = VM_FAULT_MINOR; +out: + return page; +} + +static struct vm_operations_struct ipath_vm_ops = { + .open = ipath_vma_open, + .close = ipath_vma_close, + .nopage = ipath_vma_nopage, +}; + +/** + * ipath_mmap - create a new mmap region + * @context: the IB user context of the process making the mmap() call + * @vma: the VMA to be initialized + * Return zero if the mmap is OK. Otherwise, return an errno. + */ +int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + struct ipath_ibdev *dev = to_idev(context->device); + unsigned long offset = vma->vm_pgoff << PAGE_SHIFT; + unsigned long size = vma->vm_end - vma->vm_start; + struct ipath_mmap_info *ip, **pp; + + /* + * Search the device's list of objects waiting for a mmap call. + * Normally, this list is very short since a call to create a + * CQ, QP, or SRQ is soon followed by a call to mmap(). + */ + spin_lock_irq(&dev->pending_lock); + for (pp = &dev->pending_mmaps; (ip = *pp); pp = &ip->next) { + /* Only the creator is allowed to mmap the object */ + if (context != ip->context || (void *) offset != ip->obj) + continue; + /* Don't allow a mmap larger than the object. */ + if (size > ip->size) + break; + + *pp = ip->next; + spin_unlock_irq(&dev->pending_lock); + + vma->vm_ops = &ipath_vm_ops; + vma->vm_flags |= VM_RESERVED; + vma->vm_private_data = ip; + ipath_vma_open(vma); + return 0; + } + spin_unlock_irq(&dev->pending_lock); + return -EINVAL; +} Index: src/linux-kernel/infiniband/hw/ipath/ipath_cq.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_cq.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_cq.c (working copy) @@ -41,20 +41,28 @@ * @entry: work completion entry to add * @sig: true if @entry is a solicitated entry * - * This may be called with one of the qp->s_lock or qp->r_rq.lock held. + * This may be called with qp->s_lock held. */ void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int solicited) { + struct ipath_cq_wc *wc = cq->queue; unsigned long flags; + u32 head; u32 next; spin_lock_irqsave(&cq->lock, flags); - if (cq->head == cq->ibcq.cqe) + /* + * Note that the head pointer might be writable by user processes. + * Take care to verify it is a sane value. + */ + head = wc->head; + if (head >= (unsigned) cq->ibcq.cqe) { + head = cq->ibcq.cqe; next = 0; - else - next = cq->head + 1; - if (unlikely(next == cq->tail)) { + } else + next = head + 1; + if (unlikely(next == wc->tail)) { spin_unlock_irqrestore(&cq->lock, flags); if (cq->ibcq.event_handler) { struct ib_event ev; @@ -66,8 +74,8 @@ } return; } - cq->queue[cq->head] = *entry; - cq->head = next; + wc->queue[head] = *entry; + wc->head = next; if (cq->notify == IB_CQ_NEXT_COMP || (cq->notify == IB_CQ_SOLICITED && solicited)) { @@ -100,19 +108,20 @@ int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) { struct ipath_cq *cq = to_icq(ibcq); + struct ipath_cq_wc *wc = cq->queue; unsigned long flags; int npolled; spin_lock_irqsave(&cq->lock, flags); for (npolled = 0; npolled < num_entries; ++npolled, ++entry) { - if (cq->tail == cq->head) + if (wc->tail == wc->head) break; - *entry = cq->queue[cq->tail]; - if (cq->tail == cq->ibcq.cqe) - cq->tail = 0; + *entry = wc->queue[wc->tail]; + if (wc->tail >= cq->ibcq.cqe) + wc->tail = 0; else - cq->tail++; + wc->tail++; } spin_unlock_irqrestore(&cq->lock, flags); @@ -159,7 +168,7 @@ { struct ipath_ibdev *dev = to_idev(ibdev); struct ipath_cq *cq; - struct ib_wc *wc; + struct ipath_cq_wc *wc; struct ib_cq *ret; if (entries > ib_ipath_max_cqes) { @@ -172,10 +181,7 @@ goto bail; } - /* - * Need to use vmalloc() if we want to support large #s of - * entries. - */ + /* Allocate the completion queue structure. */ cq = kmalloc(sizeof(*cq), GFP_KERNEL); if (!cq) { ret = ERR_PTR(-ENOMEM); @@ -183,15 +189,54 @@ } /* - * Need to use vmalloc() if we want to support large #s of entries. + * Allocate the completion queue entries and head/tail pointers. + * This is allocated separately so that it can be resized and + * also mapped into user space. + * We need to use vmalloc() in order to support mmap and large + * numbers of entries. */ - wc = vmalloc(sizeof(*wc) * (entries + 1)); + wc = vmalloc(sizeof(*wc) + sizeof(struct ib_wc) * entries); if (!wc) { - kfree(cq); ret = ERR_PTR(-ENOMEM); - goto bail; + goto free_cq; } + /* + * Return the address of the WC as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata) { + struct ipath_mmap_info *ip; + __u64 offset = (__u64) wc; + int err; + + err = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (err) { + ret = ERR_PTR(err); + goto free_wc; + } + + /* Allocate info for ipath_mmap(). */ + ip = kmalloc(sizeof(*ip), GFP_KERNEL); + if (!ip) { + ret = ERR_PTR(-ENOMEM); + goto free_wc; + } + cq->ip = ip; + ip->context = context; + ip->obj = wc; + kref_init(&ip->ref); + ip->mmap_cnt = 0; + ip->size = PAGE_ALIGN(sizeof(*wc) + + sizeof(struct ib_wc) * entries); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } else + cq->ip = NULL; + + /* * ib_create_cq() will initialize cq->ibcq except for cq->ibcq.cqe. * The number of entries should be >= the number requested or return * an error. @@ -201,14 +246,18 @@ cq->triggered = 0; spin_lock_init(&cq->lock); tasklet_init(&cq->comptask, send_complete, (unsigned long)cq); - cq->head = 0; - cq->tail = 0; + wc->head = 0; + wc->tail = 0; cq->queue = wc; ret = &cq->ibcq; - dev->n_cqs_allocated++; + goto bail; +free_wc: + vfree(wc); +free_cq: + kfree(cq); bail: return ret; } @@ -228,7 +277,10 @@ tasklet_kill(&cq->comptask); dev->n_cqs_allocated--; - vfree(cq->queue); + if (cq->ip) + kref_put(&cq->ip->ref, ipath_release_mmap_info); + else + vfree(cq->queue); kfree(cq); return 0; @@ -252,7 +304,7 @@ spin_lock_irqsave(&cq->lock, flags); /* * Don't change IB_CQ_NEXT_COMP to IB_CQ_SOLICITED but allow - * any other transitions. + * any other transitions (see C11-31 and C11-32 in ch. 11.4.2.2). */ if (cq->notify != IB_CQ_NEXT_COMP) cq->notify = notify; @@ -263,46 +315,87 @@ int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata) { struct ipath_cq *cq = to_icq(ibcq); - struct ib_wc *wc, *old_wc; - u32 n; + struct ipath_cq_wc *old_wc = cq->queue; + struct ipath_cq_wc *wc; + u32 head, tail, n; int ret; + /* Don't allow resize if completion queue is mmapped. */ + if (cq->ip && cq->ip->mmap_cnt) { + ret = -EBUSY; + goto bail; + } + /* * Need to use vmalloc() if we want to support large #s of entries. */ - wc = vmalloc(sizeof(*wc) * (cqe + 1)); + wc = vmalloc(sizeof(*wc) + sizeof(struct ib_wc) * cqe); if (!wc) { ret = -ENOMEM; goto bail; } + /* + * Return the address of the WC as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata) { + __u64 offset = (__u64) wc; + + ret = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (ret) + goto bail; + } + spin_lock_irq(&cq->lock); - if (cq->head < cq->tail) - n = cq->ibcq.cqe + 1 + cq->head - cq->tail; + /* + * Make sure head and tail are sane since they + * might be user writable. + */ + head = old_wc->head; + if (head > (u32) cq->ibcq.cqe) + head = (u32) cq->ibcq.cqe; + tail = old_wc->tail; + if (tail > (u32) cq->ibcq.cqe) + tail = (u32) cq->ibcq.cqe; + if (head < tail) + n = cq->ibcq.cqe + 1 + head - tail; else - n = cq->head - cq->tail; + n = head - tail; if (unlikely((u32)cqe < n)) { spin_unlock_irq(&cq->lock); vfree(wc); ret = -EOVERFLOW; goto bail; } - for (n = 0; cq->tail != cq->head; n++) { - wc[n] = cq->queue[cq->tail]; - if (cq->tail == cq->ibcq.cqe) - cq->tail = 0; + for (n = 0; tail != head; n++) { + wc->queue[n] = old_wc->queue[tail]; + if (tail == (u32) cq->ibcq.cqe) + tail = 0; else - cq->tail++; + tail++; } cq->ibcq.cqe = cqe; - cq->head = n; - cq->tail = 0; - old_wc = cq->queue; + wc->head = n; + wc->tail = 0; cq->queue = wc; spin_unlock_irq(&cq->lock); vfree(old_wc); + if (cq->ip) { + struct ipath_ibdev *dev = to_idev(ibcq->device); + struct ipath_mmap_info *ip = cq->ip; + + ip->obj = wc; + ip->size = PAGE_ALIGN(sizeof(*wc) + + sizeof(struct ib_wc) * cqe); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } + ret = 0; bail: Index: src/linux-kernel/infiniband/hw/ipath/ipath_srq.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_srq.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_srq.c (working copy) @@ -47,66 +47,38 @@ struct ib_recv_wr **bad_wr) { struct ipath_srq *srq = to_isrq(ibsrq); - struct ipath_ibdev *dev = to_idev(ibsrq->device); + struct ipath_rwq *wq; unsigned long flags; int ret; for (; wr; wr = wr->next) { struct ipath_rwqe *wqe; u32 next; - int i, j; + int i; - if (wr->num_sge > srq->rq.max_sge) { + if ((unsigned) wr->num_sge > srq->rq.max_sge) { *bad_wr = wr; ret = -ENOMEM; goto bail; } spin_lock_irqsave(&srq->rq.lock, flags); - next = srq->rq.head + 1; + wq = srq->rq.wq; + next = wq->head + 1; if (next >= srq->rq.size) next = 0; - if (next == srq->rq.tail) { + if (next == wq->tail) { spin_unlock_irqrestore(&srq->rq.lock, flags); *bad_wr = wr; ret = -ENOMEM; goto bail; } - wqe = get_rwqe_ptr(&srq->rq, srq->rq.head); + wqe = get_rwqe_ptr(&srq->rq, wq->head); wqe->wr_id = wr->wr_id; - wqe->sg_list[0].mr = NULL; - wqe->sg_list[0].vaddr = NULL; - wqe->sg_list[0].length = 0; - wqe->sg_list[0].sge_length = 0; - wqe->length = 0; - for (i = 0, j = 0; i < wr->num_sge; i++) { - /* Check LKEY */ - if (to_ipd(srq->ibsrq.pd)->user && - wr->sg_list[i].lkey == 0) { - spin_unlock_irqrestore(&srq->rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - if (wr->sg_list[i].length == 0) - continue; - if (!ipath_lkey_ok(&dev->lk_table, - &wqe->sg_list[j], - &wr->sg_list[i], - IB_ACCESS_LOCAL_WRITE)) { - spin_unlock_irqrestore(&srq->rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - wqe->length += wr->sg_list[i].length; - j++; - } - wqe->num_sge = j; - srq->rq.head = next; + for (i = 0; i < wr->num_sge; i++) + wqe->sg_list[0] = wr->sg_list[i]; + wq->head = next; spin_unlock_irqrestore(&srq->rq.lock, flags); } ret = 0; @@ -156,28 +128,67 @@ * Need to use vmalloc() if we want to support large #s of entries. */ srq->rq.size = srq_init_attr->attr.max_wr + 1; - sz = sizeof(struct ipath_sge) * srq_init_attr->attr.max_sge + + srq->rq.max_sge = srq_init_attr->attr.max_sge; + sz = sizeof(struct ib_sge) * srq->rq.max_sge + sizeof(struct ipath_rwqe); - srq->rq.wq = vmalloc(srq->rq.size * sz); + srq->rq.wq = vmalloc(sizeof(struct ipath_rwq) + srq->rq.size * sz); if (!srq->rq.wq) { - kfree(srq); ret = ERR_PTR(-ENOMEM); - goto bail; + goto free_srq; } /* + * Return the address of the RWQ as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata) { + struct ipath_mmap_info *ip; + __u64 offset = (__u64) srq->rq.wq; + int err; + + err = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (err) { + ret = ERR_PTR(err); + goto free_rwq; + } + + /* Allocate info for ipath_mmap(). */ + ip = kmalloc(sizeof(*ip), GFP_KERNEL); + if (!ip) { + ret = ERR_PTR(-ENOMEM); + goto free_rwq; + } + srq->ip = ip; + ip->context = ibpd->uobject->context; + ip->obj = srq->rq.wq; + kref_init(&ip->ref); + ip->mmap_cnt = 0; + ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) + + srq->rq.size * sz); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } else + srq->ip = NULL; + + /* * ib_create_srq() will initialize srq->ibsrq. */ spin_lock_init(&srq->rq.lock); - srq->rq.head = 0; - srq->rq.tail = 0; - srq->rq.max_sge = srq_init_attr->attr.max_sge; + srq->rq.wq->head = 0; + srq->rq.wq->tail = 0; srq->limit = srq_init_attr->attr.srq_limit; + dev->n_srqs_allocated++; + ret = &srq->ibsrq; + goto bail; - dev->n_srqs_allocated++; - +free_rwq: + vfree(srq->rq.wq); +free_srq: + kfree(srq); bail: return ret; } @@ -187,83 +198,143 @@ * @ibsrq: the SRQ to modify * @attr: the new attributes of the SRQ * @attr_mask: indicates which attributes to modify + * @udata: user data for ipathverbs.so */ int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask) + enum ib_srq_attr_mask attr_mask, + struct ib_udata *udata) { struct ipath_srq *srq = to_isrq(ibsrq); - unsigned long flags; - int ret; + int ret = 0; - if (attr_mask & IB_SRQ_MAX_WR) - if ((attr->max_wr > ib_ipath_max_srq_wrs) || - (attr->max_sge > srq->rq.max_sge)) { + if (attr_mask & IB_SRQ_MAX_WR) { + struct ipath_rwq *owq; + struct ipath_rwq *wq; + struct ipath_rwqe *p; + u32 sz, size, n, head, tail; + + /* Don't allow resize if mmapped */ + if (srq->ip && srq->ip->mmap_cnt) { ret = -EINVAL; goto bail; } - if (attr_mask & IB_SRQ_LIMIT) - if (attr->srq_limit >= srq->rq.size) { + /* + * Check that the requested sizes are below the limits + * and that user/kernel SRQs are only resized by the + * user/kernel. + */ + if ((attr->max_wr > ib_ipath_max_srq_wrs) || + (!udata != !srq->ip) || + ((attr_mask & IB_SRQ_LIMIT) && + attr->srq_limit > attr->max_wr) || + (!(attr_mask & IB_SRQ_LIMIT) && + srq->limit > attr->max_wr)) { ret = -EINVAL; goto bail; } - if (attr_mask & IB_SRQ_MAX_WR) { - struct ipath_rwqe *wq, *p; - u32 sz, size, n; - sz = sizeof(struct ipath_rwqe) + - attr->max_sge * sizeof(struct ipath_sge); + srq->rq.max_sge * sizeof(struct ib_sge); size = attr->max_wr + 1; - wq = vmalloc(size * sz); + wq = vmalloc(sizeof(struct ipath_rwq) + size * sz); if (!wq) { ret = -ENOMEM; goto bail; } - spin_lock_irqsave(&srq->rq.lock, flags); - if (srq->rq.head < srq->rq.tail) - n = srq->rq.size + srq->rq.head - srq->rq.tail; + /* + * Return the address of the RWQ as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata) { + __u64 offset_addr; + __u64 offset = (__u64) wq; + + ret = ib_copy_from_udata(&offset_addr, udata, + sizeof(offset_addr)); + if (ret) { + vfree(wq); + goto bail; + } + udata->outbuf = (void __user *) offset_addr; + ret = ib_copy_to_udata(udata, &offset, + sizeof(offset)); + if (ret) { + vfree(wq); + goto bail; + } + } + + spin_lock_irq(&srq->rq.lock); + /* + * validate head pointer value and compute + * the number of remaining WQEs. + */ + owq = srq->rq.wq; + head = owq->head; + if (head >= srq->rq.size) + head = 0; + tail = owq->tail; + if (tail >= srq->rq.size) + tail = 0; + n = head; + if (n < tail) + n += srq->rq.size - tail; else - n = srq->rq.head - srq->rq.tail; - if (size <= n || size <= srq->limit) { - spin_unlock_irqrestore(&srq->rq.lock, flags); + n -= tail; + if (size <= n) { + spin_unlock_irq(&srq->rq.lock); vfree(wq); ret = -EINVAL; goto bail; } n = 0; - p = wq; - while (srq->rq.tail != srq->rq.head) { + p = wq->wq; + while (tail != head) { struct ipath_rwqe *wqe; int i; - wqe = get_rwqe_ptr(&srq->rq, srq->rq.tail); + wqe = get_rwqe_ptr(&srq->rq, tail); p->wr_id = wqe->wr_id; - p->length = wqe->length; p->num_sge = wqe->num_sge; for (i = 0; i < wqe->num_sge; i++) p->sg_list[i] = wqe->sg_list[i]; n++; p = (struct ipath_rwqe *)((char *) p + sz); - if (++srq->rq.tail >= srq->rq.size) - srq->rq.tail = 0; + if (++tail >= srq->rq.size) + tail = 0; } - vfree(srq->rq.wq); srq->rq.wq = wq; srq->rq.size = size; - srq->rq.head = n; - srq->rq.tail = 0; - srq->rq.max_sge = attr->max_sge; - spin_unlock_irqrestore(&srq->rq.lock, flags); - } + wq->head = n; + wq->tail = 0; + if (attr_mask & IB_SRQ_LIMIT) + srq->limit = attr->srq_limit; + spin_unlock_irq(&srq->rq.lock); - if (attr_mask & IB_SRQ_LIMIT) { - spin_lock_irqsave(&srq->rq.lock, flags); - srq->limit = attr->srq_limit; - spin_unlock_irqrestore(&srq->rq.lock, flags); + vfree(owq); + + if (srq->ip) { + struct ipath_mmap_info *ip = srq->ip; + struct ipath_ibdev *dev = to_idev(srq->ibsrq.device); + + ip->obj = wq; + ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) + + size * sz); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } + } else if (attr_mask & IB_SRQ_LIMIT) { + spin_lock_irq(&srq->rq.lock); + if (attr->srq_limit >= srq->rq.size) + ret = -EINVAL; + else + srq->limit = attr->srq_limit; + spin_unlock_irq(&srq->rq.lock); } - ret = 0; bail: return ret; @@ -289,7 +360,10 @@ struct ipath_ibdev *dev = to_idev(ibsrq->device); dev->n_srqs_allocated--; - vfree(srq->rq.wq); + if (srq->ip) + kref_put(&srq->ip->ref, ipath_release_mmap_info); + else + vfree(srq->rq.wq); kfree(srq); return 0; Index: src/linux-kernel/infiniband/hw/ipath/ipath_ud.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_ud.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_ud.c (working copy) @@ -35,6 +35,53 @@ #include "ipath_verbs.h" #include "ips_common.h" +static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe, + u32 *lengthp, struct ipath_sge_state *ss) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + int user = to_ipd(qp->ibqp.pd)->user; + int i, j, ret; + struct ib_wc wc; + + *lengthp = 0; + for (i = j = 0; i < wqe->num_sge; i++) { + if (wqe->sg_list[i].length == 0) + continue; + /* Check LKEY */ + if ((user && wqe->sg_list[i].lkey == 0) || + !ipath_lkey_ok(&dev->lk_table, + j ? &ss->sg_list[j - 1] : &ss->sge, + &wqe->sg_list[i], IB_ACCESS_LOCAL_WRITE)) + goto bad_lkey; + *lengthp += wqe->sg_list[i].length; + j++; + } + ss->num_sge = j; + ret = 1; + goto bail; + +bad_lkey: + wc.wr_id = wqe->wr_id; + wc.status = IB_WC_LOC_PROT_ERR; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.imm_data = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + /* Signal solicited completion event. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + ret = 0; +bail: + return ret; +} + /** * ipath_ud_loopback - handle send on loopback QPs * @sqp: the QP @@ -45,6 +92,8 @@ * * This is called from ipath_post_ud_send() to forward a WQE addressed * to the same HCA. + * Note that the receive interrupt handler may be calling ipath_ud_rcv() + * while this is being called. */ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, @@ -59,7 +108,11 @@ struct ipath_srq *srq; struct ipath_sge_state rsge; struct ipath_sge *sge; + struct ipath_rwq *wq; struct ipath_rwqe *wqe; + void (*handler)(struct ib_event *, void *); + u32 tail; + u32 rlen; qp = ipath_lookup_qpn(&dev->qp_table, wr->wr.ud.remote_qpn); if (!qp) @@ -93,6 +146,13 @@ wc->imm_data = 0; } + if (wr->num_sge > 1) { + rsge.sg_list = kmalloc((wr->num_sge - 1) * + sizeof(struct ipath_sge), + GFP_ATOMIC); + } else + rsge.sg_list = NULL; + /* * Get the next work request entry to find where to put the data. * Note that it is safe to drop the lock after changing rq->tail @@ -100,37 +160,52 @@ */ if (qp->ibqp.srq) { srq = to_isrq(qp->ibqp.srq); + handler = srq->ibsrq.event_handler; rq = &srq->rq; } else { srq = NULL; + handler = NULL; rq = &qp->r_rq; } + spin_lock_irqsave(&rq->lock, flags); - if (rq->tail == rq->head) { - spin_unlock_irqrestore(&rq->lock, flags); - dev->n_pkt_drops++; - goto done; + wq = rq->wq; + tail = wq->tail; + while (1) { + if (unlikely(tail == wq->head)) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + goto free_sge; + } + wqe = get_rwqe_ptr(rq, tail); + if (++tail >= rq->size) + tail = 0; + if (init_sge(qp, wqe, &rlen, &rsge)) + break; + wq->tail = tail; } /* Silently drop packets which are too big. */ - wqe = get_rwqe_ptr(rq, rq->tail); - if (wc->byte_len > wqe->length) { + if (wc->byte_len > rlen) { spin_unlock_irqrestore(&rq->lock, flags); dev->n_pkt_drops++; - goto done; + goto free_sge; } + wq->tail = tail; wc->wr_id = wqe->wr_id; - rsge.sge = wqe->sg_list[0]; - rsge.sg_list = wqe->sg_list + 1; - rsge.num_sge = wqe->num_sge; - if (++rq->tail >= rq->size) - rq->tail = 0; - if (srq && srq->ibsrq.event_handler) { + if (handler) { u32 n; - if (rq->head < rq->tail) - n = rq->size + rq->head - rq->tail; + /* + * validate head pointer value and compute + * the number of remaining WQEs. + */ + n = wq->head; + if (n >= rq->size) + n = 0; + if (n < tail) + n += rq->size - tail; else - n = rq->head - rq->tail; + n -= tail; if (n < srq->limit) { struct ib_event ev; @@ -139,12 +214,12 @@ ev.device = qp->ibqp.device; ev.element.srq = qp->ibqp.srq; ev.event = IB_EVENT_SRQ_LIMIT_REACHED; - srq->ibsrq.event_handler(&ev, - srq->ibsrq.srq_context); + handler(&ev, srq->ibsrq.srq_context); } else spin_unlock_irqrestore(&rq->lock, flags); } else spin_unlock_irqrestore(&rq->lock, flags); + ah_attr = &to_iah(wr->wr.ud.ah)->attr; if (ah_attr->ah_flags & IB_AH_GRH) { ipath_copy_sge(&rsge, &ah_attr->grh, sizeof(struct ib_grh)); @@ -195,6 +270,8 @@ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc, wr->send_flags & IB_SEND_SOLICITED); +free_sge: + kfree(rsge.sg_list); done: if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); @@ -432,13 +509,9 @@ int opcode; u32 hdrsize; u32 pad; - unsigned long flags; struct ib_wc wc; u32 qkey; u32 src_qp; - struct ipath_rq *rq; - struct ipath_srq *srq; - struct ipath_rwqe *wqe; u16 dlid; int header_in_data; @@ -546,19 +619,10 @@ /* * Get the next work request entry to find where to put the data. - * Note that it is safe to drop the lock after changing rq->tail - * since ipath_post_receive() won't fill the empty slot. */ - if (qp->ibqp.srq) { - srq = to_isrq(qp->ibqp.srq); - rq = &srq->rq; - } else { - srq = NULL; - rq = &qp->r_rq; - } - spin_lock_irqsave(&rq->lock, flags); - if (rq->tail == rq->head) { - spin_unlock_irqrestore(&rq->lock, flags); + if (qp->r_reuse_sge) + qp->r_reuse_sge = 0; + else if (!ipath_get_rwqe(qp, 0)) { /* * Count VL15 packets dropped due to no receive buffer. * Otherwise, count them as buffer overruns since usually, @@ -572,39 +636,11 @@ goto bail; } /* Silently drop packets which are too big. */ - wqe = get_rwqe_ptr(rq, rq->tail); - if (wc.byte_len > wqe->length) { - spin_unlock_irqrestore(&rq->lock, flags); + if (wc.byte_len > qp->r_len) { + qp->r_reuse_sge = 1; dev->n_pkt_drops++; goto bail; } - wc.wr_id = wqe->wr_id; - qp->r_sge.sge = wqe->sg_list[0]; - qp->r_sge.sg_list = wqe->sg_list + 1; - qp->r_sge.num_sge = wqe->num_sge; - if (++rq->tail >= rq->size) - rq->tail = 0; - if (srq && srq->ibsrq.event_handler) { - u32 n; - - if (rq->head < rq->tail) - n = rq->size + rq->head - rq->tail; - else - n = rq->head - rq->tail; - if (n < srq->limit) { - struct ib_event ev; - - srq->limit = 0; - spin_unlock_irqrestore(&rq->lock, flags); - ev.device = qp->ibqp.device; - ev.element.srq = qp->ibqp.srq; - ev.event = IB_EVENT_SRQ_LIMIT_REACHED; - srq->ibsrq.event_handler(&ev, - srq->ibsrq.srq_context); - } else - spin_unlock_irqrestore(&rq->lock, flags); - } else - spin_unlock_irqrestore(&rq->lock, flags); if (has_grh) { ipath_copy_sge(&qp->r_sge, &hdr->u.l.grh, sizeof(struct ib_grh)); @@ -613,6 +649,7 @@ ipath_skip_sge(&qp->r_sge, sizeof(struct ib_grh)); ipath_copy_sge(&qp->r_sge, data, wc.byte_len - sizeof(struct ib_grh)); + wc.wr_id = qp->r_wr_id; wc.status = IB_WC_SUCCESS; wc.opcode = IB_WC_RECV; wc.vendor_err = 0; -- Ralph Campbell From ardavis at ichips.intel.com Thu Jun 15 16:04:40 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 15 Jun 2006 16:04:40 -0700 Subject: [openib-general] Processes not exiting on SVN7946 In-Reply-To: References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408> <4491D195.8030106@ichips.intel.com> <4491D7E1.5050504@ichips.intel.com> Message-ID: <4491E788.3000609@ichips.intel.com> Roland Dreier wrote: >OK, just a dumb oversight on my part. The change below (already >checked in) fixes it for me: > >--- infiniband/core/uverbs_cmd.c (revision 8055) >+++ infiniband/core/uverbs_cmd.c (working copy) >@@ -1123,6 +1123,12 @@ ssize_t ib_uverbs_create_qp(struct ib_uv > goto err_copy; > } > >+ put_pd_read(pd); >+ put_cq_read(scq); >+ put_cq_read(rcq); >+ if (srq) >+ put_srq_read(srq); >+ > mutex_lock(&file->mutex); > list_add_tail(&obj->uevent.uobject.list, &file->ucontext->qp_list); > mutex_unlock(&file->mutex); > > > Works for me too. Thanks! -arlin From tziporet at mellanox.co.il Fri Jun 16 01:54:37 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Fri, 16 Jun 2006 11:54:37 +0300 Subject: [openib-general] OFED 1.0 - Official Release Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA724C@mtlexch01.mtl.com> I am happy to announce that OFED 1.0 Official Release is now available. The release can be found under: https://openib.org/svn/gen2/branches/1.0/ofed/releases/ And later today it will be on the OpenFabrics download page: http://www.openfabrics.org/downloads.html. This is the first release that was done in a joint effort of the following companies: * Cisco * SilverStorm * Voltaire * QLogic * Intel * Mellanox Technologies I wish to thank all who contributed to the success of this release. Tziporet ======================================================================== ======= Release summary: The OFED software package is composed of several software modules intended for use on a computer cluster constructed as an InfiniBand network. OFED package contains the following components: o OpenFabrics core and ULPs: - HCA drivers (mthca, ipath) - core - Upper Layer Protocols: IPoIB, SDP, SRP Initiator, iSER Host, RDS and uDAPL o OpenFabrics utilities: - OpenSM: InfiniBand Subnet Manager - Diagnostic tools - Performance tests o MPI: - OSU MPI stack supporting the InfiniBand interface - Open MPI stack supporting the InfiniBand interface - MPI benchmark tests (OSU BW/LAT, Pallas, Presta) o Sources of all software modules (under conditions mentioned in the modules' LICENSE files) o Documentation Notes: 1. SDP and RDS are in technology preview state. 2. The SRP Initiator and Open MPI are in beta state. 3. All other OFED components are in production state. Supported Platforms and Operating Systems CPU architectures: * x86_64 * x86 * ia64 * ppc64 Linux Operating Systems: * RedHat EL4 up2: 2.6.9-22.ELsmp * RedHat EL4 up3: 2.6.9-34.ELsmp * Fedora C4: 2.6.11-1.1369_FC4 * SLES10 RC2: 2.6.16.16-1.6-smp (or RC 2.5 2.6.16.14-6-smp) * SLES10 RC1: 2.6.16.14-6-smp * SUSE 10 Pro: 2.6.13-15-smp * kernel.org: 2.6.16.x HCAs Supported Mellanox HCAs: - InfiniHost - InfiniHost III Ex (both modes: with memory and MemFree) - InfiniHost III Lx Both SDR and DDR mode of the InfiniHost III family are supported. For official FW versions please see: http://www.mellanox.com/support/firmware_table.php Qlogic HCAs: - QHT6040 (PathScale InfiniPath HT-460) - QHT6140 (PathScale InfiniPath HT-465) - QLE6140 (PathScale InfiniPath PE-880) Switches Supported This release was tested with switches and gateways provided by the following companies: - Cisco - Voltaire - SilverStorm - Flextronics Attached are the release notes Tziporet Koren Software Director Mellanox Technologies mailto: tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: OFED_release_notes.txt URL: From zhushisongzhu at yahoo.com Fri Jun 16 03:05:47 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Fri, 16 Jun 2006 03:05:47 -0700 (PDT) Subject: [openib-general] OFED 1.0 - Official Release (Tziporet Koren) In-Reply-To: Message-ID: <20060616100547.13864.qmail@web36915.mail.mud.yahoo.com> > Notes: > > 1. SDP and RDS are in technology preview state. > > 2. The SRP Initiator and Open MPI are in beta state. > > 3. All other OFED components are in production > state. > I'm sorry SDP is not in production state. SDP is very important for our application and we are waiting it mature enough to be used in our product. And do you have any schedule to let SDP work ok(especially can support many large concurrent connections just like TCP)? I very appreciate I can test new SDP before end of June. tks zhu __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From or.gerlitz at gmail.com Fri Jun 16 03:51:39 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 16 Jun 2006 12:51:39 +0200 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: References: <44903D5D.10102@ichips.intel.com> <449119AE.2010703@voltaire.com> Message-ID: <15ddcffd0606160351p276a227v18ca42301256455b@mail.gmail.com> On 6/15/06, James Lentini wrote: > ib_cm_establish() doesn't emulate an RTU reception. It generates an > IB_CM_USER_ESTABLISHED event (not an IB_CM_RTU_RECEIVED event). The > CMA's cma_ib_handler() doesn't recognize a IB_CM_USER_ESTABLISHED > event. The QP's state will not be moved to RTS. This is what i was suspecting, Sean can you confirm that? if it does not emulate RTU reception, than what it does do? > Consumers don't actually have to queue the completions, they have to > defer posting sends (either in response to the recvs or otherwise) > until the QP moves to RTS. Could the implementations queue up the > requests for the consumers? nope the CM/CMA are not in charge of the consumer CQ, so there is no way for them to queue those completions and anyway, i think its wrong for lower layer to queue completions, this "race" exists by IB's nature (since the RTU goes to QP1 and the data to the user's QP and the two QPs are totally unrelated) so if you want to have production with IB you need to handle this case in your code, as others do. > Strictly speaking, IB requires an error to be generated (C10-29 in the > IBTA spec. vol 1, page 456). Still, it would be nice if consumers > didn't have to be worry about this issue. What do you mean by error, this async event happens all the time, you can't error the establishment just b/c it happend. I don't have access now to the spec, so i can't say what i understand from the section you have pointed to. Or. From or.gerlitz at gmail.com Fri Jun 16 03:54:37 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 16 Jun 2006 12:54:37 +0200 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> References: <449119AE.2010703@voltaire.com> <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> Message-ID: <15ddcffd0606160354q516ffdccj14c721bcb60d254a@mail.gmail.com> On 6/16/06, Sean Hefty wrote: >>The cma/verbs consumer can't just ignore the event since its qp state is >>still RTR which means an attempt to tx replying the rx would fail. > In most cases, I would expect that the IB CM will eventually receive the RTU, > which will generate an event to the RDMA CM to transition the QP into RTS. But we want an IB stack and set of ULPs which would work in production so they need to handle also irregular cases... eg when the RTU is lost over and over. Or From swise at opengridcomputing.com Fri Jun 16 06:42:35 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 Jun 2006 08:42:35 -0500 Subject: [openib-general] ucma into kernel.org Message-ID: <1150465355.29508.4.camel@stevo-desktop> Hey Roland/Sean, Will the ucma make it into 2.6.18? I notice its not in Roland's for-2.6.18 tree right now. Thanks, Steve. From halr at voltaire.com Fri Jun 16 07:12:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Jun 2006 10:12:10 -0400 Subject: [openib-general] [PATCH] osmtest/osmtest.c: Start enabling link records Message-ID: <1150467129.4506.97759.camel@hal.voltaire.com> osmtest/osmtest.c: Start enabling link records Enable the obtaining of SA LinkRecords, writing to database, and parsing, and ignoring them for the time being. Signed-off-by: Hal Rosenstock Index: osmtest/osmtest.c =================================================================== --- osmtest/osmtest.c (revision 8064) +++ osmtest/osmtest.c (working copy) @@ -138,6 +138,10 @@ typedef enum _osmtest_token_val OSMTEST_TOKEN_RESP_TIME_VAL, OSMTEST_TOKEN_ERR_THRESHOLD, OSMTEST_TOKEN_MTU, + OSMTEST_TOKEN_FROMLID, + OSMTEST_TOKEN_FROMPORTNUM, + OSMTEST_TOKEN_TOPORTNUM, + OSMTEST_TOKEN_TOLID, OSMTEST_TOKEN_UNKNOWN } osmtest_token_val_t; @@ -213,6 +217,10 @@ const osmtest_token_t token_array[] = { {OSMTEST_TOKEN_RESP_TIME_VAL, 15, "resp_time_value"}, {OSMTEST_TOKEN_ERR_THRESHOLD, 15, "error_threshold"}, {OSMTEST_TOKEN_MTU, 3, "MTU"}, /* must be after the other mtu... tokens. */ + {OSMTEST_TOKEN_FROMLID, 8, "from_lid"}, + {OSMTEST_TOKEN_FROMPORTNUM, 13, "from_port_num"}, + {OSMTEST_TOKEN_TOPORTNUM, 11, "to_port_num"}, + {OSMTEST_TOKEN_TOLID, 6, "to_lid"}, {OSMTEST_TOKEN_UNKNOWN, 0, ""} /* must be last entry */ }; @@ -1962,9 +1970,6 @@ osmtest_write_node_info( IN osmtest_t * return ( status ); } -#if 0 -/* HACK: we do not support link records for now. */ - /********************************************************************** **********************************************************************/ static ib_api_status_t @@ -2076,7 +2081,6 @@ osmtest_write_all_link_recs( IN osmtest_ OSM_LOG_EXIT( &p_osmt->log ); return ( status ); } -#endif /********************************************************************** **********************************************************************/ @@ -2727,11 +2731,9 @@ osmtest_create_inventory_file( IN osmtes goto Exit; } -#if 0 status = osmtest_write_all_link_recs( p_osmt, fh ); if( status != IB_SUCCESS ) goto Exit; -#endif fclose( fh ); @@ -6114,6 +6116,94 @@ osmtest_parse_path( IN osmtest_t * const /********************************************************************** **********************************************************************/ static ib_api_status_t +osmtest_parse_link( IN osmtest_t * const p_osmt, + IN FILE * const fh, + IN OUT uint32_t * const p_line_num ) +{ + ib_api_status_t status = IB_SUCCESS; + uint32_t offset; + char line[OSMTEST_MAX_LINE_LEN]; + boolean_t done = FALSE; + const osmtest_token_t *p_tok; + + OSM_LOG_ENTER( &p_osmt->log, osmtest_parse_link); + + /* + * Parse the inventory file and create the database. + */ + while( !done ) + { + if( fgets( line, OSMTEST_MAX_LINE_LEN, fh ) == NULL ) + { + /* + * End of file in the middle of a definition. + */ + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_parse_link: ERR 012A: " + "Unexpected end of file\n" ); + status = IB_ERROR; + goto Exit; + } + + ++*p_line_num; + + /* + * Skip whitespace + */ + offset = 0; + if( !str_skip_white( line, &offset ) ) + continue; /* whole line was whitespace */ + + p_tok = str_get_token( &line[offset] ); + if( p_tok == NULL ) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_parse_link: ERR 012B: " + "Ignoring line %u with unknown token: %s\n", + *p_line_num, &line[offset] ); + continue; + } + + if( osm_log_is_active( &p_osmt->log, OSM_LOG_DEBUG ) ) + { + osm_log( &p_osmt->log, OSM_LOG_DEBUG, + "osmtest_parse_link: " + "Found '%s' (line %u)\n", p_tok->str, *p_line_num ); + } + + str_skip_token( line, &offset ); + + switch ( p_tok->val ) + { + case OSMTEST_TOKEN_FROMLID: + case OSMTEST_TOKEN_FROMPORTNUM: + case OSMTEST_TOKEN_TOPORTNUM: + case OSMTEST_TOKEN_TOLID: + /* For now */ + break; + + case OSMTEST_TOKEN_END: + done = TRUE; + break; + + default: + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_parse_link: ERR 012C: " + "Ignoring line %u with unknown token: %s\n", + *p_line_num, &line[offset] ); + + break; + } + } + + Exit: + OSM_LOG_EXIT( &p_osmt->log ); + return ( status ); +} + +/********************************************************************** + **********************************************************************/ +static ib_api_status_t osmtest_create_db( IN osmtest_t * const p_osmt ) { FILE *fh; @@ -6182,6 +6272,10 @@ osmtest_create_db( IN osmtest_t * const status = osmtest_parse_path( p_osmt, fh, &line_num ); break; + case OSMTEST_TOKEN_DEFINE_LINK: + status = osmtest_parse_link( p_osmt, fh, &line_num ); + break; + default: osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmtest_create_db: ERR 0132: " From swise at opengridcomputing.com Fri Jun 16 07:20:31 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 Jun 2006 09:20:31 -0500 Subject: [openib-general] [PATCH v2 4/7] AMSO1100 Memory Management. In-Reply-To: <1150128349.22704.20.camel@trinity.ogc.int> References: <20060607200646.9259.24588.stgit@stevo-desktop> <20060607200655.9259.90768.stgit@stevo-desktop> <20060608011744.1a66e85a.akpm@osdl.org> <1150128349.22704.20.camel@trinity.ogc.int> Message-ID: <1150467631.29508.11.camel@stevo-desktop> On Mon, 2006-06-12 at 11:05 -0500, Tom Tucker wrote: > On Thu, 2006-06-08 at 01:17 -0700, Andrew Morton wrote: > > On Wed, 07 Jun 2006 15:06:55 -0500 > > Steve Wise wrote: > > > > > > > > +void c2_free(struct c2_alloc *alloc, u32 obj) > > > +{ > > > + spin_lock(&alloc->lock); > > > + clear_bit(obj, alloc->table); > > > + spin_unlock(&alloc->lock); > > > +} > > > > The spinlock is unneeded here. > > Good point. > > > > > > > What does all the code in this file do, anyway? It looks totally generic > > (and hence inappropriate for drivers/infiniband/hw/amso1100/) and somewhat > > similar to idr trees, perhaps. > > > > We mimicked the mthca driver. It may be code that should be replaced > with Linux core services for new drivers. We'll investigate. > The code in this file implements 2 sets of services: 1) allocating unique qp identifiers (type integer). This is the c2_alloc struct and functions. 2) maintaining a sparsely allocated array of ptrs indexed by the qp identifier. This allows for quick mapping to the qp structure ptr given the qp identifier. This is the c2_array struct and functions. I believe I can use an IDR tree to provide both of these services. Steve. From jlentini at netapp.com Fri Jun 16 08:15:46 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 16 Jun 2006 11:15:46 -0400 (EDT) Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> Message-ID: On Thu, 15 Jun 2006, Sean Hefty wrote: > >The cma/verbs consumer can't just ignore the event since its qp state is > >still RTR which means an attempt to tx replying the rx would fail. > > In most cases, I would expect that the IB CM will eventually receive the RTU, > which will generate an event to the RDMA CM to transition the QP into RTS. This > is why I think that the event can safely be ignored. It does however mean that > a user cannot send on the QP until the user sees RDMA_CM_EVENT_ESTABLISHED. > > >I suggest the following design: the CMA would replace the event handler > >provided with the qp_init_attr struct with a callback of its own and > >keep the original handler/context on a private structure. > > This sounds like it would work. I don't think that there are any events where > the additional delay would matter. > > As an alternative, I don't think that there's any reason why the QP > can't be transition to RTS when the CM REP is sent. I like this idea. It simplifies how ULPs handle this issue. Are there any spec. compliance issues with this? > A user just can't post to the send queue until either an > RDMA_CM_EVENT_ESTABLISHED, IB_EVENT_COMM_EST, or a completion occurs > on the QP. (This doesn't change the fact that the IB CM still needs > to know that the connection has been established, or it risks > putting the connection into an error state if an RTU is never > received.) If the passive side CM doesn't receive an RTU, the passive side CM should retransmit the REP. At least that is how I read 12.9.8.6 "Timeouts and Retries" in the IBTA spec. I can't find where this happens in the code. Did I miss it? From swise at opengridcomputing.com Fri Jun 16 08:23:31 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 Jun 2006 10:23:31 -0500 Subject: [openib-general] [PATCH] librdmacm/examples/rping.c In-Reply-To: References: Message-ID: <1150471411.29508.17.camel@stevo-desktop> On Fri, 2006-06-16 at 11:20 -0400, amith rajith mamidala wrote: > Hi Steve, > > The rping also doesn't exit after printing these error messages. Is this > expected? > It should exit! :-( Maybe rping is not acking all the CM or Async events? Or we've got a bug in our refcnts on the iw_cm_ids in the kernel. Can you get a gdb stack trace when its stalled? And if you kdb, a kernel mode stack trace of the same thread would be nice too... What systems/distros/etc are you running this on? Thanks, Stevo. > Thanks, > Amith > > On Thu, 15 Jun 2006, Steve Wise wrote: > > > This is the normal output for rping... > > > > The status error on the completion is 5 (FLUSHED), which is normal. > > > > Steve. > > > > > > On Thu, 2006-06-15 at 17:24 -0400, amith rajith mamidala wrote: > > > Hi, > > > > > > With the latest rping code (Revision: 8055) I am still able to see this > > > race condition. > > > > > > server side: > > > > > > [@k62-oib examples]$ ./rping -s -vV -C10 -S26 -a 0.0.0.0 -p 9997 > > > server ping data: rdma-ping-0: ABCDEFGHIJKL > > > server ping data: rdma-ping-1: BCDEFGHIJKLM > > > server ping data: rdma-ping-2: CDEFGHIJKLMN > > > server ping data: rdma-ping-3: DEFGHIJKLMNO > > > server ping data: rdma-ping-4: EFGHIJKLMNOP > > > server ping data: rdma-ping-5: FGHIJKLMNOPQ > > > server ping data: rdma-ping-6: GHIJKLMNOPQR > > > server ping data: rdma-ping-7: HIJKLMNOPQRS > > > server ping data: rdma-ping-8: IJKLMNOPQRST > > > server ping data: rdma-ping-9: JKLMNOPQRSTU > > > server DISCONNECT EVENT... > > > wait for RDMA_READ_ADV state 9 > > > cq completion failed status 5 > > > > > > Client side: > > > > > > [@k63-oib examples]$ ./rping -c -vV -C10 -S26 -a 192.168.111.66 -p 9997 > > > ping data: rdma-ping-0: ABCDEFGHIJKL > > > ping data: rdma-ping-1: BCDEFGHIJKLM > > > ping data: rdma-ping-2: CDEFGHIJKLMN > > > ping data: rdma-ping-3: DEFGHIJKLMNO > > > ping data: rdma-ping-4: EFGHIJKLMNOP > > > ping data: rdma-ping-5: FGHIJKLMNOPQ > > > ping data: rdma-ping-6: GHIJKLMNOPQR > > > ping data: rdma-ping-7: HIJKLMNOPQRS > > > ping data: rdma-ping-8: IJKLMNOPQRST > > > ping data: rdma-ping-9: JKLMNOPQRSTU > > > cq completion failed status 5 > > > client DISCONNECT EVENT... > > > > > > > > > Thanks, > > > Amith > > > > > > > > > On Tue, 13 Jun 2006, Steve Wise wrote: > > > > > > > Thanks, applied. > > > > > > > > iwarp branch: r7964 > > > > trunk: r7966 > > > > > > > > > > > > On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote: > > > > > This patch resolves a race condition between the receipt of > > > > > a connection established event and a receive completion from > > > > > the client. The server no longer goes to connected state but > > > > > merely waits for the READ_ADV state to begin its looping. This > > > > > keeps the server from going back to CONNECTED from the later > > > > > states if the connection established event comes in after the > > > > > receive completion (i.e. the loop starts). > > > > > > > > > > Signed-off-by: Boyd Faulkner > > > > > > > > > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > From jlentini at netapp.com Fri Jun 16 08:25:06 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 16 Jun 2006 11:25:06 -0400 (EDT) Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <15ddcffd0606160351p276a227v18ca42301256455b@mail.gmail.com> References: <44903D5D.10102@ichips.intel.com> <449119AE.2010703@voltaire.com> <15ddcffd0606160351p276a227v18ca42301256455b@mail.gmail.com> Message-ID: On Fri, 16 Jun 2006, Or Gerlitz wrote: > On 6/15/06, James Lentini wrote: > > ib_cm_establish() doesn't emulate an RTU reception. It generates an > > IB_CM_USER_ESTABLISHED event (not an IB_CM_RTU_RECEIVED event). The > > CMA's cma_ib_handler() doesn't recognize a IB_CM_USER_ESTABLISHED > > event. The QP's state will not be moved to RTS. > > This is what i was suspecting, Sean can you confirm that? if it does > not emulate RTU > reception, than what it does do? > > > Consumers don't actually have to queue the completions, they have to > > defer posting sends (either in response to the recvs or otherwise) > > until the QP moves to RTS. Could the implementations queue up the > > requests for the consumers? > > nope the CM/CMA are not in charge of the consumer CQ, so there is no > way for them to queue those completions and anyway, i think its I was refering to requests, not completions. In any event, I like Sean's idea of moving the QP to RTS when a REP is sent better. > wrong for lower layer to queue completions, this "race" exists by > IB's nature (since the RTU goes to QP1 and the data to the user's QP > and the two QPs are totally unrelated) so if you want to have > production with IB you need to handle this case in your code, as > others do. Agreed. > > Strictly speaking, IB requires an error to be generated (C10-29 in > > the IBTA spec. vol 1, page 456). Still, it would be nice if > > consumers didn't have to be worry about this issue. > > What do you mean by error, this async event happens all the time, > you can't error the establishment just b/c it happend. I don't have > access now to the spec, so i can't say what i understand from the > section you have pointed to. Again, I was refering to requests, not completions. From mamidala at cse.ohio-state.edu Fri Jun 16 08:20:29 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Fri, 16 Jun 2006 11:20:29 -0400 (EDT) Subject: [openib-general] [PATCH] librdmacm/examples/rping.c In-Reply-To: <1150409027.6612.20.camel@stevo-desktop> Message-ID: Hi Steve, The rping also doesn't exit after printing these error messages. Is this expected? Thanks, Amith On Thu, 15 Jun 2006, Steve Wise wrote: > This is the normal output for rping... > > The status error on the completion is 5 (FLUSHED), which is normal. > > Steve. > > > On Thu, 2006-06-15 at 17:24 -0400, amith rajith mamidala wrote: > > Hi, > > > > With the latest rping code (Revision: 8055) I am still able to see this > > race condition. > > > > server side: > > > > [@k62-oib examples]$ ./rping -s -vV -C10 -S26 -a 0.0.0.0 -p 9997 > > server ping data: rdma-ping-0: ABCDEFGHIJKL > > server ping data: rdma-ping-1: BCDEFGHIJKLM > > server ping data: rdma-ping-2: CDEFGHIJKLMN > > server ping data: rdma-ping-3: DEFGHIJKLMNO > > server ping data: rdma-ping-4: EFGHIJKLMNOP > > server ping data: rdma-ping-5: FGHIJKLMNOPQ > > server ping data: rdma-ping-6: GHIJKLMNOPQR > > server ping data: rdma-ping-7: HIJKLMNOPQRS > > server ping data: rdma-ping-8: IJKLMNOPQRST > > server ping data: rdma-ping-9: JKLMNOPQRSTU > > server DISCONNECT EVENT... > > wait for RDMA_READ_ADV state 9 > > cq completion failed status 5 > > > > Client side: > > > > [@k63-oib examples]$ ./rping -c -vV -C10 -S26 -a 192.168.111.66 -p 9997 > > ping data: rdma-ping-0: ABCDEFGHIJKL > > ping data: rdma-ping-1: BCDEFGHIJKLM > > ping data: rdma-ping-2: CDEFGHIJKLMN > > ping data: rdma-ping-3: DEFGHIJKLMNO > > ping data: rdma-ping-4: EFGHIJKLMNOP > > ping data: rdma-ping-5: FGHIJKLMNOPQ > > ping data: rdma-ping-6: GHIJKLMNOPQR > > ping data: rdma-ping-7: HIJKLMNOPQRS > > ping data: rdma-ping-8: IJKLMNOPQRST > > ping data: rdma-ping-9: JKLMNOPQRSTU > > cq completion failed status 5 > > client DISCONNECT EVENT... > > > > > > Thanks, > > Amith > > > > > > On Tue, 13 Jun 2006, Steve Wise wrote: > > > > > Thanks, applied. > > > > > > iwarp branch: r7964 > > > trunk: r7966 > > > > > > > > > On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote: > > > > This patch resolves a race condition between the receipt of > > > > a connection established event and a receive completion from > > > > the client. The server no longer goes to connected state but > > > > merely waits for the READ_ADV state to begin its looping. This > > > > keeps the server from going back to CONNECTED from the later > > > > states if the connection established event comes in after the > > > > receive completion (i.e. the loop starts). > > > > > > > > Signed-off-by: Boyd Faulkner > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From Thomas.Talpey at netapp.com Fri Jun 16 08:29:27 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 16 Jun 2006 11:29:27 -0400 Subject: [openib-general] Mellanox HCAs: outstanding RDMAs In-Reply-To: <6.2.0.14.2.20060615104459.06451e28@esmail.cup.hp.com> References: <7.0.1.0.2.20060606131933.04267008@netapp.com> <6.2.0.14.2.20060615104459.06451e28@esmail.cup.hp.com> Message-ID: <7.0.1.0.2.20060616112445.042523e0@netapp.com> Mike, I am not arguing to change the standard. I am simply saying I do not want to be a victim of the default. It is my belief that very few upper layer programmers are aware of this, btw. The Linux NFS/RDMA upper layer implementation already deals with the issue, as I mentioned. It would certainly welcome a higher available IRD on Mellanox hardware however. Thanks for your comments. Tom. At 01:55 PM 6/15/2006, Michael Krause wrote: >As one of the authors of IB and iWARP, I can say that both Roland and Todd's responses are correct and the intent of the specifications. The number of outstanding RDMA Reads are bounded and that is communicated during session establishment. The ULP can choose to be aware of this requirement (certainly when we wrote iSER and DA we were well aware of the requirement and we documented as such in the ULP specs) and track from above so that it does not see a stall or it can stay ignorant and deal with the stall as a result. This is a ULP choice and has been intentionally done that way so that the hardware can be kept as simple as possible and as low cost as well while meeting the breadth of ULP needs that were used to develop these technologies. > >Tom, you raised this issue during iWARP's definition and the debate was conducted at least several times. The outcome of these debates is reflected in iWARP and remains aligned with IB. So, unless you really want to have the IETF and IBTA go and modify their specs, I believe you'll have to deal with the issue just as other ULP are doing today and be aware of the constraint and write the software accordingly. The open source community isn't really the right forum to change iWARP and IB specifications at the end of the day. Build a case in the IETF and IBTA and let those bodies determine whether it is appropriate to modify their specs or not. And yes, it is modification of the specs and therefore the hardware implementations as well address any interoperability requirements that would result (the change proposed could fragment the hardware offerings as there are many thousands of devices in the market that would not necessarily support this change). > >Mike > > > > >At 12:07 PM 6/6/2006, Talpey, Thomas wrote: >>Todd, thanks for the set-up. I'm really glad we're having this discussion! >> >>Let me give an NFS/RDMA example to illustrate why this upper layer, >>at least, doesn't want the HCA doing its flow control, or resource >>management. >> >>NFS/RDMA is a credit-based protocol which allows many operations in >>progress at the server. Let's say the client is currently running with >>an RPC slot table of 100 requests (a typical value). >> >>Of these requests, some workload-specific percentage will be reads, >>writes, or metadata. All NFS operations consist of one send from >>client to server, some number of RDMA writes (for NFS reads) or >>RDMA reads (for NFS writes), then terminated with one send from >>server to client. >> >>The number of RDMA read or write operations per NFS op depends >>on the amount of data being read or written, and also the memory >>registration strategy in use on the client. The highest-performing >>such strategy is an all-physical one, which results in one RDMA-able >>segment per physical page. NFS r/w requests are, by default, 32KB, >>or 8 pages typical. So, typically 8 RDMA requests (read or write) are >>the result. >> >>To illustrate, let's say the client is processing a multi-threaded >>workload, with (say) 50% reads, 20% writes, and 30% metadata >>such as lookup and getattr. A kernel build, for example. Therefore, >>of our 100 active operations, 50 are reads for 32KB each, 20 are >>writes of 32KB, and 30 are metadata (non-RDMA). >> >>To the server, this results in 100 requests, 100 replies, 400 RDMA >>writes, and 160 RDMA Reads. Of course, these overlap heavily due >>to the widely differing latency of each op and the highly distributed >>arrival times. But, for the example this is a snapshot of current load. >> >>The latency of the metadata operations is quite low, because lookup >>and getattr are acting on what is effectively cached data. The reads >>and writes however, are much longer, because they reference the >>filesystem. When disk queues are deep, they can take many ms. >> >>Imagine what happens if the client's IRD is 4 and the server ignores >>its local ORD. As soon as a write begins execution, the server posts >>8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads >>are sent, the fifth stalls, and stalls the send queue! Even when three >>RDMA Reads complete, the queue remains stalled, it doesn't unblock >>until the fourth is done and all the RDMA Reads have been initiated. >> >>But, what just happened to all the other server send traffic? All those >>metadata replies, and other reads which completed? They're stuck, >>waiting for that one write request. In my example, these number 99 NFS >>ops, i.e. 654 WRs! All for one NFS write! The client operation stream >>effectively became single threaded. What good is the "rapid initiation >>of RDMA Reads" you describe in the face of this? >> >>Yes, there are many arcane and resource-intensive ways around it. >>But the simplest by far is to count the RDMA Reads outstanding, and >>for the *upper layer* to honor ORD, not the HCA. Then, the send queue >>never blocks, and the operation streams never loses parallelism. This >>is what our NFS server does. >> >>As to the depth of IRD, this is a different calculation, it's a DelayxBandwidth >>of the RDMA Read stream. 4 is good for local, low latency connections. >>But over a complicated switch infrastructure, or heaven forbid a dark fiber >>long link, I guarantee it will cause a bottleneck. This isn't an issue except >>for operations that care, but it is certainly detectable. I would like to see >>if a pure RDMA Read stream can fully utilize a typical IB fabric, and how >>much headroom an IRD of 4 provides. Not much, I predict. >> >>Closing the connection if IRD is "insufficient to meet goals" isn't a good >>answer, IMO. How does that benefit interoperability? >> >>Thanks for the opportunity to spout off again. Comments welcome! >> >>Tom. >> >>At 12:43 PM 6/6/2006, Rimmer, Todd wrote: >>> >>> >>>> Talpey, Thomas >>>> Sent: Tuesday, June 06, 2006 10:49 AM >>>> >>>> At 10:40 AM 6/6/2006, Roland Dreier wrote: >>>> > Thomas> This is the difference between "may" and "must". The >>>value >>>> > Thomas> is provided, but I don't see anything in the spec that >>>> > Thomas> makes a requirement on its enforcement. Table 107 says >>>the >>>> > Thomas> consumer can query it, that's about as close as it >>>> > Thomas> comes. There's some discussion about CM exchange too. >>>> > >>>> >This seems like a very strained interpretation of the spec. For >>>> >>>> I don't see how strained has anything to do with it. It's not saying >>>> anything >>>> either way. So, a legal implementation can make either choice. We're >>>> talking about the spec! >>>> >>>> But, it really doesn't matter. The point is, an upper layer should be >>>> paying >>>> attention to the number of RDMA Reads it posts, or else suffer either >>>the >>>> queue-stalling or connection-failing consequences. Bad stuff either >>>way. >>>> >>>> Tom. >>> >>>Somewhere beneath this discussion is a bug in the application or IB >>>stack. I'm not sure which "may" in the spec you are referring to, but >>>the "may"s I have found all are for cases where the responder might >>>support only 1 outstanding request. In all cases the negotiation >>>protocol must be followed and the requestor is not allowed to exceed the >>>negotiated limit. >>> >>>The mechanism should be: >>>client queries its local HCA and determines responder resources (eg. >>>number of concurrent outstanding RDMA reads on the wire from the remote >>>end where this end will respond with the read data) and initiator depth >>>(eg. number of concurrent outstanding RDMA reads which this end can >>>initiate as the requestor). >>> >>>client puts the above information in the CM REQ. >>> >>>server similarly gets its information from its local CA and negotiates >>>down the values to the MIN of each side (REP.InitiatorDepth = >>>MIN(REQ.ResponderResources, server's local CAs Initiator depth); >>>REP.ResponderResources = MIN(REQ.InitiatorDepth, server's local CAs >>>responder resources). If server does not support RDMA Reads, it can >>>REJ. >>> >>>If client decided the negotiated values are insufficient to meet its >>>goals, it can disconnect. >>> >>>Each side sets its QP parameters via modify QP appropriately. Note they >>>too will be mirror images of eachother: >>>client: >>>QP.Max RDMA Reads as Initiator = REP.ResponderResources >>>QP.Max RDMA reads as responder = REP.InitiatorDepth >>> >>>server: >>>QP.Max RDMA Reads as responder = REP.ResponderResources >>>QP.Max RDMA reads as initiator = REP.InitiatorDepth >>> >>>We have done a lot of high stress RDMA Read traffic with Mellanox HCAs >>>and provided the above negotiation is followed, we have seen no issues. >>>Note however that by default a Mellanox HCA typically reports a large >>>InitiatorDepth (128) and a modest ResponderResources (4-8). Hence when >>>I hear that Responder Resources must be grown to 128 for some >>>application to reliably work, it implies the negotiation I outlined >>>above is not being followed. >>> >>>Note that the ordering rules in table 76 of IBTA 1.2 show how reads and >>>write on a send queue are ordered. There are many cases where an op can >>>pass an outstanding RDMA read, hence it is not always bad to queue extra >>>RDMA reads. If needed, the Fence can be sent to force order. >>> >>>For many apps, its going to be better to get the items onto queue and >>>let the QP handle the outstanding reads cases rather than have the app >>>add a level of queuing for this purpose. Letting the HCA do the queuing >>>will allow for a more rapid initiation of subsequent reads. >>> >>>Todd Rimmer >> >> >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sweitzen at cisco.com Fri Jun 16 08:58:04 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 16 Jun 2006 08:58:04 -0700 Subject: [openib-general] OFED 1.0 - Official Release Message-ID: Tziporet, I see a few C code changes from pre1 in the form of patches. What are these and why were they added after pre1? $ diff -r OFED-1.0-pre1/SOURCES/openib-1.0/patches/OFED-1.0/SOURCES/openib-1.0/pat ches/ 2>&1 | less ... Only in OFED-1.0-pre1/SOURCES/openib-1.0/patches/fixes: handle_reconnect_of_offline_host.patch Only in OFED-1.0/SOURCES/openib-1.0/patches/fixes: sdp_fix.patch Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Tziporet Koren Sent: Friday, June 16, 2006 1:55 AM To: OpenFabricsEWG; openib Subject: [openib-general] OFED 1.0 - Official Release I am happy to announce that OFED 1.0 Official Release is now available. The release can be found under: https://openib.org/svn/gen2/branches/1.0/ofed/releases/ And later today it will be on the OpenFabrics download page: http://www.openfabrics.org/downloads.html. This is the first release that was done in a joint effort of the following companies: * Cisco * SilverStorm * Voltaire * QLogic * Intel * Mellanox Technologies I wish to thank all who contributed to the success of this release. Tziporet ======================================================================== ======= Release summary: The OFED software package is composed of several software modules intended for use on a computer cluster constructed as an InfiniBand network. OFED package contains the following components: o OpenFabrics core and ULPs: - HCA drivers (mthca, ipath) - core - Upper Layer Protocols: IPoIB, SDP, SRP Initiator, iSER Host, RDS and uDAPL o OpenFabrics utilities: - OpenSM: InfiniBand Subnet Manager - Diagnostic tools - Performance tests o MPI: - OSU MPI stack supporting the InfiniBand interface - Open MPI stack supporting the InfiniBand interface - MPI benchmark tests (OSU BW/LAT, Pallas, Presta) o Sources of all software modules (under conditions mentioned in the modules' LICENSE files) o Documentation Notes: 1. SDP and RDS are in technology preview state. 2. The SRP Initiator and Open MPI are in beta state. 3. All other OFED components are in production state. Supported Platforms and Operating Systems CPU architectures: * x86_64 * x86 * ia64 * ppc64 Linux Operating Systems: * RedHat EL4 up2: 2.6.9-22.ELsmp * RedHat EL4 up3: 2.6.9-34.ELsmp * Fedora C4: 2.6.11-1.1369_FC4 * SLES10 RC2: 2.6.16.16-1.6-smp (or RC 2.5 2.6.16.14-6-smp) * SLES10 RC1: 2.6.16.14-6-smp * SUSE 10 Pro: 2.6.13-15-smp * kernel.org: 2.6.16.x HCAs Supported Mellanox HCAs: - InfiniHost - InfiniHost III Ex (both modes: with memory and MemFree) - InfiniHost III Lx Both SDR and DDR mode of the InfiniHost III family are supported. For official FW versions please see: http://www.mellanox.com/support/firmware_table.php Qlogic HCAs: - QHT6040 (PathScale InfiniPath HT-460) - QHT6140 (PathScale InfiniPath HT-465) - QLE6140 (PathScale InfiniPath PE-880) Switches Supported This release was tested with switches and gateways provided by the following companies: - Cisco - Voltaire - SilverStorm - Flextronics Attached are the release notes Tziporet Koren Software Director Mellanox Technologies mailto: tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Fri Jun 16 09:06:30 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 16 Jun 2006 09:06:30 -0700 Subject: [openib-general] ucma into kernel.org In-Reply-To: <1150465355.29508.4.camel@stevo-desktop> References: <1150465355.29508.4.camel@stevo-desktop> Message-ID: <4492D706.4060106@ichips.intel.com> Steve Wise wrote: > Will the ucma make it into 2.6.18? I notice its not in Roland's > for-2.6.18 tree right now. The plan is to allow the userspace interface to mature some before trying to merge them upstream. This is why it is not included in 2.6.18. - Sean From mshefty at ichips.intel.com Fri Jun 16 09:13:34 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 16 Jun 2006 09:13:34 -0700 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> Message-ID: <4492D8AE.2010104@ichips.intel.com> James Lentini wrote: >>As an alternative, I don't think that there's any reason why the QP >>can't be transition to RTS when the CM REP is sent. > > I like this idea. It simplifies how ULPs handle this issue. Are there > any spec. compliance issues with this? There's no spec compliance issues that I can readily find. I will make a note to fix this, as well as handle the connection established event as Or suggested, but it will be a couple of weeks before I get to this. (I will be attending the workshop next week.) > If the passive side CM doesn't receive an RTU, the passive side CM > should retransmit the REP. At least that is how I read 12.9.8.6 > "Timeouts and Retries" in the IBTA spec. I can't find where this > happens in the code. Did I miss it? The MAD layer retries the CM messages, typically until the CM cancels the operation. - Sean From halr at voltaire.com Fri Jun 16 09:08:36 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Jun 2006 12:08:36 -0400 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> Message-ID: <1150474113.4506.102460.camel@hal.voltaire.com> On Fri, 2006-06-16 at 11:15, James Lentini wrote: [snip...] > > As an alternative, I don't think that there's any reason why the QP > > can't be transition to RTS when the CM REP is sent. > > I like this idea. It simplifies how ULPs handle this issue. Are there > any spec. compliance issues with this? IMO, it would violate the CM state machine and the passive CM transition specification in 12.9.7.2 and have the effect of circumventing the retransmission of REP on lost RTU. Data can't fly until either the RTU or the first data message is received from the other direction. -- Hal From mshefty at ichips.intel.com Fri Jun 16 09:20:07 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 16 Jun 2006 09:20:07 -0700 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <15ddcffd0606160351p276a227v18ca42301256455b@mail.gmail.com> References: <44903D5D.10102@ichips.intel.com> <449119AE.2010703@voltaire.com> <15ddcffd0606160351p276a227v18ca42301256455b@mail.gmail.com> Message-ID: <4492DA37.4040402@ichips.intel.com> Or Gerlitz wrote: > This is what i was suspecting, Sean can you confirm that? if it does > not emulate RTU > reception, than what it does do? Both receiving an RTU and getting a connection established event move the connection into the established state. They generate different events to the user of the IB CM because RTUs carry private data. - Sean From rdreier at cisco.com Fri Jun 16 09:22:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jun 2006 09:22:29 -0700 Subject: [openib-general] [PATCH] add HW specific data to libibverbs modify QP, SRQ response In-Reply-To: <1150410898.32252.69.camel@brick.pathscale.com> (Ralph Campbell's message of "Thu, 15 Jun 2006 15:34:58 -0700") References: <1150396280.32252.46.camel@brick.pathscale.com> <1150407704.32252.65.camel@brick.pathscale.com> <1150410898.32252.69.camel@brick.pathscale.com> Message-ID: Roland> Hmm... it seems simpler to have userspace allocate the Roland> memory with mmap() before the resize_cq call, and then Roland> pass that new buffer into the resize_cq call. That way Roland> you don't have a window where the kernel is putting Roland> completions into a buffer that userspace doesn't know Roland> about. Ralph> Perhaps. But this way, the code is the same for kernel and Ralph> user allocated queues. I guess there is some benefit there. Ralph> Or the new kernel driver needs to handle the old way and Ralph> the new way. Yeah. From rdreier at cisco.com Fri Jun 16 09:25:38 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jun 2006 09:25:38 -0700 Subject: [openib-general] Patch for review: ipath mmaped CQs, QPs, SRQs [1 of 2] In-Reply-To: <1150411254.32252.76.camel@brick.pathscale.com> (Ralph Campbell's message of "Thu, 15 Jun 2006 15:40:54 -0700") References: <1150411254.32252.76.camel@brick.pathscale.com> Message-ID: > + /* Unmap the old queue so we can resize it. */ > + size = sizeof(struct ipath_cq_wc) + > + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe); > + (void) munmap(cq->queue, size); > + ret = ibv_cmd_resize_cq_resp(ibcq, cqe, &cmd, sizeof cmd, > + &resp.ibv_resp, sizeof resp); > + if (ret) { > + pthread_spin_unlock(&cq->lock); > + return ret; > + } It seems that this method of returning a new buffer address to mmap from the resize operation leads to some really nasty error handling though. If the resize operation fails (either because of bad userspace values or because the kernel is out of memory and can't allocate a new buffer) then the old CQ is gone, possibly beyond recovery. mthca avoids all this by allocating a resize buffer in advance. - R. From mshefty at ichips.intel.com Fri Jun 16 09:31:36 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 16 Jun 2006 09:31:36 -0700 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <1150474113.4506.102460.camel@hal.voltaire.com> References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> <1150474113.4506.102460.camel@hal.voltaire.com> Message-ID: <4492DCE8.2000408@ichips.intel.com> Hal Rosenstock wrote: > IMO, it would violate the CM state machine and the passive CM transition > specification in 12.9.7.2 and have the effect of circumventing the > retransmission of REP on lost RTU. Data can't fly until either the RTU > or the first data message is received from the other direction. This moves the QP state to RTS, as opposed to the CEP state to connected. So I don't believe that it violates the spec. A drawback to moving the QP to RTS is that the communication established event will not be generated. This forces us to wait for the RTU to move the CEP to connected, or we need to do it upon receiving the first completion. The RDMA CM has no knowledge when the latter occurs, so would need user input. - Sean From rdreier at cisco.com Fri Jun 16 09:36:51 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jun 2006 09:36:51 -0700 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> (Sean Hefty's message of "Thu, 15 Jun 2006 15:04:57 -0700") References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> Message-ID: >I suggest the following design: the CMA would replace the event handler >provided with the qp_init_attr struct with a callback of its own and >keep the original handler/context on a private structure. This is probably fine. There is one further situation where the connection needs to be established, beyond RTU and the communication established async event. Namely, if a receive completion is polled. Since async events are, well, asynchronous, there's no guarantee that the communication established event will be reported any time soon... From halr at voltaire.com Fri Jun 16 09:37:39 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Jun 2006 12:37:39 -0400 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <4492DCE8.2000408@ichips.intel.com> References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> <1150474113.4506.102460.camel@hal.voltaire.com> <4492DCE8.2000408@ichips.intel.com> Message-ID: <1150475858.4506.103650.camel@hal.voltaire.com> On Fri, 2006-06-16 at 12:31, Sean Hefty wrote: > Hal Rosenstock wrote: > > IMO, it would violate the CM state machine and the passive CM transition > > specification in 12.9.7.2 and have the effect of circumventing the > > retransmission of REP on lost RTU. Data can't fly until either the RTU > > or the first data message is received from the other direction. > > This moves the QP state to RTS, as opposed to the CEP state to connected. So I > don't believe that it violates the spec. Isn't the CEP the QP (see p. 689 line 7) ? > A drawback to moving the QP to RTS is that the communication established event > will not be generated. This forces us to wait for the RTU to move the CEP to > connected, or we need to do it upon receiving the first completion. > The RDMA CM has no knowledge when the latter occurs, so would need user input. It sounds like I may have been looking at the wrong state but nonetheless the CEP/QP states are defined there and this would be different from what is in the spec. I wasn't saying it couldn't be made to work though. I haven't looked at it enough to know. If it does work, maybe the spec should get updated to cover this option too. -- Hal > - Sean From johann.george at qlogic.com Fri Jun 16 09:59:16 2006 From: johann.george at qlogic.com (Johann George) Date: Fri, 16 Jun 2006 09:59:16 -0700 Subject: [openib-general] OFED 1.0 - Official Release In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA724C@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA724C@mtlexch01.mtl.com> Message-ID: <20060616165916.GA1866@cuprite.pathscale.com> > I am happy to announce that OFED 1.0 Official Release is now available. Congratulations to everyone involved; and especially to you, Tziporet. You have done a fabulous job in pulling this all together. Johann From mshefty at ichips.intel.com Fri Jun 16 09:58:50 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 16 Jun 2006 09:58:50 -0700 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <1150475858.4506.103650.camel@hal.voltaire.com> References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> <1150474113.4506.102460.camel@hal.voltaire.com> <4492DCE8.2000408@ichips.intel.com> <1150475858.4506.103650.camel@hal.voltaire.com> Message-ID: <4492E34A.30209@ichips.intel.com> Hal Rosenstock wrote: >>This moves the QP state to RTS, as opposed to the CEP state to connected. So I >>don't believe that it violates the spec. > > > Isn't the CEP the QP (see p. 689 line 7) ? Hmm... I was viewing the CEP as moving through the states described in 12.9.5 and 12.9.6. (Idle, REQ sent, REP wait, etc.) I see what you're saying now. > It sounds like I may have been looking at the wrong state but > nonetheless the CEP/QP states are defined there and this would be > different from what is in the spec. I wasn't saying it couldn't be made > to work though. I haven't looked at it enough to know. If it does work, > maybe the spec should get updated to cover this option too. What I'd like to find is a way that a user, upon receiving a message, can send a response. Today, a user cannot send the response until after they get a connection established event from the IB CM, and then RDMA CM. So, it sounds like even the RDMA CM needs some sort of rdma_establish() call to finish connecting a QP. I don't think that iWarp would run into this issue. - Sean From ralphc at pathscale.com Fri Jun 16 10:06:34 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 16 Jun 2006 10:06:34 -0700 Subject: [openib-general] Patch for review: ipath mmaped CQs, QPs, SRQs [1 of 2] In-Reply-To: References: <1150411254.32252.76.camel@brick.pathscale.com> Message-ID: <1150477594.32252.89.camel@brick.pathscale.com> On Fri, 2006-06-16 at 09:25 -0700, Roland Dreier wrote: > > + /* Unmap the old queue so we can resize it. */ > > + size = sizeof(struct ipath_cq_wc) + > > + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe); > > + (void) munmap(cq->queue, size); > > + ret = ibv_cmd_resize_cq_resp(ibcq, cqe, &cmd, sizeof cmd, > > + &resp.ibv_resp, sizeof resp); > > + if (ret) { > > + pthread_spin_unlock(&cq->lock); > > + return ret; > > + } > > It seems that this method of returning a new buffer address to mmap > from the resize operation leads to some really nasty error handling > though. If the resize operation fails (either because of bad > userspace values or because the kernel is out of memory and can't > allocate a new buffer) then the old CQ is gone, possibly beyond recovery. > > mthca avoids all this by allocating a resize buffer in advance. > > - R. I agree. The kernel driver is careful to allocate the new queue and copy the old contents to the new one atomically. The issue is making sure the old queue isn't still being used by some other thread. I guess if the semantics for resize are that an error means the old mmap is still valid but if the resize succeeds, the old mmap is invalid, then there isn't an error recovery issue. All user level threads lock before using the queue address so the change of address is protected. -- Ralph Campbell From rdreier at cisco.com Fri Jun 16 10:13:57 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jun 2006 10:13:57 -0700 Subject: [openib-general] Patch for review: ipath mmaped CQs, QPs, SRQs [1 of 2] In-Reply-To: <1150477594.32252.89.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 16 Jun 2006 10:06:34 -0700") References: <1150411254.32252.76.camel@brick.pathscale.com> <1150477594.32252.89.camel@brick.pathscale.com> Message-ID: Ralph> I agree. The kernel driver is careful to allocate the new Ralph> queue and copy the old contents to the new one Ralph> atomically. The issue is making sure the old queue isn't Ralph> still being used by some other thread. I guess if the Ralph> semantics for resize are that an error means the old mmap Ralph> is still valid but if the resize succeeds, the old mmap is Ralph> invalid, then there isn't an error recovery issue. All user Ralph> level threads lock before using the queue address so the Ralph> change of address is protected. Those seem like the only sane semantics -- if the operation fails then the state of the CQ shouldn't change. - R. From mamidala at cse.ohio-state.edu Fri Jun 16 10:40:38 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Fri, 16 Jun 2006 13:40:38 -0400 (EDT) Subject: [openib-general] [PATCH] librdmacm/examples/rping.c In-Reply-To: <1150471411.29508.17.camel@stevo-desktop> Message-ID: Hi, I tried using gdb but it also hangs at the end. The system used is a IA32 platform using Red Hat Enterprise Linux AS release 4 (Nahant Update 3). kernel info:Linux k63-oib 2.6.16.20 #2 SMP Wed Jun 14 15:02:47 EDT 2006 i686 i686 i386 GNU/Linux, Thanks, Amith On Fri, 16 Jun 2006, Steve Wise wrote: > On Fri, 2006-06-16 at 11:20 -0400, amith rajith mamidala wrote: > > Hi Steve, > > > > The rping also doesn't exit after printing these error messages. Is this > > expected? > > > > It should exit! :-( > > Maybe rping is not acking all the CM or Async events? Or we've got a > bug in our refcnts on the iw_cm_ids in the kernel. Can you get a gdb > stack trace when its stalled? And if you kdb, a kernel mode stack > trace of the same thread would be nice too... > > What systems/distros/etc are you running this on? > > Thanks, > > Stevo. > > > > > Thanks, > > Amith > > > > On Thu, 15 Jun 2006, Steve Wise wrote: > > > > > This is the normal output for rping... > > > > > > The status error on the completion is 5 (FLUSHED), which is normal. > > > > > > Steve. > > > > > > > > > On Thu, 2006-06-15 at 17:24 -0400, amith rajith mamidala wrote: > > > > Hi, > > > > > > > > With the latest rping code (Revision: 8055) I am still able to see this > > > > race condition. > > > > > > > > server side: > > > > > > > > [@k62-oib examples]$ ./rping -s -vV -C10 -S26 -a 0.0.0.0 -p 9997 > > > > server ping data: rdma-ping-0: ABCDEFGHIJKL > > > > server ping data: rdma-ping-1: BCDEFGHIJKLM > > > > server ping data: rdma-ping-2: CDEFGHIJKLMN > > > > server ping data: rdma-ping-3: DEFGHIJKLMNO > > > > server ping data: rdma-ping-4: EFGHIJKLMNOP > > > > server ping data: rdma-ping-5: FGHIJKLMNOPQ > > > > server ping data: rdma-ping-6: GHIJKLMNOPQR > > > > server ping data: rdma-ping-7: HIJKLMNOPQRS > > > > server ping data: rdma-ping-8: IJKLMNOPQRST > > > > server ping data: rdma-ping-9: JKLMNOPQRSTU > > > > server DISCONNECT EVENT... > > > > wait for RDMA_READ_ADV state 9 > > > > cq completion failed status 5 > > > > > > > > Client side: > > > > > > > > [@k63-oib examples]$ ./rping -c -vV -C10 -S26 -a 192.168.111.66 -p 9997 > > > > ping data: rdma-ping-0: ABCDEFGHIJKL > > > > ping data: rdma-ping-1: BCDEFGHIJKLM > > > > ping data: rdma-ping-2: CDEFGHIJKLMN > > > > ping data: rdma-ping-3: DEFGHIJKLMNO > > > > ping data: rdma-ping-4: EFGHIJKLMNOP > > > > ping data: rdma-ping-5: FGHIJKLMNOPQ > > > > ping data: rdma-ping-6: GHIJKLMNOPQR > > > > ping data: rdma-ping-7: HIJKLMNOPQRS > > > > ping data: rdma-ping-8: IJKLMNOPQRST > > > > ping data: rdma-ping-9: JKLMNOPQRSTU > > > > cq completion failed status 5 > > > > client DISCONNECT EVENT... > > > > > > > > > > > > Thanks, > > > > Amith > > > > > > > > > > > > On Tue, 13 Jun 2006, Steve Wise wrote: > > > > > > > > > Thanks, applied. > > > > > > > > > > iwarp branch: r7964 > > > > > trunk: r7966 > > > > > > > > > > > > > > > On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote: > > > > > > This patch resolves a race condition between the receipt of > > > > > > a connection established event and a receive completion from > > > > > > the client. The server no longer goes to connected state but > > > > > > merely waits for the READ_ADV state to begin its looping. This > > > > > > keeps the server from going back to CONNECTED from the later > > > > > > states if the connection established event comes in after the > > > > > > receive completion (i.e. the loop starts). > > > > > > > > > > > > Signed-off-by: Boyd Faulkner > > > > > > > > > > > > > > > _______________________________________________ > > > > > openib-general mailing list > > > > > openib-general at openib.org > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > From Sujal at Mellanox.com Fri Jun 16 10:54:18 2006 From: Sujal at Mellanox.com (Sujal Das) Date: Fri, 16 Jun 2006 10:54:18 -0700 Subject: [openib-general] [openfabrics-ewg] OFED 1.0 - Official Release Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F1A8F76@mtiexch01.mti.com> Yes, this is a great achievement. Congrats! -----Original Message----- From: openfabrics-ewg-bounces at openib.org [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of Johann George Sent: Friday, June 16, 2006 9:59 AM To: Tziporet Koren Cc: OpenFabricsEWG; openib Subject: Re: [openfabrics-ewg] [openib-general] OFED 1.0 - Official Release > I am happy to announce that OFED 1.0 Official Release is now available. Congratulations to everyone involved; and especially to you, Tziporet. You have done a fabulous job in pulling this all together. Johann _______________________________________________ openfabrics-ewg mailing list openfabrics-ewg at openib.org http://openib.org/mailman/listinfo/openfabrics-ewg From swise at opengridcomputing.com Fri Jun 16 11:43:53 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 Jun 2006 13:43:53 -0500 Subject: [openib-general] [PATCH] librdmacm/examples/rping.c In-Reply-To: References: Message-ID: <1150483433.29508.28.camel@stevo-desktop> On Fri, 2006-06-16 at 13:40 -0400, amith rajith mamidala wrote: > Hi, > > I tried using gdb but it also hangs at the end. The system used is a IA32 > platform using Red Hat Enterprise Linux AS release 4 (Nahant Update 3). > kernel info:Linux k63-oib 2.6.16.20 #2 SMP Wed Jun 14 15:02:47 EDT 2006 > i686 i686 i386 GNU/Linux, > Try breaking in rdma_destroy_id() and see if it ever returns from that function... STevo. > Thanks, > Amith > > > On Fri, 16 Jun 2006, Steve Wise wrote: > > > On Fri, 2006-06-16 at 11:20 -0400, amith rajith mamidala wrote: > > > Hi Steve, > > > > > > The rping also doesn't exit after printing these error messages. Is this > > > expected? > > > > > > > It should exit! :-( > > > > Maybe rping is not acking all the CM or Async events? Or we've got a > > bug in our refcnts on the iw_cm_ids in the kernel. Can you get a gdb > > stack trace when its stalled? And if you kdb, a kernel mode stack > > trace of the same thread would be nice too... > > > > What systems/distros/etc are you running this on? > > > > Thanks, > > > > Stevo. > > > > > > > > > Thanks, > > > Amith > > > > > > On Thu, 15 Jun 2006, Steve Wise wrote: > > > > > > > This is the normal output for rping... > > > > > > > > The status error on the completion is 5 (FLUSHED), which is normal. > > > > > > > > Steve. > > > > > > > > > > > > On Thu, 2006-06-15 at 17:24 -0400, amith rajith mamidala wrote: > > > > > Hi, > > > > > > > > > > With the latest rping code (Revision: 8055) I am still able to see this > > > > > race condition. > > > > > > > > > > server side: > > > > > > > > > > [@k62-oib examples]$ ./rping -s -vV -C10 -S26 -a 0.0.0.0 -p 9997 > > > > > server ping data: rdma-ping-0: ABCDEFGHIJKL > > > > > server ping data: rdma-ping-1: BCDEFGHIJKLM > > > > > server ping data: rdma-ping-2: CDEFGHIJKLMN > > > > > server ping data: rdma-ping-3: DEFGHIJKLMNO > > > > > server ping data: rdma-ping-4: EFGHIJKLMNOP > > > > > server ping data: rdma-ping-5: FGHIJKLMNOPQ > > > > > server ping data: rdma-ping-6: GHIJKLMNOPQR > > > > > server ping data: rdma-ping-7: HIJKLMNOPQRS > > > > > server ping data: rdma-ping-8: IJKLMNOPQRST > > > > > server ping data: rdma-ping-9: JKLMNOPQRSTU > > > > > server DISCONNECT EVENT... > > > > > wait for RDMA_READ_ADV state 9 > > > > > cq completion failed status 5 > > > > > > > > > > Client side: > > > > > > > > > > [@k63-oib examples]$ ./rping -c -vV -C10 -S26 -a 192.168.111.66 -p 9997 > > > > > ping data: rdma-ping-0: ABCDEFGHIJKL > > > > > ping data: rdma-ping-1: BCDEFGHIJKLM > > > > > ping data: rdma-ping-2: CDEFGHIJKLMN > > > > > ping data: rdma-ping-3: DEFGHIJKLMNO > > > > > ping data: rdma-ping-4: EFGHIJKLMNOP > > > > > ping data: rdma-ping-5: FGHIJKLMNOPQ > > > > > ping data: rdma-ping-6: GHIJKLMNOPQR > > > > > ping data: rdma-ping-7: HIJKLMNOPQRS > > > > > ping data: rdma-ping-8: IJKLMNOPQRST > > > > > ping data: rdma-ping-9: JKLMNOPQRSTU > > > > > cq completion failed status 5 > > > > > client DISCONNECT EVENT... > > > > > > > > > > > > > > > Thanks, > > > > > Amith > > > > > > > > > > > > > > > On Tue, 13 Jun 2006, Steve Wise wrote: > > > > > > > > > > > Thanks, applied. > > > > > > > > > > > > iwarp branch: r7964 > > > > > > trunk: r7966 > > > > > > > > > > > > > > > > > > On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote: > > > > > > > This patch resolves a race condition between the receipt of > > > > > > > a connection established event and a receive completion from > > > > > > > the client. The server no longer goes to connected state but > > > > > > > merely waits for the READ_ADV state to begin its looping. This > > > > > > > keeps the server from going back to CONNECTED from the later > > > > > > > states if the connection established event comes in after the > > > > > > > receive completion (i.e. the loop starts). > > > > > > > > > > > > > > Signed-off-by: Boyd Faulkner > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > openib-general mailing list > > > > > > openib-general at openib.org > > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > From mshefty at ichips.intel.com Fri Jun 16 11:52:24 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 16 Jun 2006 11:52:24 -0700 Subject: [openib-general] [PATCH 1/5] ib_addr: retrieve MGID from device address In-Reply-To: <000301c68dd5$5f569ca0$68fc070a@amr.corp.intel.com> References: <000301c68dd5$5f569ca0$68fc070a@amr.corp.intel.com> Message-ID: <4492FDE8.6080404@ichips.intel.com> Sean Hefty wrote: >>dev_addr->broadcast + 4/dev_addr->src_dev_addr + 4 may not be naturally >>aligned, >>so casting this pointer to structure type may cause compiler to generate >>incorrect code. > > Thanks - I'll update this. An update for this ends up working out better as a separate patch. Fixes are needed in the existing cma and multicast code. - Sean From johnip at sgi.com Fri Jun 16 12:51:06 2006 From: johnip at sgi.com (John Partridge) Date: Fri, 16 Jun 2006 14:51:06 -0500 Subject: [openib-general] MVAPICH failure on SGI Altix SLES10 Message-ID: <44930BAA.6030300@sgi.com> I am trying to run the example from MPI_README.txt (and other MPI apps like pallas), but I keep getting a Couldn't modify SRQ limit error message :- mig129:~/OFED-1.0-pre1 # /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/bin/mpirun_rsh -rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/tests/osutests-1.0/bw 1000 16 [1] Abort: Couldn't modify SRQ limit at line 995 in file viainit.c mpirun_rsh: Abort signaled from [1] [0] Abort: [mig125:0] Got completion with error, code=12 at line 2143 in file viacheck.c done. I am using OFED-1.0-pre1 (kernel modules are from OFED-1.0-pre1 also) OS is SLES10 SUSE Linux Enterprise Server 10 (ia64) VERSION = 10 HW is SGI Altix ia64 Can anyone help please ? Thanks John -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From trimmer at silverstorm.com Fri Jun 16 13:02:47 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Fri, 16 Jun 2006 16:02:47 -0400 Subject: [openib-general] design for communication established affiliated asynchronous event handling Message-ID: > -----Original Message----- > From: Or Gerlitz; openib-general > > In most cases, I would expect that the IB CM will eventually receive the > RTU, > > which will generate an event to the RDMA CM to transition the QP into > RTS. > > But we want an IB stack and set of ULPs which would work in production so > they > need to handle also irregular cases... eg when the RTU is lost over and > over. Agreed. The missing RTU case must be handled for a few reasons: 1. The RTU could honestly be lost (GSI QPs are UD, they could overflow, fabric could loose the packet, etc) 2. The RC send could beat the processing of the RTU (packets on wire may be out of order if there are different SLs/VLs involved with GSI vs application QP). Also its possible the CM is slower getting to its queue of packets (such as when bombarded by many connections) while application/ULP gets its RC send quickly. [I have observed this situation in various real world stress tests]. This problem is quite simple to handle (I did it a few years ago in the SilverStorm stack) and the IB spec completely covers this issue: CM - have a hook so the CM can get the Async Events for all CAs. On getting the Async Event for packet first packet received while in RTR (Communication established), the CM should treat this exactly like an RTU (with no private data). The CM will need to cross reference the CA/QP this event was reported for to identify the applicable connection endpoint. If you check the IBTA spec and the CM state machines you will see the CM is supposed to handle this event. Also if the RTU does arrive later, the CM state machine also handles that correctly by discarding the RTU as if it was a duplicate. Note: this is why applications should not depend on private data in the RTU. ULPs - all ULPs should be written so they are fully ready to process inbound data before they tell the CM to send the REP. It is very likely the ULP will get a CQ completion for the inbound RQ data before the CM has completed its processing. In general IB allows for this situation quite nicely. The ULP can process the inbound data normally and queue it to the Send Q. Putting data on a Send Q is permitted in RTR, but the QP will not initiate sending until moved to RTS. As such the ULP can allow the Cm RTU processing (which will race with the RQ data completion) do its normal thing and move the QP to RTS. Todd Rimmer From boris at mellanox.com Fri Jun 16 13:23:28 2006 From: boris at mellanox.com (Boris Shpolyansky) Date: Fri, 16 Jun 2006 13:23:28 -0700 Subject: [openib-general] MVAPICH failure on SGI Altix SLES10 Message-ID: <1E3DCD1C63492545881FACB6063A57C1324280@mtiexch01.mti.com> Hi John, Most probably you need to upgrade the FW on your HCAs. See the following section from MVAPICH 0.9.7 User Guide: 7.2.5 Couldn't modify SRQ limit This means that your HCA card doesn't support the ibv_modify_srq feature. Please upgrade the firmware version and OpenIB Gen2 libraries on your cluster. You can obtain the latest Mellanox firmware images from this webpage. Even after updating your firmware and OpenIB Gen2 libraries, you continue to experience this problem, please edit make.mvapich.gcc and replace -DMEMORY_SCALE with -DADAPTIVE_RDMA_FAST_PATH. After making this change you need to re-build the MVAPICH library. Note that you should first try to update your firmware and OpenIB Gen2 libraries before taking this measure. If you believe that your HCA supports this feature, yet you are experiencing this problem, please contact the MVAPICH community at mvapich-discuss at cse.ohio-state.edu. Regards, Boris Shpolyansky Application Engineer Mellanox Technologies Inc. 2900 Stender Way Santa Clara, CA 95054 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of John Partridge Sent: Friday, June 16, 2006 12:51 PM To: openib-general at openib.org Subject: [openib-general] MVAPICH failure on SGI Altix SLES10 I am trying to run the example from MPI_README.txt (and other MPI apps like pallas), but I keep getting a Couldn't modify SRQ limit error message :- mig129:~/OFED-1.0-pre1 # /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/bin/mpirun_rsh -rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/tests/osutests-1.0/bw 1000 16 [1] Abort: Couldn't modify SRQ limit at line 995 in file viainit.c mpirun_rsh: Abort signaled from [1] [0] Abort: [mig125:0] Got completion with error, code=12 at line 2143 in file viacheck.c done. I am using OFED-1.0-pre1 (kernel modules are from OFED-1.0-pre1 also) OS is SLES10 SUSE Linux Enterprise Server 10 (ia64) VERSION = 10 HW is SGI Altix ia64 Can anyone help please ? Thanks John -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ralphc at pathscale.com Fri Jun 16 13:30:31 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 16 Jun 2006 13:30:31 -0700 Subject: [openib-general] [PATCH] ib_uverbs_create_ah() doesn't initialize ib_uobject.object pointer Message-ID: <1150489831.32252.102.camel@brick.pathscale.com> I get a NULL pointer panic when trying to use the current trunk SVN (rev 8088). I traced it down to ib_uverbs_create_ah() failing to initialize the ib_uobject.object pointer. Signed-off-by: Ralph Campbell Index: src/linux-kernel/infiniband/core/uverbs_cmd.c =================================================================== --- src/linux-kernel/infiniband/core/uverbs_cmd.c (revision 8088) +++ src/linux-kernel/infiniband/core/uverbs_cmd.c (working copy) @@ -1779,6 +1779,7 @@ } ah->uobject = uobj; + uobj->object = ah; ret = idr_add_uobj(&ib_uverbs_ah_idr, uobj); if (ret) goto err_destroy; -- Ralph Campbell From rdreier at cisco.com Fri Jun 16 13:38:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jun 2006 13:38:16 -0700 Subject: [openib-general] [PATCH] ib_uverbs_create_ah() doesn't initialize ib_uobject.object pointer In-Reply-To: <1150489831.32252.102.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 16 Jun 2006 13:30:31 -0700") References: <1150489831.32252.102.camel@brick.pathscale.com> Message-ID: Thanks, applied. From johnip at sgi.com Fri Jun 16 13:51:07 2006 From: johnip at sgi.com (John Partridge) Date: Fri, 16 Jun 2006 15:51:07 -0500 Subject: [openib-general] MVAPICH failure on SGI Altix SLES10 In-Reply-To: <1E3DCD1C63492545881FACB6063A57C1324280@mtiexch01.mti.com> References: <1E3DCD1C63492545881FACB6063A57C1324280@mtiexch01.mti.com> Message-ID: <449319BB.3070702@sgi.com> Thank You Boris that seems to have fixed it. Regards John Boris Shpolyansky wrote: > Hi John, > > Most probably you need to upgrade the FW on your HCAs. > See the following section from MVAPICH 0.9.7 User Guide: > > 7.2.5 Couldn't modify SRQ limit > > This means that your HCA card doesn't support the ibv_modify_srq > feature. Please upgrade > the firmware version and OpenIB Gen2 libraries on your cluster. You can > obtain the latest > Mellanox firmware images from this webpage. > Even after updating your firmware and OpenIB Gen2 libraries, you > continue to experience > this problem, please edit make.mvapich.gcc and replace -DMEMORY_SCALE > with > -DADAPTIVE_RDMA_FAST_PATH. After making this change you need to re-build > the MVAPICH > library. Note that you should first try to update your firmware and > OpenIB Gen2 libraries > before taking this measure. > If you believe that your HCA supports this feature, yet you are > experiencing this problem, > please contact the MVAPICH community at > mvapich-discuss at cse.ohio-state.edu. > > Regards, > Boris Shpolyansky > Application Engineer > Mellanox Technologies Inc. > 2900 Stender Way > Santa Clara, CA 95054 > Tel.: (408) 916 0014 > Fax: (408) 970 3403 > Cell: (408) 834 9365 > www.mellanox.com > > > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of John Partridge > Sent: Friday, June 16, 2006 12:51 PM > To: openib-general at openib.org > Subject: [openib-general] MVAPICH failure on SGI Altix SLES10 > > I am trying to run the example from MPI_README.txt (and other MPI apps > like pallas), but I keep getting a Couldn't modify SRQ limit error > message :- > > mig129:~/OFED-1.0-pre1 # > /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/bin/mpirun_rsh -rsh -np 2 > -hostfile /root/cluster > /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/tests/osutests-1.0/bw > 1000 16 [1] Abort: Couldn't modify SRQ limit > at line 995 in file viainit.c > mpirun_rsh: Abort signaled from [1] > [0] Abort: [mig125:0] Got completion with error, code=12 > at line 2143 in file viacheck.c > done. > > I am using OFED-1.0-pre1 (kernel modules are from OFED-1.0-pre1 also) OS > is SLES10 SUSE Linux Enterprise Server 10 (ia64) VERSION = 10 > > HW is SGI Altix ia64 > > Can anyone help please ? > > Thanks > John > > -- > John Partridge > > Silicon Graphics Inc > Tel: 651-683-3428 > Vnet: 233-3428 > E-Mail: johnip at sgi.com > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From mshefty at ichips.intel.com Fri Jun 16 14:24:19 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 16 Jun 2006 14:24:19 -0700 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: References: Message-ID: <44932183.8040107@ichips.intel.com> Rimmer, Todd wrote: > CM - have a hook so the CM can get the Async Events for all CAs. On > getting the Async Event for packet first packet received while in RTR > (Communication established), the CM should treat this exactly like an > RTU (with no private data). The CM will need to cross reference the > CA/QP this event was reported for to identify the applicable connection > endpoint. If you check the IBTA spec and the CM state machines you will > see the CM is supposed to handle this event. Also if the RTU does > arrive later, the CM state machine also handles that correctly by > discarding the RTU as if it was a duplicate. Note: this is why > applications should not depend on private data in the RTU. The IB CM has this capability, and behaves as indicated. The missing piece is for the RDMA CM to handle this situation. I believe that Or's approach of replacing the user's QP handler with the CMA's will fix this. > has completed its processing. In general IB allows for this situation > quite nicely. The ULP can process the inbound data normally and queue > it to the Send Q. Putting data on a Send Q is permitted in RTR, but the This is a good point, which indicates to me that nothing more is needed than handling the communication established event by the RDMA CM. - Sean From rdreier at cisco.com Fri Jun 16 15:05:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jun 2006 15:05:54 -0700 Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with ib_uverbs_idr_mutex In-Reply-To: (Roland Dreier's message of "Tue, 13 Jun 2006 10:55:57 -0700") References: <20060613051149.GE4621@mellanox.co.il> Message-ID: OK, I think that the modify_qp, modify_srq and resize_cq calls all need to be serialized. Unfortunately the only good way I could see to serialize these calls is to add a mutex to mthca's CQ, QP and SRQ structures (which bloats the structures somewhat). The patch I committed is below -- with this change I think we're OK even if userspace does crazy multithreaded stuff. IB/mthca: Make all device methods truly reentrant Documentation/infiniband/core_locking.txt says: All of the methods in struct ib_device exported by a low-level driver must be fully reentrant. The low-level driver is required to perform all synchronization necessary to maintain consistency, even if multiple function calls using the same object are run simultaneously. However, mthca's modify_qp, modify_srq and resize_cq methods are currently not reentrant. Add a mutex to the QP, SRQ and CQ structures so that these calls can be properly serialized. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 205854e..f20a5b6 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -813,6 +813,7 @@ int mthca_init_cq(struct mthca_dev *dev, spin_lock_init(&cq->lock); cq->refcount = 1; init_waitqueue_head(&cq->wait); + mutex_init(&cq->mutex); memset(cq_context, 0, sizeof *cq_context); cq_context->flags = cpu_to_be32(MTHCA_CQ_STATUS_OK | diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 8f89ba7..230ae21 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -793,18 +793,24 @@ static int mthca_resize_cq(struct ib_cq if (entries < 1 || entries > dev->limits.max_cqes) return -EINVAL; + mutex_lock(&cq->mutex); + entries = roundup_pow_of_two(entries + 1); - if (entries == ibcq->cqe + 1) - return 0; + if (entries == ibcq->cqe + 1) { + ret = 0; + goto out; + } if (cq->is_kernel) { ret = mthca_alloc_resize_buf(dev, cq, entries); if (ret) - return ret; + goto out; lkey = cq->resize_buf->buf.mr.ibmr.lkey; } else { - if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) - return -EFAULT; + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) { + ret = -EFAULT; + goto out; + } lkey = ucmd.lkey; } @@ -821,7 +827,7 @@ static int mthca_resize_cq(struct ib_cq cq->resize_buf = NULL; spin_unlock_irq(&cq->lock); } - return ret; + goto out; } if (cq->is_kernel) { @@ -848,7 +854,10 @@ static int mthca_resize_cq(struct ib_cq } else ibcq->cqe = entries - 1; - return 0; +out: + mutex_unlock(&cq->mutex); + + return ret; } static int mthca_destroy_cq(struct ib_cq *cq) diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 322bc32..16c387d 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -536,6 +536,8 @@ int mthca_modify_qp(struct ib_qp *ibqp, u8 status; int err = -EINVAL; + mutex_lock(&qp->mutex); + if (attr_mask & IB_QP_CUR_STATE) { cur_state = attr->cur_qp_state; } else { @@ -553,39 +555,41 @@ int mthca_modify_qp(struct ib_qp *ibqp, "%d->%d with attr 0x%08x\n", qp->transport, cur_state, new_state, attr_mask); - return -EINVAL; + goto out; } if ((attr_mask & IB_QP_PKEY_INDEX) && attr->pkey_index >= dev->limits.pkey_table_len) { mthca_dbg(dev, "P_Key index (%u) too large. max is %d\n", attr->pkey_index, dev->limits.pkey_table_len-1); - return -EINVAL; + goto out; } if ((attr_mask & IB_QP_PORT) && (attr->port_num == 0 || attr->port_num > dev->limits.num_ports)) { mthca_dbg(dev, "Port number (%u) is invalid\n", attr->port_num); - return -EINVAL; + goto out; } if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC && attr->max_rd_atomic > dev->limits.max_qp_init_rdma) { mthca_dbg(dev, "Max rdma_atomic as initiator %u too large (max is %d)\n", attr->max_rd_atomic, dev->limits.max_qp_init_rdma); - return -EINVAL; + goto out; } if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC && attr->max_dest_rd_atomic > 1 << dev->qp_table.rdb_shift) { mthca_dbg(dev, "Max rdma_atomic as responder %u too large (max %d)\n", attr->max_dest_rd_atomic, 1 << dev->qp_table.rdb_shift); - return -EINVAL; + goto out; } mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); - if (IS_ERR(mailbox)) - return PTR_ERR(mailbox); + if (IS_ERR(mailbox)) { + err = PTR_ERR(mailbox); + goto out; + } qp_param = mailbox->buf; qp_context = &qp_param->context; memset(qp_param, 0, sizeof *qp_param); @@ -618,7 +622,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (attr->path_mtu < IB_MTU_256 || attr->path_mtu > IB_MTU_2048) { mthca_dbg(dev, "path MTU (%u) is invalid\n", attr->path_mtu); - goto out; + goto out_mailbox; } qp_context->mtu_msgmax = (attr->path_mtu << 5) | 31; } @@ -672,7 +676,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (attr_mask & IB_QP_AV) { if (mthca_path_set(dev, &attr->ah_attr, &qp_context->pri_path, attr_mask & IB_QP_PORT ? attr->port_num : qp->port)) - goto out; + goto out_mailbox; qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); } @@ -686,18 +690,18 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (attr->alt_pkey_index >= dev->limits.pkey_table_len) { mthca_dbg(dev, "Alternate P_Key index (%u) too large. max is %d\n", attr->alt_pkey_index, dev->limits.pkey_table_len-1); - goto out; + goto out_mailbox; } if (attr->alt_port_num == 0 || attr->alt_port_num > dev->limits.num_ports) { mthca_dbg(dev, "Alternate port number (%u) is invalid\n", attr->alt_port_num); - goto out; + goto out_mailbox; } if (mthca_path_set(dev, &attr->alt_ah_attr, &qp_context->alt_path, attr->alt_ah_attr.port_num)) - goto out; + goto out_mailbox; qp_context->alt_path.port_pkey |= cpu_to_be32(attr->alt_pkey_index | attr->alt_port_num << 24); @@ -793,12 +797,12 @@ int mthca_modify_qp(struct ib_qp *ibqp, err = mthca_MODIFY_QP(dev, cur_state, new_state, qp->qpn, 0, mailbox, sqd_event, &status); if (err) - goto out; + goto out_mailbox; if (status) { mthca_warn(dev, "modify QP %d->%d returned status %02x.\n", cur_state, new_state, status); err = -EINVAL; - goto out; + goto out_mailbox; } qp->state = new_state; @@ -853,8 +857,11 @@ int mthca_modify_qp(struct ib_qp *ibqp, } } -out: +out_mailbox: mthca_free_mailbox(dev, mailbox); + +out: + mutex_unlock(&qp->mutex); return err; } @@ -1100,6 +1107,7 @@ static int mthca_alloc_qp_common(struct qp->refcount = 1; init_waitqueue_head(&qp->wait); + mutex_init(&qp->mutex); qp->state = IB_QPS_RESET; qp->atomic_rd_en = 0; qp->resp_depth = 0; diff --git a/drivers/infiniband/hw/mthca/mthca_srq.c b/drivers/infiniband/hw/mthca/mthca_srq.c index b292fef..fab417c 100644 --- a/drivers/infiniband/hw/mthca/mthca_srq.c +++ b/drivers/infiniband/hw/mthca/mthca_srq.c @@ -243,6 +243,7 @@ int mthca_alloc_srq(struct mthca_dev *de spin_lock_init(&srq->lock); srq->refcount = 1; init_waitqueue_head(&srq->wait); + mutex_init(&srq->mutex); if (mthca_is_memfree(dev)) mthca_arbel_init_srq_context(dev, pd, srq, mailbox->buf); @@ -371,7 +372,11 @@ int mthca_modify_srq(struct ib_srq *ibsr if (attr_mask & IB_SRQ_LIMIT) { if (attr->srq_limit > srq->max) return -EINVAL; + + mutex_lock(&srq->mutex); ret = mthca_ARM_SRQ(dev, srq->srqn, attr->srq_limit, &status); + mutex_unlock(&srq->mutex); + if (ret) return ret; if (status) From rdreier at cisco.com Fri Jun 16 15:07:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jun 2006 15:07:12 -0700 Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with ib_uverbs_idr_mutex In-Reply-To: <1150223140.11881.2.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Tue, 13 Jun 2006 11:25:39 -0700") References: <20060613051149.GE4621@mellanox.co.il> <1150223140.11881.2.camel@hematite.internal.keyresearch.com> Message-ID: Robert, can you confirm that the new uverbs locking scheme helps the performance problems you're having? I'm planning on queueing the patch below for 2.6.18 (which has all fixes rolled up in it): IB/uverbs: Don't serialize with ib_uverbs_idr_mutex Currently, all userspace verbs operations that call into the kernel are serialized by ib_uverbs_idr_mutex. This can be a scalability issue for some workloads, especially for devices driven by the ipath driver, which needs to call into the kernel even for datapath operations. Fix this by adding reference counts to the userspace objects, and then converting ib_uverbs_idr_mutex into a spinlock that only protects the idrs long enough to take a reference on the object being looked up. Because remove operations may fail, we have to do a slightly funky two-step deletion, which is described in the comments at the top of uverbs_cmd.c. This also still leaves ib_uverbs_idr_lock as a single lock that is possibly subject to contention. However, the lock hold time will only be a single idr operation, so multiple threads should still be able to make progress, even if ib_uverbs_idr_lock is being ping-ponged. Surprisingly, these changes even shrink the object code: add/remove: 23/5 grow/shrink: 4/21 up/down: 633/-693 (-60) Signed-off-by: Roland Dreier --- diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index 3372d67..bb9bee5 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -132,7 +132,7 @@ struct ib_ucq_object { u32 async_events_reported; }; -extern struct mutex ib_uverbs_idr_mutex; +extern spinlock_t ib_uverbs_idr_lock; extern struct idr ib_uverbs_pd_idr; extern struct idr ib_uverbs_mr_idr; extern struct idr ib_uverbs_mw_idr; @@ -141,6 +141,8 @@ extern struct idr ib_uverbs_cq_idr; extern struct idr ib_uverbs_qp_idr; extern struct idr ib_uverbs_srq_idr; +void idr_remove_uobj(struct idr *idp, struct ib_uobject *uobj); + struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file, int is_async, int *fd); void ib_uverbs_release_event_file(struct kref *ref); diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 403dd81..76bf61e 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -50,7 +50,64 @@ #define INIT_UDATA(udata, ibuf, obuf, il (udata)->outlen = (olen); \ } while (0) -static int idr_add_uobj(struct idr *idr, void *obj, struct ib_uobject *uobj) +/* + * The ib_uobject locking scheme is as follows: + * + * - ib_uverbs_idr_lock protects the uverbs idrs themselves, so it + * needs to be held during all idr operations. When an object is + * looked up, a reference must be taken on the object's kref before + * dropping this lock. + * + * - Each object also has an rwsem. This rwsem must be held for + * reading while an operation that uses the object is performed. + * For example, while registering an MR, the associated PD's + * uobject.mutex must be held for reading. The rwsem must be held + * for writing while initializing or destroying an object. + * + * - In addition, each object has a "live" flag. If this flag is not + * set, then lookups of the object will fail even if it is found in + * the idr. This handles a reader that blocks and does not acquire + * the rwsem until after the object is destroyed. The destroy + * operation will set the live flag to 0 and then drop the rwsem; + * this will allow the reader to acquire the rwsem, see that the + * live flag is 0, and then drop the rwsem and its reference to + * object. The underlying storage will not be freed until the last + * reference to the object is dropped. + */ + +static void init_uobj(struct ib_uobject *uobj, u64 user_handle, + struct ib_ucontext *context) +{ + uobj->user_handle = user_handle; + uobj->context = context; + kref_init(&uobj->ref); + init_rwsem(&uobj->mutex); + uobj->live = 0; +} + +static void release_uobj(struct kref *kref) +{ + kfree(container_of(kref, struct ib_uobject, ref)); +} + +static void put_uobj(struct ib_uobject *uobj) +{ + kref_put(&uobj->ref, release_uobj); +} + +static void put_uobj_read(struct ib_uobject *uobj) +{ + up_read(&uobj->mutex); + put_uobj(uobj); +} + +static void put_uobj_write(struct ib_uobject *uobj) +{ + up_write(&uobj->mutex); + put_uobj(uobj); +} + +static int idr_add_uobj(struct idr *idr, struct ib_uobject *uobj) { int ret; @@ -58,7 +115,9 @@ retry: if (!idr_pre_get(idr, GFP_KERNEL)) return -ENOMEM; + spin_lock(&ib_uverbs_idr_lock); ret = idr_get_new(idr, uobj, &uobj->id); + spin_unlock(&ib_uverbs_idr_lock); if (ret == -EAGAIN) goto retry; @@ -66,6 +125,121 @@ retry: return ret; } +void idr_remove_uobj(struct idr *idr, struct ib_uobject *uobj) +{ + spin_lock(&ib_uverbs_idr_lock); + idr_remove(idr, uobj->id); + spin_unlock(&ib_uverbs_idr_lock); +} + +static struct ib_uobject *__idr_get_uobj(struct idr *idr, int id, + struct ib_ucontext *context) +{ + struct ib_uobject *uobj; + + spin_lock(&ib_uverbs_idr_lock); + uobj = idr_find(idr, id); + if (uobj) + kref_get(&uobj->ref); + spin_unlock(&ib_uverbs_idr_lock); + + return uobj; +} + +static struct ib_uobject *idr_read_uobj(struct idr *idr, int id, + struct ib_ucontext *context) +{ + struct ib_uobject *uobj; + + uobj = __idr_get_uobj(idr, id, context); + if (!uobj) + return NULL; + + down_read(&uobj->mutex); + if (!uobj->live) { + put_uobj_read(uobj); + return NULL; + } + + return uobj; +} + +static struct ib_uobject *idr_write_uobj(struct idr *idr, int id, + struct ib_ucontext *context) +{ + struct ib_uobject *uobj; + + uobj = __idr_get_uobj(idr, id, context); + if (!uobj) + return NULL; + + down_write(&uobj->mutex); + if (!uobj->live) { + put_uobj_write(uobj); + return NULL; + } + + return uobj; +} + +static void *idr_read_obj(struct idr *idr, int id, struct ib_ucontext *context) +{ + struct ib_uobject *uobj; + + uobj = idr_read_uobj(idr, id, context); + return uobj ? uobj->object : NULL; +} + +static struct ib_pd *idr_read_pd(int pd_handle, struct ib_ucontext *context) +{ + return idr_read_obj(&ib_uverbs_pd_idr, pd_handle, context); +} + +static void put_pd_read(struct ib_pd *pd) +{ + put_uobj_read(pd->uobject); +} + +static struct ib_cq *idr_read_cq(int cq_handle, struct ib_ucontext *context) +{ + return idr_read_obj(&ib_uverbs_cq_idr, cq_handle, context); +} + +static void put_cq_read(struct ib_cq *cq) +{ + put_uobj_read(cq->uobject); +} + +static struct ib_ah *idr_read_ah(int ah_handle, struct ib_ucontext *context) +{ + return idr_read_obj(&ib_uverbs_ah_idr, ah_handle, context); +} + +static void put_ah_read(struct ib_ah *ah) +{ + put_uobj_read(ah->uobject); +} + +static struct ib_qp *idr_read_qp(int qp_handle, struct ib_ucontext *context) +{ + return idr_read_obj(&ib_uverbs_qp_idr, qp_handle, context); +} + +static void put_qp_read(struct ib_qp *qp) +{ + put_uobj_read(qp->uobject); +} + +static struct ib_srq *idr_read_srq(int srq_handle, struct ib_ucontext *context) +{ + return idr_read_obj(&ib_uverbs_srq_idr, srq_handle, context); +} + +static void put_srq_read(struct ib_srq *srq) +{ + put_uobj_read(srq->uobject); +} + ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) @@ -296,7 +470,8 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uve if (!uobj) return -ENOMEM; - uobj->context = file->ucontext; + init_uobj(uobj, 0, file->ucontext); + down_write(&uobj->mutex); pd = file->device->ib_dev->alloc_pd(file->device->ib_dev, file->ucontext, &udata); @@ -309,11 +484,10 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uve pd->uobject = uobj; atomic_set(&pd->usecnt, 0); - mutex_lock(&ib_uverbs_idr_mutex); - - ret = idr_add_uobj(&ib_uverbs_pd_idr, pd, uobj); + uobj->object = pd; + ret = idr_add_uobj(&ib_uverbs_pd_idr, uobj); if (ret) - goto err_up; + goto err_idr; memset(&resp, 0, sizeof resp); resp.pd_handle = uobj->id; @@ -321,26 +495,27 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uve if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { ret = -EFAULT; - goto err_idr; + goto err_copy; } mutex_lock(&file->mutex); list_add_tail(&uobj->list, &file->ucontext->pd_list); mutex_unlock(&file->mutex); - mutex_unlock(&ib_uverbs_idr_mutex); + uobj->live = 1; + + up_write(&uobj->mutex); return in_len; -err_idr: - idr_remove(&ib_uverbs_pd_idr, uobj->id); +err_copy: + idr_remove_uobj(&ib_uverbs_pd_idr, uobj); -err_up: - mutex_unlock(&ib_uverbs_idr_mutex); +err_idr: ib_dealloc_pd(pd); err: - kfree(uobj); + put_uobj_write(uobj); return ret; } @@ -349,37 +524,34 @@ ssize_t ib_uverbs_dealloc_pd(struct ib_u int in_len, int out_len) { struct ib_uverbs_dealloc_pd cmd; - struct ib_pd *pd; struct ib_uobject *uobj; - int ret = -EINVAL; + int ret; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); + uobj = idr_write_uobj(&ib_uverbs_pd_idr, cmd.pd_handle, file->ucontext); + if (!uobj) + return -EINVAL; - pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); - if (!pd || pd->uobject->context != file->ucontext) - goto out; + ret = ib_dealloc_pd(uobj->object); + if (!ret) + uobj->live = 0; - uobj = pd->uobject; + put_uobj_write(uobj); - ret = ib_dealloc_pd(pd); if (ret) - goto out; + return ret; - idr_remove(&ib_uverbs_pd_idr, cmd.pd_handle); + idr_remove_uobj(&ib_uverbs_pd_idr, uobj); mutex_lock(&file->mutex); list_del(&uobj->list); mutex_unlock(&file->mutex); - kfree(uobj); + put_uobj(uobj); -out: - mutex_unlock(&ib_uverbs_idr_mutex); - - return ret ? ret : in_len; + return in_len; } ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, @@ -419,7 +591,8 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb if (!obj) return -ENOMEM; - obj->uobject.context = file->ucontext; + init_uobj(&obj->uobject, 0, file->ucontext); + down_write(&obj->uobject.mutex); /* * We ask for writable memory if any access flags other than @@ -436,23 +609,14 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb obj->umem.virt_base = cmd.hca_va; - mutex_lock(&ib_uverbs_idr_mutex); - - pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); - if (!pd || pd->uobject->context != file->ucontext) { - ret = -EINVAL; - goto err_up; - } - - if (!pd->device->reg_user_mr) { - ret = -ENOSYS; - goto err_up; - } + pd = idr_read_pd(cmd.pd_handle, file->ucontext); + if (!pd) + goto err_release; mr = pd->device->reg_user_mr(pd, &obj->umem, cmd.access_flags, &udata); if (IS_ERR(mr)) { ret = PTR_ERR(mr); - goto err_up; + goto err_put; } mr->device = pd->device; @@ -461,43 +625,48 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb atomic_inc(&pd->usecnt); atomic_set(&mr->usecnt, 0); - memset(&resp, 0, sizeof resp); - resp.lkey = mr->lkey; - resp.rkey = mr->rkey; - - ret = idr_add_uobj(&ib_uverbs_mr_idr, mr, &obj->uobject); + obj->uobject.object = mr; + ret = idr_add_uobj(&ib_uverbs_mr_idr, &obj->uobject); if (ret) goto err_unreg; + memset(&resp, 0, sizeof resp); + resp.lkey = mr->lkey; + resp.rkey = mr->rkey; resp.mr_handle = obj->uobject.id; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { ret = -EFAULT; - goto err_idr; + goto err_copy; } + put_pd_read(pd); + mutex_lock(&file->mutex); list_add_tail(&obj->uobject.list, &file->ucontext->mr_list); mutex_unlock(&file->mutex); - mutex_unlock(&ib_uverbs_idr_mutex); + obj->uobject.live = 1; + + up_write(&obj->uobject.mutex); return in_len; -err_idr: - idr_remove(&ib_uverbs_mr_idr, obj->uobject.id); +err_copy: + idr_remove_uobj(&ib_uverbs_mr_idr, &obj->uobject); err_unreg: ib_dereg_mr(mr); -err_up: - mutex_unlock(&ib_uverbs_idr_mutex); +err_put: + put_pd_read(pd); +err_release: ib_umem_release(file->device->ib_dev, &obj->umem); err_free: - kfree(obj); + put_uobj_write(&obj->uobject); return ret; } @@ -507,37 +676,40 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uve { struct ib_uverbs_dereg_mr cmd; struct ib_mr *mr; + struct ib_uobject *uobj; struct ib_umem_object *memobj; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); - - mr = idr_find(&ib_uverbs_mr_idr, cmd.mr_handle); - if (!mr || mr->uobject->context != file->ucontext) - goto out; + uobj = idr_write_uobj(&ib_uverbs_mr_idr, cmd.mr_handle, file->ucontext); + if (!uobj) + return -EINVAL; - memobj = container_of(mr->uobject, struct ib_umem_object, uobject); + memobj = container_of(uobj, struct ib_umem_object, uobject); + mr = uobj->object; ret = ib_dereg_mr(mr); + if (!ret) + uobj->live = 0; + + put_uobj_write(uobj); + if (ret) - goto out; + return ret; - idr_remove(&ib_uverbs_mr_idr, cmd.mr_handle); + idr_remove_uobj(&ib_uverbs_mr_idr, uobj); mutex_lock(&file->mutex); - list_del(&memobj->uobject.list); + list_del(&uobj->list); mutex_unlock(&file->mutex); ib_umem_release(file->device->ib_dev, &memobj->umem); - kfree(memobj); -out: - mutex_unlock(&ib_uverbs_idr_mutex); + put_uobj(uobj); - return ret ? ret : in_len; + return in_len; } ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file, @@ -576,7 +748,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uv struct ib_uverbs_create_cq cmd; struct ib_uverbs_create_cq_resp resp; struct ib_udata udata; - struct ib_ucq_object *uobj; + struct ib_ucq_object *obj; struct ib_uverbs_event_file *ev_file = NULL; struct ib_cq *cq; int ret; @@ -594,10 +766,13 @@ ssize_t ib_uverbs_create_cq(struct ib_uv if (cmd.comp_vector >= file->device->num_comp_vectors) return -EINVAL; - uobj = kmalloc(sizeof *uobj, GFP_KERNEL); - if (!uobj) + obj = kmalloc(sizeof *obj, GFP_KERNEL); + if (!obj) return -ENOMEM; + init_uobj(&obj->uobject, cmd.user_handle, file->ucontext); + down_write(&obj->uobject.mutex); + if (cmd.comp_channel >= 0) { ev_file = ib_uverbs_lookup_comp_file(cmd.comp_channel); if (!ev_file) { @@ -606,63 +781,64 @@ ssize_t ib_uverbs_create_cq(struct ib_uv } } - uobj->uobject.user_handle = cmd.user_handle; - uobj->uobject.context = file->ucontext; - uobj->uverbs_file = file; - uobj->comp_events_reported = 0; - uobj->async_events_reported = 0; - INIT_LIST_HEAD(&uobj->comp_list); - INIT_LIST_HEAD(&uobj->async_list); + obj->uverbs_file = file; + obj->comp_events_reported = 0; + obj->async_events_reported = 0; + INIT_LIST_HEAD(&obj->comp_list); + INIT_LIST_HEAD(&obj->async_list); cq = file->device->ib_dev->create_cq(file->device->ib_dev, cmd.cqe, file->ucontext, &udata); if (IS_ERR(cq)) { ret = PTR_ERR(cq); - goto err; + goto err_file; } cq->device = file->device->ib_dev; - cq->uobject = &uobj->uobject; + cq->uobject = &obj->uobject; cq->comp_handler = ib_uverbs_comp_handler; cq->event_handler = ib_uverbs_cq_event_handler; cq->cq_context = ev_file; atomic_set(&cq->usecnt, 0); - mutex_lock(&ib_uverbs_idr_mutex); - - ret = idr_add_uobj(&ib_uverbs_cq_idr, cq, &uobj->uobject); + obj->uobject.object = cq; + ret = idr_add_uobj(&ib_uverbs_cq_idr, &obj->uobject); if (ret) - goto err_up; + goto err_free; memset(&resp, 0, sizeof resp); - resp.cq_handle = uobj->uobject.id; + resp.cq_handle = obj->uobject.id; resp.cqe = cq->cqe; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { ret = -EFAULT; - goto err_idr; + goto err_copy; } mutex_lock(&file->mutex); - list_add_tail(&uobj->uobject.list, &file->ucontext->cq_list); + list_add_tail(&obj->uobject.list, &file->ucontext->cq_list); mutex_unlock(&file->mutex); - mutex_unlock(&ib_uverbs_idr_mutex); + obj->uobject.live = 1; + + up_write(&obj->uobject.mutex); return in_len; -err_idr: - idr_remove(&ib_uverbs_cq_idr, uobj->uobject.id); +err_copy: + idr_remove_uobj(&ib_uverbs_cq_idr, &obj->uobject); -err_up: - mutex_unlock(&ib_uverbs_idr_mutex); + +err_free: ib_destroy_cq(cq); -err: +err_file: if (ev_file) - ib_uverbs_release_ucq(file, ev_file, uobj); - kfree(uobj); + ib_uverbs_release_ucq(file, ev_file, obj); + +err: + put_uobj_write(&obj->uobject); return ret; } @@ -683,11 +859,9 @@ ssize_t ib_uverbs_resize_cq(struct ib_uv (unsigned long) cmd.response + sizeof resp, in_len - sizeof cmd, out_len - sizeof resp); - mutex_lock(&ib_uverbs_idr_mutex); - - cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); - if (!cq || cq->uobject->context != file->ucontext || !cq->device->resize_cq) - goto out; + cq = idr_read_cq(cmd.cq_handle, file->ucontext); + if (!cq) + return -EINVAL; ret = cq->device->resize_cq(cq, cmd.cqe, &udata); if (ret) @@ -701,7 +875,7 @@ ssize_t ib_uverbs_resize_cq(struct ib_uv ret = -EFAULT; out: - mutex_unlock(&ib_uverbs_idr_mutex); + put_cq_read(cq); return ret ? ret : in_len; } @@ -712,6 +886,7 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver { struct ib_uverbs_poll_cq cmd; struct ib_uverbs_poll_cq_resp *resp; + struct ib_uobject *uobj; struct ib_cq *cq; struct ib_wc *wc; int ret = 0; @@ -732,15 +907,17 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver goto out_wc; } - mutex_lock(&ib_uverbs_idr_mutex); - cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); - if (!cq || cq->uobject->context != file->ucontext) { + uobj = idr_read_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext); + if (!uobj) { ret = -EINVAL; goto out; } + cq = uobj->object; resp->count = ib_poll_cq(cq, cmd.ne, wc); + put_uobj_read(uobj); + for (i = 0; i < resp->count; i++) { resp->wc[i].wr_id = wc[i].wr_id; resp->wc[i].status = wc[i].status; @@ -762,7 +939,6 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver ret = -EFAULT; out: - mutex_unlock(&ib_uverbs_idr_mutex); kfree(resp); out_wc: @@ -775,22 +951,23 @@ ssize_t ib_uverbs_req_notify_cq(struct i int out_len) { struct ib_uverbs_req_notify_cq cmd; + struct ib_uobject *uobj; struct ib_cq *cq; - int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); - cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); - if (cq && cq->uobject->context == file->ucontext) { - ib_req_notify_cq(cq, cmd.solicited_only ? - IB_CQ_SOLICITED : IB_CQ_NEXT_COMP); - ret = in_len; - } - mutex_unlock(&ib_uverbs_idr_mutex); + uobj = idr_read_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext); + if (!uobj) + return -EINVAL; + cq = uobj->object; - return ret; + ib_req_notify_cq(cq, cmd.solicited_only ? + IB_CQ_SOLICITED : IB_CQ_NEXT_COMP); + + put_uobj_read(uobj); + + return in_len; } ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file, @@ -799,52 +976,50 @@ ssize_t ib_uverbs_destroy_cq(struct ib_u { struct ib_uverbs_destroy_cq cmd; struct ib_uverbs_destroy_cq_resp resp; + struct ib_uobject *uobj; struct ib_cq *cq; - struct ib_ucq_object *uobj; + struct ib_ucq_object *obj; struct ib_uverbs_event_file *ev_file; - u64 user_handle; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - memset(&resp, 0, sizeof resp); - - mutex_lock(&ib_uverbs_idr_mutex); + uobj = idr_write_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext); + if (!uobj) + return -EINVAL; + cq = uobj->object; + ev_file = cq->cq_context; + obj = container_of(cq->uobject, struct ib_ucq_object, uobject); - cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); - if (!cq || cq->uobject->context != file->ucontext) - goto out; + ret = ib_destroy_cq(cq); + if (!ret) + uobj->live = 0; - user_handle = cq->uobject->user_handle; - uobj = container_of(cq->uobject, struct ib_ucq_object, uobject); - ev_file = cq->cq_context; + put_uobj_write(uobj); - ret = ib_destroy_cq(cq); if (ret) - goto out; + return ret; - idr_remove(&ib_uverbs_cq_idr, cmd.cq_handle); + idr_remove_uobj(&ib_uverbs_cq_idr, uobj); mutex_lock(&file->mutex); - list_del(&uobj->uobject.list); + list_del(&uobj->list); mutex_unlock(&file->mutex); - ib_uverbs_release_ucq(file, ev_file, uobj); + ib_uverbs_release_ucq(file, ev_file, obj); - resp.comp_events_reported = uobj->comp_events_reported; - resp.async_events_reported = uobj->async_events_reported; + memset(&resp, 0, sizeof resp); + resp.comp_events_reported = obj->comp_events_reported; + resp.async_events_reported = obj->async_events_reported; - kfree(uobj); + put_uobj(uobj); if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) - ret = -EFAULT; - -out: - mutex_unlock(&ib_uverbs_idr_mutex); + return -EFAULT; - return ret ? ret : in_len; + return in_len; } ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file, @@ -854,7 +1029,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv struct ib_uverbs_create_qp cmd; struct ib_uverbs_create_qp_resp resp; struct ib_udata udata; - struct ib_uqp_object *uobj; + struct ib_uqp_object *obj; struct ib_pd *pd; struct ib_cq *scq, *rcq; struct ib_srq *srq; @@ -872,23 +1047,21 @@ ssize_t ib_uverbs_create_qp(struct ib_uv (unsigned long) cmd.response + sizeof resp, in_len - sizeof cmd, out_len - sizeof resp); - uobj = kmalloc(sizeof *uobj, GFP_KERNEL); - if (!uobj) + obj = kmalloc(sizeof *obj, GFP_KERNEL); + if (!obj) return -ENOMEM; - mutex_lock(&ib_uverbs_idr_mutex); + init_uobj(&obj->uevent.uobject, cmd.user_handle, file->ucontext); + down_write(&obj->uevent.uobject.mutex); - pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); - scq = idr_find(&ib_uverbs_cq_idr, cmd.send_cq_handle); - rcq = idr_find(&ib_uverbs_cq_idr, cmd.recv_cq_handle); - srq = cmd.is_srq ? idr_find(&ib_uverbs_srq_idr, cmd.srq_handle) : NULL; + pd = idr_read_pd(cmd.pd_handle, file->ucontext); + scq = idr_read_cq(cmd.send_cq_handle, file->ucontext); + rcq = idr_read_cq(cmd.recv_cq_handle, file->ucontext); + srq = cmd.is_srq ? idr_read_srq(cmd.srq_handle, file->ucontext) : NULL; - if (!pd || pd->uobject->context != file->ucontext || - !scq || scq->uobject->context != file->ucontext || - !rcq || rcq->uobject->context != file->ucontext || - (cmd.is_srq && (!srq || srq->uobject->context != file->ucontext))) { + if (!pd || !scq || !rcq || (cmd.is_srq && !srq)) { ret = -EINVAL; - goto err_up; + goto err_put; } attr.event_handler = ib_uverbs_qp_event_handler; @@ -905,16 +1078,14 @@ ssize_t ib_uverbs_create_qp(struct ib_uv attr.cap.max_recv_sge = cmd.max_recv_sge; attr.cap.max_inline_data = cmd.max_inline_data; - uobj->uevent.uobject.user_handle = cmd.user_handle; - uobj->uevent.uobject.context = file->ucontext; - uobj->uevent.events_reported = 0; - INIT_LIST_HEAD(&uobj->uevent.event_list); - INIT_LIST_HEAD(&uobj->mcast_list); + obj->uevent.events_reported = 0; + INIT_LIST_HEAD(&obj->uevent.event_list); + INIT_LIST_HEAD(&obj->mcast_list); qp = pd->device->create_qp(pd, &attr, &udata); if (IS_ERR(qp)) { ret = PTR_ERR(qp); - goto err_up; + goto err_put; } qp->device = pd->device; @@ -922,7 +1093,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv qp->send_cq = attr.send_cq; qp->recv_cq = attr.recv_cq; qp->srq = attr.srq; - qp->uobject = &uobj->uevent.uobject; + qp->uobject = &obj->uevent.uobject; qp->event_handler = attr.event_handler; qp->qp_context = attr.qp_context; qp->qp_type = attr.qp_type; @@ -932,14 +1103,14 @@ ssize_t ib_uverbs_create_qp(struct ib_uv if (attr.srq) atomic_inc(&attr.srq->usecnt); - memset(&resp, 0, sizeof resp); - resp.qpn = qp->qp_num; - - ret = idr_add_uobj(&ib_uverbs_qp_idr, qp, &uobj->uevent.uobject); + obj->uevent.uobject.object = qp; + ret = idr_add_uobj(&ib_uverbs_qp_idr, &obj->uevent.uobject); if (ret) goto err_destroy; - resp.qp_handle = uobj->uevent.uobject.id; + memset(&resp, 0, sizeof resp); + resp.qpn = qp->qp_num; + resp.qp_handle = obj->uevent.uobject.id; resp.max_recv_sge = attr.cap.max_recv_sge; resp.max_send_sge = attr.cap.max_send_sge; resp.max_recv_wr = attr.cap.max_recv_wr; @@ -949,27 +1120,42 @@ ssize_t ib_uverbs_create_qp(struct ib_uv if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { ret = -EFAULT; - goto err_idr; + goto err_copy; } + put_pd_read(pd); + put_cq_read(scq); + put_cq_read(rcq); + if (srq) + put_srq_read(srq); + mutex_lock(&file->mutex); - list_add_tail(&uobj->uevent.uobject.list, &file->ucontext->qp_list); + list_add_tail(&obj->uevent.uobject.list, &file->ucontext->qp_list); mutex_unlock(&file->mutex); - mutex_unlock(&ib_uverbs_idr_mutex); + obj->uevent.uobject.live = 1; + + up_write(&obj->uevent.uobject.mutex); return in_len; -err_idr: - idr_remove(&ib_uverbs_qp_idr, uobj->uevent.uobject.id); +err_copy: + idr_remove_uobj(&ib_uverbs_qp_idr, &obj->uevent.uobject); err_destroy: ib_destroy_qp(qp); -err_up: - mutex_unlock(&ib_uverbs_idr_mutex); - - kfree(uobj); +err_put: + if (pd) + put_pd_read(pd); + if (scq) + put_cq_read(scq); + if (rcq) + put_cq_read(rcq); + if (srq) + put_srq_read(srq); + + put_uobj_write(&obj->uevent.uobject); return ret; } @@ -994,15 +1180,15 @@ ssize_t ib_uverbs_query_qp(struct ib_uve goto out; } - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (qp && qp->uobject->context == file->ucontext) - ret = ib_query_qp(qp, attr, cmd.attr_mask, init_attr); - else + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) { ret = -EINVAL; + goto out; + } + + ret = ib_query_qp(qp, attr, cmd.attr_mask, init_attr); - mutex_unlock(&ib_uverbs_idr_mutex); + put_qp_read(qp); if (ret) goto out; @@ -1089,10 +1275,8 @@ ssize_t ib_uverbs_modify_qp(struct ib_uv if (!attr) return -ENOMEM; - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (!qp || qp->uobject->context != file->ucontext) { + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) { ret = -EINVAL; goto out; } @@ -1144,13 +1328,15 @@ ssize_t ib_uverbs_modify_qp(struct ib_uv attr->alt_ah_attr.port_num = cmd.alt_dest.port_num; ret = ib_modify_qp(qp, attr, cmd.attr_mask); + + put_qp_read(qp); + if (ret) goto out; ret = in_len; out: - mutex_unlock(&ib_uverbs_idr_mutex); kfree(attr); return ret; @@ -1162,8 +1348,9 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u { struct ib_uverbs_destroy_qp cmd; struct ib_uverbs_destroy_qp_resp resp; + struct ib_uobject *uobj; struct ib_qp *qp; - struct ib_uqp_object *uobj; + struct ib_uqp_object *obj; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1171,43 +1358,43 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u memset(&resp, 0, sizeof resp); - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (!qp || qp->uobject->context != file->ucontext) - goto out; - - uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + uobj = idr_write_uobj(&ib_uverbs_qp_idr, cmd.qp_handle, file->ucontext); + if (!uobj) + return -EINVAL; + qp = uobj->object; + obj = container_of(uobj, struct ib_uqp_object, uevent.uobject); - if (!list_empty(&uobj->mcast_list)) { - ret = -EBUSY; - goto out; + if (!list_empty(&obj->mcast_list)) { + put_uobj_write(uobj); + return -EBUSY; } ret = ib_destroy_qp(qp); + if (!ret) + uobj->live = 0; + + put_uobj_write(uobj); + if (ret) - goto out; + return ret; - idr_remove(&ib_uverbs_qp_idr, cmd.qp_handle); + idr_remove_uobj(&ib_uverbs_qp_idr, uobj); mutex_lock(&file->mutex); - list_del(&uobj->uevent.uobject.list); + list_del(&uobj->list); mutex_unlock(&file->mutex); - ib_uverbs_release_uevent(file, &uobj->uevent); + ib_uverbs_release_uevent(file, &obj->uevent); - resp.events_reported = uobj->uevent.events_reported; + resp.events_reported = obj->uevent.events_reported; - kfree(uobj); + put_uobj(uobj); if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) - ret = -EFAULT; - -out: - mutex_unlock(&ib_uverbs_idr_mutex); + return -EFAULT; - return ret ? ret : in_len; + return in_len; } ssize_t ib_uverbs_post_send(struct ib_uverbs_file *file, @@ -1220,6 +1407,7 @@ ssize_t ib_uverbs_post_send(struct ib_uv struct ib_send_wr *wr = NULL, *last, *next, *bad_wr; struct ib_qp *qp; int i, sg_ind; + int is_ud; ssize_t ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1236,12 +1424,11 @@ ssize_t ib_uverbs_post_send(struct ib_uv if (!user_wr) return -ENOMEM; - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (!qp || qp->uobject->context != file->ucontext) + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) goto out; + is_ud = qp->qp_type == IB_QPT_UD; sg_ind = 0; last = NULL; for (i = 0; i < cmd.wr_count; ++i) { @@ -1249,12 +1436,12 @@ ssize_t ib_uverbs_post_send(struct ib_uv buf + sizeof cmd + i * cmd.wqe_size, cmd.wqe_size)) { ret = -EFAULT; - goto out; + goto out_put; } if (user_wr->num_sge + sg_ind > cmd.sge_count) { ret = -EINVAL; - goto out; + goto out_put; } next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + @@ -1262,7 +1449,7 @@ ssize_t ib_uverbs_post_send(struct ib_uv GFP_KERNEL); if (!next) { ret = -ENOMEM; - goto out; + goto out_put; } if (!last) @@ -1278,12 +1465,12 @@ ssize_t ib_uverbs_post_send(struct ib_uv next->send_flags = user_wr->send_flags; next->imm_data = (__be32 __force) user_wr->imm_data; - if (qp->qp_type == IB_QPT_UD) { - next->wr.ud.ah = idr_find(&ib_uverbs_ah_idr, - user_wr->wr.ud.ah); + if (is_ud) { + next->wr.ud.ah = idr_read_ah(user_wr->wr.ud.ah, + file->ucontext); if (!next->wr.ud.ah) { ret = -EINVAL; - goto out; + goto out_put; } next->wr.ud.remote_qpn = user_wr->wr.ud.remote_qpn; next->wr.ud.remote_qkey = user_wr->wr.ud.remote_qkey; @@ -1320,7 +1507,7 @@ ssize_t ib_uverbs_post_send(struct ib_uv sg_ind * sizeof (struct ib_sge), next->num_sge * sizeof (struct ib_sge))) { ret = -EFAULT; - goto out; + goto out_put; } sg_ind += next->num_sge; } else @@ -1340,10 +1527,13 @@ ssize_t ib_uverbs_post_send(struct ib_uv &resp, sizeof resp)) ret = -EFAULT; -out: - mutex_unlock(&ib_uverbs_idr_mutex); +out_put: + put_qp_read(qp); +out: while (wr) { + if (is_ud && wr->wr.ud.ah) + put_ah_read(wr->wr.ud.ah); next = wr->next; kfree(wr); wr = next; @@ -1458,14 +1648,15 @@ ssize_t ib_uverbs_post_recv(struct ib_uv if (IS_ERR(wr)) return PTR_ERR(wr); - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (!qp || qp->uobject->context != file->ucontext) + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) goto out; resp.bad_wr = 0; ret = qp->device->post_recv(qp, wr, &bad_wr); + + put_qp_read(qp); + if (ret) for (next = wr; next; next = next->next) { ++resp.bad_wr; @@ -1479,8 +1670,6 @@ ssize_t ib_uverbs_post_recv(struct ib_uv ret = -EFAULT; out: - mutex_unlock(&ib_uverbs_idr_mutex); - while (wr) { next = wr->next; kfree(wr); @@ -1509,14 +1698,15 @@ ssize_t ib_uverbs_post_srq_recv(struct i if (IS_ERR(wr)) return PTR_ERR(wr); - mutex_lock(&ib_uverbs_idr_mutex); - - srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); - if (!srq || srq->uobject->context != file->ucontext) + srq = idr_read_srq(cmd.srq_handle, file->ucontext); + if (!srq) goto out; resp.bad_wr = 0; ret = srq->device->post_srq_recv(srq, wr, &bad_wr); + + put_srq_read(srq); + if (ret) for (next = wr; next; next = next->next) { ++resp.bad_wr; @@ -1530,8 +1720,6 @@ ssize_t ib_uverbs_post_srq_recv(struct i ret = -EFAULT; out: - mutex_unlock(&ib_uverbs_idr_mutex); - while (wr) { next = wr->next; kfree(wr); @@ -1563,17 +1751,15 @@ ssize_t ib_uverbs_create_ah(struct ib_uv if (!uobj) return -ENOMEM; - mutex_lock(&ib_uverbs_idr_mutex); + init_uobj(uobj, cmd.user_handle, file->ucontext); + down_write(&uobj->mutex); - pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); - if (!pd || pd->uobject->context != file->ucontext) { + pd = idr_read_pd(cmd.pd_handle, file->ucontext); + if (!pd) { ret = -EINVAL; - goto err_up; + goto err; } - uobj->user_handle = cmd.user_handle; - uobj->context = file->ucontext; - attr.dlid = cmd.attr.dlid; attr.sl = cmd.attr.sl; attr.src_path_bits = cmd.attr.src_path_bits; @@ -1589,12 +1775,13 @@ ssize_t ib_uverbs_create_ah(struct ib_uv ah = ib_create_ah(pd, &attr); if (IS_ERR(ah)) { ret = PTR_ERR(ah); - goto err_up; + goto err; } - ah->uobject = uobj; + ah->uobject = uobj; + uobj->object = ah; - ret = idr_add_uobj(&ib_uverbs_ah_idr, ah, uobj); + ret = idr_add_uobj(&ib_uverbs_ah_idr, uobj); if (ret) goto err_destroy; @@ -1603,27 +1790,29 @@ ssize_t ib_uverbs_create_ah(struct ib_uv if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { ret = -EFAULT; - goto err_idr; + goto err_copy; } + put_pd_read(pd); + mutex_lock(&file->mutex); list_add_tail(&uobj->list, &file->ucontext->ah_list); mutex_unlock(&file->mutex); - mutex_unlock(&ib_uverbs_idr_mutex); + uobj->live = 1; + + up_write(&uobj->mutex); return in_len; -err_idr: - idr_remove(&ib_uverbs_ah_idr, uobj->id); +err_copy: + idr_remove_uobj(&ib_uverbs_ah_idr, uobj); err_destroy: ib_destroy_ah(ah); -err_up: - mutex_unlock(&ib_uverbs_idr_mutex); - - kfree(uobj); +err: + put_uobj_write(uobj); return ret; } @@ -1633,35 +1822,34 @@ ssize_t ib_uverbs_destroy_ah(struct ib_u struct ib_uverbs_destroy_ah cmd; struct ib_ah *ah; struct ib_uobject *uobj; - int ret = -EINVAL; + int ret; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); + uobj = idr_write_uobj(&ib_uverbs_ah_idr, cmd.ah_handle, file->ucontext); + if (!uobj) + return -EINVAL; + ah = uobj->object; - ah = idr_find(&ib_uverbs_ah_idr, cmd.ah_handle); - if (!ah || ah->uobject->context != file->ucontext) - goto out; + ret = ib_destroy_ah(ah); + if (!ret) + uobj->live = 0; - uobj = ah->uobject; + put_uobj_write(uobj); - ret = ib_destroy_ah(ah); if (ret) - goto out; + return ret; - idr_remove(&ib_uverbs_ah_idr, cmd.ah_handle); + idr_remove_uobj(&ib_uverbs_ah_idr, uobj); mutex_lock(&file->mutex); list_del(&uobj->list); mutex_unlock(&file->mutex); - kfree(uobj); + put_uobj(uobj); -out: - mutex_unlock(&ib_uverbs_idr_mutex); - - return ret ? ret : in_len; + return in_len; } ssize_t ib_uverbs_attach_mcast(struct ib_uverbs_file *file, @@ -1670,47 +1858,43 @@ ssize_t ib_uverbs_attach_mcast(struct ib { struct ib_uverbs_attach_mcast cmd; struct ib_qp *qp; - struct ib_uqp_object *uobj; + struct ib_uqp_object *obj; struct ib_uverbs_mcast_entry *mcast; - int ret = -EINVAL; + int ret; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (!qp || qp->uobject->context != file->ucontext) - goto out; + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) + return -EINVAL; - uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + obj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); - list_for_each_entry(mcast, &uobj->mcast_list, list) + list_for_each_entry(mcast, &obj->mcast_list, list) if (cmd.mlid == mcast->lid && !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) { ret = 0; - goto out; + goto out_put; } mcast = kmalloc(sizeof *mcast, GFP_KERNEL); if (!mcast) { ret = -ENOMEM; - goto out; + goto out_put; } mcast->lid = cmd.mlid; memcpy(mcast->gid.raw, cmd.gid, sizeof mcast->gid.raw); ret = ib_attach_mcast(qp, &mcast->gid, cmd.mlid); - if (!ret) { - uobj = container_of(qp->uobject, struct ib_uqp_object, - uevent.uobject); - list_add_tail(&mcast->list, &uobj->mcast_list); - } else + if (!ret) + list_add_tail(&mcast->list, &obj->mcast_list); + else kfree(mcast); -out: - mutex_unlock(&ib_uverbs_idr_mutex); +out_put: + put_qp_read(qp); return ret ? ret : in_len; } @@ -1720,7 +1904,7 @@ ssize_t ib_uverbs_detach_mcast(struct ib int out_len) { struct ib_uverbs_detach_mcast cmd; - struct ib_uqp_object *uobj; + struct ib_uqp_object *obj; struct ib_qp *qp; struct ib_uverbs_mcast_entry *mcast; int ret = -EINVAL; @@ -1728,19 +1912,17 @@ ssize_t ib_uverbs_detach_mcast(struct ib if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); - - qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (!qp || qp->uobject->context != file->ucontext) - goto out; + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) + return -EINVAL; ret = ib_detach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); if (ret) - goto out; + goto out_put; - uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + obj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); - list_for_each_entry(mcast, &uobj->mcast_list, list) + list_for_each_entry(mcast, &obj->mcast_list, list) if (cmd.mlid == mcast->lid && !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) { list_del(&mcast->list); @@ -1748,8 +1930,8 @@ ssize_t ib_uverbs_detach_mcast(struct ib break; } -out: - mutex_unlock(&ib_uverbs_idr_mutex); +out_put: + put_qp_read(qp); return ret ? ret : in_len; } @@ -1761,7 +1943,7 @@ ssize_t ib_uverbs_create_srq(struct ib_u struct ib_uverbs_create_srq cmd; struct ib_uverbs_create_srq_resp resp; struct ib_udata udata; - struct ib_uevent_object *uobj; + struct ib_uevent_object *obj; struct ib_pd *pd; struct ib_srq *srq; struct ib_srq_init_attr attr; @@ -1777,17 +1959,17 @@ ssize_t ib_uverbs_create_srq(struct ib_u (unsigned long) cmd.response + sizeof resp, in_len - sizeof cmd, out_len - sizeof resp); - uobj = kmalloc(sizeof *uobj, GFP_KERNEL); - if (!uobj) + obj = kmalloc(sizeof *obj, GFP_KERNEL); + if (!obj) return -ENOMEM; - mutex_lock(&ib_uverbs_idr_mutex); - - pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); + init_uobj(&obj->uobject, 0, file->ucontext); + down_write(&obj->uobject.mutex); - if (!pd || pd->uobject->context != file->ucontext) { + pd = idr_read_pd(cmd.pd_handle, file->ucontext); + if (!pd) { ret = -EINVAL; - goto err_up; + goto err; } attr.event_handler = ib_uverbs_srq_event_handler; @@ -1796,59 +1978,59 @@ ssize_t ib_uverbs_create_srq(struct ib_u attr.attr.max_sge = cmd.max_sge; attr.attr.srq_limit = cmd.srq_limit; - uobj->uobject.user_handle = cmd.user_handle; - uobj->uobject.context = file->ucontext; - uobj->events_reported = 0; - INIT_LIST_HEAD(&uobj->event_list); + obj->events_reported = 0; + INIT_LIST_HEAD(&obj->event_list); srq = pd->device->create_srq(pd, &attr, &udata); if (IS_ERR(srq)) { ret = PTR_ERR(srq); - goto err_up; + goto err; } srq->device = pd->device; srq->pd = pd; - srq->uobject = &uobj->uobject; + srq->uobject = &obj->uobject; srq->event_handler = attr.event_handler; srq->srq_context = attr.srq_context; atomic_inc(&pd->usecnt); atomic_set(&srq->usecnt, 0); - memset(&resp, 0, sizeof resp); - - ret = idr_add_uobj(&ib_uverbs_srq_idr, srq, &uobj->uobject); + obj->uobject.object = srq; + ret = idr_add_uobj(&ib_uverbs_srq_idr, &obj->uobject); if (ret) goto err_destroy; - resp.srq_handle = uobj->uobject.id; + memset(&resp, 0, sizeof resp); + resp.srq_handle = obj->uobject.id; resp.max_wr = attr.attr.max_wr; resp.max_sge = attr.attr.max_sge; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { ret = -EFAULT; - goto err_idr; + goto err_copy; } + put_pd_read(pd); + mutex_lock(&file->mutex); - list_add_tail(&uobj->uobject.list, &file->ucontext->srq_list); + list_add_tail(&obj->uobject.list, &file->ucontext->srq_list); mutex_unlock(&file->mutex); - mutex_unlock(&ib_uverbs_idr_mutex); + obj->uobject.live = 1; + + up_write(&obj->uobject.mutex); return in_len; -err_idr: - idr_remove(&ib_uverbs_srq_idr, uobj->uobject.id); +err_copy: + idr_remove_uobj(&ib_uverbs_srq_idr, &obj->uobject); err_destroy: ib_destroy_srq(srq); -err_up: - mutex_unlock(&ib_uverbs_idr_mutex); - - kfree(uobj); +err: + put_uobj_write(&obj->uobject); return ret; } @@ -1864,21 +2046,16 @@ ssize_t ib_uverbs_modify_srq(struct ib_u if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); - - srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); - if (!srq || srq->uobject->context != file->ucontext) { - ret = -EINVAL; - goto out; - } + srq = idr_read_srq(cmd.srq_handle, file->ucontext); + if (!srq) + return -EINVAL; attr.max_wr = cmd.max_wr; attr.srq_limit = cmd.srq_limit; ret = ib_modify_srq(srq, &attr, cmd.attr_mask); -out: - mutex_unlock(&ib_uverbs_idr_mutex); + put_srq_read(srq); return ret ? ret : in_len; } @@ -1899,18 +2076,16 @@ ssize_t ib_uverbs_query_srq(struct ib_uv if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); + srq = idr_read_srq(cmd.srq_handle, file->ucontext); + if (!srq) + return -EINVAL; - srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); - if (srq && srq->uobject->context == file->ucontext) - ret = ib_query_srq(srq, &attr); - else - ret = -EINVAL; + ret = ib_query_srq(srq, &attr); - mutex_unlock(&ib_uverbs_idr_mutex); + put_srq_read(srq); if (ret) - goto out; + return ret; memset(&resp, 0, sizeof resp); @@ -1920,10 +2095,9 @@ ssize_t ib_uverbs_query_srq(struct ib_uv if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) - ret = -EFAULT; + return -EFAULT; -out: - return ret ? ret : in_len; + return in_len; } ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file, @@ -1932,45 +2106,45 @@ ssize_t ib_uverbs_destroy_srq(struct ib_ { struct ib_uverbs_destroy_srq cmd; struct ib_uverbs_destroy_srq_resp resp; + struct ib_uobject *uobj; struct ib_srq *srq; - struct ib_uevent_object *uobj; + struct ib_uevent_object *obj; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - mutex_lock(&ib_uverbs_idr_mutex); - - memset(&resp, 0, sizeof resp); + uobj = idr_write_uobj(&ib_uverbs_srq_idr, cmd.srq_handle, file->ucontext); + if (!uobj) + return -EINVAL; + srq = uobj->object; + obj = container_of(uobj, struct ib_uevent_object, uobject); - srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); - if (!srq || srq->uobject->context != file->ucontext) - goto out; + ret = ib_destroy_srq(srq); + if (!ret) + uobj->live = 0; - uobj = container_of(srq->uobject, struct ib_uevent_object, uobject); + put_uobj_write(uobj); - ret = ib_destroy_srq(srq); if (ret) - goto out; + return ret; - idr_remove(&ib_uverbs_srq_idr, cmd.srq_handle); + idr_remove_uobj(&ib_uverbs_srq_idr, uobj); mutex_lock(&file->mutex); - list_del(&uobj->uobject.list); + list_del(&uobj->list); mutex_unlock(&file->mutex); - ib_uverbs_release_uevent(file, uobj); + ib_uverbs_release_uevent(file, obj); - resp.events_reported = uobj->events_reported; + memset(&resp, 0, sizeof resp); + resp.events_reported = obj->events_reported; - kfree(uobj); + put_uobj(uobj); if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) ret = -EFAULT; -out: - mutex_unlock(&ib_uverbs_idr_mutex); - return ret ? ret : in_len; } diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index ff092a0..5ec2d49 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -66,7 +66,7 @@ #define IB_UVERBS_BASE_DEV MKDEV(IB_UVER static struct class *uverbs_class; -DEFINE_MUTEX(ib_uverbs_idr_mutex); +DEFINE_SPINLOCK(ib_uverbs_idr_lock); DEFINE_IDR(ib_uverbs_pd_idr); DEFINE_IDR(ib_uverbs_mr_idr); DEFINE_IDR(ib_uverbs_mw_idr); @@ -183,21 +183,21 @@ static int ib_uverbs_cleanup_ucontext(st if (!context) return 0; - mutex_lock(&ib_uverbs_idr_mutex); - list_for_each_entry_safe(uobj, tmp, &context->ah_list, list) { - struct ib_ah *ah = idr_find(&ib_uverbs_ah_idr, uobj->id); - idr_remove(&ib_uverbs_ah_idr, uobj->id); + struct ib_ah *ah = uobj->object; + + idr_remove_uobj(&ib_uverbs_ah_idr, uobj); ib_destroy_ah(ah); list_del(&uobj->list); kfree(uobj); } list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) { - struct ib_qp *qp = idr_find(&ib_uverbs_qp_idr, uobj->id); + struct ib_qp *qp = uobj->object; struct ib_uqp_object *uqp = container_of(uobj, struct ib_uqp_object, uevent.uobject); - idr_remove(&ib_uverbs_qp_idr, uobj->id); + + idr_remove_uobj(&ib_uverbs_qp_idr, uobj); ib_uverbs_detach_umcast(qp, uqp); ib_destroy_qp(qp); list_del(&uobj->list); @@ -206,11 +206,12 @@ static int ib_uverbs_cleanup_ucontext(st } list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) { - struct ib_cq *cq = idr_find(&ib_uverbs_cq_idr, uobj->id); + struct ib_cq *cq = uobj->object; struct ib_uverbs_event_file *ev_file = cq->cq_context; struct ib_ucq_object *ucq = container_of(uobj, struct ib_ucq_object, uobject); - idr_remove(&ib_uverbs_cq_idr, uobj->id); + + idr_remove_uobj(&ib_uverbs_cq_idr, uobj); ib_destroy_cq(cq); list_del(&uobj->list); ib_uverbs_release_ucq(file, ev_file, ucq); @@ -218,10 +219,11 @@ static int ib_uverbs_cleanup_ucontext(st } list_for_each_entry_safe(uobj, tmp, &context->srq_list, list) { - struct ib_srq *srq = idr_find(&ib_uverbs_srq_idr, uobj->id); + struct ib_srq *srq = uobj->object; struct ib_uevent_object *uevent = container_of(uobj, struct ib_uevent_object, uobject); - idr_remove(&ib_uverbs_srq_idr, uobj->id); + + idr_remove_uobj(&ib_uverbs_srq_idr, uobj); ib_destroy_srq(srq); list_del(&uobj->list); ib_uverbs_release_uevent(file, uevent); @@ -231,11 +233,11 @@ static int ib_uverbs_cleanup_ucontext(st /* XXX Free MWs */ list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) { - struct ib_mr *mr = idr_find(&ib_uverbs_mr_idr, uobj->id); + struct ib_mr *mr = uobj->object; struct ib_device *mrdev = mr->device; struct ib_umem_object *memobj; - idr_remove(&ib_uverbs_mr_idr, uobj->id); + idr_remove_uobj(&ib_uverbs_mr_idr, uobj); ib_dereg_mr(mr); memobj = container_of(uobj, struct ib_umem_object, uobject); @@ -246,15 +248,14 @@ static int ib_uverbs_cleanup_ucontext(st } list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) { - struct ib_pd *pd = idr_find(&ib_uverbs_pd_idr, uobj->id); - idr_remove(&ib_uverbs_pd_idr, uobj->id); + struct ib_pd *pd = uobj->object; + + idr_remove_uobj(&ib_uverbs_pd_idr, uobj); ib_dealloc_pd(pd); list_del(&uobj->list); kfree(uobj); } - mutex_unlock(&ib_uverbs_idr_mutex); - return context->device->dealloc_ucontext(context); } diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 7ced208..ee1f3a3 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -697,8 +697,12 @@ struct ib_ucontext { struct ib_uobject { u64 user_handle; /* handle given to us by userspace */ struct ib_ucontext *context; /* associated user context */ + void *object; /* containing object */ struct list_head list; /* link to context's list */ u32 id; /* index into kernel idr */ + struct kref ref; + struct rw_semaphore mutex; /* protects .live */ + int live; }; struct ib_umem { From ralphc at pathscale.com Fri Jun 16 15:48:53 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 16 Jun 2006 15:48:53 -0700 Subject: [openib-general] [PATCH] update libipathverbs library to the new initialization method Message-ID: <1150498133.32252.111.camel@brick.pathscale.com> The current libipathverbs driver in the trunk doesn't conform to the new module initialization convention for libibverbs.so. This patch corrects that. Also, with this patch, we can now try testing the performance of Roland's changes to eliminate the single ib_uverbs_idr_mutex. Signed-off-by: Ralph Campbell Index: src/userspace/libipathverbs/src/ipathverbs.c =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.c (revision 8089) +++ src/userspace/libipathverbs/src/ipathverbs.c (working copy) @@ -145,30 +145,24 @@ .free_context = ipath_free_context }; -struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev) +struct ibv_device *ibv_driver_init(const char *uverbs_sys_path, + int abi_version) { - struct sysfs_device *pcidev; - struct sysfs_attribute *attr; + char value[8]; struct ipath_device *dev; - unsigned vendor, device; - int i; + unsigned vendor, device; + int i; - pcidev = sysfs_get_classdev_device(sysdev); - if (!pcidev) + if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor", + value, sizeof value) < 0) return NULL; + sscanf(value, "%i", &vendor); - attr = sysfs_get_device_attr(pcidev, "vendor"); - if (!attr) + if (ibv_read_sysfs_file(uverbs_sys_path, "device/device", + value, sizeof value) < 0) return NULL; - sscanf(attr->value, "%i", &vendor); - sysfs_close_attribute(attr); + sscanf(value, "%i", &device); - attr = sysfs_get_device_attr(pcidev, "device"); - if (!attr) - return NULL; - sscanf(attr->value, "%i", &device); - sysfs_close_attribute(attr); - for (i = 0; i < sizeof hca_table / sizeof hca_table[0]; ++i) if (vendor == hca_table[i].vendor && device == hca_table[i].device) @@ -180,13 +174,12 @@ dev = malloc(sizeof *dev); if (!dev) { fprintf(stderr, PFX "Fatal: couldn't allocate device for %s\n", - sysdev->name); - abort(); + uverbs_sys_path); + return NULL; } dev->ibv_dev.ops = ipath_dev_ops; dev->hca_type = hca_table[i].type; - dev->page_size = sysconf(_SC_PAGESIZE); return &dev->ibv_dev; } Index: libipathverbs/src/ipathverbs.h =================================================================== --- libipathverbs/src/ipathverbs.h (revision 8089) +++ libipathverbs/src/ipathverbs.h (working copy) @@ -57,7 +57,6 @@ struct ipath_device { struct ibv_device ibv_dev; enum ipath_hca_type hca_type; - int page_size; }; struct ipath_context { -- Ralph Campbell From rjwalsh at pathscale.com Fri Jun 16 15:51:22 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Fri, 16 Jun 2006 15:51:22 -0700 Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with ib_uverbs_idr_mutex In-Reply-To: References: <20060613051149.GE4621@mellanox.co.il> <1150223140.11881.2.camel@hematite.internal.keyresearch.com> Message-ID: <1150498282.13304.0.camel@hematite.internal.keyresearch.com> On Fri, 2006-06-16 at 15:07 -0700, Roland Dreier wrote: > Robert, can you confirm that the new uverbs locking scheme helps the > performance problems you're having? Sure - I'll take a look on Monday. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From rdreier at cisco.com Fri Jun 16 16:12:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jun 2006 16:12:59 -0700 Subject: [openib-general] [PATCH] update libipathverbs library to the new initialization method In-Reply-To: <1150498133.32252.111.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 16 Jun 2006 15:48:53 -0700") References: <1150498133.32252.111.camel@brick.pathscale.com> Message-ID: > The current libipathverbs driver in the trunk doesn't > conform to the new module initialization convention for > libibverbs.so. This patch corrects that. Looks OK but you're now only compatible with unreleased development versions of libibverbs -- this won't work against the stable libibverbs 1.0 code shipped with Fedora and Debian for example. You might want to follow the approach libmthca uses to build against both libibverbs 1.0 and also pre-1.1 development code. > Also, with this patch, we can now try testing the performance > of Roland's changes to eliminate the single ib_uverbs_idr_mutex. Glad you're going to test, but why do you need this patch? Couldn't you just have put a new kernel onto a system with libibverbs 1.0 and the old libipathverbs? - R. From ralphc at pathscale.com Fri Jun 16 16:15:41 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 16 Jun 2006 16:15:41 -0700 Subject: [openib-general] [PATCH] resend: update libipathverbs library to the new initialization method Message-ID: <1150499741.32252.119.camel@brick.pathscale.com> The patch I just sent left out a minor change so please ignore the previous patch and apply this one instead. (I forgot to include the change to the map file) Signed-off-by: Ralph Campbell Index: src/userspace/libipathverbs/src/ipathverbs.c =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.c (revision 8089) +++ src/userspace/libipathverbs/src/ipathverbs.c (working copy) @@ -145,30 +145,24 @@ .free_context = ipath_free_context }; -struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev) +struct ibv_device *ibv_driver_init(const char *uverbs_sys_path, + int abi_version) { - struct sysfs_device *pcidev; - struct sysfs_attribute *attr; + char value[8]; struct ipath_device *dev; - unsigned vendor, device; - int i; + unsigned vendor, device; + int i; - pcidev = sysfs_get_classdev_device(sysdev); - if (!pcidev) + if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor", + value, sizeof value) < 0) return NULL; + sscanf(value, "%i", &vendor); - attr = sysfs_get_device_attr(pcidev, "vendor"); - if (!attr) + if (ibv_read_sysfs_file(uverbs_sys_path, "device/device", + value, sizeof value) < 0) return NULL; - sscanf(attr->value, "%i", &vendor); - sysfs_close_attribute(attr); + sscanf(value, "%i", &device); - attr = sysfs_get_device_attr(pcidev, "device"); - if (!attr) - return NULL; - sscanf(attr->value, "%i", &device); - sysfs_close_attribute(attr); - for (i = 0; i < sizeof hca_table / sizeof hca_table[0]; ++i) if (vendor == hca_table[i].vendor && device == hca_table[i].device) @@ -180,13 +174,12 @@ dev = malloc(sizeof *dev); if (!dev) { fprintf(stderr, PFX "Fatal: couldn't allocate device for %s\n", - sysdev->name); - abort(); + uverbs_sys_path); + return NULL; } dev->ibv_dev.ops = ipath_dev_ops; dev->hca_type = hca_table[i].type; - dev->page_size = sysconf(_SC_PAGESIZE); return &dev->ibv_dev; } Index: src/usrspace/libipathverbs/src/ipathverbs.h =================================================================== --- src/usrspace/libipathverbs/src/ipathverbs.h (revision 8089) +++ src/usrspace/libipathverbs/src/ipathverbs.h (working copy) @@ -57,7 +57,6 @@ struct ipath_device { struct ibv_device ibv_dev; enum ipath_hca_type hca_type; - int page_size; }; struct ipath_context { Index: src/userspace/libipathverbs/src/ipathverbs.map =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.map (revision 8089) +++ src/userspace/libipathverbs/src/ipathverbs.map (working copy) @@ -1,4 +1,4 @@ { - global: openib_driver_init; + global: ibv_driver_init; local: *; }; -- Ralph Campbell From rdreier at cisco.com Fri Jun 16 16:26:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jun 2006 16:26:20 -0700 Subject: [openib-general] [PATCH] resend: update libipathverbs library to the new initialization method In-Reply-To: <1150499741.32252.119.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 16 Jun 2006 16:15:41 -0700") References: <1150499741.32252.119.camel@brick.pathscale.com> Message-ID: > The patch I just sent left out a minor change so please > ignore the previous patch and apply this one instead. > (I forgot to include the change to the map file) You can just go ahead and check in libipathverbs changes yourself -- qlogic is definitely going to be the maintainer of that code. - R. From ralphc at pathscale.com Fri Jun 16 16:30:32 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 16 Jun 2006 16:30:32 -0700 Subject: [openib-general] [PATCH] update libipathverbs library to the new initialization method In-Reply-To: References: <1150498133.32252.111.camel@brick.pathscale.com> Message-ID: <1150500632.32252.128.camel@brick.pathscale.com> On Fri, 2006-06-16 at 16:12 -0700, Roland Dreier wrote: > > The current libipathverbs driver in the trunk doesn't > > conform to the new module initialization convention for > > libibverbs.so. This patch corrects that. > > Looks OK but you're now only compatible with unreleased development > versions of libibverbs -- this won't work against the stable > libibverbs 1.0 code shipped with Fedora and Debian for example. > > You might want to follow the approach libmthca uses to build against > both libibverbs 1.0 and also pre-1.1 development code. Its not hard to allow 1.1 libipathverbs to build against 1.0 libibverbs for just this change but I suspect that the mmap stuff might not be so easy or other 1.1 changes. I don't really think it makes sense to support every combination of up and down rev compilation and run time compatibility. > > Also, with this patch, we can now try testing the performance > > of Roland's changes to eliminate the single ib_uverbs_idr_mutex. > > Glad you're going to test, but why do you need this patch? Couldn't > you just have put a new kernel onto a system with libibverbs 1.0 and > the old libipathverbs? > > - R. Sure. I was just in the middle of getting the trunk to run again when you sent your request. -- Ralph Campbell From ralphc at pathscale.com Fri Jun 16 16:32:29 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 16 Jun 2006 16:32:29 -0700 Subject: [openib-general] [PATCH] resend: update libipathverbs library to the new initialization method In-Reply-To: References: <1150499741.32252.119.camel@brick.pathscale.com> Message-ID: <1150500749.32252.130.camel@brick.pathscale.com> On Fri, 2006-06-16 at 16:26 -0700, Roland Dreier wrote: > > The patch I just sent left out a minor change so please > > ignore the previous patch and apply this one instead. > > (I forgot to include the change to the map file) > > You can just go ahead and check in libipathverbs changes yourself -- > qlogic is definitely going to be the maintainer of that code. > > - R. Thanks. I was just trying to be complete in case anyone was applying the patch themselves before SVN was updated. -- Ralph Campbell From nickpiggin at yahoo.com.au Fri Jun 16 20:59:12 2006 From: nickpiggin at yahoo.com.au (Nick Piggin) Date: Sat, 17 Jun 2006 13:59:12 +1000 Subject: [openib-general] [PATCH v2 4/7] AMSO1100 Memory Management. In-Reply-To: <1150128349.22704.20.camel@trinity.ogc.int> References: <20060607200646.9259.24588.stgit@stevo-desktop> <20060607200655.9259.90768.stgit@stevo-desktop> <20060608011744.1a66e85a.akpm@osdl.org> <1150128349.22704.20.camel@trinity.ogc.int> Message-ID: <44937E10.3000006@yahoo.com.au> Tom Tucker wrote: > On Thu, 2006-06-08 at 01:17 -0700, Andrew Morton wrote: > >>On Wed, 07 Jun 2006 15:06:55 -0500 >>Steve Wise wrote: >> >> >>>+void c2_free(struct c2_alloc *alloc, u32 obj) >>>+{ >>>+ spin_lock(&alloc->lock); >>>+ clear_bit(obj, alloc->table); >>>+ spin_unlock(&alloc->lock); >>>+} >> >>The spinlock is unneeded here. > > > Good point. Really? clear_bit does not give you any memory ordering, so you can have the situation where another CPU sees the bit cleared, but this CPU still has stores pending to whatever it is being freed. Or any number of other nasty memory ordering badness. I'd just use the spinlocks, and prepend the clear_bit with a double underscore (so you get the non-atomic version), if that is appropriate. The spinlocks nicely handle all the memory ordering issues, and serve to document concurrency issues. If you need every last bit of performance and scalability, that's OK, but you need comments and I suspect you'd need more memory barriers. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com From panda at cse.ohio-state.edu Fri Jun 16 20:57:04 2006 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri, 16 Jun 2006 23:57:04 -0400 (EDT) Subject: [openib-general] Announcing the availability of MVAPICH 0.9.8-rc0 with on-demand connection management, fault-tolerance and adavanced multi-rail scheduling support Message-ID: <200606170357.k5H3v4w4025857@xi.cse.ohio-state.edu> The MVAPICH team is pleased to announce the availability of MVAPICH 0.9.8-rc0 with the following new features: - On-demand connection management using native InfiniBand Unreliable Datagram (UD) support. This feature enables InfiniBand connections to be setup dynamically, enhancing the scalability of MVAPICH on multi-thousand node clusters. - Support for Fault Tolerance: Mem-to-mem reliable data transfer (detection of I/O bus error with 32bit CRC and retransmission in case of error). This mode enables MVAPICH to deliver messages reliably in presence of I/O bus errors. - Multi-rail communication support with flexible scheduling policies: - Separate control of small and large message scheduling - Three different scheduling policies for small messages: - Using First Subchannel, Round Robin and Process Binding - Six different scheduling policies for large messages: - Round Robin, Weighted striping, Even striping, Stripe Blocking, Adaptive Striping and Process Binding - Shared library support for Solaris - Integrated and easy-to-use build script which automatically detects system architecture and In.niBand adapter types and optimizes MVAPICH for any particular installation More details on all features and supported platforms can be obtained by visiting the project's web page -> Overview -> features. For downloading MVAPICH 0.9.8-rc0 package and accessing the anonymous SVN, please visit the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ A stripped down version of this release is also available at the OpenIB SVN. Under the download page of the above URL, the latest testing results of this rc0 version for different platforms and test suites are shown. It also shows the rigorous testing procedures being used by the team for MVAPICH and MVAPICH2 releases. As soon as the remaining tests are done, we will make a formal release for MVAPICH 0.9.8. All feedbacks, including bug reports, hints for performance tuning, patches and enhancements are welcome. Please post it to mvapich-discuss mailing list. Thanks, MVAPICH Team at OSU/NBCL From eitan at mellanox.co.il Sat Jun 17 12:36:40 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 17 Jun 2006 22:36:40 +0300 Subject: [openib-general] [PATCH] osm: partition manager force policy In-Reply-To: <20060615181524.GB24808@sashak.voltaire.com> References: <86odwxgqrs.fsf@mtl066.yok.mtl.com> <20060615110617.GA21560@sashak.voltaire.com> <44915060.6090103@mellanox.co.il> <20060615181524.GB24808@sashak.voltaire.com> Message-ID: <449459C8.9050300@mellanox.co.il> Sasha Khapyorsky wrote: I'm working on the changes below. I will send them all as one patch EZ > Hi Eitan, > > On 15:19 Thu 15 Jun , Eitan Zahavi wrote: > >>>>+/* >>>>+* PARAMETERS >>>>+* p_physp >>>>+* [in] Pointer to an osm_physp_t object. >>>>+* >>>>+* RETURN VALUES >>>>+* The pointer to the P_Key table object. >>>>+* >>>>+* NOTES >>>>+* >>>>+* SEE ALSO >>>>+* Port, Physical Port >>>>+*********/ >>>>+ >>> >>> >>>Is not this simpler to remove 'const' from existing >>>osm_physp_get_pkey_tbl() function instead of using new one? >> >>There are plenty of const functions using this function internally >>so I would have need to fix them too. > > > You are right. Maybe separate patch for this? > I think it is preferable to keep the const function. > >>>>@@ -118,14 +121,29 @@ void osm_pkey_tbl_sync_new_blocks( >>>> p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b); >>>> if ( b < new_blocks ) >>>> p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b); >>>>- else { >>>>+ else >>>>+ { >>>> p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block)); >>>> if (!p_new_block) >>>> break; >>>>+ cl_ptr_vector_set(&((osm_pkey_tbl_t >>>>*)p_pkey_tbl)->new_blocks, + b, >>>>p_new_block); >>>>+ } >>>>+ >>>> memset(p_new_block, 0, sizeof(*p_new_block)); >>>>- cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, >>>>p_new_block); >>>> } >>>>- memcpy(p_new_block, p_block, sizeof(*p_new_block)); >>>>+} >>> >>> >>>You changed this function so it does not do any sync anymore. Should >>>function name be changed too? >> >>Yes correct I will change it. Is a better name: >>osm_pkey_tbl_init_new_blocks ? > > > Great name. > > >>>>+ to show that on the "old" blocks >>>>+*/ >>>>+int >>>>+osm_pkey_tbl_set_new_entry( >>>>+ IN osm_pkey_tbl_t *p_pkey_tbl, >>>>+ IN uint16_t block_idx, >>>>+ IN uint8_t pkey_idx, >>>>+ IN uint16_t pkey) >>>>+{ >>>>+ ib_pkey_table_t *p_old_block; >>>>+ ib_pkey_table_t *p_new_block; >>>>+ >>>>+ if (osm_pkey_tbl_make_block_pair( >>>>+ p_pkey_tbl, block_idx, &p_old_block, &p_new_block)) >>>>+ return 1; >>>>+ >>>>+ cl_map_insert( &p_pkey_tbl->keys, >>>>+ ib_pkey_get_base(pkey), >>>>+ >>>>&(p_old_block->pkey_entry[pkey_idx])); >>> >>> >>>Here you map potentially empty pkey entry. Why? "old block" will be >>>remapped anyway on pkey receiving. >> >>The reason I did this was that if the GetResp will fail I still want to >>represent >>the settings in the map.But actually it might be better not to do that so >>next >>time we run we will not find it without a GetResp. > > > Agree. > > >>>>+ IN uint16_t *p_pkey, >>>>+ OUT uint32_t *p_block_idx, >>>>+ OUT uint8_t *p_pkey_index) >>>>+{ >>>>+ uint32_t num_of_blocks; >>>>+ uint32_t block_index; >>>>+ ib_pkey_table_t *block; >>>>+ >>>>+ CL_ASSERT( p_pkey_tbl ); >>>>+ CL_ASSERT( p_block_idx != NULL ); >>>>+ CL_ASSERT( p_pkey_idx != NULL ); >>> >>> >>>Why last two CL_ASSERTs? What should be problem with uninitialized >>>pointers here? >>> >> >>These are the outputs of the function. It does not make sense to call the >>functions with >>null output pointers (calling by ref) . Anyway instead of putting the check >>in the free build >>I used an assert > > > I see. Actually I've overlooked that addresses and not values are > checked. Please ignore this comment. > > >>>>+ >>>>+ p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); >>>>+ if (! p_pkey_tbl) >>> >>> ^^^^^^^^^^^^^ >>>Is it possible? >> >>Yes it is ! I run into it during testing. The port did not have any pkey >>table. > > > static inline osm_pkey_tbl_t * > osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp ) > { > ... > return( &p_physp->pkeys ); > }; > > This returns the address of physp's pkeys field. Right? > Then if ( &p_physp->pkeys == NULL ) p_physp pointer should be equal to > unsigned equivalent of -(offset of pkey field in physp struct). Correct. I will remove the check. > > >>>>+ "Fail to allocate new pending pkey >>>>entry for node " >>>>+ "0x%016" PRIx64 " port %u\n", >>>>+ cl_ntoh64( osm_node_get_node_guid( >>>>p_node ) ), >>>>+ osm_physp_get_port_num( p_physp ) ); >>>>+ return; >>>>+ } >>>>+ p_pending->pkey = pkey; >>>>+ p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey >>>>) ); >>>>+ if ( !p_orig_pkey || >>>>+ (ib_pkey_get_base(*p_orig_pkey) != ib_pkey_get_base(pkey) >>>>)) >>> >>> >>>There the cases of new pkey and updated pkey membership is mixed. Why? >> >>I am not following your question. >>The specific case I am trying to catch is the one that for some reason the >>map points to >>a pkey entry that was modified somehow and is different then the one you >>would expect by >>the map. > > > Didn't understand it at first pass, now it is clearer. > > If pkey entry was modified somehow (how? bugs?), the assumption is that > mapping still be valid? Then it is not new entry (or we will change > pkey's index in the real table). > PKey table mismatch between the block and map should never happen. I will remove the check and replace that with an ASSERT so I catch the bug if we hit it. > >>>>+ { >>>>+ p_pending->is_new = TRUE; >>>>+ cl_qlist_insert_tail(&p_pkey_tbl->pending, >>>>(cl_list_item_t*)p_pending); >>>>+ stat = "inserted"; >>>>+ } >>>>+ else >>>>+ { >>>>+ p_pending->is_new = FALSE; >>>>+ if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey, >>>>+ >>>>&p_pending->block, &p_pending->index)) >>> >>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>>AFAIK in this function there were CL_ASSERTs which check for uinitialized >>>pointers. >> >>True. So the asserts are not required in this case. > > > Up to you. Actually this my comment may be ignored, as stated above I > didn't read this correctly. > > >>> >>>>+ { >>>>+ osm_log( p_log, OSM_LOG_ERROR, >>>>+ "pkey_mgr_process_physical_port: >>>>ERR 0503: " >>>>+ "Fail to obtain P_Key 0x%04x >>>>block and index for node " >>>>+ "0x%016" PRIx64 " port %u\n", >>>>+ cl_ntoh64( >>>>osm_node_get_node_guid( p_node ) ), >>>>+ osm_physp_get_port_num( >>>>p_physp ) ); >>>>+ return; >>>>+ } >>>>+ cl_qlist_insert_head(&p_pkey_tbl->pending, >>>>(cl_list_item_t*)p_pending); >>>>+ stat = "updated"; >>> >>> >>>Is it will be updated? It is likely "already there" case. No? >>> >>>Also in this case you can already put the pkey in new_block instead of >>>holding it in pending list. Then later you will only need to add new >>>pkeys. This may simplify the flow and even save some mem. >> >>True but in my mind it does not simplify - on the contrary it makes the >>partition between >>populating each port pending list and actually setting the pkey tables >>mixed. > > > I meant new_block filling, not actual setting. You will be able to > remove whole if { } else { } flow, as well as is_new, block and index > fields from 'pending' structure (actually only pkey value itself will > matter) - is it not nice simplification? I still prefer the clear staging: append to list when scanning the partitions and filling in the tables when looping on all ports. > > >>I do not think the memory impact deserves this mix of staging >> >> >>> > >>>>+ max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, >>>>p_physp ); >>>>+ if ( p_pkey_tbl->max_blocks > max_num_of_blocks ) >>>> { >>>>- block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); >>>>- for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) >>>>+ osm_log( p_log, OSM_LOG_INFO, >>>>+ "pkey_mgr_update_port: " >>>>+ "Max number of blocks reduced from >>>>%u to %u " + "for node 0x%016" PRIx64 " >>>>port %u\n", >>>>+ p_pkey_tbl->max_blocks, >>>>max_num_of_blocks, >>>>+ cl_ntoh64( osm_node_get_node_guid( >>>>p_node ) ), >>>>+ osm_physp_get_port_num( p_physp ) ); >>>>+ } >>>>+ p_pkey_tbl->max_blocks = max_num_of_blocks; >>>>+ >>>>+ osm_pkey_tbl_sync_new_blocks( p_pkey_tbl ); >>>>+ cl_map_remove_all( &p_pkey_tbl->keys ); >>> >>> >>>What is the reason to drop map here? AFAIK it will be reinitialized later >>>anyway when pkey blocks will be received. >> >>What if it is not received? > > > Then we will have unreliable data there. > > Maybe I know why you wanted this - this is part of "use pkey tables > before sending/receiving to/from ports" idea? > > >>>>@@ -255,24 +443,36 @@ pkey_mgr_update_peer_port( >>>> if (enforce == FALSE) >>>> return FALSE; >>>> >>>>- p_pkey_tbl = osm_physp_get_pkey_tbl( p ); >>>>- p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); >>>>+ p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); >>>>+ p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer ); >>>> num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); >>>>- if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) ) >>>>- num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ); >>>>+ peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer ); >>>>+ if (peer_max_blocks < p_pkey_tbl->used_blocks) >>>>+ { >>>>+ osm_log( p_log, OSM_LOG_ERROR, >>>>+ "pkey_mgr_update_peer_port: ERR >>>>0508: " >>>>+ "not enough entries (%u < %u) on >>>>switch 0x%016" PRIx64 >>>>+ " port %u\n", >>>>+ peer_max_blocks, num_of_blocks, >>>>+ cl_ntoh64( osm_node_get_node_guid( >>>>p_node ) ), >>>>+ osm_physp_get_port_num( peer ) ); >>>>+ return FALSE; >>> >>> >>>Do you think it is the best way, just to skip update - partitions are >>>enforced already on the switch. May be better to truncate pkey tables >>>in order to meet peer's capabilities? >> >>You are right about that - Its a bug! >>I think the best approach here is to turn off the enforcement on the switch. >>If we truncate the table we actually impact connectivity of the fabric. >>I prefer a softer approach - an error in the log. > > > Yes this should be good way to handle this. > > >>> >>>>+ } >>>> >>>>- for ( block_index = 0; block_index < num_of_blocks; block_index++ ) >>>>+ p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; >>>>+ for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; >>>>block_index++) >>>> { >>>> block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); >>>> peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); >>>> if ( memcmp( peer_block, block, sizeof( *peer_block ) ) ) >>>> { >>>>+ osm_pkey_tbl_set(p_peer_pkey_tbl, block_index, >>>>block); >>> >>> >>>Why this (osm_pkey_tbl_set())? This will be called by receiver. >> >>Same as the above note about updating the map >>I wanted to avoid to wait for the GetResp. >>I think it is a mistake and we can actually remove it. > > > Agree. > > Sasha. From or.gerlitz at gmail.com Sun Jun 18 04:35:27 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Sun, 18 Jun 2006 14:35:27 +0300 Subject: [openib-general] ucma into kernel.org In-Reply-To: <4492D706.4060106@ichips.intel.com> References: <1150465355.29508.4.camel@stevo-desktop> <4492D706.4060106@ichips.intel.com> Message-ID: <15ddcffd0606180435g366a6effs4d4826c8b3fbbd4f@mail.gmail.com> On 6/16/06, Sean Hefty wrote: > Steve Wise wrote: > > Will the ucma make it into 2.6.18? I notice its not in Roland's > > for-2.6.18 tree right now. > > The plan is to allow the userspace interface to mature some before trying to > merge them upstream. This is why it is not included in 2.6.18. Hi Sean, Can you remind (me...) what areas of the cma u/k interface seem to be not mature enough? upstream CMA can be a significant step in the sense of distros (eg SLES10 SP1 and RH5) kernel IB functional enough for production, as the primary inteface for "RDMA communication managment" the uCMA is and would be vastly used, so there should be good reason why not to push it for 2.6.18 . Or. Or. From eitan at mellanox.co.il Sun Jun 18 04:46:17 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 18 Jun 2006 14:46:17 +0300 Subject: [openib-general] [PATCHv3] osm: partition manager force policy Message-ID: <86fyi2hek6.fsf@mtl066.yok.mtl.com> Hi Hal This is a third take after incorporating Sasha's comments to the partition manager patch I have previously provided. The main difference is that the manager does not touch the current set of pkey tables but only sends Set(PKeyTable). Another one is the handling of switch limited partition cap by clearing the switch enforcement bit (on the specific port). Also modified interface of SMDB access functions from 0/1 to IB_SUCCESS/IB_ERROR/IB_NOT_FOUND appropriately. ~100 Tests passed both dedicated pkey enforcement (pkey.*) and stress test (osmStress.*). The pkey.* test was enhanced to verify correct pkey index is used by the manager (it should keep the original). BTW: the patch intentionally uses tabs and not spaces as I did not know what we have decided to use. To modify back simply replace every tab with 3 spaces. Eitan Signed-off-by: Eitan Zahavi Index: include/opensm/osm_port.h =================================================================== --- include/opensm/osm_port.h (revision 8100) +++ include/opensm/osm_port.h (working copy) @@ -591,6 +591,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy * Port, Physical Port *********/ +/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl +* NAME +* osm_physp_get_mod_pkey_tbl +* +* DESCRIPTION +* Returns a NON CONST pointer to the P_Key table object of the Physical Port object. +* +* SYNOPSIS +*/ +static inline osm_pkey_tbl_t * +osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp ) +{ + CL_ASSERT( osm_physp_is_valid( p_physp ) ); + /* + (14.2.5.7) - the block number valid values are 0-2047, and are further + limited by the size of the P_Key table specified by the PartitionCap on the node. + */ + return( &p_physp->pkeys ); +}; +/* +* PARAMETERS +* p_physp +* [in] Pointer to an osm_physp_t object. +* +* RETURN VALUES +* The pointer to the P_Key table object. +* +* NOTES +* +* SEE ALSO +* Port, Physical Port +*********/ + /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl * NAME * osm_physp_set_slvl_tbl Index: include/opensm/osm_pkey.h =================================================================== --- include/opensm/osm_pkey.h (revision 8100) +++ include/opensm/osm_pkey.h (working copy) @@ -92,6 +92,9 @@ typedef struct _osm_pkey_tbl cl_ptr_vector_t blocks; cl_ptr_vector_t new_blocks; cl_map_t keys; + cl_qlist_t pending; + uint16_t used_blocks; + uint16_t max_blocks; } osm_pkey_tbl_t; /* * FIELDS @@ -104,6 +107,18 @@ typedef struct _osm_pkey_tbl * keys * A set holding all keys * +* pending +* A list osm_pending_pkey structs that is temporarily set by the +* pkey mgr and used during pkey mgr algorithm only +* +* used_blocks +* Tracks the number of blocks having non-zero pkeys +* +* max_blocks +* The maximal number of blocks this partition table might hold +* this value is based on node_info (for port 0 or CA) or switch_info +* updated on receiving the node_info or switch_info GetResp +* * NOTES * 'blocks' vector should be used to store pkey values obtained from * the port and SM pkey manager should not change it directly, for this @@ -114,6 +129,39 @@ typedef struct _osm_pkey_tbl * *********/ +/****s* OpenSM: osm_pending_pkey_t +* NAME +* osm_pending_pkey_t +* +* DESCRIPTION +* This objects stores temporary information on pkeys their target block and index +* during the pkey manager operation +* +* SYNOPSIS +*/ +typedef struct _osm_pending_pkey { + cl_list_item_t list_item; + uint16_t pkey; + uint32_t block; + uint8_t index; + boolean_t is_new; +} osm_pending_pkey_t; +/* +* FIELDS +* pkey +* The actual P_Key +* +* block +* The block index based on the previous table extracted from the device +* +* index +* The index of the pky within the block +* +* is_new +* TRUE for new P_Keys such that the block and index are invalid in that case +* +*********/ + /****f* OpenSM: osm_pkey_tbl_construct * NAME * osm_pkey_tbl_construct @@ -142,7 +190,8 @@ void osm_pkey_tbl_construct( * * SYNOPSIS */ -int osm_pkey_tbl_init( +ib_api_status_t +osm_pkey_tbl_init( IN osm_pkey_tbl_t *p_pkey_tbl); /* * p_pkey_tbl @@ -209,8 +258,8 @@ osm_pkey_tbl_get_num_blocks( static inline ib_pkey_table_t *osm_pkey_tbl_block_get( const osm_pkey_tbl_t *p_pkey_tbl, uint16_t block) { - CL_ASSERT(block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)); - return(cl_ptr_vector_get(&p_pkey_tbl->blocks, block)); + return( (block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)) ? + cl_ptr_vector_get(&p_pkey_tbl->blocks, block) : NULL); }; /* * p_pkey_tbl @@ -244,16 +293,117 @@ static inline ib_pkey_table_t *osm_pkey_ /* *********/ -/****f* OpenSM: osm_pkey_tbl_sync_new_blocks + +/****f* OpenSM: osm_pkey_tbl_make_block_pair +* NAME +* osm_pkey_tbl_make_block_pair +* +* DESCRIPTION +* Find or create a pair of "old" and "new" blocks for the +* given block index +* +* SYNOPSIS +*/ +ib_api_status_t +osm_pkey_tbl_make_block_pair( + osm_pkey_tbl_t *p_pkey_tbl, + uint16_t block_idx, + ib_pkey_table_t **pp_old_block, + ib_pkey_table_t **pp_new_block); +/* +* p_pkey_tbl +* [in] Pointer to the PKey table +* +* block_idx +* [in] The block index to use +* +* pp_old_block +* [out] Pointer to the old block pointer arg +* +* pp_new_block +* [out] Pointer to the new block pointer arg +* +* RETURN VALUES +* IB_SUCCESS if OK IB_ERROR if failed +* +*********/ + +/****f* OpenSM: osm_pkey_tbl_set_new_entry * NAME -* osm_pkey_tbl_sync_new_blocks +* osm_pkey_tbl_set_new_entry * * DESCRIPTION -* Syncs new_blocks vector content with current pkey table blocks +* stores the given pkey in the "new" blocks array and update +* the "map" to show that on the "old" blocks * * SYNOPSIS */ -void osm_pkey_tbl_sync_new_blocks( +ib_api_status_t +osm_pkey_tbl_set_new_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t block_idx, + IN uint8_t pkey_idx, + IN uint16_t pkey); +/* +* p_pkey_tbl +* [in] Pointer to the PKey table +* +* block_idx +* [in] The block index to use +* +* pkey_idx +* [in] The index within the block +* +* pkey +* [in] PKey to store +* +* RETURN VALUES +* IB_SUCCESS if OK IB_ERROR if failed +* +*********/ + +/****f* OpenSM: osm_pkey_find_next_free_entry +* NAME +* osm_pkey_find_next_free_entry +* +* DESCRIPTION +* Find the next free entry in the PKey table. Starting at the given +* index and block number. The user should increment pkey_idx before +* next call +* Inspect the "new" blocks array for empty space. +* +* SYNOPSIS +*/ +boolean_t +osm_pkey_find_next_free_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + OUT uint16_t *p_block_idx, + OUT uint8_t *p_pkey_idx); +/* +* p_pkey_tbl +* [in] Pointer to the PKey table +* +* p_block_idx +* [out] The block index to use +* +* p_pkey_idx +* [out] The index within the block to use +* +* RETURN VALUES +* TRUE if found FALSE if did not find +* +*********/ + +/****f* OpenSM: osm_pkey_tbl_init_new_blocks +* NAME +* osm_pkey_tbl_init_new_blocks +* +* DESCRIPTION +* Initializes new_blocks vector content (clear and allocate) +* +* SYNOPSIS +*/ +void osm_pkey_tbl_init_new_blocks( const osm_pkey_tbl_t *p_pkey_tbl); /* * p_pkey_tbl @@ -263,6 +413,41 @@ void osm_pkey_tbl_sync_new_blocks( * *********/ +/****f* OpenSM: osm_pkey_tbl_get_block_and_idx +* NAME +* osm_pkey_tbl_get_block_and_idx +* +* DESCRIPTION +* set the block index and pkey index the given +* pkey is found in. return IB_NOT_FOUND if cound not find +* it, IB_SUCCESS if OK +* +* SYNOPSIS +*/ +ib_api_status_t +osm_pkey_tbl_get_block_and_idx( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t *p_pkey, + OUT uint32_t *block_idx, + OUT uint8_t *pkey_index); +/* +* p_pkey_tbl +* [in] Pointer to osm_pkey_tbl_t object. +* +* p_pkey +* [in] Pointer to the P_Key entry searched +* +* p_block_idx +* [out] Pointer to the block index to be updated +* +* p_pkey_idx +* [out] Pointer to the pkey index (in the block) to be updated +* +* +* NOTES +* +*********/ + /****f* OpenSM: osm_pkey_tbl_set * NAME * osm_pkey_tbl_set @@ -272,7 +457,8 @@ void osm_pkey_tbl_sync_new_blocks( * * SYNOPSIS */ -int osm_pkey_tbl_set( +ib_api_status_t +osm_pkey_tbl_set( IN osm_pkey_tbl_t *p_pkey_tbl, IN uint16_t block, IN ib_pkey_table_t *p_tbl); Index: opensm/osm_prtn.c =================================================================== --- opensm/osm_prtn.c (revision 8100) +++ opensm/osm_prtn.c (working copy) @@ -140,6 +140,12 @@ ib_api_status_t osm_prtn_add_port(osm_lo p_tbl = (full == TRUE) ? &p->full_guid_tbl : &p->part_guid_tbl ; + osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: " + "Added port 0x%" PRIx64 " to " + "partition \'%s\' (0x%04x) As %s member\n", + cl_ntoh64(guid), p->name, cl_ntoh16(p->pkey), + full ? "full" : "partial" ); + if (cl_map_insert(p_tbl, guid, p_physp) == NULL) return IB_INSUFFICIENT_MEMORY; Index: opensm/osm_pkey.c =================================================================== --- opensm/osm_pkey.c (revision 8100) +++ opensm/osm_pkey.c (working copy) @@ -94,18 +94,22 @@ void osm_pkey_tbl_destroy( /********************************************************************** **********************************************************************/ -int osm_pkey_tbl_init( +ib_api_status_t +osm_pkey_tbl_init( IN osm_pkey_tbl_t *p_pkey_tbl) { cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1); cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1); cl_map_init( &p_pkey_tbl->keys, 1 ); + cl_qlist_init( &p_pkey_tbl->pending ); + p_pkey_tbl->used_blocks = 0; + p_pkey_tbl->max_blocks = 0; return(IB_SUCCESS); } /********************************************************************** **********************************************************************/ -void osm_pkey_tbl_sync_new_blocks( +void osm_pkey_tbl_init_new_blocks( IN const osm_pkey_tbl_t *p_pkey_tbl) { ib_pkey_table_t *p_block, *p_new_block; @@ -123,16 +127,31 @@ void osm_pkey_tbl_sync_new_blocks( p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block)); if (!p_new_block) break; + cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, + b, p_new_block); + } + memset(p_new_block, 0, sizeof(*p_new_block)); - cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block); } - memcpy(p_new_block, p_block, sizeof(*p_new_block)); +} + +/********************************************************************** + **********************************************************************/ +void osm_pkey_tbl_cleanup_pending( + IN osm_pkey_tbl_t *p_pkey_tbl) +{ + cl_list_item_t *p_item; + p_item = cl_qlist_remove_head( &p_pkey_tbl->pending ); + while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) ) + { + free( (osm_pending_pkey_t *)p_item ); } } /********************************************************************** **********************************************************************/ -int osm_pkey_tbl_set( +ib_api_status_t +osm_pkey_tbl_set( IN osm_pkey_tbl_t *p_pkey_tbl, IN uint16_t block, IN ib_pkey_table_t *p_tbl) @@ -203,7 +222,138 @@ int osm_pkey_tbl_set( /********************************************************************** **********************************************************************/ -static boolean_t __osm_match_pkey ( +ib_api_status_t +osm_pkey_tbl_make_block_pair( + osm_pkey_tbl_t *p_pkey_tbl, + uint16_t block_idx, + ib_pkey_table_t **pp_old_block, + ib_pkey_table_t **pp_new_block) +{ + if (block_idx >= p_pkey_tbl->max_blocks) return(IB_ERROR); + + if (pp_old_block) + { + *pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx ); + if (! *pp_old_block) + { + *pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); + if (!*pp_old_block) return(IB_ERROR); + memset(*pp_old_block, 0, sizeof(ib_pkey_table_t)); + cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block); + } + } + + if (pp_new_block) + { + *pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx ); + if (! *pp_new_block) + { + *pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); + if (!*pp_new_block) return(IB_ERROR); + memset(*pp_new_block, 0, sizeof(ib_pkey_table_t)); + cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block); + } + } + return( IB_SUCCESS ); +} + +/********************************************************************** + **********************************************************************/ +/* + store the given pkey in the "new" blocks array + also makes sure the regular block exists. +*/ +ib_api_status_t +osm_pkey_tbl_set_new_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t block_idx, + IN uint8_t pkey_idx, + IN uint16_t pkey) +{ + ib_pkey_table_t *p_old_block; + ib_pkey_table_t *p_new_block; + + if (osm_pkey_tbl_make_block_pair( + p_pkey_tbl, block_idx, &p_old_block, &p_new_block)) + return( IB_ERROR ); + + p_new_block->pkey_entry[pkey_idx] = pkey; + if (p_pkey_tbl->used_blocks < block_idx) + p_pkey_tbl->used_blocks = block_idx; + + return( IB_SUCCESS ); +} + +/********************************************************************** + **********************************************************************/ +boolean_t +osm_pkey_find_next_free_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + OUT uint16_t *p_block_idx, + OUT uint8_t *p_pkey_idx) +{ + ib_pkey_table_t *p_new_block; + + CL_ASSERT(p_block_idx); + CL_ASSERT(p_pkey_idx); + + while ( *p_block_idx < p_pkey_tbl->max_blocks) + { + if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1) + { + *p_pkey_idx = 0; + (*p_block_idx)++; + if (*p_block_idx >= p_pkey_tbl->max_blocks) + return FALSE; + } + + p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx); + + if ( !p_new_block || + ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx])) + return TRUE; + else + (*p_pkey_idx)++; + } + return FALSE; +} + +/********************************************************************** + **********************************************************************/ +ib_api_status_t +osm_pkey_tbl_get_block_and_idx( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t *p_pkey, + OUT uint32_t *p_block_idx, + OUT uint8_t *p_pkey_index) +{ + uint32_t num_of_blocks; + uint32_t block_index; + ib_pkey_table_t *block; + + CL_ASSERT( p_pkey_tbl ); + CL_ASSERT( p_block_idx != NULL ); + CL_ASSERT( p_pkey_idx != NULL ); + + num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks); + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + { + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + if ( ( block->pkey_entry <= p_pkey ) && + ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK)) + { + *p_block_idx = block_index; + *p_pkey_index = p_pkey - block->pkey_entry; + return( IB_SUCCESS ); + } + } + return( IB_NOT_FOUND ); +} + +/********************************************************************** + **********************************************************************/ +static boolean_t +__osm_match_pkey ( IN const ib_net16_t *pkey1, IN const ib_net16_t *pkey2 ) { @@ -306,7 +456,8 @@ osm_physp_share_pkey( if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys)) return TRUE; - return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); + return + !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); } /********************************************************************** @@ -322,7 +473,8 @@ osm_port_share_pkey( OSM_LOG_ENTER( p_log, osm_port_share_pkey ); - if (!p_port_1 || !p_port_2) { + if (!p_port_1 || !p_port_2) + { ret = FALSE; goto Exit; } @@ -330,7 +482,8 @@ osm_port_share_pkey( p_physp1 = osm_port_get_default_phys_ptr(p_port_1); p_physp2 = osm_port_get_default_phys_ptr(p_port_2); - if (!p_physp1 || !p_physp2) { + if (!p_physp1 || !p_physp2) + { ret = FALSE; goto Exit; } Index: opensm/osm_pkey_mgr.c =================================================================== --- opensm/osm_pkey_mgr.c (revision 8100) +++ opensm/osm_pkey_mgr.c (working copy) @@ -62,6 +62,131 @@ /********************************************************************** **********************************************************************/ +/* + the max number of pkey blocks for a physical port is located in + different place for switch external ports (SwitchInfo) and the + rest of the ports (NodeInfo) +*/ +static int +pkey_mgr_get_physp_max_blocks( + IN const osm_subn_t *p_subn, + IN const osm_physp_t *p_physp) +{ + osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); + osm_switch_t *p_sw; + uint16_t num_pkeys = 0; + + if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) || + (osm_physp_get_port_num( p_physp ) == 0)) + num_pkeys = cl_ntoh16( p_node->node_info.partition_cap ); + else + { + p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid); + if (p_sw) + num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap ); + } + return( (num_pkeys + 31) / 32 ); +} + +/********************************************************************** + **********************************************************************/ +/* + * Insert the new pending pkey entry to the specific port pkey table + * pending pkeys. new entries are inserted at the back. + */ +static void +pkey_mgr_process_physical_port( + IN osm_log_t *p_log, + IN const osm_req_t *p_req, + IN const ib_net16_t pkey, + IN osm_physp_t *p_physp ) +{ + osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); + osm_pkey_tbl_t *p_pkey_tbl; + ib_net16_t *p_orig_pkey; + char *stat = NULL; + osm_pending_pkey_t *p_pending; + + p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); + p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t)); + if (! p_pending) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: ERR 0502: " + "Fail to allocate new pending pkey entry for node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + return; + } + p_pending->pkey = pkey; + p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); + if ( !p_orig_pkey ) + { + p_pending->is_new = TRUE; + cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); + stat = "inserted"; + } + else + { + CL_ASSERT( ib_pkey_get_base(*p_orig_pkey) == ib_pkey_get_base(pkey) ); + p_pending->is_new = FALSE; + if (osm_pkey_tbl_get_block_and_idx( + p_pkey_tbl, p_orig_pkey, + &p_pending->block, &p_pending->index) != IB_SUCCESS) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: ERR 0503: " + "Fail to obtain P_Key 0x%04x block and index for node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + return; + } + cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); + stat = "updated"; + } + + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_process_physical_port: " + "pkey 0x%04x was %s for node 0x%016" PRIx64 + " port %u\n", + cl_ntoh16( pkey ), stat, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); +} + +/********************************************************************** + **********************************************************************/ +static void +pkey_mgr_process_partition_table( + osm_log_t *p_log, + const osm_req_t *p_req, + const osm_prtn_t *p_prtn, + const boolean_t full ) +{ + const cl_map_t *p_tbl = + full ? &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; + cl_map_iterator_t i, i_next; + ib_net16_t pkey = p_prtn->pkey; + osm_physp_t *p_physp; + + if ( full ) + pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); + + i_next = cl_map_head( p_tbl ); + while ( i_next != cl_map_end( p_tbl ) ) + { + i = i_next; + i_next = cl_map_next( i ); + p_physp = cl_map_obj( i ); + if ( p_physp && osm_physp_is_valid( p_physp ) ) + pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); + } +} + +/********************************************************************** + **********************************************************************/ static ib_api_status_t pkey_mgr_update_pkey_entry( IN const osm_req_t *p_req, @@ -114,7 +239,8 @@ pkey_mgr_enforce_partition( p_pi->state_info2 = 0; ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE ); - context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); + context.pi_context.node_guid = + osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); context.pi_context.set_method = TRUE; context.pi_context.update_master_sm_base_lid = FALSE; @@ -131,80 +257,131 @@ pkey_mgr_enforce_partition( /********************************************************************** **********************************************************************/ -/* - * Prepare a new entry for the pkey table for this port when this pkey - * does not exist. Update existed entry when membership was changed. - */ -static void pkey_mgr_process_physical_port( - IN osm_log_t *p_log, - IN const osm_req_t *p_req, - IN const ib_net16_t pkey, - IN osm_physp_t *p_physp ) +static boolean_t pkey_mgr_update_port( + osm_log_t *p_log, + osm_req_t *p_req, + const osm_port_t * const p_port ) { - osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); - ib_pkey_table_t *block; + osm_physp_t *p_physp; + osm_node_t *p_node; + ib_pkey_table_t *block, *new_block; + osm_pkey_tbl_t *p_pkey_tbl; uint16_t block_index; + uint8_t pkey_index; + uint16_t last_free_block_index = 0; + uint8_t last_free_pkey_index = 0; uint16_t num_of_blocks; - const osm_pkey_tbl_t *p_pkey_tbl; - ib_net16_t *p_orig_pkey; - char *stat = NULL; - uint32_t i; + uint16_t max_num_of_blocks; - p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + ib_api_status_t status; + boolean_t ret_val = FALSE; + osm_pending_pkey_t *p_pending; + boolean_t found; - p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); + p_physp = osm_port_get_default_phys_ptr( p_port ); + if ( !osm_physp_is_valid( p_physp ) ) + return FALSE; - if ( !p_orig_pkey ) - { - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); + num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp ); + if ( p_pkey_tbl->max_blocks > max_num_of_blocks ) { - block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) + osm_log( p_log, OSM_LOG_INFO, + "pkey_mgr_update_port: " + "Max number of blocks reduced from %u to %u " + "for node 0x%016" PRIx64 " port %u\n", + p_pkey_tbl->max_blocks, max_num_of_blocks, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + } + p_pkey_tbl->max_blocks = max_num_of_blocks; + + osm_pkey_tbl_init_new_blocks( p_pkey_tbl ); + p_pkey_tbl->used_blocks = 0; + + /* + process every pending pkey in order - + first must be "updated" last are "new" + */ + p_pending = + (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); + while (p_pending != + (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) ) + { + if (p_pending->is_new == FALSE) + { + block_index = p_pending->block; + pkey_index = p_pending->index; + found = TRUE; + } + else { - if ( ib_pkey_is_invalid( block->pkey_entry[i] ) ) + found = osm_pkey_find_next_free_entry(p_pkey_tbl, + &last_free_block_index, + &last_free_pkey_index); + if ( !found ) { - block->pkey_entry[i] = pkey; - stat = "inserted"; - goto _done; + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0504: " + "failed to find empty space for new pkey 0x%04x " + "of node 0x%016" PRIx64 " port %u\n", + cl_ntoh16(p_pending->pkey), + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); } + else + { + block_index = last_free_block_index; + pkey_index = last_free_pkey_index++; } } + + if (found) + { + if ( IB_SUCCESS != osm_pkey_tbl_set_new_entry( + p_pkey_tbl, block_index, pkey_index, p_pending->pkey) ) + { osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_process_physical_port: ERR 0501: " - "No empty pkey entry was found to insert 0x%04x for node " - "0x%016" PRIx64 " port %u\n", - cl_ntoh16( pkey ), + "pkey_mgr_update_port: ERR 0505: " + "failed to set PKey 0x%04x in block %u idx %u " + "of node 0x%016" PRIx64 " port %u\n", + p_pending->pkey, block_index, pkey_index, cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); } - else if ( *p_orig_pkey != pkey ) - { + } + + free( p_pending ); + p_pending = + (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); + } + + /* now look for changes and store */ for ( block_index = 0; block_index < num_of_blocks; block_index++ ) { - /* we need real block (not just new_block) in order - * to resolve block/pkey indices */ block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); - i = p_orig_pkey - block->pkey_entry; - if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) { - block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - block->pkey_entry[i] = pkey; - stat = "updated"; - goto _done; - } - } - } + new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - _done: - if (stat) { - osm_log( p_log, OSM_LOG_VERBOSE, - "pkey_mgr_process_physical_port: " - "pkey 0x%04x was %s for node 0x%016" PRIx64 - " port %u\n", - cl_ntoh16( pkey ), stat, + if (block && + (!new_block || !memcmp( new_block, block, sizeof( *block ) )) ) + continue; + + status = pkey_mgr_update_pkey_entry( + p_req, p_physp , new_block, block_index ); + if (status == IB_SUCCESS) + ret_val = TRUE; + else + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0506: " + "pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); } + + return ret_val; } /********************************************************************** @@ -217,21 +394,23 @@ pkey_mgr_update_peer_port( const osm_port_t * const p_port, boolean_t enforce ) { - osm_physp_t *p, *peer; + osm_physp_t *p_physp, *peer; osm_node_t *p_node; ib_pkey_table_t *block, *peer_block; - const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl; + const osm_pkey_tbl_t *p_pkey_tbl; + osm_pkey_tbl_t *p_peer_pkey_tbl; osm_switch_t *p_sw; ib_switch_info_t *p_si; uint16_t block_index; uint16_t num_of_blocks; + uint16_t peer_max_blocks; ib_api_status_t status = IB_SUCCESS; boolean_t ret_val = FALSE; - p = osm_port_get_default_phys_ptr( p_port ); - if ( !osm_physp_is_valid( p ) ) + p_physp = osm_port_get_default_phys_ptr( p_port ); + if ( !osm_physp_is_valid( p_physp ) ) return FALSE; - peer = osm_physp_get_remote( p ); + peer = osm_physp_get_remote( p_physp ); if ( !peer || !osm_physp_is_valid( peer ) ) return FALSE; p_node = osm_physp_get_node_ptr( peer ); @@ -242,10 +421,26 @@ pkey_mgr_update_peer_port( if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || !p_si->enforce_cap) return FALSE; + p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); + p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer ); + num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer ); + if (peer_max_blocks < p_pkey_tbl->used_blocks) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_peer_port: ERR 0508: " + "not enough entries (%u < %u) on switch 0x%016" PRIx64 + " port %u. Clearing Enforcement bit.\n", + peer_max_blocks, num_of_blocks, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( peer ) ); + enforce = FALSE; + } + if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) { osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0502: " + "pkey_mgr_update_peer_port: ERR 0507: " "pkey_mgr_enforce_partition() failed to update " "node 0x%016" PRIx64 " port %u\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), @@ -255,13 +450,8 @@ pkey_mgr_update_peer_port( if (enforce == FALSE) return FALSE; - p_pkey_tbl = osm_physp_get_pkey_tbl( p ); - p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); - if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) ) - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ); - - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; + for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++) { block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); @@ -272,7 +462,7 @@ pkey_mgr_update_peer_port( ret_val = TRUE; else osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0503: " + "pkey_mgr_update_peer_port: ERR 0509: " "pkey_mgr_update_pkey_entry() failed to update " "pkey table block %d for node 0x%016" PRIx64 " port %u\n", @@ -282,10 +472,10 @@ pkey_mgr_update_peer_port( } } - if ( ret_val == TRUE && - osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) + if ( (ret_val == TRUE) && + osm_log_is_active( p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_log, OSM_LOG_VERBOSE, + osm_log( p_log, OSM_LOG_DEBUG, "pkey_mgr_update_peer_port: " "pkey table was updated for node 0x%016" PRIx64 " port %u\n", @@ -298,82 +488,6 @@ pkey_mgr_update_peer_port( /********************************************************************** **********************************************************************/ -static boolean_t pkey_mgr_update_port( - osm_log_t *p_log, - osm_req_t *p_req, - const osm_port_t * const p_port ) -{ - osm_physp_t *p; - osm_node_t *p_node; - ib_pkey_table_t *block, *new_block; - const osm_pkey_tbl_t *p_pkey_tbl; - uint16_t block_index; - uint16_t num_of_blocks; - ib_api_status_t status; - boolean_t ret_val = FALSE; - - p = osm_port_get_default_phys_ptr( p_port ); - if ( !osm_physp_is_valid( p ) ) - return FALSE; - - p_pkey_tbl = osm_physp_get_pkey_tbl(p); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); - - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) - { - block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); - new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - - if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) ) - continue; - - status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index ); - if (status == IB_SUCCESS) - ret_val = TRUE; - else - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_port: ERR 0504: " - "pkey_mgr_update_pkey_entry() failed to update " - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", - block_index, - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( p ) ); - } - - return ret_val; -} - -/********************************************************************** - **********************************************************************/ -static void -pkey_mgr_process_partition_table( - osm_log_t *p_log, - const osm_req_t *p_req, - const osm_prtn_t *p_prtn, - const boolean_t full ) -{ - const cl_map_t *p_tbl = full ? - &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; - cl_map_iterator_t i, i_next; - ib_net16_t pkey = p_prtn->pkey; - osm_physp_t *p_physp; - - if ( full ) - pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); - - i_next = cl_map_head( p_tbl ); - while ( i_next != cl_map_end( p_tbl ) ) - { - i = i_next; - i_next = cl_map_next( i ); - p_physp = cl_map_obj( i ); - if ( p_physp && osm_physp_is_valid( p_physp ) ) - pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); - } -} - -/********************************************************************** - **********************************************************************/ osm_signal_t osm_pkey_mgr_process( IN osm_opensm_t *p_osm ) @@ -383,8 +497,7 @@ osm_pkey_mgr_process( osm_prtn_t *p_prtn; osm_port_t *p_port; osm_signal_t signal = OSM_SIGNAL_DONE; - osm_physp_t *p_physp; - + osm_node_t *p_node; CL_ASSERT( p_osm ); OSM_LOG_ENTER( &p_osm->log, osm_pkey_mgr_process ); @@ -394,32 +507,25 @@ osm_pkey_mgr_process( if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS ) { osm_log( &p_osm->log, OSM_LOG_ERROR, - "osm_pkey_mgr_process: ERR 0505: " + "osm_pkey_mgr_process: ERR 0510: " "osm_prtn_make_partitions() failed\n" ); goto _err; } - p_tbl = &p_osm->subn.port_guid_tbl; - p_next = cl_qmap_head( p_tbl ); - while ( p_next != cl_qmap_end( p_tbl ) ) - { - p_port = ( osm_port_t * ) p_next; - p_next = cl_qmap_next( p_next ); - p_physp = osm_port_get_default_phys_ptr( p_port ); - if ( osm_physp_is_valid( p_physp ) ) - osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) ); - } - + /* populate the pending pkey entries by scanning all partitions */ p_tbl = &p_osm->subn.prtn_pkey_tbl; p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) { p_prtn = ( osm_prtn_t * ) p_next; p_next = cl_qmap_next( p_next ); - pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ); - pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); + pkey_mgr_process_partition_table( + &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ); + pkey_mgr_process_partition_table( + &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); } + /* calculate new pkey tables and set */ p_tbl = &p_osm->subn.port_guid_tbl; p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) @@ -428,8 +534,10 @@ osm_pkey_mgr_process( p_next = cl_qmap_next( p_next ); if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) ) signal = OSM_SIGNAL_DONE_PENDING; - if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH && - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, + p_node = osm_port_get_parent_node( p_port ); + if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && + pkey_mgr_update_peer_port( + &p_osm->log, &p_osm->sm.req, &p_osm->subn, p_port, !p_osm->subn.opt.no_partition_enforcement ) ) signal = OSM_SIGNAL_DONE_PENDING; From rdreier at cisco.com Sun Jun 18 04:49:25 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 18 Jun 2006 04:49:25 -0700 Subject: [openib-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus Opening the floodgates for 2.6.18: Ganapathi CH: IB/uverbs: Release lock on error path Ishai Rabinovitz: IB/srp: Clean up loop in srp_remove_one() IB/srp: Handle DREQ events from CM IB/srp: Factor out common request reset code Jack Morgenstein: IB: Add caching of ports' LMC IB/mad: Check GID/LID when matching requests IPoIB: Fix kernel unaligned access on ia64 Leonid Arsh: IB: Add client reregister event type IPoIB: Handle client reregister events IB: Move struct port_info from ipath to IB/mthca: Add client reregister event generation Matthew Wilcox: IB/srp: Use SCAN_WILD_CARD from SCSI headers IB/srp: Get rid of unneeded use of list_for_each_entry_safe() IB/srp: Change target_mutex to a spinlock Michael S. Tsirkin: IB/mthca: restore missing PCI registers after reset IB/mthca: memfree completion with error FW bug workaround IB/mthca: Remove dead code IB/cm: remove unneeded flush_workqueue Or Gerlitz: IB/mthca: Fill in max_map_per_fmr device attribute IB/fmr: Use device's max_map_map_per_fmr attribute in FMR pool. Ramachandra K: [SCSI] srp.h: Add I/O Class values IB/srp: Support SRP rev. 10 targets Roland Dreier: IB/srp: Use FMRs to map gather/scatter lists IB/mthca: Convert FW commands to use wait_for_completion_timeout() IB: Make needlessly global ib_mad_cache static IPoIB: Mention RFC numbers in documentation IB/srp: Get rid of "Target has req_lim 0" messages IPoIB: Avoid using stale last_send counter when reaping AHs IB/ipath: Add client reregister event generation IB/uverbs: Don't decrement usecnt on error paths IB/uverbs: Factor out common idr code IB/mthca: Fix memory leak on modify_qp error paths IB/mthca: Make all device methods truly reentrant IB/uverbs: Don't serialize with ib_uverbs_idr_mutex Sean Hefty: IB: common handling for marshalling parameters to/from userspace IB/cm: Match connection requests based on private data [NET]: Export ip_dev_find() IB: address translation to map IP toIB addresses (GIDs) IB: IP address based RDMA connection manager IB/ucm: convert semaphore to mutex IB/ucm: Get rid of duplicate P_Key parameter IB: Add ib_init_ah_from_wc() IB/sa: Add ib_init_ah_from_path() IB/cm: Use address handle helpers Vu Pham: IB/srp: Allow cmd_per_lun to be set per target port IB/srp: Allow sg_tablesize to be adjusted Documentation/infiniband/ipoib.txt | 12 drivers/infiniband/Kconfig | 5 drivers/infiniband/core/Makefile | 11 drivers/infiniband/core/addr.c | 367 +++++ drivers/infiniband/core/cache.c | 30 drivers/infiniband/core/cm.c | 119 + drivers/infiniband/core/cma.c | 1927 ++++++++++++++++++++++++ drivers/infiniband/core/fmr_pool.c | 30 drivers/infiniband/core/mad.c | 97 + drivers/infiniband/core/mad_priv.h | 2 drivers/infiniband/core/sa_query.c | 31 drivers/infiniband/core/ucm.c | 183 +- drivers/infiniband/core/uverbs.h | 4 drivers/infiniband/core/uverbs_cmd.c | 971 +++++++----- drivers/infiniband/core/uverbs_main.c | 35 drivers/infiniband/core/uverbs_marshall.c | 138 ++ drivers/infiniband/core/verbs.c | 44 - drivers/infiniband/hw/ipath/ipath_mad.c | 42 - drivers/infiniband/hw/mthca/mthca_cmd.c | 23 drivers/infiniband/hw/mthca/mthca_cq.c | 12 drivers/infiniband/hw/mthca/mthca_eq.c | 4 drivers/infiniband/hw/mthca/mthca_mad.c | 14 drivers/infiniband/hw/mthca/mthca_provider.c | 33 drivers/infiniband/hw/mthca/mthca_provider.h | 3 drivers/infiniband/hw/mthca/mthca_qp.c | 40 drivers/infiniband/hw/mthca/mthca_reset.c | 59 + drivers/infiniband/hw/mthca/mthca_srq.c | 5 drivers/infiniband/ulp/ipoib/ipoib.h | 34 drivers/infiniband/ulp/ipoib/ipoib_ib.c | 27 drivers/infiniband/ulp/ipoib/ipoib_main.c | 28 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 3 drivers/infiniband/ulp/srp/ib_srp.c | 482 ++++-- drivers/infiniband/ulp/srp/ib_srp.h | 33 include/rdma/ib_addr.h | 114 + include/rdma/ib_cache.h | 13 include/rdma/ib_cm.h | 26 include/rdma/ib_marshall.h | 50 + include/rdma/ib_sa.h | 7 include/rdma/ib_smi.h | 36 include/rdma/ib_user_cm.h | 86 - include/rdma/ib_user_sa.h | 60 + include/rdma/ib_user_verbs.h | 80 + include/rdma/ib_verbs.h | 22 include/rdma/rdma_cm.h | 256 +++ include/rdma/rdma_cm_ib.h | 47 + include/scsi/srp.h | 5 net/ipv4/fib_frontend.c | 1 48 files changed, 4590 insertions(+), 1072 deletions(-) create mode 100644 drivers/infiniband/core/addr.c create mode 100644 drivers/infiniband/core/cma.c create mode 100644 drivers/infiniband/core/uverbs_marshall.c create mode 100644 include/rdma/ib_addr.h create mode 100644 include/rdma/ib_marshall.h create mode 100644 include/rdma/ib_user_sa.h create mode 100644 include/rdma/rdma_cm.h create mode 100644 include/rdma/rdma_cm_ib.h From ogerlitz at voltaire.com Sun Jun 18 05:13:06 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 18 Jun 2006 15:13:06 +0300 Subject: [openib-general] OFED 1.0 - error while running ib_rdma_bw Message-ID: Running ib_rdma_bw (eg from the trunk but also with OFED) from time to time outputs the following message: server read: Success 0/45: Couldn't read remote address Looking in the code, line 148 (and actually 142 as well) seems to be buggy: 133 struct pingpong_dest * pp_client_exch_dest(int sockfd, 134 const struct pingpong_dest *my_dest) 135 { 136 struct pingpong_dest *rem_dest = NULL; 137 char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; 138 int parsed; 139 140 sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn, 141 my_dest->psn,my_dest->rkey,my_dest->vaddr); 142 if (write(sockfd, msg, sizeof msg) != sizeof msg) { 143 perror("client write"); 144 fprintf(stderr, "Couldn't send local address\n"); 145 goto out; 146 } 147 148 if (read(sockfd, msg, sizeof msg) != sizeof msg) { 149 perror("client read"); 150 fprintf(stderr, "Couldn't read remote address\n"); 151 goto out; 152 } as read(2) can read less then the max (expected) bytes count, and indeed error is 0 (no error) when the print is seen. The below script wouls allow you to easily reproduce it. At some point, there's also an IB completion with error printed, but it might be realated to the socket handling bug Or. SERVER=dill echo "" for i in 16384 32768 65536 131072 262144 524288 1048576 2097152 do for k in 4 do ssh $SERVER "/usr/local/ofed/bin/ib_rdma_bw" & sleep 5 echo $(date) -s = $i -n = $((512*1024*1024/$i)) -t = $k start /usr/local/ofed/bin/ib_rdma_bw $SERVER -s $i -n $((512*1024*1024/$i)) -t $k echo $(date) -s = $i -n = $((512*1024*1024/$i)) sleeping 3 seconds..... sleep 3 echo $(date) -s = $i -n = $((512*1024*1024/$i)) end echo "" wait done done -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Sun Jun 18 07:04:39 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 18 Jun 2006 17:04:39 +0300 Subject: [openib-general] is there is any SA client in user level? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30257933F@mtlexch01.mtl.com> Hi. I want to send a join message to the SA from user space. I know that I can use the umad or the osm_vendor in order to do it.. what is the best way to do it? is there is any SA client implementation in the user level (or is it a transparent layer?) thanks Dotan Barak Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Sun Jun 18 10:57:53 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 18 Jun 2006 12:57:53 -0500 Subject: [openib-general] ucma into kernel.org References: <1150465355.29508.4.camel@stevo-desktop> <4492D706.4060106@ichips.intel.com> <15ddcffd0606180435g366a6effs4d4826c8b3fbbd4f@mail.gmail.com> Message-ID: <001e01c69300$b9020c00$020010ac@haggard> | On 6/16/06, Sean Hefty wrote: | > Steve Wise wrote: | > > Will the ucma make it into 2.6.18? I notice its not in Roland's | > > for-2.6.18 tree right now. | > | > The plan is to allow the userspace interface to mature some before trying to | > merge them upstream. This is why it is not included in 2.6.18. | | Hi Sean, | | Can you remind (me...) what areas of the cma u/k interface seem to be | not mature enough? | | upstream CMA can be a significant step in the sense of distros (eg | SLES10 SP1 and RH5) kernel IB functional enough for production, as the | primary inteface for "RDMA communication managment" the uCMA is and | would be vastly used, so there should be good reason why not to push | it for 2.6.18 . | I agree that it would be nice to get this into 2.6.18. It seems stable enough IMO. Steve. From sean.hefty at intel.com Sun Jun 18 16:41:16 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 18 Jun 2006 16:41:16 -0700 Subject: [openib-general] is there is any SA client in user level? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30257933F@mtlexch01.mtl.com> Message-ID: <000001c69330$b1a67fb0$0e278686@amr.corp.intel.com> I want to send a join message to the SA from user space. I know that I can use the umad or the osm_vendor in order to do it.. what is the best way to do it? is there is any SA client implementation in the user level (or is it a transparent layer?) There is no SA client in userspace. (I'm not sure that one would be that much simpler than calling umad directly.) Ideally, join requests should go through the kernel through the ib_multicast module to allow for proper reference counting. Currently, the only interface to that from userspace is through the rdma_cm. - Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Jun 19 03:23:36 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Jun 2006 06:23:36 -0400 Subject: [openib-general] [PATCH] OpenSM/SA: In some SA records, send ERR_REQ_INVALID response on LID out of range error Message-ID: <1150712615.4391.55627.camel@hal.voltaire.com> OpenSM/SA: In some SA records, send ERR_REQ_INVALID response on LID out of range error Signed-off-by: Hal Rosenstock Index: opensm/osm_sa_vlarb_record.c =================================================================== --- opensm/osm_sa_vlarb_record.c (revision 8105) +++ opensm/osm_sa_vlarb_record.c (working copy) @@ -413,10 +413,14 @@ osm_vlarb_rec_rcv_process( } else { /* port out of range */ + cl_plock_release( p_rcv->p_lock ); + osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_vlarb_rec_rcv_process: ERR 2A01: " "Given LID (0x%X) is out of range:0x%X\n", cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID ); + goto Exit; } } Index: opensm/osm_sa_pkey_record.c =================================================================== --- opensm/osm_sa_pkey_record.c (revision 8105) +++ opensm/osm_sa_pkey_record.c (working copy) @@ -425,10 +425,14 @@ osm_pkey_rec_rcv_process( } else { /* port out of range */ + cl_plock_release( p_rcv->p_lock ); + osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_pkey_rec_rcv_process: ERR 4609: " "Given LID (0x%X) is out of range:0x%X\n", cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID ); + goto Exit; } } Index: opensm/osm_sa_slvl_record.c =================================================================== --- opensm/osm_sa_slvl_record.c (revision 8105) +++ opensm/osm_sa_slvl_record.c (working copy) @@ -393,10 +393,14 @@ osm_slvl_rec_rcv_process( } else { /* port out of range */ + cl_plock_release( p_rcv->p_lock ); + osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_slvl_rec_rcv_process: ERR 2601: " "Given LID (0x%X) is out of range:0x%X\n", cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl)); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID ); + goto Exit; } } From eitan at mellanox.co.il Mon Jun 19 03:45:49 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 19 Jun 2006 13:45:49 +0300 Subject: [openib-general] [PATCH] OpenSM/SA: In some SA records, send ERR_REQ_INVALIDresponse on LID out of range error Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236888C@mtlexch01.mtl.com> Hi Hal, Thanks for finding and fixing. Looks good to me. > Subject: [PATCH] OpenSM/SA: In some SA records, send > ERR_REQ_INVALIDresponse on LID out of range error > > OpenSM/SA: In some SA records, send ERR_REQ_INVALID response on LID out > of range error > From tziporet at mellanox.co.il Mon Jun 19 04:25:11 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 19 Jun 2006 14:25:11 +0300 Subject: [openib-general] OFED 1.0 - Official Release Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA726A@mtlexch01.mtl.com> Yes indeed we inserted one more critical bug fix in SDP. This bug is cause kernel oops in case server and client do not open the same number of sockets. Thus it can easily happened by any user level application using socket. The reason we added it as a patch was to decrease the risk, so if it cause someone a problem it can be reverted easily. Note that we did code review for the fix + tested it on all OS matrix we have to make sure this patch is safe. Tziporet -----Original Message----- From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] Sent: Friday, June 16, 2006 6:58 PM To: Tziporet Koren; OpenFabricsEWG; openib Subject: RE: [openib-general] OFED 1.0 - Official Release Tziporet, I see a few C code changes from pre1 in the form of patches. What are these and why were they added after pre1? $ diff -r OFED-1.0-pre1/SOURCES/openib-1.0/patches/OFED-1.0/SOURCES/openib-1.0/pat ches/ 2>&1 | less ... Only in OFED-1.0-pre1/SOURCES/openib-1.0/patches/fixes: handle_reconnect_of_offline_host.patch Only in OFED-1.0/SOURCES/openib-1.0/patches/fixes: sdp_fix.patch Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Tziporet Koren Sent: Friday, June 16, 2006 1:55 AM To: OpenFabricsEWG; openib Subject: [openib-general] OFED 1.0 - Official Release I am happy to announce that OFED 1.0 Official Release is now available. The release can be found under: https://openib.org/svn/gen2/branches/1.0/ofed/releases/ And later today it will be on the OpenFabrics download page: http://www.openfabrics.org/downloads.html. This is the first release that was done in a joint effort of the following companies: * Cisco * SilverStorm * Voltaire * QLogic * Intel * Mellanox Technologies I wish to thank all who contributed to the success of this release. Tziporet ======================================================================== ======= Release summary: The OFED software package is composed of several software modules intended for use on a computer cluster constructed as an InfiniBand network. OFED package contains the following components: o OpenFabrics core and ULPs: - HCA drivers (mthca, ipath) - core - Upper Layer Protocols: IPoIB, SDP, SRP Initiator, iSER Host, RDS and uDAPL o OpenFabrics utilities: - OpenSM: InfiniBand Subnet Manager - Diagnostic tools - Performance tests o MPI: - OSU MPI stack supporting the InfiniBand interface - Open MPI stack supporting the InfiniBand interface - MPI benchmark tests (OSU BW/LAT, Pallas, Presta) o Sources of all software modules (under conditions mentioned in the modules' LICENSE files) o Documentation Notes: 1. SDP and RDS are in technology preview state. 2. The SRP Initiator and Open MPI are in beta state. 3. All other OFED components are in production state. Supported Platforms and Operating Systems CPU architectures: * x86_64 * x86 * ia64 * ppc64 Linux Operating Systems: * RedHat EL4 up2: 2.6.9-22.ELsmp * RedHat EL4 up3: 2.6.9-34.ELsmp * Fedora C4: 2.6.11-1.1369_FC4 * SLES10 RC2: 2.6.16.16-1.6-smp (or RC 2.5 2.6.16.14-6-smp) * SLES10 RC1: 2.6.16.14-6-smp * SUSE 10 Pro: 2.6.13-15-smp * kernel.org: 2.6.16.x HCAs Supported Mellanox HCAs: - InfiniHost - InfiniHost III Ex (both modes: with memory and MemFree) - InfiniHost III Lx Both SDR and DDR mode of the InfiniHost III family are supported. For official FW versions please see: http://www.mellanox.com/support/firmware_table.php Qlogic HCAs: - QHT6040 (PathScale InfiniPath HT-460) - QHT6140 (PathScale InfiniPath HT-465) - QLE6140 (PathScale InfiniPath PE-880) Switches Supported This release was tested with switches and gateways provided by the following companies: - Cisco - Voltaire - SilverStorm - Flextronics Attached are the release notes Tziporet Koren Software Director Mellanox Technologies mailto: tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: From svenar at simula.no Mon Jun 19 04:35:11 2006 From: svenar at simula.no (Sven-Arne Reinemo) Date: Mon, 19 Jun 2006 13:35:11 +0200 Subject: [openib-general] A few questions about IBMgtSim Message-ID: <44968BEF.9030401@simula.no> Hi, After some testing of IBMgtSim I have a few questions: 1) If I try to build topologies using the MTS14400.ibnl as a building block my simulation fails with a "child process exited abnormally" message. I guess this is related to ibdmchk since the ibdmchk log contains lots of errors like the following: -I- Tracing all CA to CA paths for Credit Loops potential ... -E- Potential Credit Loop on Path from:H-1/U1/1 to:H-11/U1/1 Going:Down from:node:0002c9000000007d to:node:0002c9000000006a Going:Up from:node:0002c9000000006a to:node:0002c90000000076 -I- Generating non blocking full link coverage plan into:/tmp/ibdmchk.non_block_ all_links -E- After 32 stages some switch ports are still not covered: -E- Fail to cover port:system:0002c90000000054/node:0002c90000000054/P15 I have included two topology files. One that works and one that fails, the only difference is that the number of hosts are increased from 18 to 20. Also, if I create my own simple ibnl file for a switch with 144 (or other sizes) ports I am able to run simulations. Any suggestions to what the problem might be? 2) The included example ibmgtsim/tests/RhinoBased10K.topo never finishes (at least not in 24 hours). Does this work for anyone else? All other examples work fine. 3) If I would like to use IBMgtSim with my own (simplified) SM would it be straightforward? It looks too me like RunSimTest talks to any SM given the correct path, node and port number for location of the SM. Best regards, -- Sven-Arne Reinemo [simula.research laboratory] http://www.simula.no/ ++++ GnuPG public key - http://home.simula.no/~svenar/gpg.asc ++++ -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: mts14400_n18_working.topo URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: mts14400_n20_not_working.topo URL: From sashak at voltaire.com Mon Jun 19 06:56:53 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 19 Jun 2006 16:56:53 +0300 Subject: [openib-general] [PATCHv3] osm: partition manager force policy In-Reply-To: <86fyi2hek6.fsf@mtl066.yok.mtl.com> References: <86fyi2hek6.fsf@mtl066.yok.mtl.com> Message-ID: <20060619135653.GB5521@sashak.voltaire.com> Hi Eitan, On 14:46 Sun 18 Jun , Eitan Zahavi wrote: > > This is a third take after incorporating Sasha's comments to the > partition manager patch I have previously provided. Two small comments below. > /********************************************************************** > **********************************************************************/ > -/* > - * Prepare a new entry for the pkey table for this port when this pkey > - * does not exist. Update existed entry when membership was changed. > - */ > -static void pkey_mgr_process_physical_port( > - IN osm_log_t *p_log, > - IN const osm_req_t *p_req, > - IN const ib_net16_t pkey, > - IN osm_physp_t *p_physp ) > +static boolean_t pkey_mgr_update_port( > + osm_log_t *p_log, > + osm_req_t *p_req, > + const osm_port_t * const p_port ) > { > - osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); > - ib_pkey_table_t *block; > + osm_physp_t *p_physp; > + osm_node_t *p_node; p_node is uninitialized and used in osm_log() later, > + ib_pkey_table_t *block, *new_block; > + osm_pkey_tbl_t *p_pkey_tbl; > uint16_t block_index; > + uint8_t pkey_index; > + uint16_t last_free_block_index = 0; > + uint8_t last_free_pkey_index = 0; > uint16_t num_of_blocks; > - const osm_pkey_tbl_t *p_pkey_tbl; > - ib_net16_t *p_orig_pkey; > - char *stat = NULL; > - uint32_t i; > + uint16_t max_num_of_blocks; > > - p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); > - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); > + ib_api_status_t status; > + boolean_t ret_val = FALSE; > + osm_pending_pkey_t *p_pending; > + boolean_t found; > > - p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); > + p_physp = osm_port_get_default_phys_ptr( p_port ); > + if ( !osm_physp_is_valid( p_physp ) ) > + return FALSE; > > - if ( !p_orig_pkey ) > - { > - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) > + p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); > + num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); > + max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp ); > + if ( p_pkey_tbl->max_blocks > max_num_of_blocks ) > { > - block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); > - for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) > + osm_log( p_log, OSM_LOG_INFO, > + "pkey_mgr_update_port: " > + "Max number of blocks reduced from %u to %u " > + "for node 0x%016" PRIx64 " port %u\n", > + p_pkey_tbl->max_blocks, max_num_of_blocks, > + cl_ntoh64( osm_node_get_node_guid( p_node ) ), > + osm_physp_get_port_num( p_physp ) ); > + } > @@ -255,13 +450,8 @@ pkey_mgr_update_peer_port( > if (enforce == FALSE) > return FALSE; > > - p_pkey_tbl = osm_physp_get_pkey_tbl( p ); > - p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); > - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); > - if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) ) > - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ); > - > - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) > + p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; Peer's pkey table blocks may be not initialized yet, and then > + for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++) > { > block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); > peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); peer_block can be NULL. Later in the code (not in this patch) there is 'if (memcmp(peer_block, ...))', should be changed to 'if (!peer_block || memcmp(peer_block, ...))'. Sasha From ogerlitz at voltaire.com Mon Jun 19 07:30:16 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 19 Jun 2006 17:30:16 +0300 (IDT) Subject: [openib-general] trunk's udapl does not compile Message-ID: I've just noted an inconsistency with librdmacm of udapl calling rdma_create_id without providing the PS param. This is the trivial patch i was using to fix the compilation. Or. Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 8106) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -235,7 +235,7 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_N return DAT_INVALID_ADDRESS; /* cm_id will bind local device/GID based on IP address */ - if (rdma_create_id(g_cm_events, &cm_id, (void*)hca_ptr)) + if (rdma_create_id(g_cm_events, &cm_id, (void*)hca_ptr, RDMA_PS_TCP)) return DAT_INTERNAL_ERROR; ret = rdma_bind_addr(cm_id, Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 8106) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -694,7 +694,7 @@ dapls_ib_setup_conn_listener(IN DAPL_IA dapl_os_lock_init(&conn->lock); /* create CM_ID, bind to local device, create QP */ - if (rdma_create_id(g_cm_events, &conn->cm_id, (void*)conn)) { + if (rdma_create_id(g_cm_events, &conn->cm_id, (void*)conn, RDMA_PS_TCP)) { dapl_os_free(conn, sizeof(*conn)); return(dapl_convert_errno(errno,"setup_listener")); } Index: dapl/openib_cma/dapl_ib_qp.c =================================================================== --- dapl/openib_cma/dapl_ib_qp.c (revision 8106) +++ dapl/openib_cma/dapl_ib_qp.c (working copy) @@ -130,7 +130,7 @@ DAT_RETURN dapls_ib_qp_alloc(IN DAPL_IA dapl_os_lock_init(&conn->lock); /* create CM_ID, bind to local device, create QP */ - if (rdma_create_id(g_cm_events, &cm_id, (void*)conn)) { + if (rdma_create_id(g_cm_events, &cm_id, (void*)conn, RDMA_PS_TCP)) { dapl_os_free(conn, sizeof(*conn)); return(dapl_convert_errno(errno, "create_qp")); } From ogerlitz at voltaire.com Mon Jun 19 07:43:14 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 19 Jun 2006 17:43:14 +0300 (IDT) Subject: [openib-general] dapltest gets segfaulted in librdmacm init Message-ID: After fixing the ucma/port space issue with the calls to rdma_create_id i am now trying to run $ ./Target/dapltest -T S -D OpenIB-cma and getting an immediate segfault with the below trace, any idea? Or. #0 0x00002af6d3a97685 in ibv_open_device (device=0x537440) at device.c:128 128 context = device->ops.alloc_context(device, cmd_fd); (gdb) where #0 0x00002af6d3a97685 in ibv_open_device (device=0x537440) at device.c:128 #1 0x00002af6d3cc4076 in ucma_init () at cma.c:220 #2 0x00002af6d3cc4182 in rdma_create_event_channel () at cma.c:257 #3 0x00002af6d3bb20e3 in dapls_ib_open_hca (hca_name=0x534430 "ib0", hca_ptr=0x532870) at dapl_ib_util.c:222 #4 0x00002af6d3bab454 in dapl_ia_open (name=0x530028 "OpenIB-cma", async_evd_qlen=8, async_evd_handle_ptr=0x52e690, ia_handle_ptr=0x52e660) at dapl_ia_open.c:145 #5 0x00002af6d352e422 in dat_ia_openv (name=0x530028 "OpenIB-cma", async_event_qlen=8, async_event_handle=0x52e690, ia_handle=0x52e660, dapl_major=1, dapl_minor=2, thread_safety=DAT_FALSE) at udat.c:229 #6 0x000000000041461f in DT_cs_Server (params_ptr=0x530020) at dapl_server.c:105 #7 0x0000000000407aa2 in DT_Execute_Test (params_ptr=0x530020) at dapl_execute.c:55 #8 0x000000000041e9d9 in DT_Tdep_Execute_Test (params_ptr=0x530020) at udapl_tdep.c:48 #9 0x0000000000403669 in dapltest (argc=5, argv=0x7fffd7693748) at dapl_main.c:95 #10 0x00000000004035bb in main (argc=5, argv=0x7fffd7693748) at dapl_main.c:37 (gdb) info sharedlibrary >From To Syms Read Shared Object Library 0x00002af6d352e0e0 0x00002af6d3533e38 Yes /usr/local/ib/lib/libdat.so.1 0x00002af6d365d470 0x00002af6d3664d48 Yes /lib64/tls/libpthread.so.0 0x00002af6d37888b0 0x00002af6d3852ce0 Yes /lib64/tls/libc.so.6 0x00002af6d398f450 0x00002af6d3990128 Yes /lib64/libdl.so.2 0x00002af6d3a94690 0x00002af6d3a99aa8 Yes /usr/local/ib/lib/libibverbs.so.2 0x00002af6d3415cf0 0x00002af6d3426ab7 Yes /lib64/ld-linux-x86-64.so.2 0x00002af6d3b9ffc0 0x00002af6d3bb7028 Yes /usr/local/ib/lib/libdaplcma.so 0x00002af6d3cc3ca0 0x00002af6d3cc6d18 Yes /usr/local/ib/lib/librdmacm.so 0x00002af6d3deb200 0x00002af6d3df2348 Yes /usr/local/lib/libsysfs.so.1 0x00002af6d3ef5b50 0x00002af6d3efc138 Yes /usr/local/ib/lib/infiniband/mthca.so 0x00002af6d40006c0 0x00002af6d4005838 Yes /usr/local/ib/lib/libibverbs.so.1 From eitan at mellanox.co.il Mon Jun 19 07:43:23 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 19 Jun 2006 17:43:23 +0300 Subject: [openib-general] [PATCHv3] osm: partition manager force policy In-Reply-To: <20060619135653.GB5521@sashak.voltaire.com> References: <86fyi2hek6.fsf@mtl066.yok.mtl.com> <20060619135653.GB5521@sashak.voltaire.com> Message-ID: <4496B80B.8090504@mellanox.co.il> Hi Sasha, Thanks! These two are real bugs. I am sending PATCHv4... Sasha Khapyorsky wrote: > Hi Eitan, > > On 14:46 Sun 18 Jun , Eitan Zahavi wrote: > >>This is a third take after incorporating Sasha's comments to the >>partition manager patch I have previously provided. > > > Two small comments below. > > >> /********************************************************************** >> **********************************************************************/ >>-/* >>- * Prepare a new entry for the pkey table for this port when this pkey >>- * does not exist. Update existed entry when membership was changed. >>- */ >>-static void pkey_mgr_process_physical_port( >>- IN osm_log_t *p_log, >>- IN const osm_req_t *p_req, >>- IN const ib_net16_t pkey, >>- IN osm_physp_t *p_physp ) >>+static boolean_t pkey_mgr_update_port( >>+ osm_log_t *p_log, >>+ osm_req_t *p_req, >>+ const osm_port_t * const p_port ) >> { >>- osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); >>- ib_pkey_table_t *block; >>+ osm_physp_t *p_physp; >>+ osm_node_t *p_node; > > > p_node is uninitialized and used in osm_log() later, Thanks. I wonder how I missed this one. > > >>+ ib_pkey_table_t *block, *new_block; >>+ osm_pkey_tbl_t *p_pkey_tbl; >> uint16_t block_index; >>+ uint8_t pkey_index; >>+ uint16_t last_free_block_index = 0; >>+ uint8_t last_free_pkey_index = 0; >> uint16_t num_of_blocks; >>- const osm_pkey_tbl_t *p_pkey_tbl; >>- ib_net16_t *p_orig_pkey; >>- char *stat = NULL; >>- uint32_t i; >>+ uint16_t max_num_of_blocks; >> >>- p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); >>- num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); >>+ ib_api_status_t status; >>+ boolean_t ret_val = FALSE; >>+ osm_pending_pkey_t *p_pending; >>+ boolean_t found; >> >>- p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); >>+ p_physp = osm_port_get_default_phys_ptr( p_port ); >>+ if ( !osm_physp_is_valid( p_physp ) ) >>+ return FALSE; >> >>- if ( !p_orig_pkey ) >>- { >>- for ( block_index = 0; block_index < num_of_blocks; block_index++ ) >>+ p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); >>+ num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); >>+ max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp ); >>+ if ( p_pkey_tbl->max_blocks > max_num_of_blocks ) >> { >>- block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); >>- for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) >>+ osm_log( p_log, OSM_LOG_INFO, >>+ "pkey_mgr_update_port: " >>+ "Max number of blocks reduced from %u to %u " >>+ "for node 0x%016" PRIx64 " port %u\n", >>+ p_pkey_tbl->max_blocks, max_num_of_blocks, >>+ cl_ntoh64( osm_node_get_node_guid( p_node ) ), >>+ osm_physp_get_port_num( p_physp ) ); >>+ } > > > >>@@ -255,13 +450,8 @@ pkey_mgr_update_peer_port( >> if (enforce == FALSE) >> return FALSE; >> >>- p_pkey_tbl = osm_physp_get_pkey_tbl( p ); >>- p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); >>- num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); >>- if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) ) >>- num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ); >>- >>- for ( block_index = 0; block_index < num_of_blocks; block_index++ ) >>+ p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; > > > Peer's pkey table blocks may be not initialized yet, and then > > >>+ for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++) >> { >> block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); >> peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); > > > peer_block can be NULL. > > Later in the code (not in this patch) there is > 'if (memcmp(peer_block, ...))', should be changed to > 'if (!peer_block || memcmp(peer_block, ...))'. > > > Sasha > From sashak at voltaire.com Mon Jun 19 07:50:30 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 19 Jun 2006 17:50:30 +0300 Subject: [openib-general] [PATCHv3] osm: partition manager force policy In-Reply-To: <86fyi2hek6.fsf@mtl066.yok.mtl.com> References: <86fyi2hek6.fsf@mtl066.yok.mtl.com> Message-ID: <20060619145030.GC5521@sashak.voltaire.com> On 14:46 Sun 18 Jun , Eitan Zahavi wrote: > > Another one is the handling of switch limited partition cap by > clearing the switch enforcement bit (on the specific port). Some comment about this too. See below. > +ib_api_status_t > +osm_pkey_tbl_set_new_entry( > + IN osm_pkey_tbl_t *p_pkey_tbl, > + IN uint16_t block_idx, > + IN uint8_t pkey_idx, > + IN uint16_t pkey) > +{ > + ib_pkey_table_t *p_old_block; > + ib_pkey_table_t *p_new_block; > + > + if (osm_pkey_tbl_make_block_pair( > + p_pkey_tbl, block_idx, &p_old_block, &p_new_block)) > + return( IB_ERROR ); > + > + p_new_block->pkey_entry[pkey_idx] = pkey; > + if (p_pkey_tbl->used_blocks < block_idx) > + p_pkey_tbl->used_blocks = block_idx; > + > + return( IB_SUCCESS ); > +} p_pkey_tbl->used_blocks is updated as block index in range 0,1,2.... > @@ -242,10 +421,26 @@ pkey_mgr_update_peer_port( > if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || !p_si->enforce_cap) > return FALSE; > > + p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); > + p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer ); > + num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); > + peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer ); > + if (peer_max_blocks < p_pkey_tbl->used_blocks) > + { But compared with total number of blocks (ranged 1,2,3,...). In case where switch supports N pkey blocks and CA - N+1, switch's ports will be updated and partitioning enforced. Sasha > + osm_log( p_log, OSM_LOG_ERROR, > + "pkey_mgr_update_peer_port: ERR 0508: " > + "not enough entries (%u < %u) on switch 0x%016" PRIx64 > + " port %u. Clearing Enforcement bit.\n", > + peer_max_blocks, num_of_blocks, > + cl_ntoh64( osm_node_get_node_guid( p_node ) ), > + osm_physp_get_port_num( peer ) ); > + enforce = FALSE; > + } > + From eitan at mellanox.co.il Mon Jun 19 07:50:36 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 19 Jun 2006 17:50:36 +0300 Subject: [openib-general] [PATCHv4] osm: partition manager force policy Message-ID: <86ejxlgpxf.fsf@mtl066.yok.mtl.com> Hi Hal This is a 4th take after incorporating Sasha's new 2 bug reports for the PATCHv3 for partition manager. The difference from previous patch is very minor: 1. p_node is initialized in pkey_mgr_update_port 2. checking for a change in peer port pkey block first check for that block is not null Eitan Signed-off-by: Eitan Zahavi Index: include/opensm/osm_port.h =================================================================== --- include/opensm/osm_port.h (revision 8100) +++ include/opensm/osm_port.h (working copy) @@ -591,6 +591,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy * Port, Physical Port *********/ +/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl +* NAME +* osm_physp_get_mod_pkey_tbl +* +* DESCRIPTION +* Returns a NON CONST pointer to the P_Key table object of the Physical Port object. +* +* SYNOPSIS +*/ +static inline osm_pkey_tbl_t * +osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp ) +{ + CL_ASSERT( osm_physp_is_valid( p_physp ) ); + /* + (14.2.5.7) - the block number valid values are 0-2047, and are further + limited by the size of the P_Key table specified by the PartitionCap on the node. + */ + return( &p_physp->pkeys ); +}; +/* +* PARAMETERS +* p_physp +* [in] Pointer to an osm_physp_t object. +* +* RETURN VALUES +* The pointer to the P_Key table object. +* +* NOTES +* +* SEE ALSO +* Port, Physical Port +*********/ + /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl * NAME * osm_physp_set_slvl_tbl Index: include/opensm/osm_pkey.h =================================================================== --- include/opensm/osm_pkey.h (revision 8100) +++ include/opensm/osm_pkey.h (working copy) @@ -92,6 +92,9 @@ typedef struct _osm_pkey_tbl cl_ptr_vector_t blocks; cl_ptr_vector_t new_blocks; cl_map_t keys; + cl_qlist_t pending; + uint16_t used_blocks; + uint16_t max_blocks; } osm_pkey_tbl_t; /* * FIELDS @@ -104,6 +107,18 @@ typedef struct _osm_pkey_tbl * keys * A set holding all keys * +* pending +* A list osm_pending_pkey structs that is temporarily set by the +* pkey mgr and used during pkey mgr algorithm only +* +* used_blocks +* Tracks the number of blocks having non-zero pkeys +* +* max_blocks +* The maximal number of blocks this partition table might hold +* this value is based on node_info (for port 0 or CA) or switch_info +* updated on receiving the node_info or switch_info GetResp +* * NOTES * 'blocks' vector should be used to store pkey values obtained from * the port and SM pkey manager should not change it directly, for this @@ -114,6 +129,39 @@ typedef struct _osm_pkey_tbl * *********/ +/****s* OpenSM: osm_pending_pkey_t +* NAME +* osm_pending_pkey_t +* +* DESCRIPTION +* This objects stores temporary information on pkeys their target block and index +* during the pkey manager operation +* +* SYNOPSIS +*/ +typedef struct _osm_pending_pkey { + cl_list_item_t list_item; + uint16_t pkey; + uint32_t block; + uint8_t index; + boolean_t is_new; +} osm_pending_pkey_t; +/* +* FIELDS +* pkey +* The actual P_Key +* +* block +* The block index based on the previous table extracted from the device +* +* index +* The index of the pky within the block +* +* is_new +* TRUE for new P_Keys such that the block and index are invalid in that case +* +*********/ + /****f* OpenSM: osm_pkey_tbl_construct * NAME * osm_pkey_tbl_construct @@ -142,7 +190,8 @@ void osm_pkey_tbl_construct( * * SYNOPSIS */ -int osm_pkey_tbl_init( +ib_api_status_t +osm_pkey_tbl_init( IN osm_pkey_tbl_t *p_pkey_tbl); /* * p_pkey_tbl @@ -209,8 +258,8 @@ osm_pkey_tbl_get_num_blocks( static inline ib_pkey_table_t *osm_pkey_tbl_block_get( const osm_pkey_tbl_t *p_pkey_tbl, uint16_t block) { - CL_ASSERT(block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)); - return(cl_ptr_vector_get(&p_pkey_tbl->blocks, block)); + return( (block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)) ? + cl_ptr_vector_get(&p_pkey_tbl->blocks, block) : NULL); }; /* * p_pkey_tbl @@ -244,16 +293,117 @@ static inline ib_pkey_table_t *osm_pkey_ /* *********/ -/****f* OpenSM: osm_pkey_tbl_sync_new_blocks + +/****f* OpenSM: osm_pkey_tbl_make_block_pair +* NAME +* osm_pkey_tbl_make_block_pair +* +* DESCRIPTION +* Find or create a pair of "old" and "new" blocks for the +* given block index +* +* SYNOPSIS +*/ +ib_api_status_t +osm_pkey_tbl_make_block_pair( + osm_pkey_tbl_t *p_pkey_tbl, + uint16_t block_idx, + ib_pkey_table_t **pp_old_block, + ib_pkey_table_t **pp_new_block); +/* +* p_pkey_tbl +* [in] Pointer to the PKey table +* +* block_idx +* [in] The block index to use +* +* pp_old_block +* [out] Pointer to the old block pointer arg +* +* pp_new_block +* [out] Pointer to the new block pointer arg +* +* RETURN VALUES +* IB_SUCCESS if OK IB_ERROR if failed +* +*********/ + +/****f* OpenSM: osm_pkey_tbl_set_new_entry * NAME -* osm_pkey_tbl_sync_new_blocks +* osm_pkey_tbl_set_new_entry * * DESCRIPTION -* Syncs new_blocks vector content with current pkey table blocks +* stores the given pkey in the "new" blocks array and update +* the "map" to show that on the "old" blocks * * SYNOPSIS */ -void osm_pkey_tbl_sync_new_blocks( +ib_api_status_t +osm_pkey_tbl_set_new_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t block_idx, + IN uint8_t pkey_idx, + IN uint16_t pkey); +/* +* p_pkey_tbl +* [in] Pointer to the PKey table +* +* block_idx +* [in] The block index to use +* +* pkey_idx +* [in] The index within the block +* +* pkey +* [in] PKey to store +* +* RETURN VALUES +* IB_SUCCESS if OK IB_ERROR if failed +* +*********/ + +/****f* OpenSM: osm_pkey_find_next_free_entry +* NAME +* osm_pkey_find_next_free_entry +* +* DESCRIPTION +* Find the next free entry in the PKey table. Starting at the given +* index and block number. The user should increment pkey_idx before +* next call +* Inspect the "new" blocks array for empty space. +* +* SYNOPSIS +*/ +boolean_t +osm_pkey_find_next_free_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + OUT uint16_t *p_block_idx, + OUT uint8_t *p_pkey_idx); +/* +* p_pkey_tbl +* [in] Pointer to the PKey table +* +* p_block_idx +* [out] The block index to use +* +* p_pkey_idx +* [out] The index within the block to use +* +* RETURN VALUES +* TRUE if found FALSE if did not find +* +*********/ + +/****f* OpenSM: osm_pkey_tbl_init_new_blocks +* NAME +* osm_pkey_tbl_init_new_blocks +* +* DESCRIPTION +* Initializes new_blocks vector content (clear and allocate) +* +* SYNOPSIS +*/ +void osm_pkey_tbl_init_new_blocks( const osm_pkey_tbl_t *p_pkey_tbl); /* * p_pkey_tbl @@ -263,6 +413,41 @@ void osm_pkey_tbl_sync_new_blocks( * *********/ +/****f* OpenSM: osm_pkey_tbl_get_block_and_idx +* NAME +* osm_pkey_tbl_get_block_and_idx +* +* DESCRIPTION +* set the block index and pkey index the given +* pkey is found in. return IB_NOT_FOUND if cound not find +* it, IB_SUCCESS if OK +* +* SYNOPSIS +*/ +ib_api_status_t +osm_pkey_tbl_get_block_and_idx( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t *p_pkey, + OUT uint32_t *block_idx, + OUT uint8_t *pkey_index); +/* +* p_pkey_tbl +* [in] Pointer to osm_pkey_tbl_t object. +* +* p_pkey +* [in] Pointer to the P_Key entry searched +* +* p_block_idx +* [out] Pointer to the block index to be updated +* +* p_pkey_idx +* [out] Pointer to the pkey index (in the block) to be updated +* +* +* NOTES +* +*********/ + /****f* OpenSM: osm_pkey_tbl_set * NAME * osm_pkey_tbl_set @@ -272,7 +457,8 @@ void osm_pkey_tbl_sync_new_blocks( * * SYNOPSIS */ -int osm_pkey_tbl_set( +ib_api_status_t +osm_pkey_tbl_set( IN osm_pkey_tbl_t *p_pkey_tbl, IN uint16_t block, IN ib_pkey_table_t *p_tbl); Index: opensm/osm_prtn.c =================================================================== --- opensm/osm_prtn.c (revision 8100) +++ opensm/osm_prtn.c (working copy) @@ -140,6 +140,12 @@ ib_api_status_t osm_prtn_add_port(osm_lo p_tbl = (full == TRUE) ? &p->full_guid_tbl : &p->part_guid_tbl ; + osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: " + "Added port 0x%" PRIx64 " to " + "partition \'%s\' (0x%04x) As %s member\n", + cl_ntoh64(guid), p->name, cl_ntoh16(p->pkey), + full ? "full" : "partial" ); + if (cl_map_insert(p_tbl, guid, p_physp) == NULL) return IB_INSUFFICIENT_MEMORY; Index: opensm/osm_pkey.c =================================================================== --- opensm/osm_pkey.c (revision 8100) +++ opensm/osm_pkey.c (working copy) @@ -94,18 +94,22 @@ void osm_pkey_tbl_destroy( /********************************************************************** **********************************************************************/ -int osm_pkey_tbl_init( +ib_api_status_t +osm_pkey_tbl_init( IN osm_pkey_tbl_t *p_pkey_tbl) { cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1); cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1); cl_map_init( &p_pkey_tbl->keys, 1 ); + cl_qlist_init( &p_pkey_tbl->pending ); + p_pkey_tbl->used_blocks = 0; + p_pkey_tbl->max_blocks = 0; return(IB_SUCCESS); } /********************************************************************** **********************************************************************/ -void osm_pkey_tbl_sync_new_blocks( +void osm_pkey_tbl_init_new_blocks( IN const osm_pkey_tbl_t *p_pkey_tbl) { ib_pkey_table_t *p_block, *p_new_block; @@ -123,16 +127,31 @@ void osm_pkey_tbl_sync_new_blocks( p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block)); if (!p_new_block) break; + cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, + b, p_new_block); + } + memset(p_new_block, 0, sizeof(*p_new_block)); - cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block); } - memcpy(p_new_block, p_block, sizeof(*p_new_block)); +} + +/********************************************************************** + **********************************************************************/ +void osm_pkey_tbl_cleanup_pending( + IN osm_pkey_tbl_t *p_pkey_tbl) +{ + cl_list_item_t *p_item; + p_item = cl_qlist_remove_head( &p_pkey_tbl->pending ); + while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) ) + { + free( (osm_pending_pkey_t *)p_item ); } } /********************************************************************** **********************************************************************/ -int osm_pkey_tbl_set( +ib_api_status_t +osm_pkey_tbl_set( IN osm_pkey_tbl_t *p_pkey_tbl, IN uint16_t block, IN ib_pkey_table_t *p_tbl) @@ -203,7 +222,138 @@ int osm_pkey_tbl_set( /********************************************************************** **********************************************************************/ -static boolean_t __osm_match_pkey ( +ib_api_status_t +osm_pkey_tbl_make_block_pair( + osm_pkey_tbl_t *p_pkey_tbl, + uint16_t block_idx, + ib_pkey_table_t **pp_old_block, + ib_pkey_table_t **pp_new_block) +{ + if (block_idx >= p_pkey_tbl->max_blocks) return(IB_ERROR); + + if (pp_old_block) + { + *pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx ); + if (! *pp_old_block) + { + *pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); + if (!*pp_old_block) return(IB_ERROR); + memset(*pp_old_block, 0, sizeof(ib_pkey_table_t)); + cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block); + } + } + + if (pp_new_block) + { + *pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx ); + if (! *pp_new_block) + { + *pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); + if (!*pp_new_block) return(IB_ERROR); + memset(*pp_new_block, 0, sizeof(ib_pkey_table_t)); + cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block); + } + } + return( IB_SUCCESS ); +} + +/********************************************************************** + **********************************************************************/ +/* + store the given pkey in the "new" blocks array + also makes sure the regular block exists. +*/ +ib_api_status_t +osm_pkey_tbl_set_new_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t block_idx, + IN uint8_t pkey_idx, + IN uint16_t pkey) +{ + ib_pkey_table_t *p_old_block; + ib_pkey_table_t *p_new_block; + + if (osm_pkey_tbl_make_block_pair( + p_pkey_tbl, block_idx, &p_old_block, &p_new_block)) + return( IB_ERROR ); + + p_new_block->pkey_entry[pkey_idx] = pkey; + if (p_pkey_tbl->used_blocks < block_idx) + p_pkey_tbl->used_blocks = block_idx; + + return( IB_SUCCESS ); +} + +/********************************************************************** + **********************************************************************/ +boolean_t +osm_pkey_find_next_free_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + OUT uint16_t *p_block_idx, + OUT uint8_t *p_pkey_idx) +{ + ib_pkey_table_t *p_new_block; + + CL_ASSERT(p_block_idx); + CL_ASSERT(p_pkey_idx); + + while ( *p_block_idx < p_pkey_tbl->max_blocks) + { + if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1) + { + *p_pkey_idx = 0; + (*p_block_idx)++; + if (*p_block_idx >= p_pkey_tbl->max_blocks) + return FALSE; + } + + p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx); + + if ( !p_new_block || + ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx])) + return TRUE; + else + (*p_pkey_idx)++; + } + return FALSE; +} + +/********************************************************************** + **********************************************************************/ +ib_api_status_t +osm_pkey_tbl_get_block_and_idx( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t *p_pkey, + OUT uint32_t *p_block_idx, + OUT uint8_t *p_pkey_index) +{ + uint32_t num_of_blocks; + uint32_t block_index; + ib_pkey_table_t *block; + + CL_ASSERT( p_pkey_tbl ); + CL_ASSERT( p_block_idx != NULL ); + CL_ASSERT( p_pkey_idx != NULL ); + + num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks); + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + { + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + if ( ( block->pkey_entry <= p_pkey ) && + ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK)) + { + *p_block_idx = block_index; + *p_pkey_index = p_pkey - block->pkey_entry; + return( IB_SUCCESS ); + } + } + return( IB_NOT_FOUND ); +} + +/********************************************************************** + **********************************************************************/ +static boolean_t +__osm_match_pkey ( IN const ib_net16_t *pkey1, IN const ib_net16_t *pkey2 ) { @@ -306,7 +456,8 @@ osm_physp_share_pkey( if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys)) return TRUE; - return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); + return + !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); } /********************************************************************** @@ -322,7 +473,8 @@ osm_port_share_pkey( OSM_LOG_ENTER( p_log, osm_port_share_pkey ); - if (!p_port_1 || !p_port_2) { + if (!p_port_1 || !p_port_2) + { ret = FALSE; goto Exit; } @@ -330,7 +482,8 @@ osm_port_share_pkey( p_physp1 = osm_port_get_default_phys_ptr(p_port_1); p_physp2 = osm_port_get_default_phys_ptr(p_port_2); - if (!p_physp1 || !p_physp2) { + if (!p_physp1 || !p_physp2) + { ret = FALSE; goto Exit; } Index: opensm/osm_pkey_mgr.c =================================================================== --- opensm/osm_pkey_mgr.c (revision 8100) +++ opensm/osm_pkey_mgr.c (working copy) @@ -62,6 +62,131 @@ /********************************************************************** **********************************************************************/ +/* + the max number of pkey blocks for a physical port is located in + different place for switch external ports (SwitchInfo) and the + rest of the ports (NodeInfo) +*/ +static int +pkey_mgr_get_physp_max_blocks( + IN const osm_subn_t *p_subn, + IN const osm_physp_t *p_physp) +{ + osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); + osm_switch_t *p_sw; + uint16_t num_pkeys = 0; + + if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) || + (osm_physp_get_port_num( p_physp ) == 0)) + num_pkeys = cl_ntoh16( p_node->node_info.partition_cap ); + else + { + p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid); + if (p_sw) + num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap ); + } + return( (num_pkeys + 31) / 32 ); +} + +/********************************************************************** + **********************************************************************/ +/* + * Insert the new pending pkey entry to the specific port pkey table + * pending pkeys. new entries are inserted at the back. + */ +static void +pkey_mgr_process_physical_port( + IN osm_log_t *p_log, + IN const osm_req_t *p_req, + IN const ib_net16_t pkey, + IN osm_physp_t *p_physp ) +{ + osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); + osm_pkey_tbl_t *p_pkey_tbl; + ib_net16_t *p_orig_pkey; + char *stat = NULL; + osm_pending_pkey_t *p_pending; + + p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); + p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t)); + if (! p_pending) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: ERR 0502: " + "Fail to allocate new pending pkey entry for node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + return; + } + p_pending->pkey = pkey; + p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); + if ( !p_orig_pkey ) + { + p_pending->is_new = TRUE; + cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); + stat = "inserted"; + } + else + { + CL_ASSERT( ib_pkey_get_base(*p_orig_pkey) == ib_pkey_get_base(pkey) ); + p_pending->is_new = FALSE; + if (osm_pkey_tbl_get_block_and_idx( + p_pkey_tbl, p_orig_pkey, + &p_pending->block, &p_pending->index) != IB_SUCCESS) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: ERR 0503: " + "Fail to obtain P_Key 0x%04x block and index for node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + return; + } + cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); + stat = "updated"; + } + + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_process_physical_port: " + "pkey 0x%04x was %s for node 0x%016" PRIx64 + " port %u\n", + cl_ntoh16( pkey ), stat, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); +} + +/********************************************************************** + **********************************************************************/ +static void +pkey_mgr_process_partition_table( + osm_log_t *p_log, + const osm_req_t *p_req, + const osm_prtn_t *p_prtn, + const boolean_t full ) +{ + const cl_map_t *p_tbl = + full ? &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; + cl_map_iterator_t i, i_next; + ib_net16_t pkey = p_prtn->pkey; + osm_physp_t *p_physp; + + if ( full ) + pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); + + i_next = cl_map_head( p_tbl ); + while ( i_next != cl_map_end( p_tbl ) ) + { + i = i_next; + i_next = cl_map_next( i ); + p_physp = cl_map_obj( i ); + if ( p_physp && osm_physp_is_valid( p_physp ) ) + pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); + } +} + +/********************************************************************** + **********************************************************************/ static ib_api_status_t pkey_mgr_update_pkey_entry( IN const osm_req_t *p_req, @@ -114,7 +239,8 @@ pkey_mgr_enforce_partition( p_pi->state_info2 = 0; ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE ); - context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); + context.pi_context.node_guid = + osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); context.pi_context.set_method = TRUE; context.pi_context.update_master_sm_base_lid = FALSE; @@ -131,80 +257,132 @@ pkey_mgr_enforce_partition( /********************************************************************** **********************************************************************/ -/* - * Prepare a new entry for the pkey table for this port when this pkey - * does not exist. Update existed entry when membership was changed. - */ -static void pkey_mgr_process_physical_port( - IN osm_log_t *p_log, - IN const osm_req_t *p_req, - IN const ib_net16_t pkey, - IN osm_physp_t *p_physp ) +static boolean_t pkey_mgr_update_port( + osm_log_t *p_log, + osm_req_t *p_req, + const osm_port_t * const p_port ) { - osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); - ib_pkey_table_t *block; + osm_physp_t *p_physp; + osm_node_t *p_node; + ib_pkey_table_t *block, *new_block; + osm_pkey_tbl_t *p_pkey_tbl; uint16_t block_index; + uint8_t pkey_index; + uint16_t last_free_block_index = 0; + uint8_t last_free_pkey_index = 0; uint16_t num_of_blocks; - const osm_pkey_tbl_t *p_pkey_tbl; - ib_net16_t *p_orig_pkey; - char *stat = NULL; - uint32_t i; + uint16_t max_num_of_blocks; - p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + ib_api_status_t status; + boolean_t ret_val = FALSE; + osm_pending_pkey_t *p_pending; + boolean_t found; - p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); + p_physp = osm_port_get_default_phys_ptr( p_port ); + if ( !osm_physp_is_valid( p_physp ) ) + return FALSE; - if ( !p_orig_pkey ) - { - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + p_node = osm_physp_get_node_ptr( p_physp ); + p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); + num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp ); + if ( p_pkey_tbl->max_blocks > max_num_of_blocks ) { - block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) + osm_log( p_log, OSM_LOG_INFO, + "pkey_mgr_update_port: " + "Max number of blocks reduced from %u to %u " + "for node 0x%016" PRIx64 " port %u\n", + p_pkey_tbl->max_blocks, max_num_of_blocks, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + } + p_pkey_tbl->max_blocks = max_num_of_blocks; + + osm_pkey_tbl_init_new_blocks( p_pkey_tbl ); + p_pkey_tbl->used_blocks = 0; + + /* + process every pending pkey in order - + first must be "updated" last are "new" + */ + p_pending = + (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); + while (p_pending != + (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) ) + { + if (p_pending->is_new == FALSE) + { + block_index = p_pending->block; + pkey_index = p_pending->index; + found = TRUE; + } + else { - if ( ib_pkey_is_invalid( block->pkey_entry[i] ) ) + found = osm_pkey_find_next_free_entry(p_pkey_tbl, + &last_free_block_index, + &last_free_pkey_index); + if ( !found ) { - block->pkey_entry[i] = pkey; - stat = "inserted"; - goto _done; + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0504: " + "failed to find empty space for new pkey 0x%04x " + "of node 0x%016" PRIx64 " port %u\n", + cl_ntoh16(p_pending->pkey), + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); } + else + { + block_index = last_free_block_index; + pkey_index = last_free_pkey_index++; } } + + if (found) + { + if ( IB_SUCCESS != osm_pkey_tbl_set_new_entry( + p_pkey_tbl, block_index, pkey_index, p_pending->pkey) ) + { osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_process_physical_port: ERR 0501: " - "No empty pkey entry was found to insert 0x%04x for node " - "0x%016" PRIx64 " port %u\n", - cl_ntoh16( pkey ), + "pkey_mgr_update_port: ERR 0505: " + "failed to set PKey 0x%04x in block %u idx %u " + "of node 0x%016" PRIx64 " port %u\n", + p_pending->pkey, block_index, pkey_index, cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); } - else if ( *p_orig_pkey != pkey ) - { + } + + free( p_pending ); + p_pending = + (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); + } + + /* now look for changes and store */ for ( block_index = 0; block_index < num_of_blocks; block_index++ ) { - /* we need real block (not just new_block) in order - * to resolve block/pkey indices */ block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); - i = p_orig_pkey - block->pkey_entry; - if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) { - block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - block->pkey_entry[i] = pkey; - stat = "updated"; - goto _done; - } - } - } + new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - _done: - if (stat) { - osm_log( p_log, OSM_LOG_VERBOSE, - "pkey_mgr_process_physical_port: " - "pkey 0x%04x was %s for node 0x%016" PRIx64 - " port %u\n", - cl_ntoh16( pkey ), stat, + if (block && + (!new_block || !memcmp( new_block, block, sizeof( *block ) )) ) + continue; + + status = pkey_mgr_update_pkey_entry( + p_req, p_physp , new_block, block_index ); + if (status == IB_SUCCESS) + ret_val = TRUE; + else + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0506: " + "pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); } + + return ret_val; } /********************************************************************** @@ -217,21 +395,23 @@ pkey_mgr_update_peer_port( const osm_port_t * const p_port, boolean_t enforce ) { - osm_physp_t *p, *peer; + osm_physp_t *p_physp, *peer; osm_node_t *p_node; ib_pkey_table_t *block, *peer_block; - const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl; + const osm_pkey_tbl_t *p_pkey_tbl; + osm_pkey_tbl_t *p_peer_pkey_tbl; osm_switch_t *p_sw; ib_switch_info_t *p_si; uint16_t block_index; uint16_t num_of_blocks; + uint16_t peer_max_blocks; ib_api_status_t status = IB_SUCCESS; boolean_t ret_val = FALSE; - p = osm_port_get_default_phys_ptr( p_port ); - if ( !osm_physp_is_valid( p ) ) + p_physp = osm_port_get_default_phys_ptr( p_port ); + if ( !osm_physp_is_valid( p_physp ) ) return FALSE; - peer = osm_physp_get_remote( p ); + peer = osm_physp_get_remote( p_physp ); if ( !peer || !osm_physp_is_valid( peer ) ) return FALSE; p_node = osm_physp_get_node_ptr( peer ); @@ -242,10 +422,26 @@ pkey_mgr_update_peer_port( if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || !p_si->enforce_cap) return FALSE; + p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); + p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer ); + num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer ); + if (peer_max_blocks < p_pkey_tbl->used_blocks) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_peer_port: ERR 0508: " + "not enough entries (%u < %u) on switch 0x%016" PRIx64 + " port %u. Clearing Enforcement bit.\n", + peer_max_blocks, num_of_blocks, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( peer ) ); + enforce = FALSE; + } + if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) { osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0502: " + "pkey_mgr_update_peer_port: ERR 0507: " "pkey_mgr_enforce_partition() failed to update " "node 0x%016" PRIx64 " port %u\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), @@ -255,24 +451,19 @@ pkey_mgr_update_peer_port( if (enforce == FALSE) return FALSE; - p_pkey_tbl = osm_physp_get_pkey_tbl( p ); - p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); - if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) ) - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ); - - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; + for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++) { block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); - if ( memcmp( peer_block, block, sizeof( *peer_block ) ) ) + if ( !peer_block || memcmp( peer_block, block, sizeof( *peer_block ) ) ) { status = pkey_mgr_update_pkey_entry( p_req, peer, block, block_index ); if ( status == IB_SUCCESS ) ret_val = TRUE; else osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0503: " + "pkey_mgr_update_peer_port: ERR 0509: " "pkey_mgr_update_pkey_entry() failed to update " "pkey table block %d for node 0x%016" PRIx64 " port %u\n", @@ -282,10 +473,10 @@ pkey_mgr_update_peer_port( } } - if ( ret_val == TRUE && - osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) + if ( (ret_val == TRUE) && + osm_log_is_active( p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_log, OSM_LOG_VERBOSE, + osm_log( p_log, OSM_LOG_DEBUG, "pkey_mgr_update_peer_port: " "pkey table was updated for node 0x%016" PRIx64 " port %u\n", @@ -298,82 +489,6 @@ pkey_mgr_update_peer_port( /********************************************************************** **********************************************************************/ -static boolean_t pkey_mgr_update_port( - osm_log_t *p_log, - osm_req_t *p_req, - const osm_port_t * const p_port ) -{ - osm_physp_t *p; - osm_node_t *p_node; - ib_pkey_table_t *block, *new_block; - const osm_pkey_tbl_t *p_pkey_tbl; - uint16_t block_index; - uint16_t num_of_blocks; - ib_api_status_t status; - boolean_t ret_val = FALSE; - - p = osm_port_get_default_phys_ptr( p_port ); - if ( !osm_physp_is_valid( p ) ) - return FALSE; - - p_pkey_tbl = osm_physp_get_pkey_tbl(p); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); - - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) - { - block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); - new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - - if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) ) - continue; - - status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index ); - if (status == IB_SUCCESS) - ret_val = TRUE; - else - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_port: ERR 0504: " - "pkey_mgr_update_pkey_entry() failed to update " - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", - block_index, - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( p ) ); - } - - return ret_val; -} - -/********************************************************************** - **********************************************************************/ -static void -pkey_mgr_process_partition_table( - osm_log_t *p_log, - const osm_req_t *p_req, - const osm_prtn_t *p_prtn, - const boolean_t full ) -{ - const cl_map_t *p_tbl = full ? - &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; - cl_map_iterator_t i, i_next; - ib_net16_t pkey = p_prtn->pkey; - osm_physp_t *p_physp; - - if ( full ) - pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); - - i_next = cl_map_head( p_tbl ); - while ( i_next != cl_map_end( p_tbl ) ) - { - i = i_next; - i_next = cl_map_next( i ); - p_physp = cl_map_obj( i ); - if ( p_physp && osm_physp_is_valid( p_physp ) ) - pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); - } -} - -/********************************************************************** - **********************************************************************/ osm_signal_t osm_pkey_mgr_process( IN osm_opensm_t *p_osm ) @@ -383,8 +498,7 @@ osm_pkey_mgr_process( osm_prtn_t *p_prtn; osm_port_t *p_port; osm_signal_t signal = OSM_SIGNAL_DONE; - osm_physp_t *p_physp; - + osm_node_t *p_node; CL_ASSERT( p_osm ); OSM_LOG_ENTER( &p_osm->log, osm_pkey_mgr_process ); @@ -394,32 +508,25 @@ osm_pkey_mgr_process( if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS ) { osm_log( &p_osm->log, OSM_LOG_ERROR, - "osm_pkey_mgr_process: ERR 0505: " + "osm_pkey_mgr_process: ERR 0510: " "osm_prtn_make_partitions() failed\n" ); goto _err; } - p_tbl = &p_osm->subn.port_guid_tbl; - p_next = cl_qmap_head( p_tbl ); - while ( p_next != cl_qmap_end( p_tbl ) ) - { - p_port = ( osm_port_t * ) p_next; - p_next = cl_qmap_next( p_next ); - p_physp = osm_port_get_default_phys_ptr( p_port ); - if ( osm_physp_is_valid( p_physp ) ) - osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) ); - } - + /* populate the pending pkey entries by scanning all partitions */ p_tbl = &p_osm->subn.prtn_pkey_tbl; p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) { p_prtn = ( osm_prtn_t * ) p_next; p_next = cl_qmap_next( p_next ); - pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ); - pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); + pkey_mgr_process_partition_table( + &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ); + pkey_mgr_process_partition_table( + &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); } + /* calculate new pkey tables and set */ p_tbl = &p_osm->subn.port_guid_tbl; p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) @@ -428,8 +535,10 @@ osm_pkey_mgr_process( p_next = cl_qmap_next( p_next ); if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) ) signal = OSM_SIGNAL_DONE_PENDING; - if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH && - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, + p_node = osm_port_get_parent_node( p_port ); + if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && + pkey_mgr_update_peer_port( + &p_osm->log, &p_osm->sm.req, &p_osm->subn, p_port, !p_osm->subn.opt.no_partition_enforcement ) ) signal = OSM_SIGNAL_DONE_PENDING; From bill at Princeton.EDU Mon Jun 19 08:11:12 2006 From: bill at Princeton.EDU (Bill Wichser) Date: Mon, 19 Jun 2006 11:11:12 -0400 Subject: [openib-general] Problem with mca_mpool_openib_register - Cannot allocate memory Message-ID: <4496BE90.40607@princeton.edu> Running the openib stack from Redhat on a 2.6.9-34.ELsmp kernel, dual Xeon. Running with openmpi v1.0.2 compiled w/gcc. While we still have the problem with btl_openib_endpoint.c returning 0 byte(s) for max inline data, and realize that another IB stack addresses this, another problem when running across more than a single host pops up generating huge amounts of error messages. The errors go something like this: mca_mpool_openib_register: ibv_reg_mr(0x2ac2622000,1052672) failed with error: Cannot allocate memory [0,1,1][btl_openib.c:496:mca_btl_openib_prepare_dst] mpool_register(0x2ac2622040,1048576) failed: base 0x2ac2222040 lb 0 offset 4194304 We fixed the /etc/security/limits.conf problem but I don't know what to do about this one. The job seems to complete without error on 2 nodes (4 processors) but to scale any larger just generates megabyte files of these types of error messages. Any insights for this problem? All searches lead me to the limits.conf which we have set to 8192. These are 8G machines if that makes any difference. Thanks, Bill From sashak at voltaire.com Mon Jun 19 08:20:04 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 19 Jun 2006 18:20:04 +0300 Subject: [openib-general] [PATCH TRIVIAL] opensm: fix type in the usage Message-ID: <20060619152004.GD5521@sashak.voltaire.com> Hi Hal, This fixes typo in the usage. Signed-off-by: Sasha Khapyorsky --- osm/opensm/main.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/osm/opensm/main.c b/osm/opensm/main.c index dfb2aec..4382fdb 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -180,7 +180,7 @@ show_usage(void) printf( "-U\n" "--ucast_file \n" " This option specifies name of the unicast dump file\n" - " from where switch forwarding tables will be loaded.\nn"); + " from where switch forwarding tables will be loaded.\n\n"); printf ("-a\n" "--add_guid_file \n" " Set the root nodes for the Up/Down routing algorithm\n" From eitan at mellanox.co.il Mon Jun 19 08:24:40 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 19 Jun 2006 18:24:40 +0300 Subject: [openib-general] [PATCHv3] osm: partition manager force policy In-Reply-To: <20060619145030.GC5521@sashak.voltaire.com> References: <86fyi2hek6.fsf@mtl066.yok.mtl.com> <20060619145030.GC5521@sashak.voltaire.com> Message-ID: <4496C1B8.40200@mellanox.co.il> Hi Sasha, Thanks. This is yet another bug. The fix is trivial and is noted below. Please let me know when you are done reviewing and I will post a new patch. EZ Sasha Khapyorsky wrote: > On 14:46 Sun 18 Jun , Eitan Zahavi wrote: > >>Another one is the handling of switch limited partition cap by >>clearing the switch enforcement bit (on the specific port). > > > Some comment about this too. See below. > > >>+ib_api_status_t >>+osm_pkey_tbl_set_new_entry( >>+ IN osm_pkey_tbl_t *p_pkey_tbl, >>+ IN uint16_t block_idx, >>+ IN uint8_t pkey_idx, >>+ IN uint16_t pkey) >>+{ >>+ ib_pkey_table_t *p_old_block; >>+ ib_pkey_table_t *p_new_block; >>+ >>+ if (osm_pkey_tbl_make_block_pair( >>+ p_pkey_tbl, block_idx, &p_old_block, &p_new_block)) >>+ return( IB_ERROR ); >>+ >>+ p_new_block->pkey_entry[pkey_idx] = pkey; >>+ if (p_pkey_tbl->used_blocks < block_idx) >>+ p_pkey_tbl->used_blocks = block_idx; Fix: if (p_pkey_tbl->used_blocks <= block_idx) p_pkey_tbl->used_blocks = block_idx + 1; >>+ >>+ return( IB_SUCCESS ); >>+} > > > p_pkey_tbl->used_blocks is updated as block index in range 0,1,2.... > > >>@@ -242,10 +421,26 @@ pkey_mgr_update_peer_port( >> if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || !p_si->enforce_cap) >> return FALSE; >> >>+ p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); >>+ p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer ); >>+ num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); >>+ peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer ); >>+ if (peer_max_blocks < p_pkey_tbl->used_blocks) >>+ { > > > But compared with total number of blocks (ranged 1,2,3,...). In case > where switch supports N pkey blocks and CA - N+1, switch's ports will be > updated and partitioning enforced. > > Sasha > > >>+ osm_log( p_log, OSM_LOG_ERROR, >>+ "pkey_mgr_update_peer_port: ERR 0508: " >>+ "not enough entries (%u < %u) on switch 0x%016" PRIx64 >>+ " port %u. Clearing Enforcement bit.\n", >>+ peer_max_blocks, num_of_blocks, >>+ cl_ntoh64( osm_node_get_node_guid( p_node ) ), >>+ osm_physp_get_port_num( peer ) ); >>+ enforce = FALSE; >>+ } >>+ > > From halr at voltaire.com Mon Jun 19 08:24:41 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Jun 2006 11:24:41 -0400 Subject: [openib-general] [PATCH TRIVIAL] opensm: fix type in the usage In-Reply-To: <20060619152004.GD5521@sashak.voltaire.com> References: <20060619152004.GD5521@sashak.voltaire.com> Message-ID: <1150730675.4391.67038.camel@hal.voltaire.com> On Mon, 2006-06-19 at 11:20, Sasha Khapyorsky wrote: > Hi Hal, > > This fixes typo in the usage. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From bugzilla-daemon at openib.org Mon Jun 19 08:53:25 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 19 Jun 2006 08:53:25 -0700 (PDT) Subject: [openib-general] [Bug 145] New: IB Core unable to communicate IPoIB on Fedora Core 4 Message-ID: <20060619155325.75BDD228735@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=145 Summary: IB Core unable to communicate IPoIB on Fedora Core 4 Product: OpenFabrics Linux Version: 1.0rc5 Platform: X86-64 OS/Version: Other Status: NEW Severity: major Priority: P2 Component: IB Core AssignedTo: bugzilla at openib.org ReportedBy: smarsh at analogic.com I have installed OFED 1.0rc5 on a Dual-core Intel X86-64 system with Fedora Core 4 (2.6.11-1.1369) intalled. I have installed using the "everything" option and typed "no" for mpi_osu with gcc install. Everything compiles without error. After the install and a reboot, the ib0 and ib1 coonections are apparent. I can ping over the TCP/IP stack but cannot ibping (I suspect I have issues with SDP, the daemon seems to be running though) I receive the following message in verbose mode..."ibwarn: [3494: ibping to Lid 0xc failed". Any help would be greatly appreciated. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sashak at voltaire.com Mon Jun 19 09:25:45 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 19 Jun 2006 19:25:45 +0300 Subject: [openib-general] [PATCHv3] osm: partition manager force policy In-Reply-To: <4496C1B8.40200@mellanox.co.il> References: <86fyi2hek6.fsf@mtl066.yok.mtl.com> <20060619145030.GC5521@sashak.voltaire.com> <4496C1B8.40200@mellanox.co.il> Message-ID: <20060619162545.GE5521@sashak.voltaire.com> On 18:24 Mon 19 Jun , Eitan Zahavi wrote: > Hi Sasha, > > Thanks. This is yet another bug. > The fix is trivial and is noted below. > > Please let me know when you are done reviewing and I will post a new patch. I'm done. Did some running, enforcement works as expected now. Sasha > > EZ > Sasha Khapyorsky wrote: > >On 14:46 Sun 18 Jun , Eitan Zahavi wrote: > > > >>Another one is the handling of switch limited partition cap by > >>clearing the switch enforcement bit (on the specific port). > > > > > >Some comment about this too. See below. > > > > > >>+ib_api_status_t > >>+osm_pkey_tbl_set_new_entry( > >>+ IN osm_pkey_tbl_t *p_pkey_tbl, > >>+ IN uint16_t block_idx, > >>+ IN uint8_t pkey_idx, > >>+ IN uint16_t pkey) > >>+{ > >>+ ib_pkey_table_t *p_old_block; > >>+ ib_pkey_table_t *p_new_block; > >>+ > >>+ if (osm_pkey_tbl_make_block_pair( > >>+ p_pkey_tbl, block_idx, &p_old_block, &p_new_block)) > >>+ return( IB_ERROR ); > >>+ > >>+ p_new_block->pkey_entry[pkey_idx] = pkey; > >>+ if (p_pkey_tbl->used_blocks < block_idx) > >>+ p_pkey_tbl->used_blocks = block_idx; > Fix: > if (p_pkey_tbl->used_blocks <= block_idx) > p_pkey_tbl->used_blocks = block_idx + 1; > >>+ > >>+ return( IB_SUCCESS ); > >>+} > > > > > >p_pkey_tbl->used_blocks is updated as block index in range 0,1,2.... > > > > > >>@@ -242,10 +421,26 @@ pkey_mgr_update_peer_port( > >> if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || > >> !p_si->enforce_cap) > >> return FALSE; > >> > >>+ p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); > >>+ p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer ); > >>+ num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); > >>+ peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer ); > >>+ if (peer_max_blocks < p_pkey_tbl->used_blocks) > >>+ { > > > > > >But compared with total number of blocks (ranged 1,2,3,...). In case > >where switch supports N pkey blocks and CA - N+1, switch's ports will be > >updated and partitioning enforced. > > > >Sasha > > > > > >>+ osm_log( p_log, OSM_LOG_ERROR, > >>+ "pkey_mgr_update_peer_port: ERR > >>0508: " > >>+ "not enough entries (%u < %u) on > >>switch 0x%016" PRIx64 > >>+ " port %u. Clearing Enforcement > >>bit.\n", > >>+ peer_max_blocks, num_of_blocks, > >>+ cl_ntoh64( osm_node_get_node_guid( > >>p_node ) ), > >>+ osm_physp_get_port_num( peer ) ); > >>+ enforce = FALSE; > >>+ } > >>+ > > > > > From halr at voltaire.com Mon Jun 19 09:39:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Jun 2006 12:39:54 -0400 Subject: [openib-general] [PATCH] OpenSM/osm_sa_link_record.c: Fix LMC > 0 handling Message-ID: <1150735193.4391.69996.camel@hal.voltaire.com> OpenSM/osm_sa_link_record.c: Fix LMC > 0 handling In osm_sa_link_record.c, properly handle non base LID requests per C15-0.1.11: Query responses shall contain a port's base LID in any LID component of a RID. So when LMC is non 0, the only records that appear are those with the base LID and not with any masked LIDs. Furthermore, if a query comes in on a non base LID, the LID in the RID returned is only with the base LID. To do this, added new routine osm_get_port_by_base_lid in osm_port.c for use by other SA records. Also, fixed some error handling for SA GetTable LinkRecord requests. Also, added more SA LinkRecord test cases to osmtest/osmtest.c Signed-off-by: Hal Rosenstock Index: include/opensm/osm_port.h =================================================================== --- include/opensm/osm_port.h (revision 8108) +++ include/opensm/osm_port.h (working copy) @@ -1737,6 +1737,42 @@ osm_port_get_lid_range_ho( * Port *********/ +/****f* OpenSM: Port/osm_get_port_by_base_lid +* NAME +* osm_get_port_by_base_lid +* +* DESCRIPTION +* Returns a status on whether a Port was able to be +* determined based on the LID supplied and if so, return the Port. +* +* SYNOPSIS +*/ +ib_api_status_t +osm_get_port_by_base_lid( + IN const osm_subn_t* const p_subn, + IN const ib_net16_t lid, + IN OUT const osm_port_t** const pp_port ); +/* +* PARAMETERS +* p_subn +* [in] Pointer to the subnet data structure. +* +* lid +* [in] LID requested. +* +* pp_port +* [in][out] Pointer to pointer to Port object. +* +* RETURN VALUES +* IB_SUCCESS +* IB_NOT_FOUND +* +* NOTES +* +* SEE ALSO +* Port +*********/ + /****f* OpenSM: Port/osm_port_add_new_physp * NAME * osm_port_add_new_physp Index: opensm/osm_port.c =================================================================== --- opensm/osm_port.c (revision 8108) +++ opensm/osm_port.c (working copy) @@ -266,6 +266,44 @@ osm_port_get_lid_range_ho( /********************************************************************** **********************************************************************/ +ib_api_status_t +osm_get_port_by_base_lid( + IN const osm_subn_t* const p_subn, + IN const ib_net16_t lid, + IN OUT const osm_port_t** const pp_port ) +{ + ib_api_status_t status; + uint16_t base_lid; + uint8_t lmc; + + *pp_port = NULL; + + /* Loop on lmc from 0 up through max LMC */ + for (lmc = 0; lmc <= IB_PORT_LMC_MAX; lmc++) + { + /* Calculate a base LID assuming this is the real LMC */ + base_lid = (cl_ntoh16(lid) & ~(1 << lmc)); + + /* Look for a match */ + status = cl_ptr_vector_at( &p_subn->port_lid_tbl, + base_lid, + (void**)pp_port ); + if ((status == CL_SUCCESS) && (*pp_port != NULL)) + { + /* Determine if base LID "tested" is the real base LID */ + /* This is true if the LMC "tested" is the port's actual LMC */ + if (lmc == osm_port_get_lmc( *pp_port ) ) + goto Found; + } + } + status = IB_NOT_FOUND; + + Found: + return status; +} + +/********************************************************************** + **********************************************************************/ void osm_port_add_new_physp( IN osm_port_t* const p_port, Index: opensm/osm_sa_link_record.c =================================================================== --- opensm/osm_sa_link_record.c (revision 8108) +++ opensm/osm_sa_link_record.c (working copy) @@ -209,7 +209,6 @@ __osm_lr_rcv_get_physp_link( ib_net16_t from_max_lid_ho; ib_net16_t to_max_lid_ho; ib_net16_t to_base_lid_ho; - uint16_t i, j; OSM_LOG_ENTER( p_rcv->p_log, __osm_lr_rcv_get_physp_link ); @@ -313,30 +312,12 @@ __osm_lr_rcv_get_physp_link( dest_port_num ); } - if( comp_mask & IB_LR_COMPMASK_FROM_LID ) - { - from_max_lid_ho = from_base_lid_ho = cl_ntoh16(p_lr->from_lid); - } - else - { - __get_lid_range(p_src_physp, &from_base_lid_ho, &from_max_lid_ho); - } + __get_lid_range(p_src_physp, &from_base_lid_ho, &from_max_lid_ho); + __get_lid_range(p_dest_physp, &to_base_lid_ho, &to_max_lid_ho); - if( comp_mask & IB_LR_COMPMASK_TO_LID ) - { - to_max_lid_ho = to_base_lid_ho = cl_ntoh16(p_lr->to_lid); - } - else - { - __get_lid_range(p_dest_physp, &to_base_lid_ho, &to_max_lid_ho); - } - - for (i = from_base_lid_ho; i <= from_max_lid_ho; i++) - { - for(j = to_base_lid_ho; j <= to_max_lid_ho; j++) - __osm_lr_rcv_build_physp_link(p_rcv, cl_ntoh16(i), cl_ntoh16(j), - src_port_num, dest_port_num, p_list); - } + __osm_lr_rcv_build_physp_link(p_rcv, cl_ntoh16(from_base_lid_ho), + cl_ntoh16(to_base_lid_ho), + src_port_num, dest_port_num, p_list); Exit: OSM_LOG_EXIT( p_rcv->p_log ); @@ -515,12 +496,11 @@ __osm_lr_rcv_get_end_points( if( p_sa_mad->comp_mask & IB_LR_COMPMASK_FROM_LID ) { - status = cl_ptr_vector_at( &p_rcv->p_subn->port_lid_tbl, - cl_ntoh16(p_lr->from_lid), - (void**)pp_src_port ); + status = osm_get_port_by_base_lid( p_rcv->p_subn, + p_lr->from_lid, + pp_src_port ); - if( ( (status != CL_SUCCESS) || (*pp_src_port == NULL) ) && - (p_sa_mad->method == IB_MAD_METHOD_GET) ) + if( (status != CL_SUCCESS) || (*pp_src_port == NULL) ) { /* This 'error' is the client's fault (bad lid) so @@ -539,12 +519,11 @@ __osm_lr_rcv_get_end_points( if( p_sa_mad->comp_mask & IB_LR_COMPMASK_TO_LID ) { - status = cl_ptr_vector_at( &p_rcv->p_subn->port_lid_tbl, - cl_ntoh16(p_lr->to_lid), - (void**)pp_dest_port ); + status = osm_get_port_by_base_lid( p_rcv->p_subn, + p_lr->to_lid, + pp_dest_port ); - if( ( (status != CL_SUCCESS) || (*pp_dest_port == NULL) ) && - (p_sa_mad->method == IB_MAD_METHOD_GET) ) + if( (status != CL_SUCCESS) || (*pp_dest_port == NULL) ) { /* This 'error' is the client's fault (bad lid) so @@ -732,8 +711,8 @@ osm_lr_rcv_process( { const ib_link_record_t* p_lr; const ib_sa_mad_t* p_sa_mad; - const osm_port_t* p_src_port = NULL; - const osm_port_t* p_dest_port = NULL; + const osm_port_t* p_src_port; + const osm_port_t* p_dest_port; cl_qlist_t lr_list; ib_net16_t sa_status; osm_physp_t* p_req_physp; @@ -784,16 +763,12 @@ osm_lr_rcv_process( sa_status = __osm_lr_rcv_get_end_points( p_rcv, p_madw, &p_src_port, &p_dest_port ); - if( sa_status != IB_SA_MAD_STATUS_SUCCESS ) + if( sa_status == IB_SA_MAD_STATUS_SUCCESS ) { - cl_plock_release( p_rcv->p_lock ); - osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status ); - goto Exit; + __osm_lr_rcv_get_port_links( p_rcv, p_lr, p_src_port, p_dest_port, + p_sa_mad->comp_mask, &lr_list, p_req_physp ); } - __osm_lr_rcv_get_port_links( p_rcv, p_lr, p_src_port, p_dest_port, - p_sa_mad->comp_mask, &lr_list, p_req_physp ); - cl_plock_release( p_rcv->p_lock ); if( (cl_qlist_count( &lr_list ) == 0) && Index: osmtest/osmtest.c =================================================================== --- osmtest/osmtest.c (revision 8109) +++ osmtest/osmtest.c (working copy) @@ -4309,6 +4309,99 @@ osmtest_validate_all_path_recs( IN osmte OSM_LOG_EXIT( &p_osmt->log ); return ( status ); } + +/********************************************************************** + * Get link record by LID + **********************************************************************/ +ib_api_status_t +osmtest_get_link_rec_by_lid( IN osmtest_t * const p_osmt, + IN ib_net16_t const from_lid, + IN ib_net16_t const to_lid, + IN OUT osmtest_req_context_t * const p_context ) +{ + ib_api_status_t status = IB_SUCCESS; + osmv_user_query_t user; + osmv_query_req_t req; + ib_link_record_t record; + ib_mad_t *p_mad; + + OSM_LOG_ENTER( &p_osmt->log, osmtest_get_link_rec_by_lid ); + + if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) + { + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_get_link_rec_by_lid: " + "Getting link record from LID 0x%02X to LID 0x%02X\n", + cl_ntoh16( from_lid ), cl_ntoh16( to_lid ) ); + } + + /* + * Do a blocking query for this record in the subnet. + * The result is returned in the result field of the caller's + * context structure. + * + * The query structures are locals. + */ + memset( &req, 0, sizeof( req ) ); + memset( &user, 0, sizeof( user ) ); + memset( &record, 0, sizeof( record ) ); + + record.from_lid = from_lid; + record.to_lid = to_lid; + p_context->p_osmt = p_osmt; + if (from_lid) + user.comp_mask |= IB_LR_COMPMASK_FROM_LID; + if (to_lid) + user.comp_mask |= IB_LR_COMPMASK_TO_LID; + user.attr_id = IB_MAD_ATTR_LINK_RECORD; + user.attr_offset = cl_ntoh16( ( uint16_t ) ( sizeof( record ) >> 3 ) ); + user.p_attr = &record; + + req.query_type = OSMV_QUERY_USER_DEFINED; + req.timeout_ms = p_osmt->opt.transaction_timeout; + req.retry_cnt = p_osmt->opt.retry_count; + req.flags = OSM_SA_FLAGS_SYNC; + req.query_context = p_context; + req.pfn_query_cb = osmtest_query_res_cb; + req.p_query_input = &user; + req.sm_key = 0; + + status = osmv_query_sa( p_osmt->h_bind, &req ); + if( status != IB_SUCCESS ) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_get_link_rec_by_lid: ERR 007A: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); + goto Exit; + } + + status = p_context->result.status; + + if( status != IB_SUCCESS ) + { + if (status != IB_INVALID_PARAMETER) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_get_link_rec_by_lid: ERR 007B: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); + } + if( status == IB_REMOTE_ERROR ) + { + p_mad = osm_madw_get_mad_ptr( p_context->result.p_result_madw ); + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_get_link_rec_by_lid: " + "Remote error = %s\n", + ib_get_mad_status_str( p_mad )); + + status = (ib_net16_t) (p_mad->status & IB_SMP_STATUS_MASK ); + } + goto Exit; + } + + Exit: + OSM_LOG_EXIT( &p_osmt->log ); + return ( status ); +} #endif /********************************************************************** @@ -4891,9 +4984,10 @@ osmtest_validate_against_db( IN osmtest_ { ib_api_status_t status = IB_SUCCESS; #ifdef VENDOR_RMPP_SUPPORT + ib_net16_t test_lid; uint8_t lmc; -#ifdef DUAL_SIDED_RMPP osmtest_req_context_t context; +#ifdef DUAL_SIDED_RMPP osmv_multipath_req_t request; #endif #endif @@ -5003,6 +5097,7 @@ osmtest_validate_against_db( IN osmtest_ #endif #ifdef VENDOR_RMPP_SUPPORT + /* GUIDInfoRecords */ status = osmtest_validate_all_guidinfo_recs( p_osmt ); if( status != IB_SUCCESS ) goto Exit; @@ -5019,6 +5114,43 @@ osmtest_validate_against_db( IN osmtest_ goto Exit; } + /* Some LinkRecord tests */ + test_lid = cl_ntoh16( p_osmt->local_port.lid ); + /* FromLID */ + memset( &context, 0, sizeof( context ) ); + status = osmtest_get_link_rec_by_lid( p_osmt, test_lid, 0, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + /* ToLID */ + memset( &context, 0, sizeof( context ) ); + status = osmtest_get_link_rec_by_lid( p_osmt, 0, test_lid, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + /* FromLID & ToLID */ + memset( &context, 0, sizeof( context ) ); + status = osmtest_get_link_rec_by_lid( p_osmt, test_lid, test_lid, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + if (lmc != 0) + { + test_lid = cl_ntoh16( p_osmt->local_port.lid + 1 ); + /* FromLID */ + memset( &context, 0, sizeof( context ) ); + status = osmtest_get_link_rec_by_lid( p_osmt, test_lid, 0, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + /* ToLID */ + memset( &context, 0, sizeof( context ) ); + status = osmtest_get_link_rec_by_lid( p_osmt, 0, test_lid, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + } + + /* PathRecords */ if (! p_osmt->opt.ignore_path_records) { status = osmtest_validate_all_path_recs( p_osmt ); From jlentini at netapp.com Mon Jun 19 10:19:59 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 19 Jun 2006 13:19:59 -0400 (EDT) Subject: [openib-general] trunk's udapl does not compile In-Reply-To: References: Message-ID: On Mon, 19 Jun 2006, Or Gerlitz wrote: > I've just noted an inconsistency with librdmacm of udapl calling > rdma_create_id without providing the PS param. > > This is the trivial patch i was using to fix the compilation. Yup. The RDMA CM update on Friday afternoon broke uDAPL. Fixed in revision 8112. From jlentini at netapp.com Mon Jun 19 10:23:39 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 19 Jun 2006 13:23:39 -0400 (EDT) Subject: [openib-general] dapltest gets segfaulted in librdmacm init In-Reply-To: References: Message-ID: I don't see this. The gdb sharedlibrary output looks suspicious. /usr/local/ib isn't a standard path for our binaries. Are you sure everything is up-to-date on your system? Is the provided library that you have configured to handle IA "OpenIB-cma" the latest and greatest? On Mon, 19 Jun 2006, Or Gerlitz wrote: > After fixing the ucma/port space issue with the calls to rdma_create_id i > am now trying to run > > $ ./Target/dapltest -T S -D OpenIB-cma > > and getting an immediate segfault with the below trace, any idea? > > Or. > > #0 0x00002af6d3a97685 in ibv_open_device (device=0x537440) at device.c:128 > 128 context = device->ops.alloc_context(device, cmd_fd); > (gdb) where > #0 0x00002af6d3a97685 in ibv_open_device (device=0x537440) at device.c:128 > #1 0x00002af6d3cc4076 in ucma_init () at cma.c:220 > #2 0x00002af6d3cc4182 in rdma_create_event_channel () at cma.c:257 > #3 0x00002af6d3bb20e3 in dapls_ib_open_hca (hca_name=0x534430 "ib0", hca_ptr=0x532870) at dapl_ib_util.c:222 > #4 0x00002af6d3bab454 in dapl_ia_open (name=0x530028 "OpenIB-cma", async_evd_qlen=8, async_evd_handle_ptr=0x52e690, > ia_handle_ptr=0x52e660) at dapl_ia_open.c:145 > #5 0x00002af6d352e422 in dat_ia_openv (name=0x530028 "OpenIB-cma", async_event_qlen=8, async_event_handle=0x52e690, > ia_handle=0x52e660, dapl_major=1, dapl_minor=2, thread_safety=DAT_FALSE) at udat.c:229 > #6 0x000000000041461f in DT_cs_Server (params_ptr=0x530020) at dapl_server.c:105 > #7 0x0000000000407aa2 in DT_Execute_Test (params_ptr=0x530020) at dapl_execute.c:55 > #8 0x000000000041e9d9 in DT_Tdep_Execute_Test (params_ptr=0x530020) at udapl_tdep.c:48 > #9 0x0000000000403669 in dapltest (argc=5, argv=0x7fffd7693748) at dapl_main.c:95 > #10 0x00000000004035bb in main (argc=5, argv=0x7fffd7693748) at dapl_main.c:37 > (gdb) info sharedlibrary > >From To Syms Read Shared Object Library > 0x00002af6d352e0e0 0x00002af6d3533e38 Yes /usr/local/ib/lib/libdat.so.1 > 0x00002af6d365d470 0x00002af6d3664d48 Yes /lib64/tls/libpthread.so.0 > 0x00002af6d37888b0 0x00002af6d3852ce0 Yes /lib64/tls/libc.so.6 > 0x00002af6d398f450 0x00002af6d3990128 Yes /lib64/libdl.so.2 > 0x00002af6d3a94690 0x00002af6d3a99aa8 Yes /usr/local/ib/lib/libibverbs.so.2 > 0x00002af6d3415cf0 0x00002af6d3426ab7 Yes /lib64/ld-linux-x86-64.so.2 > 0x00002af6d3b9ffc0 0x00002af6d3bb7028 Yes /usr/local/ib/lib/libdaplcma.so > 0x00002af6d3cc3ca0 0x00002af6d3cc6d18 Yes /usr/local/ib/lib/librdmacm.so > 0x00002af6d3deb200 0x00002af6d3df2348 Yes /usr/local/lib/libsysfs.so.1 > 0x00002af6d3ef5b50 0x00002af6d3efc138 Yes /usr/local/ib/lib/infiniband/mthca.so > 0x00002af6d40006c0 0x00002af6d4005838 Yes /usr/local/ib/lib/libibverbs.so.1 > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Mon Jun 19 10:27:18 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 19 Jun 2006 12:27:18 -0500 Subject: [openib-general] MVAPICH and librdmacm Message-ID: <1150738038.26165.5.camel@stevo-desktop> Hello, Anybody working on porting the MVAPICH code to use the RDMA CM for connection setup? Just wondering how much work is needed to make MVAPICH run on the iwarp devices. Thanks, Steve. From bugzilla-daemon at openib.org Mon Jun 19 10:32:29 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 19 Jun 2006 10:32:29 -0700 (PDT) Subject: [openib-general] [Bug 145] IB Core unable to communicate IPoIB on Fedora Core 4 Message-ID: <20060619173229.EB5CA228738@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=145 ------- Comment #1 from halr at voltaire.com 2006-06-19 10:32 ------- If I understand what you wrote correctly, IPoIB is running fine but ibping reports some error. What is LID 0xC (and how was this determined) ? Is the ibping kernel module running or the user space daemon for ibping running on LID 0xC ? This may or may not be separate from whatever SDP issue you may have. Can you do an ibnetdiscover and attach the output ? Can you do an /sbin/lsmod | grep ib_ on the remote node (LID 0xC) ? ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From ardavis at ichips.intel.com Mon Jun 19 11:16:15 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 19 Jun 2006 11:16:15 -0700 Subject: [openib-general] dapltest gets segfaulted in librdmacm init In-Reply-To: References: Message-ID: <4496E9EF.1090607@ichips.intel.com> Or Gerlitz wrote: >After fixing the ucma/port space issue with the calls to rdma_create_id i >am now trying to run > > $ ./Target/dapltest -T S -D OpenIB-cma > >and getting an immediate segfault with the below trace, any idea? > > Hmm, no idea. I just updated to 8112 and everything runs fine for me (2.6.17). >Or. > >#0 0x00002af6d3a97685 in ibv_open_device (device=0x537440) at device.c:128 >128 context = device->ops.alloc_context(device, cmd_fd); >(gdb) where >#0 0x00002af6d3a97685 in ibv_open_device (device=0x537440) at device.c:128 >#1 0x00002af6d3cc4076 in ucma_init () at cma.c:220 >#2 0x00002af6d3cc4182 in rdma_create_event_channel () at cma.c:257 >#3 0x00002af6d3bb20e3 in dapls_ib_open_hca (hca_name=0x534430 "ib0", hca_ptr=0x532870) at dapl_ib_util.c:222 >#4 0x00002af6d3bab454 in dapl_ia_open (name=0x530028 "OpenIB-cma", async_evd_qlen=8, async_evd_handle_ptr=0x52e690, > ia_handle_ptr=0x52e660) at dapl_ia_open.c:145 >#5 0x00002af6d352e422 in dat_ia_openv (name=0x530028 "OpenIB-cma", async_event_qlen=8, async_event_handle=0x52e690, > ia_handle=0x52e660, dapl_major=1, dapl_minor=2, thread_safety=DAT_FALSE) at udat.c:229 >#6 0x000000000041461f in DT_cs_Server (params_ptr=0x530020) at dapl_server.c:105 >#7 0x0000000000407aa2 in DT_Execute_Test (params_ptr=0x530020) at dapl_execute.c:55 >#8 0x000000000041e9d9 in DT_Tdep_Execute_Test (params_ptr=0x530020) at udapl_tdep.c:48 >#9 0x0000000000403669 in dapltest (argc=5, argv=0x7fffd7693748) at dapl_main.c:95 >#10 0x00000000004035bb in main (argc=5, argv=0x7fffd7693748) at dapl_main.c:37 >(gdb) info sharedlibrary >>From To Syms Read Shared Object Library >0x00002af6d352e0e0 0x00002af6d3533e38 Yes /usr/local/ib/lib/libdat.so.1 >0x00002af6d365d470 0x00002af6d3664d48 Yes /lib64/tls/libpthread.so.0 >0x00002af6d37888b0 0x00002af6d3852ce0 Yes /lib64/tls/libc.so.6 >0x00002af6d398f450 0x00002af6d3990128 Yes /lib64/libdl.so.2 >0x00002af6d3a94690 0x00002af6d3a99aa8 Yes /usr/local/ib/lib/libibverbs.so.2 >0x00002af6d3415cf0 0x00002af6d3426ab7 Yes /lib64/ld-linux-x86-64.so.2 >0x00002af6d3b9ffc0 0x00002af6d3bb7028 Yes /usr/local/ib/lib/libdaplcma.so >0x00002af6d3cc3ca0 0x00002af6d3cc6d18 Yes /usr/local/ib/lib/librdmacm.so >0x00002af6d3deb200 0x00002af6d3df2348 Yes /usr/local/lib/libsysfs.so.1 >0x00002af6d3ef5b50 0x00002af6d3efc138 Yes /usr/local/ib/lib/infiniband/mthca.so >0x00002af6d40006c0 0x00002af6d4005838 Yes /usr/local/ib/lib/libibverbs.so.1 > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From sashak at voltaire.com Mon Jun 19 11:30:46 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 19 Jun 2006 21:30:46 +0300 Subject: [openib-general] [PATCH TRIVIAL] opensm: libibmad: fix umad retry counter Message-ID: <20060619183046.GF5521@sashak.voltaire.com> Hi Hal, This fixes umad send/recv retry counter in error report. Signed-off-by: Sasha Khapyorsky --- libibmad/src/rpc.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index a3b29c9..e929ba4 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -132,7 +132,7 @@ _do_madrpc(void *umad, int agentid, int for (retries = 0; retries < madrpc_retries; retries++) { if (retries) { - ERRS("retry %d (timeout %d ms)", retries + 1, timeout); + ERRS("retry %d (timeout %d ms)", retries, timeout); /* Restore user MAD header */ memcpy(&mad->addr, &addr, sizeof addr); } From eitan at mellanox.co.il Mon Jun 19 12:05:11 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 19 Jun 2006 22:05:11 +0300 Subject: [openib-general] [PATCHv5] osm: partition manager force policy Message-ID: <86d5d5ge54.fsf@mtl066.yok.mtl.com> Hi Hal This is a 5th take after incorporating Sasha's last reported bug on bad assignment of the used_blocks. This code was run again through my verification flow and also Sasha had run some tests too. Eitan Signed-off-by: Eitan Zahavi Index: include/opensm/osm_port.h =================================================================== --- include/opensm/osm_port.h (revision 8113) +++ include/opensm/osm_port.h (working copy) @@ -591,6 +591,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy * Port, Physical Port *********/ +/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl +* NAME +* osm_physp_get_mod_pkey_tbl +* +* DESCRIPTION +* Returns a NON CONST pointer to the P_Key table object of the Physical Port object. +* +* SYNOPSIS +*/ +static inline osm_pkey_tbl_t * +osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp ) +{ + CL_ASSERT( osm_physp_is_valid( p_physp ) ); + /* + (14.2.5.7) - the block number valid values are 0-2047, and are further + limited by the size of the P_Key table specified by the PartitionCap on the node. + */ + return( &p_physp->pkeys ); +}; +/* +* PARAMETERS +* p_physp +* [in] Pointer to an osm_physp_t object. +* +* RETURN VALUES +* The pointer to the P_Key table object. +* +* NOTES +* +* SEE ALSO +* Port, Physical Port +*********/ + /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl * NAME * osm_physp_set_slvl_tbl Index: include/opensm/osm_pkey.h =================================================================== --- include/opensm/osm_pkey.h (revision 8113) +++ include/opensm/osm_pkey.h (working copy) @@ -92,6 +92,9 @@ typedef struct _osm_pkey_tbl cl_ptr_vector_t blocks; cl_ptr_vector_t new_blocks; cl_map_t keys; + cl_qlist_t pending; + uint16_t used_blocks; + uint16_t max_blocks; } osm_pkey_tbl_t; /* * FIELDS @@ -104,6 +107,18 @@ typedef struct _osm_pkey_tbl * keys * A set holding all keys * +* pending +* A list osm_pending_pkey structs that is temporarily set by the +* pkey mgr and used during pkey mgr algorithm only +* +* used_blocks +* Tracks the number of blocks having non-zero pkeys +* +* max_blocks +* The maximal number of blocks this partition table might hold +* this value is based on node_info (for port 0 or CA) or switch_info +* updated on receiving the node_info or switch_info GetResp +* * NOTES * 'blocks' vector should be used to store pkey values obtained from * the port and SM pkey manager should not change it directly, for this @@ -114,6 +129,39 @@ typedef struct _osm_pkey_tbl * *********/ +/****s* OpenSM: osm_pending_pkey_t +* NAME +* osm_pending_pkey_t +* +* DESCRIPTION +* This objects stores temporary information on pkeys their target block and index +* during the pkey manager operation +* +* SYNOPSIS +*/ +typedef struct _osm_pending_pkey { + cl_list_item_t list_item; + uint16_t pkey; + uint32_t block; + uint8_t index; + boolean_t is_new; +} osm_pending_pkey_t; +/* +* FIELDS +* pkey +* The actual P_Key +* +* block +* The block index based on the previous table extracted from the device +* +* index +* The index of the pky within the block +* +* is_new +* TRUE for new P_Keys such that the block and index are invalid in that case +* +*********/ + /****f* OpenSM: osm_pkey_tbl_construct * NAME * osm_pkey_tbl_construct @@ -142,7 +190,8 @@ void osm_pkey_tbl_construct( * * SYNOPSIS */ -int osm_pkey_tbl_init( +ib_api_status_t +osm_pkey_tbl_init( IN osm_pkey_tbl_t *p_pkey_tbl); /* * p_pkey_tbl @@ -209,8 +258,8 @@ osm_pkey_tbl_get_num_blocks( static inline ib_pkey_table_t *osm_pkey_tbl_block_get( const osm_pkey_tbl_t *p_pkey_tbl, uint16_t block) { - CL_ASSERT(block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)); - return(cl_ptr_vector_get(&p_pkey_tbl->blocks, block)); + return( (block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)) ? + cl_ptr_vector_get(&p_pkey_tbl->blocks, block) : NULL); }; /* * p_pkey_tbl @@ -244,16 +293,117 @@ static inline ib_pkey_table_t *osm_pkey_ /* *********/ -/****f* OpenSM: osm_pkey_tbl_sync_new_blocks + +/****f* OpenSM: osm_pkey_tbl_make_block_pair +* NAME +* osm_pkey_tbl_make_block_pair +* +* DESCRIPTION +* Find or create a pair of "old" and "new" blocks for the +* given block index +* +* SYNOPSIS +*/ +ib_api_status_t +osm_pkey_tbl_make_block_pair( + osm_pkey_tbl_t *p_pkey_tbl, + uint16_t block_idx, + ib_pkey_table_t **pp_old_block, + ib_pkey_table_t **pp_new_block); +/* +* p_pkey_tbl +* [in] Pointer to the PKey table +* +* block_idx +* [in] The block index to use +* +* pp_old_block +* [out] Pointer to the old block pointer arg +* +* pp_new_block +* [out] Pointer to the new block pointer arg +* +* RETURN VALUES +* IB_SUCCESS if OK IB_ERROR if failed +* +*********/ + +/****f* OpenSM: osm_pkey_tbl_set_new_entry * NAME -* osm_pkey_tbl_sync_new_blocks +* osm_pkey_tbl_set_new_entry * * DESCRIPTION -* Syncs new_blocks vector content with current pkey table blocks +* stores the given pkey in the "new" blocks array and update +* the "map" to show that on the "old" blocks * * SYNOPSIS */ -void osm_pkey_tbl_sync_new_blocks( +ib_api_status_t +osm_pkey_tbl_set_new_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t block_idx, + IN uint8_t pkey_idx, + IN uint16_t pkey); +/* +* p_pkey_tbl +* [in] Pointer to the PKey table +* +* block_idx +* [in] The block index to use +* +* pkey_idx +* [in] The index within the block +* +* pkey +* [in] PKey to store +* +* RETURN VALUES +* IB_SUCCESS if OK IB_ERROR if failed +* +*********/ + +/****f* OpenSM: osm_pkey_find_next_free_entry +* NAME +* osm_pkey_find_next_free_entry +* +* DESCRIPTION +* Find the next free entry in the PKey table. Starting at the given +* index and block number. The user should increment pkey_idx before +* next call +* Inspect the "new" blocks array for empty space. +* +* SYNOPSIS +*/ +boolean_t +osm_pkey_find_next_free_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + OUT uint16_t *p_block_idx, + OUT uint8_t *p_pkey_idx); +/* +* p_pkey_tbl +* [in] Pointer to the PKey table +* +* p_block_idx +* [out] The block index to use +* +* p_pkey_idx +* [out] The index within the block to use +* +* RETURN VALUES +* TRUE if found FALSE if did not find +* +*********/ + +/****f* OpenSM: osm_pkey_tbl_init_new_blocks +* NAME +* osm_pkey_tbl_init_new_blocks +* +* DESCRIPTION +* Initializes new_blocks vector content (clear and allocate) +* +* SYNOPSIS +*/ +void osm_pkey_tbl_init_new_blocks( const osm_pkey_tbl_t *p_pkey_tbl); /* * p_pkey_tbl @@ -263,6 +413,41 @@ void osm_pkey_tbl_sync_new_blocks( * *********/ +/****f* OpenSM: osm_pkey_tbl_get_block_and_idx +* NAME +* osm_pkey_tbl_get_block_and_idx +* +* DESCRIPTION +* set the block index and pkey index the given +* pkey is found in. return IB_NOT_FOUND if cound not find +* it, IB_SUCCESS if OK +* +* SYNOPSIS +*/ +ib_api_status_t +osm_pkey_tbl_get_block_and_idx( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t *p_pkey, + OUT uint32_t *block_idx, + OUT uint8_t *pkey_index); +/* +* p_pkey_tbl +* [in] Pointer to osm_pkey_tbl_t object. +* +* p_pkey +* [in] Pointer to the P_Key entry searched +* +* p_block_idx +* [out] Pointer to the block index to be updated +* +* p_pkey_idx +* [out] Pointer to the pkey index (in the block) to be updated +* +* +* NOTES +* +*********/ + /****f* OpenSM: osm_pkey_tbl_set * NAME * osm_pkey_tbl_set @@ -272,7 +457,8 @@ void osm_pkey_tbl_sync_new_blocks( * * SYNOPSIS */ -int osm_pkey_tbl_set( +ib_api_status_t +osm_pkey_tbl_set( IN osm_pkey_tbl_t *p_pkey_tbl, IN uint16_t block, IN ib_pkey_table_t *p_tbl); Index: opensm/osm_pkey.c =================================================================== --- opensm/osm_pkey.c (revision 8113) +++ opensm/osm_pkey.c (working copy) @@ -94,18 +94,22 @@ void osm_pkey_tbl_destroy( /********************************************************************** **********************************************************************/ -int osm_pkey_tbl_init( +ib_api_status_t +osm_pkey_tbl_init( IN osm_pkey_tbl_t *p_pkey_tbl) { cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1); cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1); cl_map_init( &p_pkey_tbl->keys, 1 ); + cl_qlist_init( &p_pkey_tbl->pending ); + p_pkey_tbl->used_blocks = 0; + p_pkey_tbl->max_blocks = 0; return(IB_SUCCESS); } /********************************************************************** **********************************************************************/ -void osm_pkey_tbl_sync_new_blocks( +void osm_pkey_tbl_init_new_blocks( IN const osm_pkey_tbl_t *p_pkey_tbl) { ib_pkey_table_t *p_block, *p_new_block; @@ -123,16 +127,31 @@ void osm_pkey_tbl_sync_new_blocks( p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block)); if (!p_new_block) break; + cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, + b, p_new_block); + } + memset(p_new_block, 0, sizeof(*p_new_block)); - cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block); } - memcpy(p_new_block, p_block, sizeof(*p_new_block)); +} + +/********************************************************************** + **********************************************************************/ +void osm_pkey_tbl_cleanup_pending( + IN osm_pkey_tbl_t *p_pkey_tbl) +{ + cl_list_item_t *p_item; + p_item = cl_qlist_remove_head( &p_pkey_tbl->pending ); + while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) ) + { + free( (osm_pending_pkey_t *)p_item ); } } /********************************************************************** **********************************************************************/ -int osm_pkey_tbl_set( +ib_api_status_t +osm_pkey_tbl_set( IN osm_pkey_tbl_t *p_pkey_tbl, IN uint16_t block, IN ib_pkey_table_t *p_tbl) @@ -203,7 +222,138 @@ int osm_pkey_tbl_set( /********************************************************************** **********************************************************************/ -static boolean_t __osm_match_pkey ( +ib_api_status_t +osm_pkey_tbl_make_block_pair( + osm_pkey_tbl_t *p_pkey_tbl, + uint16_t block_idx, + ib_pkey_table_t **pp_old_block, + ib_pkey_table_t **pp_new_block) +{ + if (block_idx >= p_pkey_tbl->max_blocks) return(IB_ERROR); + + if (pp_old_block) + { + *pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx ); + if (! *pp_old_block) + { + *pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); + if (!*pp_old_block) return(IB_ERROR); + memset(*pp_old_block, 0, sizeof(ib_pkey_table_t)); + cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block); + } + } + + if (pp_new_block) + { + *pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx ); + if (! *pp_new_block) + { + *pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); + if (!*pp_new_block) return(IB_ERROR); + memset(*pp_new_block, 0, sizeof(ib_pkey_table_t)); + cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block); + } + } + return( IB_SUCCESS ); +} + +/********************************************************************** + **********************************************************************/ +/* + store the given pkey in the "new" blocks array + also makes sure the regular block exists. +*/ +ib_api_status_t +osm_pkey_tbl_set_new_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t block_idx, + IN uint8_t pkey_idx, + IN uint16_t pkey) +{ + ib_pkey_table_t *p_old_block; + ib_pkey_table_t *p_new_block; + + if (osm_pkey_tbl_make_block_pair( + p_pkey_tbl, block_idx, &p_old_block, &p_new_block)) + return( IB_ERROR ); + + p_new_block->pkey_entry[pkey_idx] = pkey; + if (p_pkey_tbl->used_blocks <= block_idx) + p_pkey_tbl->used_blocks = block_idx + 1; + + return( IB_SUCCESS ); +} + +/********************************************************************** + **********************************************************************/ +boolean_t +osm_pkey_find_next_free_entry( + IN osm_pkey_tbl_t *p_pkey_tbl, + OUT uint16_t *p_block_idx, + OUT uint8_t *p_pkey_idx) +{ + ib_pkey_table_t *p_new_block; + + CL_ASSERT(p_block_idx); + CL_ASSERT(p_pkey_idx); + + while ( *p_block_idx < p_pkey_tbl->max_blocks) + { + if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1) + { + *p_pkey_idx = 0; + (*p_block_idx)++; + if (*p_block_idx >= p_pkey_tbl->max_blocks) + return FALSE; + } + + p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx); + + if ( !p_new_block || + ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx])) + return TRUE; + else + (*p_pkey_idx)++; + } + return FALSE; +} + +/********************************************************************** + **********************************************************************/ +ib_api_status_t +osm_pkey_tbl_get_block_and_idx( + IN osm_pkey_tbl_t *p_pkey_tbl, + IN uint16_t *p_pkey, + OUT uint32_t *p_block_idx, + OUT uint8_t *p_pkey_index) +{ + uint32_t num_of_blocks; + uint32_t block_index; + ib_pkey_table_t *block; + + CL_ASSERT( p_pkey_tbl ); + CL_ASSERT( p_block_idx != NULL ); + CL_ASSERT( p_pkey_idx != NULL ); + + num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks); + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + { + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + if ( ( block->pkey_entry <= p_pkey ) && + ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK)) + { + *p_block_idx = block_index; + *p_pkey_index = p_pkey - block->pkey_entry; + return( IB_SUCCESS ); + } + } + return( IB_NOT_FOUND ); +} + +/********************************************************************** + **********************************************************************/ +static boolean_t +__osm_match_pkey ( IN const ib_net16_t *pkey1, IN const ib_net16_t *pkey2 ) { @@ -306,7 +456,8 @@ osm_physp_share_pkey( if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys)) return TRUE; - return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); + return + !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); } /********************************************************************** @@ -322,7 +473,8 @@ osm_port_share_pkey( OSM_LOG_ENTER( p_log, osm_port_share_pkey ); - if (!p_port_1 || !p_port_2) { + if (!p_port_1 || !p_port_2) + { ret = FALSE; goto Exit; } @@ -330,7 +482,8 @@ osm_port_share_pkey( p_physp1 = osm_port_get_default_phys_ptr(p_port_1); p_physp2 = osm_port_get_default_phys_ptr(p_port_2); - if (!p_physp1 || !p_physp2) { + if (!p_physp1 || !p_physp2) + { ret = FALSE; goto Exit; } Index: opensm/osm_pkey_mgr.c =================================================================== --- opensm/osm_pkey_mgr.c (revision 8113) +++ opensm/osm_pkey_mgr.c (working copy) @@ -62,6 +62,131 @@ /********************************************************************** **********************************************************************/ +/* + the max number of pkey blocks for a physical port is located in + different place for switch external ports (SwitchInfo) and the + rest of the ports (NodeInfo) +*/ +static int +pkey_mgr_get_physp_max_blocks( + IN const osm_subn_t *p_subn, + IN const osm_physp_t *p_physp) +{ + osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); + osm_switch_t *p_sw; + uint16_t num_pkeys = 0; + + if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) || + (osm_physp_get_port_num( p_physp ) == 0)) + num_pkeys = cl_ntoh16( p_node->node_info.partition_cap ); + else + { + p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid); + if (p_sw) + num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap ); + } + return( (num_pkeys + 31) / 32 ); +} + +/********************************************************************** + **********************************************************************/ +/* + * Insert the new pending pkey entry to the specific port pkey table + * pending pkeys. new entries are inserted at the back. + */ +static void +pkey_mgr_process_physical_port( + IN osm_log_t *p_log, + IN const osm_req_t *p_req, + IN const ib_net16_t pkey, + IN osm_physp_t *p_physp ) +{ + osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); + osm_pkey_tbl_t *p_pkey_tbl; + ib_net16_t *p_orig_pkey; + char *stat = NULL; + osm_pending_pkey_t *p_pending; + + p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); + p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t)); + if (! p_pending) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: ERR 0502: " + "Fail to allocate new pending pkey entry for node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + return; + } + p_pending->pkey = pkey; + p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); + if ( !p_orig_pkey ) + { + p_pending->is_new = TRUE; + cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); + stat = "inserted"; + } + else + { + CL_ASSERT( ib_pkey_get_base(*p_orig_pkey) == ib_pkey_get_base(pkey) ); + p_pending->is_new = FALSE; + if (osm_pkey_tbl_get_block_and_idx( + p_pkey_tbl, p_orig_pkey, + &p_pending->block, &p_pending->index) != IB_SUCCESS) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: ERR 0503: " + "Fail to obtain P_Key 0x%04x block and index for node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + return; + } + cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending); + stat = "updated"; + } + + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_process_physical_port: " + "pkey 0x%04x was %s for node 0x%016" PRIx64 + " port %u\n", + cl_ntoh16( pkey ), stat, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); +} + +/********************************************************************** + **********************************************************************/ +static void +pkey_mgr_process_partition_table( + osm_log_t *p_log, + const osm_req_t *p_req, + const osm_prtn_t *p_prtn, + const boolean_t full ) +{ + const cl_map_t *p_tbl = + full ? &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; + cl_map_iterator_t i, i_next; + ib_net16_t pkey = p_prtn->pkey; + osm_physp_t *p_physp; + + if ( full ) + pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); + + i_next = cl_map_head( p_tbl ); + while ( i_next != cl_map_end( p_tbl ) ) + { + i = i_next; + i_next = cl_map_next( i ); + p_physp = cl_map_obj( i ); + if ( p_physp && osm_physp_is_valid( p_physp ) ) + pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); + } +} + +/********************************************************************** + **********************************************************************/ static ib_api_status_t pkey_mgr_update_pkey_entry( IN const osm_req_t *p_req, @@ -114,7 +239,8 @@ pkey_mgr_enforce_partition( p_pi->state_info2 = 0; ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE ); - context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); + context.pi_context.node_guid = + osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); context.pi_context.set_method = TRUE; context.pi_context.update_master_sm_base_lid = FALSE; @@ -131,80 +257,132 @@ pkey_mgr_enforce_partition( /********************************************************************** **********************************************************************/ -/* - * Prepare a new entry for the pkey table for this port when this pkey - * does not exist. Update existed entry when membership was changed. - */ -static void pkey_mgr_process_physical_port( - IN osm_log_t *p_log, - IN const osm_req_t *p_req, - IN const ib_net16_t pkey, - IN osm_physp_t *p_physp ) +static boolean_t pkey_mgr_update_port( + osm_log_t *p_log, + osm_req_t *p_req, + const osm_port_t * const p_port ) { - osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); - ib_pkey_table_t *block; + osm_physp_t *p_physp; + osm_node_t *p_node; + ib_pkey_table_t *block, *new_block; + osm_pkey_tbl_t *p_pkey_tbl; uint16_t block_index; + uint8_t pkey_index; + uint16_t last_free_block_index = 0; + uint8_t last_free_pkey_index = 0; uint16_t num_of_blocks; - const osm_pkey_tbl_t *p_pkey_tbl; - ib_net16_t *p_orig_pkey; - char *stat = NULL; - uint32_t i; + uint16_t max_num_of_blocks; - p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + ib_api_status_t status; + boolean_t ret_val = FALSE; + osm_pending_pkey_t *p_pending; + boolean_t found; - p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); + p_physp = osm_port_get_default_phys_ptr( p_port ); + if ( !osm_physp_is_valid( p_physp ) ) + return FALSE; - if ( !p_orig_pkey ) - { - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + p_node = osm_physp_get_node_ptr( p_physp ); + p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp ); + num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp ); + if ( p_pkey_tbl->max_blocks > max_num_of_blocks ) { - block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) + osm_log( p_log, OSM_LOG_INFO, + "pkey_mgr_update_port: " + "Max number of blocks reduced from %u to %u " + "for node 0x%016" PRIx64 " port %u\n", + p_pkey_tbl->max_blocks, max_num_of_blocks, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + } + p_pkey_tbl->max_blocks = max_num_of_blocks; + + osm_pkey_tbl_init_new_blocks( p_pkey_tbl ); + p_pkey_tbl->used_blocks = 0; + + /* + process every pending pkey in order - + first must be "updated" last are "new" + */ + p_pending = + (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); + while (p_pending != + (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) ) + { + if (p_pending->is_new == FALSE) + { + block_index = p_pending->block; + pkey_index = p_pending->index; + found = TRUE; + } + else { - if ( ib_pkey_is_invalid( block->pkey_entry[i] ) ) + found = osm_pkey_find_next_free_entry(p_pkey_tbl, + &last_free_block_index, + &last_free_pkey_index); + if ( !found ) { - block->pkey_entry[i] = pkey; - stat = "inserted"; - goto _done; + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0504: " + "failed to find empty space for new pkey 0x%04x " + "of node 0x%016" PRIx64 " port %u\n", + cl_ntoh16(p_pending->pkey), + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); } + else + { + block_index = last_free_block_index; + pkey_index = last_free_pkey_index++; } } + + if (found) + { + if ( IB_SUCCESS != osm_pkey_tbl_set_new_entry( + p_pkey_tbl, block_index, pkey_index, p_pending->pkey) ) + { osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_process_physical_port: ERR 0501: " - "No empty pkey entry was found to insert 0x%04x for node " - "0x%016" PRIx64 " port %u\n", - cl_ntoh16( pkey ), + "pkey_mgr_update_port: ERR 0505: " + "failed to set PKey 0x%04x in block %u idx %u " + "of node 0x%016" PRIx64 " port %u\n", + p_pending->pkey, block_index, pkey_index, cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); } - else if ( *p_orig_pkey != pkey ) - { + } + + free( p_pending ); + p_pending = + (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending ); + } + + /* now look for changes and store */ for ( block_index = 0; block_index < num_of_blocks; block_index++ ) { - /* we need real block (not just new_block) in order - * to resolve block/pkey indices */ block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); - i = p_orig_pkey - block->pkey_entry; - if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) { - block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - block->pkey_entry[i] = pkey; - stat = "updated"; - goto _done; - } - } - } + new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - _done: - if (stat) { - osm_log( p_log, OSM_LOG_VERBOSE, - "pkey_mgr_process_physical_port: " - "pkey 0x%04x was %s for node 0x%016" PRIx64 - " port %u\n", - cl_ntoh16( pkey ), stat, + if (block && + (!new_block || !memcmp( new_block, block, sizeof( *block ) )) ) + continue; + + status = pkey_mgr_update_pkey_entry( + p_req, p_physp , new_block, block_index ); + if (status == IB_SUCCESS) + ret_val = TRUE; + else + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0506: " + "pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); } + + return ret_val; } /********************************************************************** @@ -217,21 +395,23 @@ pkey_mgr_update_peer_port( const osm_port_t * const p_port, boolean_t enforce ) { - osm_physp_t *p, *peer; + osm_physp_t *p_physp, *peer; osm_node_t *p_node; ib_pkey_table_t *block, *peer_block; - const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl; + const osm_pkey_tbl_t *p_pkey_tbl; + osm_pkey_tbl_t *p_peer_pkey_tbl; osm_switch_t *p_sw; ib_switch_info_t *p_si; uint16_t block_index; uint16_t num_of_blocks; + uint16_t peer_max_blocks; ib_api_status_t status = IB_SUCCESS; boolean_t ret_val = FALSE; - p = osm_port_get_default_phys_ptr( p_port ); - if ( !osm_physp_is_valid( p ) ) + p_physp = osm_port_get_default_phys_ptr( p_port ); + if ( !osm_physp_is_valid( p_physp ) ) return FALSE; - peer = osm_physp_get_remote( p ); + peer = osm_physp_get_remote( p_physp ); if ( !peer || !osm_physp_is_valid( peer ) ) return FALSE; p_node = osm_physp_get_node_ptr( peer ); @@ -242,10 +422,26 @@ pkey_mgr_update_peer_port( if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || !p_si->enforce_cap) return FALSE; + p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); + p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer ); + num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer ); + if (peer_max_blocks < p_pkey_tbl->used_blocks) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_peer_port: ERR 0508: " + "not enough entries (%u < %u) on switch 0x%016" PRIx64 + " port %u. Clearing Enforcement bit.\n", + peer_max_blocks, num_of_blocks, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( peer ) ); + enforce = FALSE; + } + if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) { osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0502: " + "pkey_mgr_update_peer_port: ERR 0507: " "pkey_mgr_enforce_partition() failed to update " "node 0x%016" PRIx64 " port %u\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), @@ -255,24 +451,19 @@ pkey_mgr_update_peer_port( if (enforce == FALSE) return FALSE; - p_pkey_tbl = osm_physp_get_pkey_tbl( p ); - p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); - if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) ) - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ); - - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; + for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++) { block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); - if ( memcmp( peer_block, block, sizeof( *peer_block ) ) ) + if ( !peer_block || memcmp( peer_block, block, sizeof( *peer_block ) ) ) { status = pkey_mgr_update_pkey_entry( p_req, peer, block, block_index ); if ( status == IB_SUCCESS ) ret_val = TRUE; else osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0503: " + "pkey_mgr_update_peer_port: ERR 0509: " "pkey_mgr_update_pkey_entry() failed to update " "pkey table block %d for node 0x%016" PRIx64 " port %u\n", @@ -282,10 +473,10 @@ pkey_mgr_update_peer_port( } } - if ( ret_val == TRUE && - osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) + if ( (ret_val == TRUE) && + osm_log_is_active( p_log, OSM_LOG_DEBUG ) ) { - osm_log( p_log, OSM_LOG_VERBOSE, + osm_log( p_log, OSM_LOG_DEBUG, "pkey_mgr_update_peer_port: " "pkey table was updated for node 0x%016" PRIx64 " port %u\n", @@ -298,82 +489,6 @@ pkey_mgr_update_peer_port( /********************************************************************** **********************************************************************/ -static boolean_t pkey_mgr_update_port( - osm_log_t *p_log, - osm_req_t *p_req, - const osm_port_t * const p_port ) -{ - osm_physp_t *p; - osm_node_t *p_node; - ib_pkey_table_t *block, *new_block; - const osm_pkey_tbl_t *p_pkey_tbl; - uint16_t block_index; - uint16_t num_of_blocks; - ib_api_status_t status; - boolean_t ret_val = FALSE; - - p = osm_port_get_default_phys_ptr( p_port ); - if ( !osm_physp_is_valid( p ) ) - return FALSE; - - p_pkey_tbl = osm_physp_get_pkey_tbl(p); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); - - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) - { - block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); - new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - - if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) ) - continue; - - status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index ); - if (status == IB_SUCCESS) - ret_val = TRUE; - else - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_port: ERR 0504: " - "pkey_mgr_update_pkey_entry() failed to update " - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", - block_index, - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( p ) ); - } - - return ret_val; -} - -/********************************************************************** - **********************************************************************/ -static void -pkey_mgr_process_partition_table( - osm_log_t *p_log, - const osm_req_t *p_req, - const osm_prtn_t *p_prtn, - const boolean_t full ) -{ - const cl_map_t *p_tbl = full ? - &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; - cl_map_iterator_t i, i_next; - ib_net16_t pkey = p_prtn->pkey; - osm_physp_t *p_physp; - - if ( full ) - pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); - - i_next = cl_map_head( p_tbl ); - while ( i_next != cl_map_end( p_tbl ) ) - { - i = i_next; - i_next = cl_map_next( i ); - p_physp = cl_map_obj( i ); - if ( p_physp && osm_physp_is_valid( p_physp ) ) - pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); - } -} - -/********************************************************************** - **********************************************************************/ osm_signal_t osm_pkey_mgr_process( IN osm_opensm_t *p_osm ) @@ -383,8 +498,7 @@ osm_pkey_mgr_process( osm_prtn_t *p_prtn; osm_port_t *p_port; osm_signal_t signal = OSM_SIGNAL_DONE; - osm_physp_t *p_physp; - + osm_node_t *p_node; CL_ASSERT( p_osm ); OSM_LOG_ENTER( &p_osm->log, osm_pkey_mgr_process ); @@ -394,32 +508,25 @@ osm_pkey_mgr_process( if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS ) { osm_log( &p_osm->log, OSM_LOG_ERROR, - "osm_pkey_mgr_process: ERR 0505: " + "osm_pkey_mgr_process: ERR 0510: " "osm_prtn_make_partitions() failed\n" ); goto _err; } - p_tbl = &p_osm->subn.port_guid_tbl; - p_next = cl_qmap_head( p_tbl ); - while ( p_next != cl_qmap_end( p_tbl ) ) - { - p_port = ( osm_port_t * ) p_next; - p_next = cl_qmap_next( p_next ); - p_physp = osm_port_get_default_phys_ptr( p_port ); - if ( osm_physp_is_valid( p_physp ) ) - osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) ); - } - + /* populate the pending pkey entries by scanning all partitions */ p_tbl = &p_osm->subn.prtn_pkey_tbl; p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) { p_prtn = ( osm_prtn_t * ) p_next; p_next = cl_qmap_next( p_next ); - pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ); - pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); + pkey_mgr_process_partition_table( + &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ); + pkey_mgr_process_partition_table( + &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); } + /* calculate new pkey tables and set */ p_tbl = &p_osm->subn.port_guid_tbl; p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) @@ -428,8 +535,10 @@ osm_pkey_mgr_process( p_next = cl_qmap_next( p_next ); if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) ) signal = OSM_SIGNAL_DONE_PENDING; - if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH && - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, + p_node = osm_port_get_parent_node( p_port ); + if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && + pkey_mgr_update_peer_port( + &p_osm->log, &p_osm->sm.req, &p_osm->subn, p_port, !p_osm->subn.opt.no_partition_enforcement ) ) signal = OSM_SIGNAL_DONE_PENDING; From eitan at mellanox.co.il Mon Jun 19 12:12:07 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 19 Jun 2006 22:12:07 +0300 Subject: [openib-general] [PATCH] osm: fix segfault due to unprotected access to InformInfo DB Message-ID: <86bqspgdtk.fsf@mtl066.yok.mtl.com> Hi Hal I have added InformInfo requests to the osmStress simulator flow. Running it overnight exposed a bug as OpenSM segfaulted during osm_report_notice. Some debug shows the following two flows were missing a lock. Such that under stress the InformInfo DB was altered while being accessed by the code in osm_report_notice. I have verified the other flows calling osm_report_notice are under a lock. The fixed code is running for a while with no crash so far. Eitan Signed-off-by: Eitan Zahavi Index: opensm/osm_state_mgr.c =================================================================== --- opensm/osm_state_mgr.c (revision 8113) +++ opensm/osm_state_mgr.c (working copy) @@ -1709,6 +1709,7 @@ __osm_state_mgr_report_new_ports( OSM_LOG_ENTER( p_mgr->p_log, __osm_state_mgr_report_new_ports ); + CL_PLOCK_ACQUIRE( p_mgr->p_lock ); p_port = ( osm_port_t * ) ( cl_list_remove_head( &p_mgr->p_subn->new_ports_list ) ); @@ -1759,6 +1760,7 @@ __osm_state_mgr_report_new_ports( ( osm_port_t * ) ( cl_list_remove_head( &p_mgr->p_subn->new_ports_list ) ); } + CL_PLOCK_RELEASE( p_mgr->p_lock ); OSM_LOG_EXIT( p_mgr->p_log ); } Index: opensm/osm_trap_rcv.c =================================================================== --- opensm/osm_trap_rcv.c (revision 8113) +++ opensm/osm_trap_rcv.c (working copy) @@ -652,7 +652,10 @@ __osm_trap_rcv_process_request( p_ntci->issuer_gid.unicast.interface_id = p_port->guid; } + /* we need a lock here as the InformInfo DB must be stable */ + CL_PLOCK_ACQUIRE( p_rcv->p_lock ); status = osm_report_notice(p_rcv->p_log, p_rcv->p_subn, p_ntci); + CL_PLOCK_RELEASE( p_rcv->p_lock ); if( status != IB_SUCCESS ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, From eitan at mellanox.co.il Mon Jun 19 12:24:41 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 19 Jun 2006 22:24:41 +0300 Subject: [openib-general] A few questions about IBMgtSim In-Reply-To: <44968BEF.9030401@simula.no> References: <44968BEF.9030401@simula.no> Message-ID: <4496F9F9.90101@mellanox.co.il> Hi Sven, Please see my response below: Eitan Sven-Arne Reinemo wrote: > Hi, > > After some testing of IBMgtSim I have a few questions: > > 1) If I try to build topologies using the MTS14400.ibnl as a building > block my simulation fails with a "child process exited abnormally" > message. I guess this is related to ibdmchk since the ibdmchk log > contains lots of errors like the following: > > -I- Tracing all CA to CA paths for Credit Loops potential ... > -E- Potential Credit Loop on Path from:H-1/U1/1 to:H-11/U1/1 > Going:Down from:node:0002c9000000007d to:node:0002c9000000006a > Going:Up from:node:0002c9000000006a to:node:0002c90000000076 This error indicate what it say: The resulting routing has a potential credit loop as it does not follow an up/down routing scheme. Credit loops can really generated by the OpenSM on some topologies and can be avoided by adding the -R updn flag. And possible also --add_guid_file if the SM is not able to recognize the root nodes automatically (if the topology is highly not symmetric). > > -I- Generating non blocking full link coverage plan > into:/tmp/ibdmchk.non_block_ > all_links > -E- After 32 stages some switch ports are still not covered: > -E- Fail to cover port:system:0002c90000000054/node:0002c90000000054/P15 This means that there is no route that goes through that port. I.e. if you trace from all HCA to all other HCA you never go through that port. > > I have included two topology files. One that works and one that fails, > the only difference is that the number of hosts are increased from 18 to > 20. Also, if I create my own simple ibnl file for a switch with 144 (or > other sizes) ports I am able to run simulations. Any suggestions to what > the problem might be? As described above the reason is credit loop potential and the specific topology and routing algorithm used. Please try the -R updn and --add_guid_file. You can scan the ibmgtsim.guids.txt file to know the GUIDS assigned to the spine switches. > > > 2) The included example ibmgtsim/tests/RhinoBased10K.topo never finishes > (at least not in 24 hours). Does this work for anyone else? All other > examples work fine. I was able to simulate it by: 1. Decreasing the verbosity 2. Running the simulator on one machine and the OpenSM on another > > 3) If I would like to use IBMgtSim with my own (simplified) SM would it > be straightforward? It looks too me like RunSimTest talks to any SM > given the correct path, node and port number for location of the SM. You can use libibmscli.so/.a to integrate your SM with ibmgtsim. This lib API is provided in ibms_client_api.h It mainly enables connecting to the ibmgtsim server TCP/IP port declaring the port the SM is attached to, registering to receive some MAD class/attributes sending and receiving MADs. > From halr at voltaire.com Mon Jun 19 13:22:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Jun 2006 16:22:40 -0400 Subject: [openib-general] [PATCH TRIVIAL] opensm: libibmad: fix umad retry counter In-Reply-To: <20060619183046.GF5521@sashak.voltaire.com> References: <20060619183046.GF5521@sashak.voltaire.com> Message-ID: <1150748559.4391.78544.camel@hal.voltaire.com> Hi Sasha, On Mon, 2006-06-19 at 14:30, Sasha Khapyorsky wrote: > Hi Hal, > > This fixes umad send/recv retry counter in error report. > > Signed-off-by: Sasha Khapyorsky > --- Thanks. Applied. -- Hal From halr at voltaire.com Mon Jun 19 13:39:09 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Jun 2006 16:39:09 -0400 Subject: [openib-general] [PATCH] osm: fix segfault due to unprotected access to InformInfo DB In-Reply-To: <86bqspgdtk.fsf@mtl066.yok.mtl.com> References: <86bqspgdtk.fsf@mtl066.yok.mtl.com> Message-ID: <1150749203.4391.78989.camel@hal.voltaire.com> Hi Eitan, On Mon, 2006-06-19 at 15:12, Eitan Zahavi wrote: > Hi Hal > > I have added InformInfo requests to the osmStress simulator flow. > Running it overnight exposed a bug as OpenSM segfaulted during > osm_report_notice. Some debug shows the following two flows were > missing a lock. Such that under stress the InformInfo DB was altered > while being accessed by the code in osm_report_notice. > > I have verified the other flows calling osm_report_notice are under a > lock. > > The fixed code is running for a while with no crash so far. > > Eitan > > Signed-off-by: Eitan Zahavi Thanks. Applied. -- Hal From ralphc at pathscale.com Mon Jun 19 16:37:30 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 19 Jun 2006 16:37:30 -0700 Subject: [openib-general] [PATCH 1/4] ipath mmaped CQs, QPs, SRQs Message-ID: <1150760250.32252.158.camel@brick.pathscale.com> Here is a set of patches which adds mmapped completion queues and receive queues for the InfiniPath HCA. This required changing some of the core code in order to return HW specific data for the ibv_resize_cq(), ibv_modify_qp(), and ibv_modify_srq(). I have included the minimal changes to mthca and ehca to match the function signature changes and incorporated Roland's review comments on the earlier code posted. The first patch contains the core changes, the second contains mthca and ehca specific changes, the third contains libipathverbs changes, the fourth contains ib_ipath changes. Signed-off-by: Ralph Campbell Index: src/userspace/libibverbs/include/infiniband/driver.h =================================================================== --- src/userspace/libibverbs/include/infiniband/driver.h (revision 8021) +++ src/userspace/libibverbs/include/infiniband/driver.h (working copy) @@ -95,7 +95,8 @@ int ibv_cmd_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited_only); int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, - struct ibv_resize_cq *cmd, size_t cmd_size); + struct ibv_resize_cq *cmd, size_t cmd_size, + struct ibv_resize_cq_resp *resp, size_t resp_size); int ibv_cmd_destroy_cq(struct ibv_cq *cq); int ibv_cmd_create_srq(struct ibv_pd *pd, Index: src/userspace/libibverbs/include/infiniband/kern-abi.h =================================================================== --- src/userspace/libibverbs/include/infiniband/kern-abi.h (revision 8021) +++ src/userspace/libibverbs/include/infiniband/kern-abi.h (working copy) @@ -355,6 +355,8 @@ struct ibv_resize_cq_resp { __u32 cqe; + __u32 reserved; + __u64 driver_data[0]; }; struct ibv_destroy_cq { Index: src/userspace/libibverbs/src/cmd.c =================================================================== --- src/userspace/libibverbs/src/cmd.c (revision 8021) +++ src/userspace/libibverbs/src/cmd.c (working copy) @@ -368,18 +368,18 @@ } int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe, - struct ibv_resize_cq *cmd, size_t cmd_size) + struct ibv_resize_cq *cmd, size_t cmd_size, + struct ibv_resize_cq_resp *resp, size_t resp_size) { - struct ibv_resize_cq_resp resp; - IBV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, &resp, sizeof resp); + IBV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, resp, resp_size); cmd->cq_handle = cq->handle; cmd->cqe = cqe; if (write(cq->context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; - cq->cqe = resp.cqe; + cq->cqe = resp->cqe; return 0; } Index: src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h =================================================================== --- src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h (revision 8021) +++ src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h (working copy) @@ -275,6 +275,8 @@ struct ib_uverbs_resize_cq_resp { __u32 cqe; + __u32 reserved; + __u64 driver_data[0]; }; struct ib_uverbs_poll_cq { Index: src/linux-kernel/infiniband/include/rdma/ib_verbs.h =================================================================== --- src/linux-kernel/infiniband/include/rdma/ib_verbs.h (revision 8021) +++ src/linux-kernel/infiniband/include/rdma/ib_verbs.h (working copy) @@ -911,7 +911,8 @@ struct ib_udata *udata); int (*modify_srq)(struct ib_srq *srq, struct ib_srq_attr *srq_attr, - enum ib_srq_attr_mask srq_attr_mask); + enum ib_srq_attr_mask srq_attr_mask, + struct ib_udata *udata); int (*query_srq)(struct ib_srq *srq, struct ib_srq_attr *srq_attr); int (*destroy_srq)(struct ib_srq *srq); @@ -923,7 +924,8 @@ struct ib_udata *udata); int (*modify_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, - int qp_attr_mask); + int qp_attr_mask, + struct ib_udata *udata); int (*query_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask, Index: src/linux-kernel/infiniband/core/verbs.c =================================================================== --- src/linux-kernel/infiniband/core/verbs.c (revision 8021) +++ src/linux-kernel/infiniband/core/verbs.c (working copy) @@ -231,7 +231,7 @@ struct ib_srq_attr *srq_attr, enum ib_srq_attr_mask srq_attr_mask) { - return srq->device->modify_srq(srq, srq_attr, srq_attr_mask); + return srq->device->modify_srq(srq, srq_attr, srq_attr_mask, NULL); } EXPORT_SYMBOL(ib_modify_srq); @@ -547,7 +547,7 @@ struct ib_qp_attr *qp_attr, int qp_attr_mask) { - return qp->device->modify_qp(qp, qp_attr, qp_attr_mask); + return qp->device->modify_qp(qp, qp_attr, qp_attr_mask, NULL); } EXPORT_SYMBOL(ib_modify_qp); Index: src/linux-kernel/infiniband/core/uverbs_cmd.c =================================================================== --- src/linux-kernel/infiniband/core/uverbs_cmd.c (revision 8021) +++ src/linux-kernel/infiniband/core/uverbs_cmd.c (working copy) @@ -1258,6 +1258,7 @@ int out_len) { struct ib_uverbs_modify_qp cmd; + struct ib_udata udata; struct ib_qp *qp; struct ib_qp_attr *attr; int ret; @@ -1265,6 +1266,9 @@ if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; + INIT_UDATA(&udata, buf + sizeof cmd, NULL, in_len - sizeof cmd, + out_len); + attr = kmalloc(sizeof *attr, GFP_KERNEL); if (!attr) return -ENOMEM; @@ -1321,7 +1325,7 @@ attr->alt_ah_attr.ah_flags = cmd.alt_dest.is_global ? IB_AH_GRH : 0; attr->alt_ah_attr.port_num = cmd.alt_dest.port_num; - ret = ib_modify_qp(qp, attr, cmd.attr_mask); + ret = qp->device->modify_qp(qp, attr, cmd.attr_mask, &udata); put_qp_read(qp); @@ -1773,6 +1777,7 @@ } ah->uobject = uobj; + uobj->object = ah; ret = idr_add_uobj(&ib_uverbs_ah_idr, uobj); if (ret) goto err_destroy; @@ -2031,6 +2036,7 @@ int out_len) { struct ib_uverbs_modify_srq cmd; + struct ib_udata udata; struct ib_srq *srq; struct ib_srq_attr attr; int ret; @@ -2038,6 +2044,9 @@ if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; + INIT_UDATA(&udata, buf + sizeof cmd, NULL, in_len - sizeof cmd, + out_len); + srq = idr_read_srq(cmd.srq_handle, file->ucontext); if (!srq) return -EINVAL; @@ -2045,7 +2054,7 @@ attr.max_wr = cmd.max_wr; attr.srq_limit = cmd.srq_limit; - ret = ib_modify_srq(srq, &attr, cmd.attr_mask); + ret = srq->device->modify_srq(srq, &attr, cmd.attr_mask, &udata); put_srq_read(srq); -- Ralph Campbell From ralphc at pathscale.com Mon Jun 19 16:41:51 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 19 Jun 2006 16:41:51 -0700 Subject: [openib-general] [PATCH 2/4] ipath mmaped CQs, QPs, SRQs Message-ID: <1150760512.32252.164.camel@brick.pathscale.com> This patch contains the mthca and ehca specific changes. Signed-off-by: Ralph Campbell Index: src/userspace/libmthca/src/verbs.c =================================================================== --- src/userspace/libmthca/src/verbs.c (revision 8021) +++ src/userspace/libmthca/src/verbs.c (working copy) @@ -259,6 +259,7 @@ { struct mthca_cq *cq = to_mcq(ibcq); struct mthca_resize_cq cmd; + struct ibv_resize_cq_resp resp; struct ibv_mr *mr; void *buf; int old_cqe; @@ -292,7 +293,8 @@ old_cqe = ibcq->cqe; cmd.lkey = mr->lkey; - ret = ibv_cmd_resize_cq(ibcq, cqe - 1, &cmd.ibv_cmd, sizeof cmd); + ret = ibv_cmd_resize_cq(ibcq, cqe - 1, &cmd.ibv_cmd, sizeof cmd, + &resp, sizeof resp); if (ret) { mthca_dereg_mr(mr); free(buf); Index: src/linux-kernel/infiniband/hw/mthca/mthca_srq.c =================================================================== --- src/linux-kernel/infiniband/hw/mthca/mthca_srq.c (revision 8021) +++ src/linux-kernel/infiniband/hw/mthca/mthca_srq.c (working copy) @@ -357,7 +357,7 @@ } int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask) + enum ib_srq_attr_mask attr_mask, struct ib_udata *udata) { struct mthca_dev *dev = to_mdev(ibsrq->device); struct mthca_srq *srq = to_msrq(ibsrq); Index: src/linux-kernel/infiniband/hw/mthca/mthca_dev.h =================================================================== --- src/linux-kernel/infiniband/hw/mthca/mthca_dev.h (revision 8021) +++ src/linux-kernel/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -506,7 +506,7 @@ struct ib_srq_attr *attr, struct mthca_srq *srq); void mthca_free_srq(struct mthca_dev *dev, struct mthca_srq *srq); int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask); + enum ib_srq_attr_mask attr_mask, struct ib_udata *udata); int mthca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr); int mthca_max_srq_sge(struct mthca_dev *dev); void mthca_srq_event(struct mthca_dev *dev, u32 srqn, @@ -521,7 +521,8 @@ enum ib_event_type event_type); int mthca_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr); -int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, + struct ib_udata *udata); int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr); int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, Index: src/linux-kernel/infiniband/hw/mthca/mthca_qp.c =================================================================== --- src/linux-kernel/infiniband/hw/mthca/mthca_qp.c (revision 8021) +++ src/linux-kernel/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -522,7 +522,8 @@ return 0; } -int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, + struct ib_udata *udata) { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); Index: src/linux-kernel/infiniband/hw/ehca/ehca_qp.c =================================================================== --- src/linux-kernel/infiniband/hw/ehca/ehca_qp.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ehca/ehca_qp.c (working copy) @@ -1288,7 +1288,8 @@ return ret; } -int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, + struct ib_udata *udata) { int ret = 0; struct ehca_qp *my_qp = NULL; Index: src/linux-kernel/infiniband/hw/ehca/ehca_iverbs.h =================================================================== --- src/linux-kernel/infiniband/hw/ehca/ehca_iverbs.h (revision 8021) +++ src/linux-kernel/infiniband/hw/ehca/ehca_iverbs.h (working copy) @@ -143,7 +143,8 @@ int ehca_destroy_qp(struct ib_qp *qp); -int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); +int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, + struct ib_udata *udata); int ehca_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr); -- Ralph Campbell From ralphc at pathscale.com Mon Jun 19 16:43:33 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 19 Jun 2006 16:43:33 -0700 Subject: [openib-general] [PATCH 3/4] ipath mmaped CQs, QPs, SRQs Message-ID: <1150760613.32252.166.camel@brick.pathscale.com> This patch contains the libipathverbs specific changes. Signed-off-by: Ralph Campbell Index: src/userspace/libipathverbs/src/verbs.c =================================================================== --- src/userspace/libipathverbs/src/verbs.c (revision 8021) +++ src/userspace/libipathverbs/src/verbs.c (working copy) @@ -40,11 +40,14 @@ #include #include -#include +#include #include #include +#include +#include #include "ipathverbs.h" +#include "ipath-abi.h" int ipath_query_device(struct ibv_context *context, struct ibv_device_attr *attr) @@ -83,11 +86,11 @@ struct ibv_pd *pd; pd = malloc(sizeof *pd); - if(!pd) + if (!pd) return NULL; - if(ibv_cmd_alloc_pd(context, pd, &cmd, sizeof cmd, - &resp, sizeof resp)) { + if (ibv_cmd_alloc_pd(context, pd, &cmd, sizeof cmd, + &resp, sizeof resp)) { free(pd); return NULL; } @@ -142,57 +145,159 @@ struct ibv_comp_channel *channel, int comp_vector) { - struct ibv_cq *cq; - struct ibv_create_cq cmd; - struct ibv_create_cq_resp resp; - int ret; + struct ipath_cq *cq; + struct ibv_create_cq cmd; + struct ipath_create_cq_resp resp; + int ret; + size_t size; cq = malloc(sizeof *cq); if (!cq) return NULL; - ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, cq, - &cmd, sizeof cmd, &resp, sizeof resp); + ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, + &cq->ibv_cq, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); if (ret) { free(cq); return NULL; } - return cq; + size = sizeof(struct ipath_cq_wc) + sizeof(struct ipath_wc) * cqe; + cq->queue = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, + context->cmd_fd, resp.offset); + if ((void *) cq->queue == MAP_FAILED) { + free(cq); + return NULL; + } + + pthread_spin_init(&cq->lock, PTHREAD_PROCESS_PRIVATE); + return &cq->ibv_cq; } -int ipath_destroy_cq(struct ibv_cq *cq) +int ipath_resize_cq(struct ibv_cq *ibcq, int cqe) { + struct ipath_cq *cq = to_icq(ibcq); + struct ibv_resize_cq cmd; + struct ipath_resize_cq_resp resp; + size_t size; + int ret; + + pthread_spin_lock(&cq->lock); + /* Save the old size so we can unmmap the queue. */ + size = sizeof(struct ipath_cq_wc) + + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe); + ret = ibv_cmd_resize_cq(ibcq, cqe, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); + if (ret) { + pthread_spin_unlock(&cq->lock); + return ret; + } + (void) munmap(cq->queue, size); + size = sizeof(struct ipath_cq_wc) + + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe); + cq->queue = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, + ibcq->context->cmd_fd, resp.offset); + ret = errno; + pthread_spin_unlock(&cq->lock); + if ((void *) cq->queue == MAP_FAILED) + return ret; + return 0; +} + +int ipath_destroy_cq(struct ibv_cq *ibcq) +{ + struct ipath_cq *cq = to_icq(ibcq); int ret; - ret = ibv_cmd_destroy_cq(cq); + ret = ibv_cmd_destroy_cq(ibcq); if (ret) return ret; + (void) munmap(cq->queue, sizeof(struct ipath_cq_wc) + + (sizeof(struct ipath_wc) * cq->ibv_cq.cqe)); free(cq); return 0; } +int ipath_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc) +{ + struct ipath_cq *cq = to_icq(ibcq); + struct ipath_cq_wc *q; + int npolled; + uint32_t tail; + + pthread_spin_lock(&cq->lock); + q = cq->queue; + tail = q->tail; + for (npolled = 0; npolled < ne; ++npolled, ++wc) { + if (tail == q->head) + break; + memcpy(wc, &q->queue[tail], sizeof(*wc)); + if (tail == cq->ibv_cq.cqe) + tail = 0; + else + tail++; + } + q->tail = tail; + pthread_spin_unlock(&cq->lock); + + return npolled; +} + struct ibv_qp *ipath_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) { - struct ibv_create_qp cmd; - struct ibv_create_qp_resp resp; - struct ibv_qp *qp; - int ret; + struct ibv_create_qp cmd; + struct ipath_create_qp_resp resp; + struct ipath_qp *qp; + int ret; + size_t size; qp = malloc(sizeof *qp); if (!qp) return NULL; - ret = ibv_cmd_create_qp(pd, qp, attr, &cmd, sizeof cmd, &resp, sizeof resp); + ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); if (ret) { free(qp); return NULL; } - return qp; + if (attr->srq) { + qp->rq.size = 0; + qp->rq.max_sge = 0; + qp->rq.rwq = NULL; + } else { + qp->rq.size = attr->cap.max_recv_wr + 1; + qp->rq.max_sge = attr->cap.max_recv_sge; + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * qp->rq.max_sge)) * + qp->rq.size; + qp->rq.rwq = mmap(NULL, size, + PROT_READ | PROT_WRITE, MAP_SHARED, + pd->context->cmd_fd, resp.offset); + if ((void *) qp->rq.rwq == MAP_FAILED) { + free(qp); + return NULL; + } + } + + pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE); + return &qp->ibv_qp; } +int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, + enum ibv_qp_attr_mask attr_mask, + struct ibv_qp_init_attr *init_attr) +{ + struct ibv_query_qp cmd; + + return ibv_cmd_query_qp(qp, attr, attr_mask, init_attr, + &cmd, sizeof cmd); +} + int ipath_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask) { @@ -201,70 +306,196 @@ return ibv_cmd_modify_qp(qp, attr, attr_mask, &cmd, sizeof cmd); } -int ipath_destroy_qp(struct ibv_qp *qp) +int ipath_destroy_qp(struct ibv_qp *ibqp) { + struct ipath_qp *qp = to_iqp(ibqp); int ret; - ret = ibv_cmd_destroy_qp(qp); + ret = ibv_cmd_destroy_qp(ibqp); if (ret) return ret; + if (qp->rq.rwq) { + size_t size; + + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * qp->rq.max_sge)) * + qp->rq.size; + (void) munmap(qp->rq.rwq, size); + } free(qp); return 0; } +static int post_recv(struct ipath_rq *rq, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + struct ibv_recv_wr *i; + struct ipath_rwq *rwq; + struct ipath_rwqe *wqe; + uint32_t head; + int n, ret; + + pthread_spin_lock(&rq->lock); + rwq = rq->rwq; + head = rwq->head; + for (i = wr; i; i = i->next) { + if ((unsigned) i->num_sge > rq->max_sge) + goto bad; + wqe = get_rwqe_ptr(rq, head); + if (++head >= rq->size) + head = 0; + if (head == rwq->tail) + goto bad; + wqe->wr_id = i->wr_id; + wqe->num_sge = i->num_sge; + for (n = 0; n < wqe->num_sge; n++) + wqe->sg_list[n] = i->sg_list[n]; + rwq->head = head; + } + ret = 0; + goto done; + +bad: + ret = -ENOMEM; + if (bad_wr) + *bad_wr = i; +done: + pthread_spin_unlock(&rq->lock); + return ret; +} + +int ipath_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + struct ipath_qp *qp = to_iqp(ibqp); + + return post_recv(&qp->rq, wr, bad_wr); +} + struct ibv_srq *ipath_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *attr) { - struct ibv_srq *srq; + struct ipath_srq *srq; struct ibv_create_srq cmd; - struct ibv_create_srq_resp resp; + struct ipath_create_srq_resp resp; int ret; + size_t size; srq = malloc(sizeof *srq); - if(srq == NULL) + if (srq == NULL) return NULL; - ret = ibv_cmd_create_srq(pd, srq, attr, &cmd, sizeof cmd, - &resp, sizeof resp); + ret = ibv_cmd_create_srq(pd, &srq->ibv_srq, attr, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); if (ret) { free(srq); return NULL; } - return srq; + srq->rq.size = attr->attr.max_wr + 1; + srq->rq.max_sge = attr->attr.max_sge; + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * srq->rq.max_sge)) * srq->rq.size; + srq->rq.rwq = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, + pd->context->cmd_fd, resp.offset); + if ((void *) srq->rq.rwq == MAP_FAILED) { + free(srq); + return NULL; + } + + pthread_spin_init(&srq->rq.lock, PTHREAD_PROCESS_PRIVATE); + return &srq->ibv_srq; } -int ipath_modify_srq(struct ibv_srq *srq, +int ipath_modify_srq(struct ibv_srq *ibsrq, struct ibv_srq_attr *attr, enum ibv_srq_attr_mask attr_mask) { - struct ibv_modify_srq cmd; + struct ipath_srq *srq = to_isrq(ibsrq); + struct ipath_modify_srq_cmd cmd; + __u64 offset; + size_t size; + int ret; - return ibv_cmd_modify_srq(srq, attr, attr_mask, &cmd, sizeof cmd); + if (attr_mask & IBV_SRQ_MAX_WR) { + pthread_spin_lock(&srq->rq.lock); + /* Save the old size so we can unmmap the queue. */ + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * srq->rq.max_sge)) * + srq->rq.size; + } + cmd.offset_addr = (__u64) &offset; + ret = ibv_cmd_modify_srq(ibsrq, attr, attr_mask, + &cmd.ibv_cmd, sizeof cmd); + if (ret) { + if (attr_mask & IBV_SRQ_MAX_WR) + pthread_spin_unlock(&srq->rq.lock); + return ret; + } + if (attr_mask & IBV_SRQ_MAX_WR) { + (void) munmap(srq->rq.rwq, size); + srq->rq.size = attr->max_wr + 1; + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * srq->rq.max_sge)) * + srq->rq.size; + srq->rq.rwq = mmap(NULL, size, + PROT_READ | PROT_WRITE, MAP_SHARED, + ibsrq->context->cmd_fd, offset); + pthread_spin_unlock(&srq->rq.lock); + /* XXX Now we have no receive queue. */ + if ((void *) srq->rq.rwq == MAP_FAILED) + return errno; + } + return 0; } -int ipath_destroy_srq(struct ibv_srq *srq) +int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr) { + struct ibv_query_srq cmd; + + return ibv_cmd_query_srq(srq, attr, &cmd, sizeof cmd); +} + +int ipath_destroy_srq(struct ibv_srq *ibsrq) +{ + struct ipath_srq *srq = to_isrq(ibsrq); + size_t size; int ret; - ret = ibv_cmd_destroy_srq(srq); + ret = ibv_cmd_destroy_srq(ibsrq); if (ret) return ret; + size = sizeof(struct ipath_rwq) + + (sizeof(struct ipath_rwqe) + + (sizeof(struct ibv_sge) * srq->rq.max_sge)) * srq->rq.size; + (void) munmap(srq->rq.rwq, size); free(srq); return 0; } +int ipath_post_srq_recv(struct ibv_srq *ibsrq, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + struct ipath_srq *srq = to_isrq(ibsrq); + + return post_recv(&srq->rq, wr, bad_wr); +} + struct ibv_ah *ipath_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr) { struct ibv_ah *ah; ah = malloc(sizeof *ah); - if(ah == NULL) + if (ah == NULL) return NULL; - if(ibv_cmd_create_ah(pd, ah, attr)) { + if (ibv_cmd_create_ah(pd, ah, attr)) { free(ah); return NULL; } Index: src/userspace/libipathverbs/src/ipathverbs.map =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.map (revision 8021) +++ src/userspace/libipathverbs/src/ipathverbs.map (working copy) @@ -1,4 +1,4 @@ { - global: openib_driver_init; + global: ibv_driver_init; local: *; }; Index: src/userspace/libipathverbs/src/ipath-abi.h =================================================================== --- src/userspace/libipathverbs/src/ipath-abi.h (revision 0) +++ src/userspace/libipathverbs/src/ipath-abi.h (revision 0) @@ -0,0 +1,67 @@ +/* + * Copyright (c) 2006. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#ifndef IPATH_ABI_H +#define IPATH_ABI_H + +#include + +struct ipath_create_cq_resp { + struct ibv_create_cq_resp ibv_resp; + __u64 offset; +}; + +struct ipath_resize_cq_resp { + struct ibv_resize_cq_resp ibv_resp; + __u64 offset; +}; + +struct ipath_create_qp_resp { + struct ibv_create_qp_resp ibv_resp; + __u64 offset; +}; + +struct ipath_create_srq_resp { + struct ibv_create_srq_resp ibv_resp; + __u64 offset; +}; + +struct ipath_modify_srq_cmd { + struct ibv_modify_srq ibv_cmd; + __u64 offset_addr; +}; + +#endif /* IPATH_ABI_H */ Index: src/userspace/libipathverbs/src/ipathverbs.c =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.c (revision 8021) +++ src/userspace/libipathverbs/src/ipathverbs.c (working copy) @@ -86,22 +86,25 @@ .dereg_mr = ipath_dereg_mr, .create_cq = ipath_create_cq, - .poll_cq = ibv_cmd_poll_cq, + .poll_cq = ipath_poll_cq, .req_notify_cq = ibv_cmd_req_notify_cq, .cq_event = NULL, + .resize_cq = ipath_resize_cq, .destroy_cq = ipath_destroy_cq, .create_srq = ipath_create_srq, .modify_srq = ipath_modify_srq, + .query_srq = ipath_query_srq, .destroy_srq = ipath_destroy_srq, - .post_srq_recv = ibv_cmd_post_srq_recv, + .post_srq_recv = ipath_post_srq_recv, .create_qp = ipath_create_qp, + .query_qp = ipath_query_qp, .modify_qp = ipath_modify_qp, .destroy_qp = ipath_destroy_qp, .post_send = ibv_cmd_post_send, - .post_recv = ibv_cmd_post_recv, + .post_recv = ipath_post_recv, .create_ah = ipath_create_ah, .destroy_ah = ipath_destroy_ah, @@ -145,30 +148,24 @@ .free_context = ipath_free_context }; -struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev) +struct ibv_device *ibv_driver_init(const char *uverbs_sys_path, + int abi_version) { - struct sysfs_device *pcidev; - struct sysfs_attribute *attr; + char value[8]; struct ipath_device *dev; - unsigned vendor, device; - int i; + unsigned vendor, device; + int i; - pcidev = sysfs_get_classdev_device(sysdev); - if (!pcidev) + if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor", + value, sizeof value) < 0) return NULL; + sscanf(value, "%i", &vendor); - attr = sysfs_get_device_attr(pcidev, "vendor"); - if (!attr) + if (ibv_read_sysfs_file(uverbs_sys_path, "device/device", + value, sizeof value) < 0) return NULL; - sscanf(attr->value, "%i", &vendor); - sysfs_close_attribute(attr); + sscanf(value, "%i", &device); - attr = sysfs_get_device_attr(pcidev, "device"); - if (!attr) - return NULL; - sscanf(attr->value, "%i", &device); - sysfs_close_attribute(attr); - for (i = 0; i < sizeof hca_table / sizeof hca_table[0]; ++i) if (vendor == hca_table[i].vendor && device == hca_table[i].device) @@ -180,13 +177,12 @@ dev = malloc(sizeof *dev); if (!dev) { fprintf(stderr, PFX "Fatal: couldn't allocate device for %s\n", - sysdev->name); - abort(); + uverbs_sys_path); + return NULL; } dev->ibv_dev.ops = ipath_dev_ops; dev->hca_type = hca_table[i].type; - dev->page_size = sysconf(_SC_PAGESIZE); return &dev->ibv_dev; } Index: src/userspace/libipathverbs/src/ipathverbs.h =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.h (revision 8021) +++ src/userspace/libipathverbs/src/ipathverbs.h (working copy) @@ -39,6 +39,7 @@ #include #include +#include #include #include @@ -57,13 +58,87 @@ struct ipath_device { struct ibv_device ibv_dev; enum ipath_hca_type hca_type; - int page_size; }; struct ipath_context { struct ibv_context ibv_ctx; }; +/* + * This structure needs to have the same size and offsets as + * the kernel's ib_wc structure since it is memory mapped. + */ +struct ipath_wc { + uint64_t wr_id; + enum ibv_wc_status status; + enum ibv_wc_opcode opcode; + uint32_t vendor_err; + uint32_t byte_len; + uint32_t imm_data; /* in network byte order */ + uint32_t qp_num; + uint32_t src_qp; + enum ibv_wc_flags wc_flags; + uint16_t pkey_index; + uint16_t slid; + uint8_t sl; + uint8_t dlid_path_bits; + uint8_t port_num; +}; + +struct ipath_cq_wc { + uint32_t head; + uint32_t tail; + struct ipath_wc queue[1]; +}; + +struct ipath_cq { + struct ibv_cq ibv_cq; + struct ipath_cq_wc *queue; + pthread_spinlock_t lock; +}; + +/* + * Receive work request queue entry. + * The size of the sg_list is determined when the QP is created and stored + * in qp->r_max_sge. + */ +struct ipath_rwqe { + uint64_t wr_id; + uint8_t num_sge; + struct ibv_sge sg_list[0]; +}; + +/* + * This struture is used to contain the head pointer, tail pointer, + * and receive work queue entries as a single memory allocation so + * it can be mmap'ed into user space. + * Note that the wq array elements are variable size so you can't + * just index into the array to get the N'th element; + * use get_rwqe_ptr() instead. + */ +struct ipath_rwq { + uint32_t head; /* new requests posted to the head */ + uint32_t tail; /* receives pull requests from here. */ + struct ipath_rwqe wq[0]; +}; + +struct ipath_rq { + struct ipath_rwq *rwq; + pthread_spinlock_t lock; + uint32_t size; + uint32_t max_sge; +}; + +struct ipath_qp { + struct ibv_qp ibv_qp; + struct ipath_rq rq; +}; + +struct ipath_srq { + struct ibv_srq ibv_srq; + struct ipath_rq rq; +}; + #define to_ixxx(xxx, type) \ ((struct ipath_##type *) \ ((void *) ib##xxx - offsetof(struct ipath_##type, ibv_##xxx))) @@ -73,6 +148,34 @@ return to_ixxx(ctx, context); } +static inline struct ipath_cq *to_icq(struct ibv_cq *ibcq) +{ + return to_ixxx(cq, cq); +} + +static inline struct ipath_qp *to_iqp(struct ibv_qp *ibqp) +{ + return to_ixxx(qp, qp); +} + +static inline struct ipath_srq *to_isrq(struct ibv_srq *ibsrq) +{ + return to_ixxx(srq, srq); +} + +/* + * Since struct ipath_rwqe is not a fixed size, we can't simply index into + * struct ipath_rq.wq. This function does the array index computation. + */ +static inline struct ipath_rwqe *get_rwqe_ptr(struct ipath_rq *rq, + unsigned n) +{ + return (struct ipath_rwqe *) + ((char *) rq->rwq->wq + + (sizeof(struct ipath_rwqe) + + rq->max_sge * sizeof(struct ibv_sge)) * n); +} + extern int ipath_query_device(struct ibv_context *context, struct ibv_device_attr *attr); @@ -92,11 +195,19 @@ struct ibv_comp_channel *channel, int comp_vector); +int ipath_resize_cq(struct ibv_cq *cq, int cqe); + int ipath_destroy_cq(struct ibv_cq *cq); +int ipath_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); + struct ibv_qp *ipath_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); +int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, + enum ibv_qp_attr_mask attr_mask, + struct ibv_qp_init_attr *init_attr); + int ipath_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask); @@ -115,8 +226,12 @@ struct ibv_srq_attr *attr, enum ibv_srq_attr_mask attr_mask); +int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr); + int ipath_destroy_srq(struct ibv_srq *srq); +int ipath_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr); struct ibv_ah *ipath_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr); -- Ralph Campbell From ralphc at pathscale.com Mon Jun 19 16:45:46 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 19 Jun 2006 16:45:46 -0700 Subject: [openib-general] [PATCH 4/4] ipath mmaped CQs, QPs, SRQs Message-ID: <1150760746.32252.169.camel@brick.pathscale.com> This patch contains the ib_ipath kernel driver specific changes. Signed-off-by: Ralph Campbell Index: src/linux-kernel/infiniband/hw/ipath/ipath_qp.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_qp.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_qp.c (working copy) @@ -354,8 +354,10 @@ qp->s_last = 0; qp->s_ssn = 1; qp->s_lsn = 0; - qp->r_rq.head = 0; - qp->r_rq.tail = 0; + if (qp->r_rq.wq) { + qp->r_rq.wq->head = 0; + qp->r_rq.wq->tail = 0; + } qp->r_reuse_sge = 0; } @@ -364,7 +366,7 @@ * @qp: the QP to put into an error state * * Flushes both send and receive work queues. - * QP s_lock should be held. + * QP s_lock should be held and interrupts disabled. */ void ipath_error_qp(struct ipath_qp *qp) @@ -409,15 +411,32 @@ qp->s_hdrwords = 0; qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; - wc.opcode = IB_WC_RECV; - spin_lock(&qp->r_rq.lock); - while (qp->r_rq.tail != qp->r_rq.head) { - wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id; - if (++qp->r_rq.tail >= qp->r_rq.size) - qp->r_rq.tail = 0; - ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + if (qp->r_rq.wq) { + struct ipath_rwq *wq; + u32 head; + u32 tail; + + spin_lock(&qp->r_rq.lock); + + /* sanity check pointers before trusting them */ + wq = qp->r_rq.wq; + head = wq->head; + if (head >= qp->r_rq.size) + head = 0; + tail = wq->tail; + if (tail >= qp->r_rq.size) + tail = 0; + wc.opcode = IB_WC_RECV; + while (tail != head) { + wc.wr_id = get_rwqe_ptr(&qp->r_rq, tail)->wr_id; + if (++tail >= qp->r_rq.size) + tail = 0; + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + } + wq->tail = tail; + + spin_unlock(&qp->r_rq.lock); } - spin_unlock(&qp->r_rq.lock); } /** @@ -425,11 +444,12 @@ * @ibqp: the queue pair who's attributes we're modifying * @attr: the new attributes * @attr_mask: the mask of attributes to modify + * @udata: user data for ipathverbs.so * * Returns 0 on success, otherwise returns an errno. */ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, - int attr_mask) + int attr_mask, struct ib_udata *udata) { struct ipath_ibdev *dev = to_idev(ibqp->device); struct ipath_qp *qp = to_iqp(ibqp); @@ -542,7 +562,7 @@ attr->dest_qp_num = qp->remote_qpn; attr->qp_access_flags = qp->qp_access_flags; attr->cap.max_send_wr = qp->s_size - 1; - attr->cap.max_recv_wr = qp->r_rq.size - 1; + attr->cap.max_recv_wr = qp->ibqp.srq ? 0 : qp->r_rq.size - 1; attr->cap.max_send_sge = qp->s_max_sge; attr->cap.max_recv_sge = qp->r_rq.max_sge; attr->cap.max_inline_data = 0; @@ -595,13 +615,23 @@ } else { u32 min, max, x; u32 credits; + struct ipath_rwq *wq = qp->r_rq.wq; + u32 head; + u32 tail; + /* sanity check pointers before trusting them */ + head = wq->head; + if (head >= qp->r_rq.size) + head = 0; + tail = wq->tail; + if (tail >= qp->r_rq.size) + tail = 0; /* * Compute the number of credits available (RWQEs). * XXX Not holding the r_rq.lock here so there is a small * chance that the pair of reads are not atomic. */ - credits = qp->r_rq.head - qp->r_rq.tail; + credits = head - tail; if ((int)credits < 0) credits += qp->r_rq.size; /* @@ -678,27 +708,32 @@ case IB_QPT_UD: case IB_QPT_SMI: case IB_QPT_GSI: - qp = kmalloc(sizeof(*qp), GFP_KERNEL); + sz = sizeof(*qp); + if (!init_attr->srq) + sz += sizeof(*qp->r_sg_list) * + init_attr->cap.max_recv_sge; + qp = kmalloc(sz, GFP_KERNEL); if (!qp) { - vfree(swq); ret = ERR_PTR(-ENOMEM); - goto bail; + goto free_swq; } if (init_attr->srq) { + sz = 0; qp->r_rq.size = 0; qp->r_rq.max_sge = 0; qp->r_rq.wq = NULL; + init_attr->cap.max_recv_wr = 0; + init_attr->cap.max_recv_sge = 0; } else { qp->r_rq.size = init_attr->cap.max_recv_wr + 1; qp->r_rq.max_sge = init_attr->cap.max_recv_sge; - sz = (sizeof(struct ipath_sge) * qp->r_rq.max_sge) + + sz = (sizeof(struct ib_sge) * qp->r_rq.max_sge) + sizeof(struct ipath_rwqe); - qp->r_rq.wq = vmalloc(qp->r_rq.size * sz); + qp->r_rq.wq = vmalloc(sizeof(struct ipath_rwq) + + qp->r_rq.size * sz); if (!qp->r_rq.wq) { - kfree(qp); - vfree(swq); ret = ERR_PTR(-ENOMEM); - goto bail; + goto free_qp; } } @@ -724,16 +759,14 @@ err = ipath_alloc_qpn(&dev->qp_table, qp, init_attr->qp_type); if (err) { - vfree(swq); - vfree(qp->r_rq.wq); - kfree(qp); ret = ERR_PTR(err); - goto bail; + goto free_rwq; } + qp->ip = NULL; ipath_reset_qp(qp); /* Tell the core driver that the kernel SMA is present. */ - if (qp->ibqp.qp_type == IB_QPT_SMI) + if (init_attr->qp_type == IB_QPT_SMI) ipath_layer_set_verbs_flags(dev->dd, IPATH_VERBS_KERNEL_SMA); break; @@ -746,8 +779,51 @@ init_attr->cap.max_inline_data = 0; + /* + * Return the address of the RWQ as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata) { + struct ipath_mmap_info *ip; + __u64 offset = (__u64) qp->r_rq.wq; + int err; + + err = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (err) { + ret = ERR_PTR(err); + goto free_rwq; + } + + if (qp->r_rq.wq) { + /* Allocate info for ipath_mmap(). */ + ip = kmalloc(sizeof(*ip), GFP_KERNEL); + if (!ip) { + ret = ERR_PTR(-ENOMEM); + goto free_rwq; + } + qp->ip = ip; + ip->context = ibpd->uobject->context; + ip->obj = qp->r_rq.wq; + kref_init(&ip->ref); + ip->mmap_cnt = 0; + ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) + + qp->r_rq.size * sz); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } + } + ret = &qp->ibqp; + goto bail; +free_rwq: + vfree(qp->r_rq.wq); +free_qp: + kfree(qp); +free_swq: + vfree(swq); bail: return ret; } @@ -771,11 +847,9 @@ if (qp->ibqp.qp_type == IB_QPT_SMI) ipath_layer_set_verbs_flags(dev->dd, 0); - spin_lock_irqsave(&qp->r_rq.lock, flags); - spin_lock(&qp->s_lock); + spin_lock_irqsave(&qp->s_lock, flags); qp->state = IB_QPS_ERR; - spin_unlock(&qp->s_lock); - spin_unlock_irqrestore(&qp->r_rq.lock, flags); + spin_unlock_irqrestore(&qp->s_lock, flags); /* Stop the sending tasklet. */ tasklet_kill(&qp->s_task); @@ -796,8 +870,11 @@ if (atomic_read(&qp->refcount) != 0) ipath_free_qp(&dev->qp_table, qp); + if (qp->ip) + kref_put(&qp->ip->ref, ipath_release_mmap_info); + else + vfree(qp->r_rq.wq); vfree(qp->s_wq); - vfree(qp->r_rq.wq); kfree(qp); return 0; } Index: src/linux-kernel/infiniband/hw/ipath/ipath_ruc.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_ruc.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_ruc.c (working copy) @@ -105,6 +105,54 @@ spin_unlock_irqrestore(&dev->pending_lock, flags); } +static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + int user = to_ipd(qp->ibqp.pd)->user; + int i, j, ret; + struct ib_wc wc; + + qp->r_len = 0; + for (i = j = 0; i < wqe->num_sge; i++) { + if (wqe->sg_list[i].length == 0) + continue; + /* Check LKEY */ + if ((user && wqe->sg_list[i].lkey == 0) || + !ipath_lkey_ok(&dev->lk_table, + &qp->r_sg_list[j], &wqe->sg_list[i], + IB_ACCESS_LOCAL_WRITE)) + goto bad_lkey; + qp->r_len += wqe->sg_list[i].length; + j++; + } + qp->r_sge.sge = qp->r_sg_list[0]; + qp->r_sge.sg_list = qp->r_sg_list + 1; + qp->r_sge.num_sge = j; + ret = 1; + goto bail; + +bad_lkey: + wc.wr_id = wqe->wr_id; + wc.status = IB_WC_LOC_PROT_ERR; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.imm_data = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + /* Signal solicited completion event. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + ret = 0; +bail: + return ret; +} + /** * ipath_get_rwqe - copy the next RWQE into the QP's RWQE * @qp: the QP @@ -118,73 +166,69 @@ { unsigned long flags; struct ipath_rq *rq; + struct ipath_rwq *wq; struct ipath_srq *srq; struct ipath_rwqe *wqe; + void (*handler)(struct ib_event *, void *); + u32 tail; int ret; - if (!qp->ibqp.srq) { + if (qp->ibqp.srq) { + srq = to_isrq(qp->ibqp.srq); + handler = srq->ibsrq.event_handler; + rq = &srq->rq; + } else { + srq = NULL; + handler = NULL; rq = &qp->r_rq; - spin_lock_irqsave(&rq->lock, flags); + } - if (unlikely(rq->tail == rq->head)) { + spin_lock_irqsave(&rq->lock, flags); + wq = rq->wq; + tail = wq->tail; + do { + if (unlikely(tail == wq->head)) { + spin_unlock_irqrestore(&rq->lock, flags); ret = 0; goto bail; } - wqe = get_rwqe_ptr(rq, rq->tail); - qp->r_wr_id = wqe->wr_id; - if (!wr_id_only) { - qp->r_sge.sge = wqe->sg_list[0]; - qp->r_sge.sg_list = wqe->sg_list + 1; - qp->r_sge.num_sge = wqe->num_sge; - qp->r_len = wqe->length; - } - if (++rq->tail >= rq->size) - rq->tail = 0; - goto done; - } + wqe = get_rwqe_ptr(rq, tail); + if (++tail >= rq->size) + tail = 0; + } while (!wr_id_only && !init_sge(qp, wqe)); + qp->r_wr_id = wqe->wr_id; + wq->tail = tail; - srq = to_isrq(qp->ibqp.srq); - rq = &srq->rq; - spin_lock_irqsave(&rq->lock, flags); - - if (unlikely(rq->tail == rq->head)) { - ret = 0; - goto bail; - } - wqe = get_rwqe_ptr(rq, rq->tail); - qp->r_wr_id = wqe->wr_id; - if (!wr_id_only) { - qp->r_sge.sge = wqe->sg_list[0]; - qp->r_sge.sg_list = wqe->sg_list + 1; - qp->r_sge.num_sge = wqe->num_sge; - qp->r_len = wqe->length; - } - if (++rq->tail >= rq->size) - rq->tail = 0; - if (srq->ibsrq.event_handler) { - struct ib_event ev; + ret = 1; + if (handler) { u32 n; - if (rq->head < rq->tail) - n = rq->size + rq->head - rq->tail; + /* + * validate head pointer value and compute + * the number of remaining WQEs. + */ + n = wq->head; + if (n >= rq->size) + n = 0; + if (n < tail) + n += rq->size - tail; else - n = rq->head - rq->tail; + n -= tail; if (n < srq->limit) { + struct ib_event ev; + srq->limit = 0; spin_unlock_irqrestore(&rq->lock, flags); ev.device = qp->ibqp.device; ev.element.srq = qp->ibqp.srq; ev.event = IB_EVENT_SRQ_LIMIT_REACHED; - srq->ibsrq.event_handler(&ev, - srq->ibsrq.srq_context); - spin_lock_irqsave(&rq->lock, flags); + handler(&ev, srq->ibsrq.srq_context); + goto bail; } } -done: - ret = 1; + spin_unlock_irqrestore(&rq->lock, flags); bail: - spin_unlock_irqrestore(&rq->lock, flags); return ret; } Index: src/linux-kernel/infiniband/hw/ipath/Makefile =================================================================== --- src/linux-kernel/infiniband/hw/ipath/Makefile (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/Makefile (working copy) @@ -25,6 +25,7 @@ ipath_cq.o \ ipath_keys.o \ ipath_mad.o \ + ipath_mmap.o \ ipath_mr.o \ ipath_qp.o \ ipath_rc.o \ Index: src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c (working copy) @@ -280,11 +280,12 @@ struct ib_recv_wr **bad_wr) { struct ipath_qp *qp = to_iqp(ibqp); + struct ipath_rwq *wq = qp->r_rq.wq; unsigned long flags; int ret; /* Check that state is OK to post receive. */ - if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_RECV_OK)) { + if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_RECV_OK) || !wq) { *bad_wr = wr; ret = -EINVAL; goto bail; @@ -293,59 +294,31 @@ for (; wr; wr = wr->next) { struct ipath_rwqe *wqe; u32 next; - int i, j; + int i; - if (wr->num_sge > qp->r_rq.max_sge) { + if ((unsigned) wr->num_sge > qp->r_rq.max_sge) { *bad_wr = wr; ret = -ENOMEM; goto bail; } spin_lock_irqsave(&qp->r_rq.lock, flags); - next = qp->r_rq.head + 1; + next = wq->head + 1; if (next >= qp->r_rq.size) next = 0; - if (next == qp->r_rq.tail) { + if (next == wq->tail) { spin_unlock_irqrestore(&qp->r_rq.lock, flags); *bad_wr = wr; ret = -ENOMEM; goto bail; } - wqe = get_rwqe_ptr(&qp->r_rq, qp->r_rq.head); + wqe = get_rwqe_ptr(&qp->r_rq, wq->head); wqe->wr_id = wr->wr_id; - wqe->sg_list[0].mr = NULL; - wqe->sg_list[0].vaddr = NULL; - wqe->sg_list[0].length = 0; - wqe->sg_list[0].sge_length = 0; - wqe->length = 0; - for (i = 0, j = 0; i < wr->num_sge; i++) { - /* Check LKEY */ - if (to_ipd(qp->ibqp.pd)->user && - wr->sg_list[i].lkey == 0) { - spin_unlock_irqrestore(&qp->r_rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - if (wr->sg_list[i].length == 0) - continue; - if (!ipath_lkey_ok( - &to_idev(qp->ibqp.device)->lk_table, - &wqe->sg_list[j], &wr->sg_list[i], - IB_ACCESS_LOCAL_WRITE)) { - spin_unlock_irqrestore(&qp->r_rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - wqe->length += wr->sg_list[i].length; - j++; - } - wqe->num_sge = j; - qp->r_rq.head = next; + wqe->num_sge = wr->num_sge; + for (i = 0; i < wr->num_sge; i++) + wqe->sg_list[i] = wr->sg_list[i]; + wq->head = next; spin_unlock_irqrestore(&qp->r_rq.lock, flags); } ret = 0; @@ -694,7 +667,7 @@ ipath_layer_get_lastibcstat(dev->dd) & 0xf]; props->port_cap_flags = dev->port_cap_flags; props->gid_tbl_len = 1; - props->max_msg_sz = 4096; + props->max_msg_sz = 0x80000000; props->pkey_tbl_len = ipath_layer_get_npkeys(dev->dd); props->bad_pkey_cntr = ipath_layer_get_cr_errpkey(dev->dd) - dev->z_pkey_violations; @@ -871,7 +844,7 @@ goto bail; } - if (ah_attr->port_num != 1 || + if (ah_attr->port_num < 1 || ah_attr->port_num > pd->device->phys_port_cnt) { ret = ERR_PTR(-EINVAL); goto bail; @@ -883,6 +856,8 @@ goto bail; } + dev->n_ahs_allocated++; + /* ib_create_ah() will initialize ah->ibah. */ ah->attr = *ah_attr; @@ -1137,6 +1112,7 @@ dev->attach_mcast = ipath_multicast_attach; dev->detach_mcast = ipath_multicast_detach; dev->process_mad = ipath_process_mad; + dev->mmap = ipath_mmap; snprintf(dev->node_desc, sizeof(dev->node_desc), IPATH_IDSTR " %s kernel_SMA", system_utsname.nodename); Index: src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h (working copy) @@ -37,6 +37,7 @@ #include #include #include +#include #include #include "ipath_layer.h" @@ -177,58 +178,41 @@ }; /* - * Quick description of our CQ/QP locking scheme: - * - * We have one global lock that protects dev->cq/qp_table. Each - * struct ipath_cq/qp also has its own lock. An individual qp lock - * may be taken inside of an individual cq lock. Both cqs attached to - * a qp may be locked, with the send cq locked first. No other - * nesting should be done. - * - * Each struct ipath_cq/qp also has an atomic_t ref count. The - * pointer from the cq/qp_table to the struct counts as one reference. - * This reference also is good for access through the consumer API, so - * modifying the CQ/QP etc doesn't need to take another reference. - * Access because of a completion being polled does need a reference. - * - * Finally, each struct ipath_cq/qp has a wait_queue_head_t for the - * destroy function to sleep on. - * - * This means that access from the consumer API requires nothing but - * taking the struct's lock. - * - * Access because of a completion event should go as follows: - * - lock cq/qp_table and look up struct - * - increment ref count in struct - * - drop cq/qp_table lock - * - lock struct, do your thing, and unlock struct - * - decrement ref count; if zero, wake up waiters - * - * To destroy a CQ/QP, we can do the following: - * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock - * - decrement ref count - * - wait_event until ref count is zero - * - * It is the consumer's responsibilty to make sure that no QP - * operations (WQE posting or state modification) are pending when the - * QP is destroyed. Also, the consumer must make sure that calls to - * qp_modify are serialized. - * - * Possible optimizations (wait for profile data to see if/where we - * have locks bouncing between CPUs): - * - split cq/qp table lock into n separate (cache-aligned) locks, - * indexed (say) by the page in the table + * This structure is used by ipath_mmap() to validate an offset + * when an mmap() request is made. The vm_area_struct then uses + * this as its vm_private_data. */ +struct ipath_mmap_info { + struct ipath_mmap_info *next; + struct ib_ucontext *context; + void *obj; + struct kref ref; + unsigned size; + unsigned mmap_cnt; +}; +/* + * This struture is used to contain the head pointer, tail pointer, + * and completion queue entries as a single memory allocation so + * it can be mmap'ed into user space. + */ +struct ipath_cq_wc { + u32 head; /* index of next entry to fill */ + u32 tail; /* index of next ib_poll_cq() entry */ + struct ib_wc queue[1]; /* this is actually size ibcq.cqe + 1 */ +}; + +/* + * The completion queue structure. + */ struct ipath_cq { struct ib_cq ibcq; struct tasklet_struct comptask; spinlock_t lock; u8 notify; u8 triggered; - u32 head; /* new records added to the head */ - u32 tail; /* poll_cq() reads from here. */ - struct ib_wc *queue; /* this is actually ibcq.cqe + 1 */ + struct ipath_cq_wc *queue; + struct ipath_mmap_info *ip; }; /* @@ -247,28 +231,40 @@ /* * Receive work request queue entry. - * The size of the sg_list is determined when the QP is created and stored - * in qp->r_max_sge. + * The size of the sg_list is determined when the QP (or SRQ) is created + * and stored in qp->r_rq.max_sge (or srq->rq.max_sge). */ struct ipath_rwqe { u64 wr_id; - u32 length; /* total length of data in sg_list */ u8 num_sge; - struct ipath_sge sg_list[0]; + struct ib_sge sg_list[0]; }; +/* + * This struture is used to contain the head pointer, tail pointer, + * and receive work queue entries as a single memory allocation so + * it can be mmap'ed into user space. + * Note that the wq array elements are variable size so you can't + * just index into the array to get the N'th element; + * use get_rwqe_ptr() instead. + */ +struct ipath_rwq { + u32 head; /* new work requests posted to the head */ + u32 tail; /* receives pull requests from here. */ + struct ipath_rwqe wq[0]; +}; + struct ipath_rq { + struct ipath_rwq *wq; spinlock_t lock; - u32 head; /* new work requests posted to the head */ - u32 tail; /* receives pull requests from here. */ u32 size; /* size of RWQE array */ u8 max_sge; - struct ipath_rwqe *wq; /* RWQE array */ }; struct ipath_srq { struct ib_srq ibsrq; struct ipath_rq rq; + struct ipath_mmap_info *ip; /* send signal when number of RWQEs < limit */ u32 limit; }; @@ -292,6 +288,7 @@ atomic_t refcount; wait_queue_head_t wait; struct tasklet_struct s_task; + struct ipath_mmap_info *ip; struct ipath_sge_state *s_cur_sge; struct ipath_sge_state s_sge; /* current send request data */ /* current RDMA read send data */ @@ -343,7 +340,8 @@ u32 s_ssn; /* SSN of tail entry */ u32 s_lsn; /* limit sequence number (credit) */ struct ipath_swqe *s_wq; /* send work queue */ - struct ipath_rq r_rq; /* receive work queue */ + struct ipath_rq r_rq; /* receive work queue */ + struct ipath_sge r_sg_list[0]; /* verified SGEs */ }; /* @@ -367,15 +365,15 @@ /* * Since struct ipath_rwqe is not a fixed size, we can't simply index into - * struct ipath_rq.wq. This function does the array index computation. + * struct ipath_rwq.wq. This function does the array index computation. */ static inline struct ipath_rwqe *get_rwqe_ptr(struct ipath_rq *rq, unsigned n) { return (struct ipath_rwqe *) - ((char *) rq->wq + + ((char *) rq->wq->wq + (sizeof(struct ipath_rwqe) + - rq->max_sge * sizeof(struct ipath_sge)) * n); + rq->max_sge * sizeof(struct ib_sge)) * n); } /* @@ -415,6 +413,7 @@ struct ib_device ibdev; struct list_head dev_list; struct ipath_devdata *dd; + struct ipath_mmap_info *pending_mmaps; int ib_unit; /* This is the device number */ u16 sm_lid; /* in host order */ u8 sm_sl; @@ -577,7 +576,7 @@ int ipath_destroy_qp(struct ib_qp *ibqp); int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, - int attr_mask); + int attr_mask, struct ib_udata *udata); int ipath_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, struct ib_qp_init_attr *init_attr); @@ -636,7 +635,8 @@ struct ib_udata *udata); int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask); + enum ib_srq_attr_mask attr_mask, + struct ib_udata *udata); int ipath_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr); @@ -678,6 +678,10 @@ int ipath_dealloc_fmr(struct ib_fmr *ibfmr); +void ipath_release_mmap_info(struct kref *ref); + +int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma); + void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev); void ipath_insert_rnr_queue(struct ipath_qp *qp); Index: src/linux-kernel/infiniband/hw/ipath/ipath_mmap.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_mmap.c (revision 0) +++ src/linux-kernel/infiniband/hw/ipath/ipath_mmap.c (revision 0) @@ -0,0 +1,147 @@ +/* + * Copyright (c) 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include + +#include "ipath_verbs.h" + +/** + * ipath_release_mmap_info - free mmap info structure + * @ref: a pointer to the kref within struct ipath_mmap_info + */ +void ipath_release_mmap_info(struct kref *ref) +{ + struct ipath_mmap_info *ip = + container_of(ref, struct ipath_mmap_info, ref); + + vfree(ip->obj); + kfree(ip); +} + +/* + * open and close keep track of how many times the CQ is mapped, + * to avoid releasing it. + */ +static void ipath_vma_open(struct vm_area_struct *vma) +{ + struct ipath_mmap_info *ip = vma->vm_private_data; + + kref_get(&ip->ref); + ip->mmap_cnt++; +} + +static void ipath_vma_close(struct vm_area_struct *vma) +{ + struct ipath_mmap_info *ip = vma->vm_private_data; + + ip->mmap_cnt--; + kref_put(&ip->ref, ipath_release_mmap_info); +} + +/* + * ipath_vma_nopage - handle a VMA page fault. + */ +static struct page *ipath_vma_nopage(struct vm_area_struct *vma, + unsigned long address, int *type) +{ + struct ipath_mmap_info *ip = vma->vm_private_data; + unsigned long offset = address - vma->vm_start; + struct page *page = NOPAGE_SIGBUS; + void *pageptr; + + if (offset >= ip->size) + goto out; /* out of range */ + + /* + * Convert the vmalloc address into a struct page. + */ + pageptr = (void *)(offset + (vma->vm_pgoff << PAGE_SHIFT)); + page = vmalloc_to_page(pageptr); + + /* Increment the reference count. */ + get_page(page); + if (type) + *type = VM_FAULT_MINOR; +out: + return page; +} + +static struct vm_operations_struct ipath_vm_ops = { + .open = ipath_vma_open, + .close = ipath_vma_close, + .nopage = ipath_vma_nopage, +}; + +/** + * ipath_mmap - create a new mmap region + * @context: the IB user context of the process making the mmap() call + * @vma: the VMA to be initialized + * Return zero if the mmap is OK. Otherwise, return an errno. + */ +int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + struct ipath_ibdev *dev = to_idev(context->device); + unsigned long offset = vma->vm_pgoff << PAGE_SHIFT; + unsigned long size = vma->vm_end - vma->vm_start; + struct ipath_mmap_info *ip, **pp; + + /* + * Search the device's list of objects waiting for a mmap call. + * Normally, this list is very short since a call to create a + * CQ, QP, or SRQ is soon followed by a call to mmap(). + */ + spin_lock_irq(&dev->pending_lock); + for (pp = &dev->pending_mmaps; (ip = *pp); pp = &ip->next) { + /* Only the creator is allowed to mmap the object */ + if (context != ip->context || (void *) offset != ip->obj) + continue; + /* Don't allow a mmap larger than the object. */ + if (size > ip->size) + break; + + *pp = ip->next; + spin_unlock_irq(&dev->pending_lock); + + vma->vm_ops = &ipath_vm_ops; + vma->vm_flags |= VM_RESERVED; + vma->vm_private_data = ip; + ipath_vma_open(vma); + return 0; + } + spin_unlock_irq(&dev->pending_lock); + return -EINVAL; +} Index: src/linux-kernel/infiniband/hw/ipath/ipath_cq.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_cq.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_cq.c (working copy) @@ -41,20 +41,28 @@ * @entry: work completion entry to add * @sig: true if @entry is a solicitated entry * - * This may be called with one of the qp->s_lock or qp->r_rq.lock held. + * This may be called with qp->s_lock held. */ void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int solicited) { + struct ipath_cq_wc *wc = cq->queue; unsigned long flags; + u32 head; u32 next; spin_lock_irqsave(&cq->lock, flags); - if (cq->head == cq->ibcq.cqe) + /* + * Note that the head pointer might be writable by user processes. + * Take care to verify it is a sane value. + */ + head = wc->head; + if (head >= (unsigned) cq->ibcq.cqe) { + head = cq->ibcq.cqe; next = 0; - else - next = cq->head + 1; - if (unlikely(next == cq->tail)) { + } else + next = head + 1; + if (unlikely(next == wc->tail)) { spin_unlock_irqrestore(&cq->lock, flags); if (cq->ibcq.event_handler) { struct ib_event ev; @@ -66,8 +74,8 @@ } return; } - cq->queue[cq->head] = *entry; - cq->head = next; + wc->queue[head] = *entry; + wc->head = next; if (cq->notify == IB_CQ_NEXT_COMP || (cq->notify == IB_CQ_SOLICITED && solicited)) { @@ -100,19 +108,20 @@ int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) { struct ipath_cq *cq = to_icq(ibcq); + struct ipath_cq_wc *wc = cq->queue; unsigned long flags; int npolled; spin_lock_irqsave(&cq->lock, flags); for (npolled = 0; npolled < num_entries; ++npolled, ++entry) { - if (cq->tail == cq->head) + if (wc->tail == wc->head) break; - *entry = cq->queue[cq->tail]; - if (cq->tail == cq->ibcq.cqe) - cq->tail = 0; + *entry = wc->queue[wc->tail]; + if (wc->tail >= cq->ibcq.cqe) + wc->tail = 0; else - cq->tail++; + wc->tail++; } spin_unlock_irqrestore(&cq->lock, flags); @@ -159,7 +168,7 @@ { struct ipath_ibdev *dev = to_idev(ibdev); struct ipath_cq *cq; - struct ib_wc *wc; + struct ipath_cq_wc *wc; struct ib_cq *ret; if (entries > ib_ipath_max_cqes) { @@ -172,10 +181,7 @@ goto bail; } - /* - * Need to use vmalloc() if we want to support large #s of - * entries. - */ + /* Allocate the completion queue structure. */ cq = kmalloc(sizeof(*cq), GFP_KERNEL); if (!cq) { ret = ERR_PTR(-ENOMEM); @@ -183,15 +189,54 @@ } /* - * Need to use vmalloc() if we want to support large #s of entries. + * Allocate the completion queue entries and head/tail pointers. + * This is allocated separately so that it can be resized and + * also mapped into user space. + * We need to use vmalloc() in order to support mmap and large + * numbers of entries. */ - wc = vmalloc(sizeof(*wc) * (entries + 1)); + wc = vmalloc(sizeof(*wc) + sizeof(struct ib_wc) * entries); if (!wc) { - kfree(cq); ret = ERR_PTR(-ENOMEM); - goto bail; + goto free_cq; } + /* + * Return the address of the WC as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata) { + struct ipath_mmap_info *ip; + __u64 offset = (__u64) wc; + int err; + + err = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (err) { + ret = ERR_PTR(err); + goto free_wc; + } + + /* Allocate info for ipath_mmap(). */ + ip = kmalloc(sizeof(*ip), GFP_KERNEL); + if (!ip) { + ret = ERR_PTR(-ENOMEM); + goto free_wc; + } + cq->ip = ip; + ip->context = context; + ip->obj = wc; + kref_init(&ip->ref); + ip->mmap_cnt = 0; + ip->size = PAGE_ALIGN(sizeof(*wc) + + sizeof(struct ib_wc) * entries); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } else + cq->ip = NULL; + + /* * ib_create_cq() will initialize cq->ibcq except for cq->ibcq.cqe. * The number of entries should be >= the number requested or return * an error. @@ -201,14 +246,18 @@ cq->triggered = 0; spin_lock_init(&cq->lock); tasklet_init(&cq->comptask, send_complete, (unsigned long)cq); - cq->head = 0; - cq->tail = 0; + wc->head = 0; + wc->tail = 0; cq->queue = wc; ret = &cq->ibcq; - dev->n_cqs_allocated++; + goto bail; +free_wc: + vfree(wc); +free_cq: + kfree(cq); bail: return ret; } @@ -228,7 +277,10 @@ tasklet_kill(&cq->comptask); dev->n_cqs_allocated--; - vfree(cq->queue); + if (cq->ip) + kref_put(&cq->ip->ref, ipath_release_mmap_info); + else + vfree(cq->queue); kfree(cq); return 0; @@ -252,7 +304,7 @@ spin_lock_irqsave(&cq->lock, flags); /* * Don't change IB_CQ_NEXT_COMP to IB_CQ_SOLICITED but allow - * any other transitions. + * any other transitions (see C11-31 and C11-32 in ch. 11.4.2.2). */ if (cq->notify != IB_CQ_NEXT_COMP) cq->notify = notify; @@ -263,46 +315,81 @@ int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata) { struct ipath_cq *cq = to_icq(ibcq); - struct ib_wc *wc, *old_wc; - u32 n; + struct ipath_cq_wc *old_wc = cq->queue; + struct ipath_cq_wc *wc; + u32 head, tail, n; int ret; /* * Need to use vmalloc() if we want to support large #s of entries. */ - wc = vmalloc(sizeof(*wc) * (cqe + 1)); + wc = vmalloc(sizeof(*wc) + sizeof(struct ib_wc) * cqe); if (!wc) { ret = -ENOMEM; goto bail; } + /* + * Return the address of the WC as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata) { + __u64 offset = (__u64) wc; + + ret = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (ret) + goto bail; + } + spin_lock_irq(&cq->lock); - if (cq->head < cq->tail) - n = cq->ibcq.cqe + 1 + cq->head - cq->tail; + /* + * Make sure head and tail are sane since they + * might be user writable. + */ + head = old_wc->head; + if (head > (u32) cq->ibcq.cqe) + head = (u32) cq->ibcq.cqe; + tail = old_wc->tail; + if (tail > (u32) cq->ibcq.cqe) + tail = (u32) cq->ibcq.cqe; + if (head < tail) + n = cq->ibcq.cqe + 1 + head - tail; else - n = cq->head - cq->tail; + n = head - tail; if (unlikely((u32)cqe < n)) { spin_unlock_irq(&cq->lock); vfree(wc); ret = -EOVERFLOW; goto bail; } - for (n = 0; cq->tail != cq->head; n++) { - wc[n] = cq->queue[cq->tail]; - if (cq->tail == cq->ibcq.cqe) - cq->tail = 0; + for (n = 0; tail != head; n++) { + wc->queue[n] = old_wc->queue[tail]; + if (tail == (u32) cq->ibcq.cqe) + tail = 0; else - cq->tail++; + tail++; } cq->ibcq.cqe = cqe; - cq->head = n; - cq->tail = 0; - old_wc = cq->queue; + wc->head = n; + wc->tail = 0; cq->queue = wc; spin_unlock_irq(&cq->lock); vfree(old_wc); + if (cq->ip) { + struct ipath_ibdev *dev = to_idev(ibcq->device); + struct ipath_mmap_info *ip = cq->ip; + + ip->obj = wc; + ip->size = PAGE_ALIGN(sizeof(*wc) + + sizeof(struct ib_wc) * cqe); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } + ret = 0; bail: Index: src/linux-kernel/infiniband/hw/ipath/ipath_srq.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_srq.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_srq.c (working copy) @@ -47,66 +47,38 @@ struct ib_recv_wr **bad_wr) { struct ipath_srq *srq = to_isrq(ibsrq); - struct ipath_ibdev *dev = to_idev(ibsrq->device); + struct ipath_rwq *wq; unsigned long flags; int ret; for (; wr; wr = wr->next) { struct ipath_rwqe *wqe; u32 next; - int i, j; + int i; - if (wr->num_sge > srq->rq.max_sge) { + if ((unsigned) wr->num_sge > srq->rq.max_sge) { *bad_wr = wr; ret = -ENOMEM; goto bail; } spin_lock_irqsave(&srq->rq.lock, flags); - next = srq->rq.head + 1; + wq = srq->rq.wq; + next = wq->head + 1; if (next >= srq->rq.size) next = 0; - if (next == srq->rq.tail) { + if (next == wq->tail) { spin_unlock_irqrestore(&srq->rq.lock, flags); *bad_wr = wr; ret = -ENOMEM; goto bail; } - wqe = get_rwqe_ptr(&srq->rq, srq->rq.head); + wqe = get_rwqe_ptr(&srq->rq, wq->head); wqe->wr_id = wr->wr_id; - wqe->sg_list[0].mr = NULL; - wqe->sg_list[0].vaddr = NULL; - wqe->sg_list[0].length = 0; - wqe->sg_list[0].sge_length = 0; - wqe->length = 0; - for (i = 0, j = 0; i < wr->num_sge; i++) { - /* Check LKEY */ - if (to_ipd(srq->ibsrq.pd)->user && - wr->sg_list[i].lkey == 0) { - spin_unlock_irqrestore(&srq->rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - if (wr->sg_list[i].length == 0) - continue; - if (!ipath_lkey_ok(&dev->lk_table, - &wqe->sg_list[j], - &wr->sg_list[i], - IB_ACCESS_LOCAL_WRITE)) { - spin_unlock_irqrestore(&srq->rq.lock, - flags); - *bad_wr = wr; - ret = -EINVAL; - goto bail; - } - wqe->length += wr->sg_list[i].length; - j++; - } - wqe->num_sge = j; - srq->rq.head = next; + for (i = 0; i < wr->num_sge; i++) + wqe->sg_list[0] = wr->sg_list[i]; + wq->head = next; spin_unlock_irqrestore(&srq->rq.lock, flags); } ret = 0; @@ -156,28 +128,67 @@ * Need to use vmalloc() if we want to support large #s of entries. */ srq->rq.size = srq_init_attr->attr.max_wr + 1; - sz = sizeof(struct ipath_sge) * srq_init_attr->attr.max_sge + + srq->rq.max_sge = srq_init_attr->attr.max_sge; + sz = sizeof(struct ib_sge) * srq->rq.max_sge + sizeof(struct ipath_rwqe); - srq->rq.wq = vmalloc(srq->rq.size * sz); + srq->rq.wq = vmalloc(sizeof(struct ipath_rwq) + srq->rq.size * sz); if (!srq->rq.wq) { - kfree(srq); ret = ERR_PTR(-ENOMEM); - goto bail; + goto free_srq; } /* + * Return the address of the RWQ as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata) { + struct ipath_mmap_info *ip; + __u64 offset = (__u64) srq->rq.wq; + int err; + + err = ib_copy_to_udata(udata, &offset, sizeof(offset)); + if (err) { + ret = ERR_PTR(err); + goto free_rwq; + } + + /* Allocate info for ipath_mmap(). */ + ip = kmalloc(sizeof(*ip), GFP_KERNEL); + if (!ip) { + ret = ERR_PTR(-ENOMEM); + goto free_rwq; + } + srq->ip = ip; + ip->context = ibpd->uobject->context; + ip->obj = srq->rq.wq; + kref_init(&ip->ref); + ip->mmap_cnt = 0; + ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) + + srq->rq.size * sz); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } else + srq->ip = NULL; + + /* * ib_create_srq() will initialize srq->ibsrq. */ spin_lock_init(&srq->rq.lock); - srq->rq.head = 0; - srq->rq.tail = 0; - srq->rq.max_sge = srq_init_attr->attr.max_sge; + srq->rq.wq->head = 0; + srq->rq.wq->tail = 0; srq->limit = srq_init_attr->attr.srq_limit; + dev->n_srqs_allocated++; + ret = &srq->ibsrq; + goto bail; - dev->n_srqs_allocated++; - +free_rwq: + vfree(srq->rq.wq); +free_srq: + kfree(srq); bail: return ret; } @@ -187,83 +198,137 @@ * @ibsrq: the SRQ to modify * @attr: the new attributes of the SRQ * @attr_mask: indicates which attributes to modify + * @udata: user data for ipathverbs.so */ int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, - enum ib_srq_attr_mask attr_mask) + enum ib_srq_attr_mask attr_mask, + struct ib_udata *udata) { struct ipath_srq *srq = to_isrq(ibsrq); - unsigned long flags; - int ret; + int ret = 0; - if (attr_mask & IB_SRQ_MAX_WR) + if (attr_mask & IB_SRQ_MAX_WR) { + struct ipath_rwq *owq; + struct ipath_rwq *wq; + struct ipath_rwqe *p; + u32 sz, size, n, head, tail; + + /* + * Check that the requested sizes are below the limits + * and that user/kernel SRQs are only resized by the + * user/kernel. + */ if ((attr->max_wr > ib_ipath_max_srq_wrs) || - (attr->max_sge > srq->rq.max_sge)) { + (!udata != !srq->ip) || + ((attr_mask & IB_SRQ_LIMIT) && + attr->srq_limit > attr->max_wr) || + (!(attr_mask & IB_SRQ_LIMIT) && + srq->limit > attr->max_wr)) { ret = -EINVAL; goto bail; } - if (attr_mask & IB_SRQ_LIMIT) - if (attr->srq_limit >= srq->rq.size) { - ret = -EINVAL; - goto bail; - } - - if (attr_mask & IB_SRQ_MAX_WR) { - struct ipath_rwqe *wq, *p; - u32 sz, size, n; - sz = sizeof(struct ipath_rwqe) + - attr->max_sge * sizeof(struct ipath_sge); + srq->rq.max_sge * sizeof(struct ib_sge); size = attr->max_wr + 1; - wq = vmalloc(size * sz); + wq = vmalloc(sizeof(struct ipath_rwq) + size * sz); if (!wq) { ret = -ENOMEM; goto bail; } - spin_lock_irqsave(&srq->rq.lock, flags); - if (srq->rq.head < srq->rq.tail) - n = srq->rq.size + srq->rq.head - srq->rq.tail; + /* + * Return the address of the RWQ as the offset to mmap. + * See ipath_mmap() for details. + */ + if (udata) { + __u64 offset_addr; + __u64 offset = (__u64) wq; + + ret = ib_copy_from_udata(&offset_addr, udata, + sizeof(offset_addr)); + if (ret) { + vfree(wq); + goto bail; + } + udata->outbuf = (void __user *) offset_addr; + ret = ib_copy_to_udata(udata, &offset, + sizeof(offset)); + if (ret) { + vfree(wq); + goto bail; + } + } + + spin_lock_irq(&srq->rq.lock); + /* + * validate head pointer value and compute + * the number of remaining WQEs. + */ + owq = srq->rq.wq; + head = owq->head; + if (head >= srq->rq.size) + head = 0; + tail = owq->tail; + if (tail >= srq->rq.size) + tail = 0; + n = head; + if (n < tail) + n += srq->rq.size - tail; else - n = srq->rq.head - srq->rq.tail; - if (size <= n || size <= srq->limit) { - spin_unlock_irqrestore(&srq->rq.lock, flags); + n -= tail; + if (size <= n) { + spin_unlock_irq(&srq->rq.lock); vfree(wq); ret = -EINVAL; goto bail; } n = 0; - p = wq; - while (srq->rq.tail != srq->rq.head) { + p = wq->wq; + while (tail != head) { struct ipath_rwqe *wqe; int i; - wqe = get_rwqe_ptr(&srq->rq, srq->rq.tail); + wqe = get_rwqe_ptr(&srq->rq, tail); p->wr_id = wqe->wr_id; - p->length = wqe->length; p->num_sge = wqe->num_sge; for (i = 0; i < wqe->num_sge; i++) p->sg_list[i] = wqe->sg_list[i]; n++; p = (struct ipath_rwqe *)((char *) p + sz); - if (++srq->rq.tail >= srq->rq.size) - srq->rq.tail = 0; + if (++tail >= srq->rq.size) + tail = 0; } - vfree(srq->rq.wq); srq->rq.wq = wq; srq->rq.size = size; - srq->rq.head = n; - srq->rq.tail = 0; - srq->rq.max_sge = attr->max_sge; - spin_unlock_irqrestore(&srq->rq.lock, flags); - } + wq->head = n; + wq->tail = 0; + if (attr_mask & IB_SRQ_LIMIT) + srq->limit = attr->srq_limit; + spin_unlock_irq(&srq->rq.lock); - if (attr_mask & IB_SRQ_LIMIT) { - spin_lock_irqsave(&srq->rq.lock, flags); - srq->limit = attr->srq_limit; - spin_unlock_irqrestore(&srq->rq.lock, flags); + vfree(owq); + + if (srq->ip) { + struct ipath_mmap_info *ip = srq->ip; + struct ipath_ibdev *dev = to_idev(srq->ibsrq.device); + + ip->obj = wq; + ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) + + size * sz); + spin_lock_irq(&dev->pending_lock); + ip->next = dev->pending_mmaps; + dev->pending_mmaps = ip; + spin_unlock_irq(&dev->pending_lock); + } + } else if (attr_mask & IB_SRQ_LIMIT) { + spin_lock_irq(&srq->rq.lock); + if (attr->srq_limit >= srq->rq.size) + ret = -EINVAL; + else + srq->limit = attr->srq_limit; + spin_unlock_irq(&srq->rq.lock); } - ret = 0; bail: return ret; @@ -289,7 +354,10 @@ struct ipath_ibdev *dev = to_idev(ibsrq->device); dev->n_srqs_allocated--; - vfree(srq->rq.wq); + if (srq->ip) + kref_put(&srq->ip->ref, ipath_release_mmap_info); + else + vfree(srq->rq.wq); kfree(srq); return 0; Index: src/linux-kernel/infiniband/hw/ipath/ipath_ud.c =================================================================== --- src/linux-kernel/infiniband/hw/ipath/ipath_ud.c (revision 8021) +++ src/linux-kernel/infiniband/hw/ipath/ipath_ud.c (working copy) @@ -35,6 +35,53 @@ #include "ipath_verbs.h" #include "ips_common.h" +static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe, + u32 *lengthp, struct ipath_sge_state *ss) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + int user = to_ipd(qp->ibqp.pd)->user; + int i, j, ret; + struct ib_wc wc; + + *lengthp = 0; + for (i = j = 0; i < wqe->num_sge; i++) { + if (wqe->sg_list[i].length == 0) + continue; + /* Check LKEY */ + if ((user && wqe->sg_list[i].lkey == 0) || + !ipath_lkey_ok(&dev->lk_table, + j ? &ss->sg_list[j - 1] : &ss->sge, + &wqe->sg_list[i], IB_ACCESS_LOCAL_WRITE)) + goto bad_lkey; + *lengthp += wqe->sg_list[i].length; + j++; + } + ss->num_sge = j; + ret = 1; + goto bail; + +bad_lkey: + wc.wr_id = wqe->wr_id; + wc.status = IB_WC_LOC_PROT_ERR; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.imm_data = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + /* Signal solicited completion event. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + ret = 0; +bail: + return ret; +} + /** * ipath_ud_loopback - handle send on loopback QPs * @sqp: the QP @@ -45,6 +92,8 @@ * * This is called from ipath_post_ud_send() to forward a WQE addressed * to the same HCA. + * Note that the receive interrupt handler may be calling ipath_ud_rcv() + * while this is being called. */ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, @@ -59,7 +108,11 @@ struct ipath_srq *srq; struct ipath_sge_state rsge; struct ipath_sge *sge; + struct ipath_rwq *wq; struct ipath_rwqe *wqe; + void (*handler)(struct ib_event *, void *); + u32 tail; + u32 rlen; qp = ipath_lookup_qpn(&dev->qp_table, wr->wr.ud.remote_qpn); if (!qp) @@ -93,6 +146,13 @@ wc->imm_data = 0; } + if (wr->num_sge > 1) { + rsge.sg_list = kmalloc((wr->num_sge - 1) * + sizeof(struct ipath_sge), + GFP_ATOMIC); + } else + rsge.sg_list = NULL; + /* * Get the next work request entry to find where to put the data. * Note that it is safe to drop the lock after changing rq->tail @@ -100,37 +160,52 @@ */ if (qp->ibqp.srq) { srq = to_isrq(qp->ibqp.srq); + handler = srq->ibsrq.event_handler; rq = &srq->rq; } else { srq = NULL; + handler = NULL; rq = &qp->r_rq; } + spin_lock_irqsave(&rq->lock, flags); - if (rq->tail == rq->head) { - spin_unlock_irqrestore(&rq->lock, flags); - dev->n_pkt_drops++; - goto done; + wq = rq->wq; + tail = wq->tail; + while (1) { + if (unlikely(tail == wq->head)) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + goto free_sge; + } + wqe = get_rwqe_ptr(rq, tail); + if (++tail >= rq->size) + tail = 0; + if (init_sge(qp, wqe, &rlen, &rsge)) + break; + wq->tail = tail; } /* Silently drop packets which are too big. */ - wqe = get_rwqe_ptr(rq, rq->tail); - if (wc->byte_len > wqe->length) { + if (wc->byte_len > rlen) { spin_unlock_irqrestore(&rq->lock, flags); dev->n_pkt_drops++; - goto done; + goto free_sge; } + wq->tail = tail; wc->wr_id = wqe->wr_id; - rsge.sge = wqe->sg_list[0]; - rsge.sg_list = wqe->sg_list + 1; - rsge.num_sge = wqe->num_sge; - if (++rq->tail >= rq->size) - rq->tail = 0; - if (srq && srq->ibsrq.event_handler) { + if (handler) { u32 n; - if (rq->head < rq->tail) - n = rq->size + rq->head - rq->tail; + /* + * validate head pointer value and compute + * the number of remaining WQEs. + */ + n = wq->head; + if (n >= rq->size) + n = 0; + if (n < tail) + n += rq->size - tail; else - n = rq->head - rq->tail; + n -= tail; if (n < srq->limit) { struct ib_event ev; @@ -139,12 +214,12 @@ ev.device = qp->ibqp.device; ev.element.srq = qp->ibqp.srq; ev.event = IB_EVENT_SRQ_LIMIT_REACHED; - srq->ibsrq.event_handler(&ev, - srq->ibsrq.srq_context); + handler(&ev, srq->ibsrq.srq_context); } else spin_unlock_irqrestore(&rq->lock, flags); } else spin_unlock_irqrestore(&rq->lock, flags); + ah_attr = &to_iah(wr->wr.ud.ah)->attr; if (ah_attr->ah_flags & IB_AH_GRH) { ipath_copy_sge(&rsge, &ah_attr->grh, sizeof(struct ib_grh)); @@ -195,6 +270,8 @@ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc, wr->send_flags & IB_SEND_SOLICITED); +free_sge: + kfree(rsge.sg_list); done: if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); @@ -432,13 +509,9 @@ int opcode; u32 hdrsize; u32 pad; - unsigned long flags; struct ib_wc wc; u32 qkey; u32 src_qp; - struct ipath_rq *rq; - struct ipath_srq *srq; - struct ipath_rwqe *wqe; u16 dlid; int header_in_data; @@ -546,19 +619,10 @@ /* * Get the next work request entry to find where to put the data. - * Note that it is safe to drop the lock after changing rq->tail - * since ipath_post_receive() won't fill the empty slot. */ - if (qp->ibqp.srq) { - srq = to_isrq(qp->ibqp.srq); - rq = &srq->rq; - } else { - srq = NULL; - rq = &qp->r_rq; - } - spin_lock_irqsave(&rq->lock, flags); - if (rq->tail == rq->head) { - spin_unlock_irqrestore(&rq->lock, flags); + if (qp->r_reuse_sge) + qp->r_reuse_sge = 0; + else if (!ipath_get_rwqe(qp, 0)) { /* * Count VL15 packets dropped due to no receive buffer. * Otherwise, count them as buffer overruns since usually, @@ -572,39 +636,11 @@ goto bail; } /* Silently drop packets which are too big. */ - wqe = get_rwqe_ptr(rq, rq->tail); - if (wc.byte_len > wqe->length) { - spin_unlock_irqrestore(&rq->lock, flags); + if (wc.byte_len > qp->r_len) { + qp->r_reuse_sge = 1; dev->n_pkt_drops++; goto bail; } - wc.wr_id = wqe->wr_id; - qp->r_sge.sge = wqe->sg_list[0]; - qp->r_sge.sg_list = wqe->sg_list + 1; - qp->r_sge.num_sge = wqe->num_sge; - if (++rq->tail >= rq->size) - rq->tail = 0; - if (srq && srq->ibsrq.event_handler) { - u32 n; - - if (rq->head < rq->tail) - n = rq->size + rq->head - rq->tail; - else - n = rq->head - rq->tail; - if (n < srq->limit) { - struct ib_event ev; - - srq->limit = 0; - spin_unlock_irqrestore(&rq->lock, flags); - ev.device = qp->ibqp.device; - ev.element.srq = qp->ibqp.srq; - ev.event = IB_EVENT_SRQ_LIMIT_REACHED; - srq->ibsrq.event_handler(&ev, - srq->ibsrq.srq_context); - } else - spin_unlock_irqrestore(&rq->lock, flags); - } else - spin_unlock_irqrestore(&rq->lock, flags); if (has_grh) { ipath_copy_sge(&qp->r_sge, &hdr->u.l.grh, sizeof(struct ib_grh)); @@ -613,6 +649,7 @@ ipath_skip_sge(&qp->r_sge, sizeof(struct ib_grh)); ipath_copy_sge(&qp->r_sge, data, wc.byte_len - sizeof(struct ib_grh)); + wc.wr_id = qp->r_wr_id; wc.status = IB_WC_SUCCESS; wc.opcode = IB_WC_RECV; wc.vendor_err = 0; -- Ralph Campbell From amit_byron at yahoo.com Mon Jun 19 17:36:46 2006 From: amit_byron at yahoo.com (Amit Byron) Date: Tue, 20 Jun 2006 00:36:46 +0000 (UTC) Subject: [openib-general] =?utf-8?q?ib=5Fgid_lookup?= Message-ID: hello, i'm trying to find whether i can do a lookup of ib_gid by either node name or node's ip address. is this information available from the subnet manager? thanks, Amit. From rjwalsh at pathscale.com Mon Jun 19 18:34:44 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Mon, 19 Jun 2006 18:34:44 -0700 Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with ib_uverbs_idr_mutex In-Reply-To: References: <20060613051149.GE4621@mellanox.co.il> <1150223140.11881.2.camel@hematite.internal.keyresearch.com> Message-ID: <1150767284.15618.59.camel@hematite.internal.keyresearch.com> On Fri, 2006-06-16 at 15:07 -0700, Roland Dreier wrote: > Robert, can you confirm that the new uverbs locking scheme helps the > performance problems you're having? Yup - that was a big help. Thanks! Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From rdreier at cisco.com Mon Jun 19 20:46:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 Jun 2006 20:46:23 -0700 Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with ib_uverbs_idr_mutex In-Reply-To: <1150767284.15618.59.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Mon, 19 Jun 2006 18:34:44 -0700") References: <20060613051149.GE4621@mellanox.co.il> <1150223140.11881.2.camel@hematite.internal.keyresearch.com> <1150767284.15618.59.camel@hematite.internal.keyresearch.com> Message-ID: > > Robert, can you confirm that the new uverbs locking scheme helps the > > performance problems you're having? > Yup - that was a big help. Thanks! Good, because it's upstream now... From panda at cse.ohio-state.edu Mon Jun 19 21:06:33 2006 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Tue, 20 Jun 2006 00:06:33 -0400 (EDT) Subject: [openib-general] MVAPICH and librdmacm In-Reply-To: <1150738038.26165.5.camel@stevo-desktop> from "Steve Wise" at Jun 19, 2006 12:27:18 PM Message-ID: <200606200406.k5K46XcM029580@xi.cse.ohio-state.edu> Steve, > Anybody working on porting the MVAPICH code to use the RDMA CM for > connection setup? Just wondering how much work is needed to make > MVAPICH run on the iwarp devices. Thanks for your note. We are taking a look at these issues. During the next few weeks, we will have a complete understanding on these issues and plan to make some concrete progress. We will keep you and OpenIB community updated about our progress. Thanks, DK > Thanks, > > > Steve. > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Mon Jun 19 21:27:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 Jun 2006 21:27:09 -0700 Subject: [openib-general] iSER updates Message-ID: Linus pull my for-2.6.18 branch today, so the RDMA CM is upstream now. He still has not pulled scsi-misc-2.6 so AFAIK there is still more required before we can merge iSER. However, I had to add the patch below to keep iSER compiling in my tree, since is now gone from the upstream kernel. Also, I notice that there seem to be a few changes to the svn tree that you have not sent to me for my git tree (r7784 and r7924 at least). Please send me a patch if there are updates to my git tree that you would like before it gets merged... - R. diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c index c051b65..7099e94 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.c +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c @@ -68,7 +68,6 @@ #include #include #include #include -#include #include #include #include From ogerlitz at voltaire.com Mon Jun 19 21:54:25 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 Jun 2006 07:54:25 +0300 Subject: [openib-general] iSER updates In-Reply-To: References: Message-ID: <44977F81.9080206@voltaire.com> Roland Dreier wrote: > Linus pull my for-2.6.18 branch today, so the RDMA CM is upstream > now. He still has not pulled scsi-misc-2.6 so AFAIK there is still > more required before we can merge iSER. > > However, I had to add the patch below to keep iSER compiling in my > tree, since is now gone from the upstream > kernel. Also, I notice that there seem to be a few changes to the svn > tree that you have not sent to me for my git tree (r7784 and r7924 at > least). Please send me a patch if there are updates to my git tree > that you would like before it gets merged... I was aware that scsi/scsi_request.h was killed by Christoph but as iser kept compiling under my copy of your tree with James tree pulled into it i have not noticed the breakage... guess the reason for that was an update by James which i missed, anyway thanks for catching that and will update the SVN. As for the two updates in the SVN since my last patches were sent to you, these are two bug fixes which can go to 2.6.18-rc2, but as i understand its fine with you to add them into what's pushed for 2.6.18-rc1, i will send them today. Or. > diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c > b/drivers/infiniband/ulp/iser/iscsi_iser.c > index c051b65..7099e94 100644 > --- a/drivers/infiniband/ulp/iser/iscsi_iser.c > +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c > @@ -68,7 +68,6 @@ #include > #include > #include > #include > -#include > #include > #include > #include > From krkumar2 at in.ibm.com Mon Jun 19 22:22:19 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Tue, 20 Jun 2006 10:52:19 +0530 Subject: [openib-general] [PATCH] Remove redundant uninitialized warning Message-ID: This removes a compile warning : "is_ud might be used uninitialized in this function". Signed-off-by: Krishna Kumar --- diff -ruNp 1/core/uverbs_cmd.c 2/core/uverbs_cmd.c --- 1/core/uverbs_cmd.c 2006-06-20 10:14:46.000000000 +0530 +++ 2/core/uverbs_cmd.c 2006-06-20 10:23:50.000000000 +0530 @@ -1530,7 +1530,6 @@ ssize_t ib_uverbs_post_send(struct ib_uv out_put: put_qp_read(qp); -out: while (wr) { if (is_ud && wr->wr.ud.ah) put_ah_read(wr->wr.ud.ah); @@ -1539,6 +1538,7 @@ out: wr = next; } +out: kfree(user_wr); return ret ? ret : in_len; -------------- next part -------------- A non-text attachment was scrubbed... Name: diff. Type: application/octet-stream Size: 449 bytes Desc: not available URL: From rdreier at cisco.com Mon Jun 19 23:31:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 Jun 2006 23:31:58 -0700 Subject: [openib-general] [PATCH] Remove redundant uninitialized warning In-Reply-To: (Krishna Kumar2's message of "Tue, 20 Jun 2006 10:52:19 +0530") References: Message-ID: Thanks, applied and queued for 2.6.18 From ogerlitz at voltaire.com Tue Jun 20 00:09:08 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 Jun 2006 10:09:08 +0300 Subject: [openib-general] iSER updates In-Reply-To: <44977F81.9080206@voltaire.com> References: <44977F81.9080206@voltaire.com> Message-ID: <44979F14.9000905@voltaire.com> Or Gerlitz wrote: > Roland Dreier wrote: >> However, I had to add the patch below to keep iSER compiling in my >> tree, since is now gone from the upstream >> kernel. I see that the patch is applied at the for-mm branch but not at the iser branch, is it fine? Or. From ogerlitz at voltaire.com Tue Jun 20 00:23:25 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 Jun 2006 10:23:25 +0300 Subject: [openib-general] dapltest gets segfaulted in librdmacm init In-Reply-To: <4496E9EF.1090607@ichips.intel.com> References: <4496E9EF.1090607@ichips.intel.com> Message-ID: <4497A26D.9090606@voltaire.com> Arlin Davis wrote: > Or Gerlitz wrote: > >> After fixing the ucma/port space issue with the calls to rdma_create_id i >> am now trying to run >> >> $ ./Target/dapltest -T S -D OpenIB-cma >> >> and getting an immediate segfault with the below trace, any idea? >> >> > Hmm, no idea. I just updated to 8112 and everything runs fine for me > (2.6.17). OK, sorry, i suspect to had some inconsistency between libibverbs and libmthca, recompiling & installing them things now work fine. Or. From ogerlitz at voltaire.com Tue Jun 20 00:25:55 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 Jun 2006 10:25:55 +0300 Subject: [openib-general] dapltest gets segfaulted in librdmacm init In-Reply-To: References: Message-ID: <4497A303.8090806@voltaire.com> James Lentini wrote: > I don't see this. > > The gdb sharedlibrary output looks suspicious. /usr/local/ib isn't a > standard path for our binaries. I have added --prefix=/usr/local/ib to the configure input, we do it all the time to test multiple things. Or. From tziporet at mellanox.co.il Tue Jun 20 01:47:48 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 20 Jun 2006 11:47:48 +0300 Subject: [openib-general] OFED 1.0-pre 1 build issues. In-Reply-To: References: <1150324203.10676.17.camel@chalcedony.pathscale.com> Message-ID: <4497B634.2070704@mellanox.co.il> Paul wrote: > Michael, > I performed the same work-around in bash (not so good with perl > these days) it gets past the prior point. Thanks. Should something > that takes care of this be included in the build.sh or build_env.sh > scripts ? We would certainly need it covered in the docs at least. > > Now the build is dying on some undefined references. (log attached) > > Regards. > I will ask Vlad to look into it. Tziporet From tziporet at mellanox.co.il Tue Jun 20 01:57:08 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 20 Jun 2006 11:57:08 +0300 Subject: [openib-general] MVAPICH failure on SGI Altix SLES10 In-Reply-To: <44930BAA.6030300@sgi.com> References: <44930BAA.6030300@sgi.com> Message-ID: <4497B864.7040107@mellanox.co.il> John Partridge wrote: > I am trying to run the example from MPI_README.txt (and other MPI apps > like pallas), but I keep getting a Couldn't modify SRQ limit error > message :- > > mig129:~/OFED-1.0-pre1 # > /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/bin/mpirun_rsh -rsh -np 2 > -hostfile /root/cluster > /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/tests/osutests-1.0/bw 1000 16 > [1] Abort: Couldn't modify SRQ limit > at line 995 in file viainit.c > mpirun_rsh: Abort signaled from [1] > [0] Abort: [mig125:0] Got completion with error, code=12 > at line 2143 in file viacheck.c > done. > > I am using OFED-1.0-pre1 (kernel modules are from OFED-1.0-pre1 also) > OS is SLES10 SUSE Linux Enterprise Server 10 (ia64) VERSION = 10 > > HW is SGI Altix ia64 > > Can anyone help please ? > > Thanks > John > > I guess you use older FW version. See in osu_mpi release notes: - For users of Mellanox Technologies firmware fw-23108 or fw-25208 only: OSU MPI may fail in its default configuration if your HCA is burnt with an fw-23108 version that is earlier than 3.4.000, or with an fw-25208 version 4.7.400 or earlier. Workaround: Option 1 - Update the firmware. Option 2 - In mvapich.conf, set VIADEV_SRQ_ENABLE=0 Tziporet From ogerlitz at voltaire.com Tue Jun 20 02:33:49 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 Jun 2006 12:33:49 +0300 (IDT) Subject: [openib-general] [PATCH 1/2] IB/iser: don't access sc->request_buffer when sc->request_bufflen is zero Message-ID: calling scsi_init_one on sc->request_buffer when sc->request_bufflen is zero is unsafe Signed-off-by: Or Gerlitz Index: infiniband-git/drivers/infiniband/ulp/iser/iser_initiator.c =================================================================== --- infiniband-git.orig/drivers/infiniband/ulp/iser/iser_initiator.c 2006-06-20 12:26:17.000000000 +0300 +++ infiniband-git/drivers/infiniband/ulp/iser/iser_initiator.c 2006-06-20 12:27:42.000000000 +0300 @@ -391,7 +391,8 @@ if (sc->use_sg) { /* using a scatter list */ data_buf->buf = sc->request_buffer; data_buf->size = sc->use_sg; - } else { /* using a single buffer - convert it into one entry SG */ + } else if (sc->request_bufflen) { + /* using a single buffer - convert it into one entry SG */ sg_init_one(&data_buf->sg_single, sc->request_buffer, sc->request_bufflen); data_buf->buf = &data_buf->sg_single; From ogerlitz at voltaire.com Tue Jun 20 02:35:51 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 Jun 2006 12:35:51 +0300 (IDT) Subject: [openib-general] [PATCH 2/2] IB/iser: bugfix for the reconnect flow In-Reply-To: References: Message-ID: for iscsi reconnect flow the sequence of calls would be conn stop/bind/start i.e conn create is not called; fixed the post receive code to take that into account, also moved setting conn->recv_lock into conn bind which is called for both connect and reconnect flows. Signed-off-by: Erez Zilber Signed-off-by: Or Gerlitz Index: infiniband-git/drivers/infiniband/ulp/iser/iscsi_iser.c =================================================================== --- infiniband-git.orig/drivers/infiniband/ulp/iser/iscsi_iser.c 2006-06-20 12:27:42.000000000 +0300 +++ infiniband-git/drivers/infiniband/ulp/iser/iscsi_iser.c 2006-06-20 12:28:14.000000000 +0300 @@ -311,8 +311,6 @@ /* currently this is the only field which need to be initiated */ rwlock_init(&iser_conn->lock); - conn->recv_lock = &iser_conn->lock; - conn->dd_data = iser_conn; iser_conn->iscsi_conn = conn; @@ -363,6 +361,8 @@ ib_conn->iser_conn = iser_conn; iser_conn->ib_conn = ib_conn; + conn->recv_lock = &iser_conn->lock; + return 0; } Index: infiniband-git/drivers/infiniband/ulp/iser/iser_initiator.c =================================================================== --- infiniband-git.orig/drivers/infiniband/ulp/iser/iser_initiator.c 2006-06-20 12:27:42.000000000 +0300 +++ infiniband-git/drivers/infiniband/ulp/iser/iser_initiator.c 2006-06-20 12:28:14.000000000 +0300 @@ -232,8 +232,11 @@ } rx_desc->type = ISCSI_RX; - /* for the login sequence we must support rx of upto 8K */ - if (conn->c_stage == ISCSI_CONN_INITIAL_STAGE) + /* for the login sequence we must support rx of upto 8K; login is done + * after conn create/bind (connect) and conn stop/bind (reconnect), + * what's common for both schemes is that the connection is not started + */ + if (conn->c_stage != ISCSI_CONN_STARTED) rx_data_size = DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH; else /* FIXME till user space sets conn->max_recv_dlength correctly */ rx_data_size = 128; From ogerlitz at voltaire.com Tue Jun 20 02:39:22 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 Jun 2006 12:39:22 +0300 (IDT) Subject: [openib-general] resend [PATCH 1/2] IB/iser: don't access sc->request_buffer when sc->request_bufflen is zero In-Reply-To: References: Message-ID: Roland, there was an error in the changelog comment, here it's again. calling sg_init_one on sc->request_buffer when sc->request_bufflen is zero is unsafe Signed-off-by: Or Gerlitz Index: infiniband-git/drivers/infiniband/ulp/iser/iser_initiator.c =================================================================== --- infiniband-git.orig/drivers/infiniband/ulp/iser/iser_initiator.c 2006-06-20 12:26:17.000000000 +0300 +++ infiniband-git/drivers/infiniband/ulp/iser/iser_initiator.c 2006-06-20 12:27:42.000000000 +0300 @@ -391,7 +391,8 @@ if (sc->use_sg) { /* using a scatter list */ data_buf->buf = sc->request_buffer; data_buf->size = sc->use_sg; - } else { /* using a single buffer - convert it into one entry SG */ + } else if (sc->request_bufflen) { + /* using a single buffer - convert it into one entry SG */ sg_init_one(&data_buf->sg_single, sc->request_buffer, sc->request_bufflen); data_buf->buf = &data_buf->sg_single; From halr at voltaire.com Tue Jun 20 03:08:32 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Jun 2006 06:08:32 -0400 Subject: [openib-general] ib_gid lookup In-Reply-To: References: Message-ID: <1150798111.4391.111384.camel@hal.voltaire.com> Hi Amit, On Mon, 2006-06-19 at 20:36, Amit Byron wrote: > hello, > i'm trying to find whether i can do a lookup of ib_gid by either > node name or node's ip address. is this information available from > the subnet manager? The SM doesn't know the node name but you might be able to do this by NodeDescription depending on how the subnet was setup (the NodeDescriptions would need to be made unique on each node; a script for this was supplied for mthca; there is also a current standards issue with the SM detecting that these had changed which is being worked on). If that were to be done, the SA could be queried by NodeDescription which would return a NodeRecord which would obtain the NodeInfo which includes the NodeGUID and PortGUID. Note it also returns the base LID as well. The SM does not know the IP addresses unless they are registered by DAPL (via ServiceRecords) but I'm not sure that is done anymore or whether DAPL runs in your environment. -- Hal > thanks, > Amit. > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ogerlitz at voltaire.com Tue Jun 20 03:25:23 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 Jun 2006 13:25:23 +0300 Subject: [openib-general] IB/iser upstream push for 2.6.18 awaiting for the SCSI/iscsi updates Message-ID: <4497CD13.6080200@voltaire.com> Hi James, Roland is ready to push iSER for 2.6.18 through his tree but it can't be done before the 2.6.18 iscsi updates on which iSER is dependent upon (libiscsi etc) are pushed (and pulled by Linus), so ... just wondering when do you plan to push the iscsi updates? Or. From halr at voltaire.com Tue Jun 20 06:06:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Jun 2006 09:06:43 -0400 Subject: [openib-general] [PATCHv5] osm: partition manager force policy In-Reply-To: <86d5d5ge54.fsf@mtl066.yok.mtl.com> References: <86d5d5ge54.fsf@mtl066.yok.mtl.com> Message-ID: <1150808795.4391.118133.camel@hal.voltaire.com> On Mon, 2006-06-19 at 15:05, Eitan Zahavi wrote: > Hi Hal > > This is a 5th take after incorporating Sasha's last reported bug > on bad assignment of the used_blocks. > > This code was run again through my verification flow and also Sasha > had run some tests too. > > Eitan > > Signed-off-by: Eitan Zahavi [snip...] > Index: opensm/osm_pkey.c > =================================================================== > --- opensm/osm_pkey.c (revision 8113) > +++ opensm/osm_pkey.c (working copy) > @@ -94,18 +94,22 @@ void osm_pkey_tbl_destroy( > > /********************************************************************** > **********************************************************************/ > -int osm_pkey_tbl_init( > +ib_api_status_t > +osm_pkey_tbl_init( > IN osm_pkey_tbl_t *p_pkey_tbl) > { > cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1); > cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1); > cl_map_init( &p_pkey_tbl->keys, 1 ); > + cl_qlist_init( &p_pkey_tbl->pending ); > + p_pkey_tbl->used_blocks = 0; > + p_pkey_tbl->max_blocks = 0; > return(IB_SUCCESS); > } > > /********************************************************************** > **********************************************************************/ > -void osm_pkey_tbl_sync_new_blocks( > +void osm_pkey_tbl_init_new_blocks( > IN const osm_pkey_tbl_t *p_pkey_tbl) > { > ib_pkey_table_t *p_block, *p_new_block; > @@ -123,16 +127,31 @@ void osm_pkey_tbl_sync_new_blocks( > p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block)); > if (!p_new_block) > break; > + cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, > + b, p_new_block); > + } > + > memset(p_new_block, 0, sizeof(*p_new_block)); > - cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block); > } > - memcpy(p_new_block, p_block, sizeof(*p_new_block)); > +} > + > +/********************************************************************** > + **********************************************************************/ > +void osm_pkey_tbl_cleanup_pending( > + IN osm_pkey_tbl_t *p_pkey_tbl) > +{ > + cl_list_item_t *p_item; > + p_item = cl_qlist_remove_head( &p_pkey_tbl->pending ); > + while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) ) > + { > + free( (osm_pending_pkey_t *)p_item ); > } > } > > /********************************************************************** > **********************************************************************/ > -int osm_pkey_tbl_set( > +ib_api_status_t > +osm_pkey_tbl_set( > IN osm_pkey_tbl_t *p_pkey_tbl, > IN uint16_t block, > IN ib_pkey_table_t *p_tbl) > @@ -203,7 +222,138 @@ int osm_pkey_tbl_set( > > /********************************************************************** > **********************************************************************/ > -static boolean_t __osm_match_pkey ( > +ib_api_status_t > +osm_pkey_tbl_make_block_pair( > + osm_pkey_tbl_t *p_pkey_tbl, > + uint16_t block_idx, > + ib_pkey_table_t **pp_old_block, > + ib_pkey_table_t **pp_new_block) > +{ > + if (block_idx >= p_pkey_tbl->max_blocks) return(IB_ERROR); > + > + if (pp_old_block) > + { > + *pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx ); > + if (! *pp_old_block) > + { > + *pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); > + if (!*pp_old_block) return(IB_ERROR); > + memset(*pp_old_block, 0, sizeof(ib_pkey_table_t)); > + cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block); > + } > + } > + > + if (pp_new_block) > + { > + *pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx ); > + if (! *pp_new_block) > + { > + *pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); > + if (!*pp_new_block) return(IB_ERROR); > + memset(*pp_new_block, 0, sizeof(ib_pkey_table_t)); > + cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block); > + } > + } > + return( IB_SUCCESS ); > +} > + > +/********************************************************************** > + **********************************************************************/ > +/* > + store the given pkey in the "new" blocks array > + also makes sure the regular block exists. > +*/ > +ib_api_status_t > +osm_pkey_tbl_set_new_entry( > + IN osm_pkey_tbl_t *p_pkey_tbl, > + IN uint16_t block_idx, > + IN uint8_t pkey_idx, > + IN uint16_t pkey) > +{ > + ib_pkey_table_t *p_old_block; > + ib_pkey_table_t *p_new_block; > + > + if (osm_pkey_tbl_make_block_pair( > + p_pkey_tbl, block_idx, &p_old_block, &p_new_block)) > + return( IB_ERROR ); > + > + p_new_block->pkey_entry[pkey_idx] = pkey; > + if (p_pkey_tbl->used_blocks <= block_idx) > + p_pkey_tbl->used_blocks = block_idx + 1; > + > + return( IB_SUCCESS ); > +} > + > +/********************************************************************** > + **********************************************************************/ > +boolean_t > +osm_pkey_find_next_free_entry( > + IN osm_pkey_tbl_t *p_pkey_tbl, > + OUT uint16_t *p_block_idx, > + OUT uint8_t *p_pkey_idx) > +{ > + ib_pkey_table_t *p_new_block; > + > + CL_ASSERT(p_block_idx); > + CL_ASSERT(p_pkey_idx); > + > + while ( *p_block_idx < p_pkey_tbl->max_blocks) > + { > + if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1) > + { > + *p_pkey_idx = 0; > + (*p_block_idx)++; > + if (*p_block_idx >= p_pkey_tbl->max_blocks) > + return FALSE; > + } > + > + p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx); > + > + if ( !p_new_block || > + ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx])) > + return TRUE; > + else > + (*p_pkey_idx)++; > + } > + return FALSE; > +} > + > +/********************************************************************** > + **********************************************************************/ > +ib_api_status_t > +osm_pkey_tbl_get_block_and_idx( > + IN osm_pkey_tbl_t *p_pkey_tbl, > + IN uint16_t *p_pkey, > + OUT uint32_t *p_block_idx, > + OUT uint8_t *p_pkey_index) > +{ > + uint32_t num_of_blocks; > + uint32_t block_index; > + ib_pkey_table_t *block; > + > + CL_ASSERT( p_pkey_tbl ); Should the other routines also assert on this or should this be consistent with the others ? > + CL_ASSERT( p_block_idx != NULL ); > + CL_ASSERT( p_pkey_idx != NULL ); There is no p_pkey_idx parameter. I presume this should be p_pkey_index. Also, should there be: CL_ASSERT( p_pkey ); as well ? > + num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks); > + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) > + { > + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); > + if ( ( block->pkey_entry <= p_pkey ) && > + ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK)) > + { > + *p_block_idx = block_index; > + *p_pkey_index = p_pkey - block->pkey_entry; > + return( IB_SUCCESS ); > + } > + } > + return( IB_NOT_FOUND ); > +} > + > +/********************************************************************** > + **********************************************************************/ > +static boolean_t > +__osm_match_pkey ( > IN const ib_net16_t *pkey1, > IN const ib_net16_t *pkey2 ) { > > @@ -306,7 +456,8 @@ osm_physp_share_pkey( > if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys)) > return TRUE; > > - return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); > + return > + !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2)); > } > > /********************************************************************** [snip...] Also, two things about osm_pkey_mgr.c: Was there a need to reorder the routines ? This broke the diff so it had to be done largely by hand. Also, it would have been nice not to mix the format changes with the substantive changes. Try to keep it to "one thought per patch". This patch has been applied with cosmetic changes. We will go from here... -- Hal From dotanb at mellanox.co.il Tue Jun 20 07:30:44 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 20 Jun 2006 17:30:44 +0300 Subject: [openib-general] [librdmacm] check return value in operations of rping Message-ID: <200606201730.45253.dotanb@mellanox.co.il> Added checks to the return values of all of the functions that may fail (in order to add this test to the regression system). Signed-off-by: Dotan Barak Index: last_stable/src/userspace/librdmacm/examples/rping.c =================================================================== --- last_stable.orig/src/userspace/librdmacm/examples/rping.c 2006-06-20 14:41:47.000000000 +0300 +++ last_stable/src/userspace/librdmacm/examples/rping.c 2006-06-20 14:42:12.000000000 +0300 @@ -157,10 +157,10 @@ struct rping_cb { struct rdma_cm_id *child_cm_id; /* connection on server side */ }; -static void rping_cma_event_handler(struct rdma_cm_id *cma_id, +static int rping_cma_event_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event) { - int ret; + int ret = 0; struct rping_cb *cb = cma_id->context; DEBUG_LOG("cma_event type %d cma_id %p (%s)\n", event->event, cma_id, @@ -209,6 +209,7 @@ static void rping_cma_event_handler(stru fprintf(stderr, "cma event %d, error %d\n", event->event, event->status); sem_post(&cb->sem); + ret = -1; break; case RDMA_CM_EVENT_DISCONNECTED: @@ -218,13 +219,17 @@ static void rping_cma_event_handler(stru case RDMA_CM_EVENT_DEVICE_REMOVAL: fprintf(stderr, "cma detected device removal!!!!\n"); + ret = -1; break; default: fprintf(stderr, "oof bad type!\n"); sem_post(&cb->sem); + ret = -1; break; } + + return ret; } static int server_recv(struct rping_cb *cb, struct ibv_wc *wc) @@ -263,16 +268,20 @@ static int client_recv(struct rping_cb * return 0; } -static void rping_cq_event_handler(struct rping_cb *cb) +static int rping_cq_event_handler(struct rping_cb *cb) { struct ibv_wc wc; struct ibv_recv_wr *bad_wr; int ret; while ((ret = ibv_poll_cq(cb->cq, 1, &wc)) == 1) { + ret = 0; + if (wc.status) { fprintf(stderr, "cq completion failed status %d\n", wc.status); + if (wc.status != IBV_WC_WR_FLUSH_ERR) + ret = -1; goto error; } @@ -312,6 +321,7 @@ static void rping_cq_event_handler(struc default: DEBUG_LOG("unknown!!!!! completion\n"); + ret = -1; goto error; } } @@ -319,11 +329,12 @@ static void rping_cq_event_handler(struc fprintf(stderr, "poll error %d\n", ret); goto error; } - return; + return 0; error: cb->state = ERROR; sem_post(&cb->sem); + return ret; } static int rping_accept(struct rping_cb *cb) @@ -560,7 +571,9 @@ static void *cm_thread(void *arg) fprintf(stderr, "rdma_get_cm_event err %d\n", ret); exit(ret); } - rping_cma_event_handler(event->id, event); + ret = rping_cma_event_handler(event->id, event); + if (ret) + exit(ret); rdma_ack_cm_event(event); } } @@ -589,7 +602,9 @@ static void *cq_thread(void *arg) fprintf(stderr, "Failed to set notify!\n"); exit(ret); } - rping_cq_event_handler(cb); + ret = rping_cq_event_handler(cb); + if (ret) + exit(ret); ibv_ack_cq_events(cb->cq, 1); } } @@ -606,7 +621,7 @@ static void rping_format_send(struct rpi info->buf, info->rkey, info->size); } -static void rping_test_server(struct rping_cb *cb) +static int rping_test_server(struct rping_cb *cb) { struct ibv_send_wr *bad_wr; int ret; @@ -617,6 +632,7 @@ static void rping_test_server(struct rpi if (cb->state != RDMA_READ_ADV) { fprintf(stderr, "wait for RDMA_READ_ADV state %d\n", cb->state); + ret = -1; break; } @@ -640,6 +656,7 @@ static void rping_test_server(struct rpi if (cb->state != RDMA_READ_COMPLETE) { fprintf(stderr, "wait for RDMA_READ_COMPLETE state %d\n", cb->state); + ret = -1; break; } DEBUG_LOG("server received read complete\n"); @@ -661,6 +678,7 @@ static void rping_test_server(struct rpi if (cb->state != RDMA_WRITE_ADV) { fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n", cb->state); + ret = -1; break; } DEBUG_LOG("server received sink adv\n"); @@ -686,6 +704,7 @@ static void rping_test_server(struct rpi if (cb->state != RDMA_WRITE_COMPLETE) { fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n", cb->state); + ret = -1; break; } DEBUG_LOG("server rdma write complete \n"); @@ -698,6 +717,8 @@ static void rping_test_server(struct rpi } DEBUG_LOG("server posted go ahead\n"); } + + return ret; } static int rping_bind_server(struct rping_cb *cb) @@ -734,19 +755,19 @@ static int rping_bind_server(struct rpin return 0; } -static void rping_run_server(struct rping_cb *cb) +static int rping_run_server(struct rping_cb *cb) { struct ibv_recv_wr *bad_wr; int ret; ret = rping_bind_server(cb); if (ret) - return; + return ret; ret = rping_setup_qp(cb, cb->child_cm_id); if (ret) { fprintf(stderr, "setup_qp failed: %d\n", ret); - return; + return ret; } ret = rping_setup_buffers(cb); @@ -776,11 +797,13 @@ err2: rping_free_buffers(cb); err1: rping_free_qp(cb); + + return ret; } -static void rping_test_client(struct rping_cb *cb) +static int rping_test_client(struct rping_cb *cb) { - int ping, start, cc, i, ret; + int ping, start, cc, i, ret = 0; struct ibv_send_wr *bad_wr; unsigned char c; @@ -813,6 +836,7 @@ static void rping_test_client(struct rpi if (cb->state != RDMA_WRITE_ADV) { fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n", cb->state); + ret = -1; break; } @@ -828,18 +852,22 @@ static void rping_test_client(struct rpi if (cb->state != RDMA_WRITE_COMPLETE) { fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n", cb->state); + ret = -1; break; } if (cb->validate) if (memcmp(cb->start_buf, cb->rdma_buf, cb->size)) { fprintf(stderr, "data mismatch!\n"); + ret = -1; break; } if (cb->verbose) printf("ping data: %s\n", cb->rdma_buf); } + + return ret; } static int rping_connect_client(struct rping_cb *cb) @@ -896,19 +924,19 @@ static int rping_bind_client(struct rpin return 0; } -static void rping_run_client(struct rping_cb *cb) +static int rping_run_client(struct rping_cb *cb) { struct ibv_recv_wr *bad_wr; int ret; ret = rping_bind_client(cb); if (ret) - return; + return ret; ret = rping_setup_qp(cb, cb->cm_id); if (ret) { fprintf(stderr, "setup_qp failed: %d\n", ret); - return; + return ret; } ret = rping_setup_buffers(cb); @@ -937,6 +965,8 @@ err2: rping_free_buffers(cb); err1: rping_free_qp(cb); + + return ret; } static void usage(char *name) @@ -1054,9 +1084,9 @@ int main(int argc, char *argv[]) pthread_create(&cb->cmthread, NULL, cm_thread, cb); if (cb->server) - rping_run_server(cb); + ret = rping_run_server(cb); else - rping_run_client(cb); + ret = rping_run_client(cb); DEBUG_LOG("destroy cm_id %p\n", cb->cm_id); rdma_destroy_id(cb->cm_id); From rdreier at cisco.com Tue Jun 20 07:59:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 Jun 2006 07:59:59 -0700 Subject: [openib-general] iSER updates In-Reply-To: <44979F14.9000905@voltaire.com> (Or Gerlitz's message of "Tue, 20 Jun 2006 10:09:08 +0300") References: <44977F81.9080206@voltaire.com> <44979F14.9000905@voltaire.com> Message-ID: Or> I see that the patch is applied at the for-mm branch but not Or> at the iser branch, is it fine? Sorry, I forgot to push out an update the iser branch on master.kernel.org. It should be OK now. - R. From swise at opengridcomputing.com Tue Jun 20 08:23:43 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 10:23:43 -0500 Subject: [openib-general] [librdmacm] check return value in operations of rping In-Reply-To: <200606201730.45253.dotanb@mellanox.co.il> References: <200606201730.45253.dotanb@mellanox.co.il> Message-ID: <1150817023.22519.22.camel@stevo-desktop> This patch is malformed, I think. Did your mailer munge it? On Tue, 2006-06-20 at 17:30 +0300, Dotan Barak wrote: > Added checks to the return values of all of the functions that may fail > (in order to add this test to the regression system). > > Signed-off-by: Dotan Barak > > Index: last_stable/src/userspace/librdmacm/examples/rping.c > =================================================================== > --- last_stable.orig/src/userspace/librdmacm/examples/rping.c 2006-06-20 14:41:47.000000000 +0300 > +++ last_stable/src/userspace/librdmacm/examples/rping.c 2006-06-20 14:42:12.000000000 +0300 > @@ -157,10 +157,10 @@ struct rping_cb { > struct rdma_cm_id *child_cm_id; /* connection on server side */ > }; > > -static void rping_cma_event_handler(struct rdma_cm_id *cma_id, > +static int rping_cma_event_handler(struct rdma_cm_id *cma_id, > struct rdma_cm_event *event) > { > - int ret; > + int ret = 0; > struct rping_cb *cb = cma_id->context; > > DEBUG_LOG("cma_event type %d cma_id %p (%s)\n", event->event, cma_id, > @@ -209,6 +209,7 @@ static void rping_cma_event_handler(stru > fprintf(stderr, "cma event %d, error %d\n", event->event, > event->status); > sem_post(&cb->sem); > + ret = -1; > break; > > case RDMA_CM_EVENT_DISCONNECTED: > @@ -218,13 +219,17 @@ static void rping_cma_event_handler(stru > > case RDMA_CM_EVENT_DEVICE_REMOVAL: > fprintf(stderr, "cma detected device removal!!!!\n"); > + ret = -1; > break; > > default: > fprintf(stderr, "oof bad type!\n"); > sem_post(&cb->sem); > + ret = -1; > break; > } > + > + return ret; > } > > static int server_recv(struct rping_cb *cb, struct ibv_wc *wc) > @@ -263,16 +268,20 @@ static int client_recv(struct rping_cb * > return 0; > } > > -static void rping_cq_event_handler(struct rping_cb *cb) > +static int rping_cq_event_handler(struct rping_cb *cb) > { > struct ibv_wc wc; > struct ibv_recv_wr *bad_wr; > int ret; > > while ((ret = ibv_poll_cq(cb->cq, 1, &wc)) == 1) { > + ret = 0; > + > if (wc.status) { > fprintf(stderr, "cq completion failed status %d\n", > wc.status); > + if (wc.status != IBV_WC_WR_FLUSH_ERR) > + ret = -1; > goto error; > } > > @@ -312,6 +321,7 @@ static void rping_cq_event_handler(struc > > default: > DEBUG_LOG("unknown!!!!! completion\n"); > + ret = -1; > goto error; > } > } > @@ -319,11 +329,12 @@ static void rping_cq_event_handler(struc > fprintf(stderr, "poll error %d\n", ret); > goto error; > } > - return; > + return 0; > > error: > cb->state = ERROR; > sem_post(&cb->sem); > + return ret; > } > > static int rping_accept(struct rping_cb *cb) > @@ -560,7 +571,9 @@ static void *cm_thread(void *arg) > fprintf(stderr, "rdma_get_cm_event err %d\n", ret); > exit(ret); > } > - rping_cma_event_handler(event->id, event); > + ret = rping_cma_event_handler(event->id, event); > + if (ret) > + exit(ret); > rdma_ack_cm_event(event); > } > } > @@ -589,7 +602,9 @@ static void *cq_thread(void *arg) > fprintf(stderr, "Failed to set notify!\n"); > exit(ret); > } > - rping_cq_event_handler(cb); > + ret = rping_cq_event_handler(cb); > + if (ret) > + exit(ret); > ibv_ack_cq_events(cb->cq, 1); > } > } > @@ -606,7 +621,7 @@ static void rping_format_send(struct rpi > info->buf, info->rkey, info->size); > } > > -static void rping_test_server(struct rping_cb *cb) > +static int rping_test_server(struct rping_cb *cb) > { > struct ibv_send_wr *bad_wr; > int ret; > @@ -617,6 +632,7 @@ static void rping_test_server(struct rpi > if (cb->state != RDMA_READ_ADV) { > fprintf(stderr, "wait for RDMA_READ_ADV state %d\n", > cb->state); > + ret = -1; > break; > } > > @@ -640,6 +656,7 @@ static void rping_test_server(struct rpi > if (cb->state != RDMA_READ_COMPLETE) { > fprintf(stderr, "wait for RDMA_READ_COMPLETE state %d\n", > cb->state); > + ret = -1; > break; > } > DEBUG_LOG("server received read complete\n"); > @@ -661,6 +678,7 @@ static void rping_test_server(struct rpi > if (cb->state != RDMA_WRITE_ADV) { > fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n", > cb->state); > + ret = -1; > break; > } > DEBUG_LOG("server received sink adv\n"); > @@ -686,6 +704,7 @@ static void rping_test_server(struct rpi > if (cb->state != RDMA_WRITE_COMPLETE) { > fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n", > cb->state); > + ret = -1; > break; > } > DEBUG_LOG("server rdma write complete \n"); > @@ -698,6 +717,8 @@ static void rping_test_server(struct rpi > } > DEBUG_LOG("server posted go ahead\n"); > } > + > + return ret; > } > > static int rping_bind_server(struct rping_cb *cb) > @@ -734,19 +755,19 @@ static int rping_bind_server(struct rpin > return 0; > } > > -static void rping_run_server(struct rping_cb *cb) > +static int rping_run_server(struct rping_cb *cb) > { > struct ibv_recv_wr *bad_wr; > int ret; > > ret = rping_bind_server(cb); > if (ret) > - return; > + return ret; > > ret = rping_setup_qp(cb, cb->child_cm_id); > if (ret) { > fprintf(stderr, "setup_qp failed: %d\n", ret); > - return; > + return ret; > } > > ret = rping_setup_buffers(cb); > @@ -776,11 +797,13 @@ err2: > rping_free_buffers(cb); > err1: > rping_free_qp(cb); > + > + return ret; > } > > -static void rping_test_client(struct rping_cb *cb) > +static int rping_test_client(struct rping_cb *cb) > { > - int ping, start, cc, i, ret; > + int ping, start, cc, i, ret = 0; > struct ibv_send_wr *bad_wr; > unsigned char c; > > @@ -813,6 +836,7 @@ static void rping_test_client(struct rpi > if (cb->state != RDMA_WRITE_ADV) { > fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n", > cb->state); > + ret = -1; > break; > } > > @@ -828,18 +852,22 @@ static void rping_test_client(struct rpi > if (cb->state != RDMA_WRITE_COMPLETE) { > fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n", > cb->state); > + ret = -1; > break; > } > > if (cb->validate) > if (memcmp(cb->start_buf, cb->rdma_buf, cb->size)) { > fprintf(stderr, "data mismatch!\n"); > + ret = -1; > break; > } > > if (cb->verbose) > printf("ping data: %s\n", cb->rdma_buf); > } > + > + return ret; > } > > static int rping_connect_client(struct rping_cb *cb) > @@ -896,19 +924,19 @@ static int rping_bind_client(struct rpin > return 0; > } > > -static void rping_run_client(struct rping_cb *cb) > +static int rping_run_client(struct rping_cb *cb) > { > struct ibv_recv_wr *bad_wr; > int ret; > > ret = rping_bind_client(cb); > if (ret) > - return; > + return ret; > > ret = rping_setup_qp(cb, cb->cm_id); > if (ret) { > fprintf(stderr, "setup_qp failed: %d\n", ret); > - return; > + return ret; > } > > ret = rping_setup_buffers(cb); > @@ -937,6 +965,8 @@ err2: > rping_free_buffers(cb); > err1: > rping_free_qp(cb); > + > + return ret; > } > > static void usage(char *name) > @@ -1054,9 +1084,9 @@ int main(int argc, char *argv[]) > pthread_create(&cb->cmthread, NULL, cm_thread, cb); > > if (cb->server) > - rping_run_server(cb); > + ret = rping_run_server(cb); > else > - rping_run_client(cb); > + ret = rping_run_client(cb); > > DEBUG_LOG("destroy cm_id %p\n", cb->cm_id); > rdma_destroy_id(cb->cm_id); From swise at opengridcomputing.com Tue Jun 20 08:49:16 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 10:49:16 -0500 Subject: [openib-general] [librdmacm] check return value in operations of rping In-Reply-To: <200606201844.34381.dotanb@mellanox.co.il> References: <200606201730.45253.dotanb@mellanox.co.il> <1150817023.22519.22.camel@stevo-desktop> <200606201844.34381.dotanb@mellanox.co.il> Message-ID: <1150818556.22519.28.camel@stevo-desktop> This is still balled up. It's like the tabs have been converted to spaces... Go ahead and send it as an attachment and I'll review it... Stevo. On Tue, 2006-06-20 at 18:44 +0300, Dotan Barak wrote: > On Tuesday 20 June 2006 18:23, Steve Wise wrote: > > This patch is malformed, I think. Did your mailer munge it? > > > > Sorry, i changed the mail client recently and it wasnt' configured properly ... > > i hope that this patch looks better .. > > Added checks to the return values of all of the functions that may fail > (in order to add this test to the regression system). > > Signed-off-by: Dotan Barak > > Index: last_stable/src/userspace/librdmacm/examples/rping.c > =================================================================== > --- last_stable.orig/src/userspace/librdmacm/examples/rping.c 2006-06-20 14:41:47.000000000 +0300 > +++ last_stable/src/userspace/librdmacm/examples/rping.c 2006-06-20 14:42:12.000000000 +0300 > @@ -157,10 +157,10 @@ struct rping_cb { > struct rdma_cm_id *child_cm_id; /* connection on server side */ > }; > > -static void rping_cma_event_handler(struct rdma_cm_id *cma_id, > +static int rping_cma_event_handler(struct rdma_cm_id *cma_id, > struct rdma_cm_event *event) > { > - int ret; > + int ret = 0; > struct rping_cb *cb = cma_id->context; > > DEBUG_LOG("cma_event type %d cma_id %p (%s)\n", event->event, cma_id, > @@ -209,6 +209,7 @@ static void rping_cma_event_handler(stru > fprintf(stderr, "cma event %d, error %d\n", event->event, > event->status); > sem_post(&cb->sem); > + ret = -1; > break; > > case RDMA_CM_EVENT_DISCONNECTED: > @@ -218,13 +219,17 @@ static void rping_cma_event_handler(stru > > case RDMA_CM_EVENT_DEVICE_REMOVAL: > fprintf(stderr, "cma detected device removal!!!!\n"); > + ret = -1; > break; > > default: > fprintf(stderr, "oof bad type!\n"); > sem_post(&cb->sem); > + ret = -1; > break; > } > + > + return ret; > } > > static int server_recv(struct rping_cb *cb, struct ibv_wc *wc) > @@ -263,16 +268,20 @@ static int client_recv(struct rping_cb * > return 0; > } > > -static void rping_cq_event_handler(struct rping_cb *cb) > +static int rping_cq_event_handler(struct rping_cb *cb) > { > struct ibv_wc wc; > struct ibv_recv_wr *bad_wr; > int ret; > > while ((ret = ibv_poll_cq(cb->cq, 1, &wc)) == 1) { > + ret = 0; > + > if (wc.status) { > fprintf(stderr, "cq completion failed status %d\n", > wc.status); > + if (wc.status != IBV_WC_WR_FLUSH_ERR) > + ret = -1; > goto error; > } > > @@ -312,6 +321,7 @@ static void rping_cq_event_handler(struc > > default: > DEBUG_LOG("unknown!!!!! completion\n"); > + ret = -1; > goto error; > } > } > @@ -319,11 +329,12 @@ static void rping_cq_event_handler(struc > fprintf(stderr, "poll error %d\n", ret); > goto error; > } > - return; > + return 0; > > error: > cb->state = ERROR; > sem_post(&cb->sem); > + return ret; > } > > static int rping_accept(struct rping_cb *cb) > @@ -560,7 +571,9 @@ static void *cm_thread(void *arg) > fprintf(stderr, "rdma_get_cm_event err %d\n", ret); > exit(ret); > } > - rping_cma_event_handler(event->id, event); > + ret = rping_cma_event_handler(event->id, event); > + if (ret) > + exit(ret); > rdma_ack_cm_event(event); > } > } > @@ -589,7 +602,9 @@ static void *cq_thread(void *arg) > fprintf(stderr, "Failed to set notify!\n"); > exit(ret); > } > - rping_cq_event_handler(cb); > + ret = rping_cq_event_handler(cb); > + if (ret) > + exit(ret); > ibv_ack_cq_events(cb->cq, 1); > } > } > @@ -606,7 +621,7 @@ static void rping_format_send(struct rpi > info->buf, info->rkey, info->size); > } > > -static void rping_test_server(struct rping_cb *cb) > +static int rping_test_server(struct rping_cb *cb) > { > struct ibv_send_wr *bad_wr; > int ret; > @@ -617,6 +632,7 @@ static void rping_test_server(struct rpi > if (cb->state != RDMA_READ_ADV) { > fprintf(stderr, "wait for RDMA_READ_ADV state %d\n", > cb->state); > + ret = -1; > break; > } > > @@ -640,6 +656,7 @@ static void rping_test_server(struct rpi > if (cb->state != RDMA_READ_COMPLETE) { > fprintf(stderr, "wait for RDMA_READ_COMPLETE state %d\n", > cb->state); > + ret = -1; > break; > } > DEBUG_LOG("server received read complete\n"); > @@ -661,6 +678,7 @@ static void rping_test_server(struct rpi > if (cb->state != RDMA_WRITE_ADV) { > fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n", > cb->state); > + ret = -1; > break; > } > DEBUG_LOG("server received sink adv\n"); > @@ -686,6 +704,7 @@ static void rping_test_server(struct rpi > if (cb->state != RDMA_WRITE_COMPLETE) { > fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n", > cb->state); > + ret = -1; > break; > } > DEBUG_LOG("server rdma write complete \n"); > @@ -698,6 +717,8 @@ static void rping_test_server(struct rpi > } > DEBUG_LOG("server posted go ahead\n"); > } > + > + return ret; > } > > static int rping_bind_server(struct rping_cb *cb) > @@ -734,19 +755,19 @@ static int rping_bind_server(struct rpin > return 0; > } > > -static void rping_run_server(struct rping_cb *cb) > +static int rping_run_server(struct rping_cb *cb) > { > struct ibv_recv_wr *bad_wr; > int ret; > > ret = rping_bind_server(cb); > if (ret) > - return; > + return ret; > > ret = rping_setup_qp(cb, cb->child_cm_id); > if (ret) { > fprintf(stderr, "setup_qp failed: %d\n", ret); > - return; > + return ret; > } > > ret = rping_setup_buffers(cb); > @@ -776,11 +797,13 @@ err2: > rping_free_buffers(cb); > err1: > rping_free_qp(cb); > + > + return ret; > } > > -static void rping_test_client(struct rping_cb *cb) > +static int rping_test_client(struct rping_cb *cb) > { > - int ping, start, cc, i, ret; > + int ping, start, cc, i, ret = 0; > struct ibv_send_wr *bad_wr; > unsigned char c; > > @@ -813,6 +836,7 @@ static void rping_test_client(struct rpi > if (cb->state != RDMA_WRITE_ADV) { > fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n", > cb->state); > + ret = -1; > break; > } > > @@ -828,18 +852,22 @@ static void rping_test_client(struct rpi > if (cb->state != RDMA_WRITE_COMPLETE) { > fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n", > cb->state); > + ret = -1; > break; > } > > if (cb->validate) > if (memcmp(cb->start_buf, cb->rdma_buf, cb->size)) { > fprintf(stderr, "data mismatch!\n"); > + ret = -1; > break; > } > > if (cb->verbose) > printf("ping data: %s\n", cb->rdma_buf); > } > + > + return ret; > } > > static int rping_connect_client(struct rping_cb *cb) > @@ -896,19 +924,19 @@ static int rping_bind_client(struct rpin > return 0; > } > > -static void rping_run_client(struct rping_cb *cb) > +static int rping_run_client(struct rping_cb *cb) > { > struct ibv_recv_wr *bad_wr; > int ret; > > ret = rping_bind_client(cb); > if (ret) > - return; > + return ret; > > ret = rping_setup_qp(cb, cb->cm_id); > if (ret) { > fprintf(stderr, "setup_qp failed: %d\n", ret); > - return; > + return ret; > } > > ret = rping_setup_buffers(cb); > @@ -937,6 +965,8 @@ err2: > rping_free_buffers(cb); > err1: > rping_free_qp(cb); > + > + return ret; > } > > static void usage(char *name) > @@ -1054,9 +1084,9 @@ int main(int argc, char *argv[]) > pthread_create(&cb->cmthread, NULL, cm_thread, cb); > > if (cb->server) > - rping_run_server(cb); > + ret = rping_run_server(cb); > else > - rping_run_client(cb); > + ret = rping_run_client(cb); > > DEBUG_LOG("destroy cm_id %p\n", cb->cm_id); > rdma_destroy_id(cb->cm_id); From bchang at atipa.com Tue Jun 20 08:42:23 2006 From: bchang at atipa.com (Brady Chang) Date: Tue, 20 Jun 2006 10:42:23 -0500 Subject: [openib-general] FW: mvapich xhpl memory usage References: <0D6FBA307D01EA42BAC8715725643AA01EDB53@EXCHG2003.microtech-ks.com> Message-ID: <0D6FBA307D01EA42BAC8715725643AA01EDB54@EXCHG2003.microtech-ks.com> Hello, I installed OFED 1.0 (mvapich 0.97)and compile Linpack benchmark. When I run xhpl, the memory usage creeps up with each NB. and as each N changes the memory allocated is not freed. LAZY_MEM_REGISTER is not defined per http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2006-March/000057.html I removed it from Make.mvapich.gen2. tar it back up and reran the install. Hardware: dual core opteron. 4Gig mem InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev a0) xhpl compiled using libmpich.so thanks -Brady -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Jun 20 09:29:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 Jun 2006 09:29:20 -0700 Subject: [openib-general] [PATCH 2/2] IB/iser: bugfix for the reconnect flow In-Reply-To: (Or Gerlitz's message of "Tue, 20 Jun 2006 12:35:51 +0300 (IDT)") References: Message-ID: Thanks, I rolled up both of these patches into what I have queued. From amit_byron at yahoo.com Tue Jun 20 09:27:50 2006 From: amit_byron at yahoo.com (amit byron) Date: Tue, 20 Jun 2006 16:27:50 +0000 (UTC) Subject: [openib-general] =?utf-8?q?ib=5Fgid_lookup?= References: <1150798111.4391.111384.camel@hal.voltaire.com> Message-ID: > Hal Rosenstock voltaire.com> writes: > > > Hi Amit, > > On Mon, 2006-06-19 at 20:36, Amit Byron wrote: > > hello, > > i'm trying to find whether i can do a lookup of ib_gid by either > > node name or node's ip address. is this information available from > > the subnet manager? > > The SM doesn't know the node name but you might be able to do this by > NodeDescription depending on how the subnet was setup (the > NodeDescriptions would need to be made unique on each node; a script for > this was supplied for mthca; there is also a current standards issue > with the SM detecting that these had changed which is being worked on). > If that were to be done, the SA could be queried by NodeDescription > which would return a NodeRecord which would obtain the NodeInfo which > includes the NodeGUID and PortGUID. Note it also returns the base LID as > well. hi Hal, thank you very much for your suggestions. do you mean to say setting up subnet through the topology file? are there any examples on how to setup the topology file? also, where can i find the mthca script that you mention above. > > The SM does not know the IP addresses unless they are registered by DAPL > (via ServiceRecords) but I'm not sure that is done anymore or whether > DAPL runs in your environment. > if i run DAPL in my environment will it work or this is already made obsolete? thanks again, Amit From halr at voltaire.com Tue Jun 20 09:44:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Jun 2006 12:44:34 -0400 Subject: [openib-general] [PATCH] OpenSM/SA: Message-ID: <1150821864.4391.126438.camel@hal.voltaire.com> OpenSM/SA: Properly handle non base LID requests per C15-0.1.11 on remaining SA records where this hasn't been fixed already. C15-0.1.11: Query responses shall contain a port's base LID in any LID component of a RID. So when LMC is non 0, the only records that appear are those with the base LID and not with any masked LIDs. Furthermore, if a query comes in on a non base LID, the LID in the RID returned is only with the base LID. Also, fixed some error handling for SA GetTable requests in these SA records. Signed-off-by: Hal Rosenstock Index: opensm/osm_sa_guidinfo_record.c =================================================================== --- opensm/osm_sa_guidinfo_record.c (revision 8140) +++ opensm/osm_sa_guidinfo_record.c (working copy) @@ -201,12 +201,10 @@ __osm_sa_gir_create_gir( uint8_t port_num; uint8_t num_ports; uint16_t match_lid_ho; - uint16_t lid_ho; ib_net16_t base_lid_ho; ib_net16_t max_lid_ho; uint8_t lmc; ib_net64_t port_guid; - ib_api_status_t status; const ib_port_info_t* p_pi; uint8_t block_num, start_block_num, end_block_num, num_blocks; @@ -276,11 +274,12 @@ __osm_sa_gir_create_gir( } base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_physp ) ); - lmc = osm_physp_get_lmc( p_physp ); - max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 ); match_lid_ho = cl_ntoh16( match_lid ); if( match_lid_ho ) { + lmc = osm_physp_get_lmc( p_physp ); + max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 ); + /* We validate that the lid belongs to this node. */ @@ -295,34 +294,15 @@ __osm_sa_gir_create_gir( ); } - if( match_lid_ho <= max_lid_ho && match_lid_ho >= base_lid_ho ) - { - /* - Ignore return code for now. - */ - for (block_num = start_block_num; block_num <= end_block_num; block_num++) - __osm_gir_rcv_new_gir( p_rcv, p_node, p_list, - port_guid, match_lid, - p_physp, block_num ); - } - } - else - { - /* - For every lid value create the GUIDInfo record(s). - */ - for( lid_ho = base_lid_ho; lid_ho <= max_lid_ho; lid_ho++ ) - { - for (block_num = start_block_num; block_num <= end_block_num; block_num++) - { - status = __osm_gir_rcv_new_gir( p_rcv, p_node, p_list, - port_guid, cl_hton16( lid_ho ), - p_physp, block_num ); - if( status != IB_SUCCESS ) - break; - } - } + if ( match_lid_ho < base_lid_ho || match_lid_ho > max_lid_ho ) + continue; } + + for (block_num = start_block_num; block_num <= end_block_num; block_num++) + __osm_gir_rcv_new_gir( p_rcv, p_node, p_list, + port_guid, cl_ntoh16(base_lid_ho), + p_physp, block_num ); + } OSM_LOG_EXIT( p_rcv->p_log ); @@ -496,24 +476,32 @@ osm_gir_rcv_process( * C15-0.1.30: * If we do a SubnAdmGet and got more than one record it is an error ! */ - if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && - (num_rec > 1)) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_gir_rcv_process: ERR 5103: " - "Got more than one record for SubnAdmGet (%u)\n", - num_rec ); - osm_sa_send_error( p_rcv->p_resp, p_madw, - IB_SA_MAD_STATUS_TOO_MANY_RECORDS ); - - /* need to set the mem free ... */ - p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list ); - while( p_rec_item != (osm_gir_item_t*)cl_qlist_end( &rec_list ) ) + if (p_rcvd_mad->method == IB_MAD_METHOD_GET) + { + if (num_rec == 0) { - cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); - p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; } + if (num_rec > 1) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_gir_rcv_process: ERR 5103: " + "Got more than one record for SubnAdmGet (%u)\n", + num_rec ); + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_TOO_MANY_RECORDS ); - goto Exit; + /* need to set the mem free ... */ + p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list ); + while( p_rec_item != (osm_gir_item_t*)cl_qlist_end( &rec_list ) ) + { + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list ); + } + + goto Exit; + } } pre_trim_num_rec = num_rec; Index: opensm/osm_sa_lft_record.c =================================================================== --- opensm/osm_sa_lft_record.c (revision 8140) +++ opensm/osm_sa_lft_record.c (working copy) @@ -199,7 +199,6 @@ __osm_lftr_get_port_by_guid( p_port = (osm_port_t *)cl_qmap_get(&p_rcv->p_subn->port_guid_tbl, port_guid); - if(p_port == (osm_port_t *)cl_qmap_end(&p_rcv->p_subn->port_guid_tbl)) { osm_log( p_rcv->p_log, OSM_LOG_DEBUG, @@ -249,9 +248,6 @@ __osm_lftr_rcv_by_comp_mask( return; } - /* get the port 0 of the switch */ - osm_port_get_lid_range_ho( p_port, &min_lid_ho, &max_lid_ho ); - /* check that the requester physp and the current physp are under the same partition. */ p_physp = osm_port_get_default_phys_ptr( p_port ); @@ -268,6 +264,9 @@ __osm_lftr_rcv_by_comp_mask( if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_physp )) return; + /* get the port 0 of the switch */ + osm_port_get_lid_range_ho( p_port, &min_lid_ho, &max_lid_ho ); + /* compare the lids - if required */ if( comp_mask & IB_LFTR_COMPMASK_LID ) { @@ -277,8 +276,8 @@ __osm_lftr_rcv_by_comp_mask( cl_ntoh16( p_rcvd_rec->lid ), min_lid_ho, max_lid_ho ); /* ok we are ready for range check */ - if ((min_lid_ho > cl_ntoh16(p_rcvd_rec->lid)) || - (max_lid_ho < cl_ntoh16(p_rcvd_rec->lid))) + if (min_lid_ho > cl_ntoh16(p_rcvd_rec->lid) || + max_lid_ho < cl_ntoh16(p_rcvd_rec->lid)) return; } @@ -323,7 +322,7 @@ osm_lftr_rcv_process( uint32_t i; osm_lftr_search_ctxt_t context; osm_lftr_item_t* p_rec_item; - ib_api_status_t status; + ib_api_status_t status = IB_SUCCESS; osm_physp_t* p_req_physp; CL_ASSERT( p_rcv ); @@ -382,24 +381,32 @@ osm_lftr_rcv_process( * C15-0.1.30: * If we do a SubnAdmGet and got more than one record it is an error ! */ - if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && - (num_rec > 1)) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_lftr_rcv_process: ERR 4409: " - "Got more than one record for SubnAdmGet (%u)\n", - num_rec ); - osm_sa_send_error( p_rcv->p_resp, p_madw, - IB_SA_MAD_STATUS_TOO_MANY_RECORDS); - - /* need to set the mem free ... */ - p_rec_item = (osm_lftr_item_t*)cl_qlist_remove_head( &rec_list ); - while( p_rec_item != (osm_lftr_item_t*)cl_qlist_end( &rec_list ) ) + if (p_rcvd_mad->method == IB_MAD_METHOD_GET) + { + if (num_rec == 0) { - cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); - p_rec_item = (osm_lftr_item_t*)cl_qlist_remove_head( &rec_list ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; } + if (num_rec > 1) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_lftr_rcv_process: ERR 4409: " + "Got more than one record for SubnAdmGet (%u)\n", + num_rec ); + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_TOO_MANY_RECORDS); - goto Exit; + /* need to set the mem free ... */ + p_rec_item = (osm_lftr_item_t*)cl_qlist_remove_head( &rec_list ); + while( p_rec_item != (osm_lftr_item_t*)cl_qlist_end( &rec_list ) ) + { + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_rec_item = (osm_lftr_item_t*)cl_qlist_remove_head( &rec_list ); + } + + goto Exit; + } } pre_trim_num_rec = num_rec; Index: opensm/osm_sa_node_record.c =================================================================== --- opensm/osm_sa_node_record.c (revision 8140) +++ opensm/osm_sa_node_record.c (working copy) @@ -264,15 +264,12 @@ __osm_nr_rcv_create_nr( ); } - if( (match_lid_ho <= max_lid_ho) && (match_lid_ho >= base_lid_ho) ) - { - __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, base_lid ); - } - } - else - { - __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, base_lid ); + if ( match_lid_ho < base_lid_ho || match_lid_ho > max_lid_ho ) + continue; } + + __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, base_lid ); + } OSM_LOG_EXIT( p_rcv->p_log ); Index: opensm/osm_sa_path_record.c =================================================================== --- opensm/osm_sa_path_record.c (revision 8140) +++ opensm/osm_sa_path_record.c (working copy) @@ -1027,8 +1027,7 @@ __osm_pr_rcv_get_end_points( status = cl_ptr_vector_at( &p_rcv->p_subn->port_lid_tbl, cl_ntoh16(p_pr->slid), (void**)pp_src_port ); - if( ( (status != CL_SUCCESS) || (*pp_src_port == NULL) ) && - (p_sa_mad->method == IB_MAD_METHOD_GET) ) + if( (status != CL_SUCCESS) || (*pp_src_port == NULL) ) { /* This 'error' is the client's fault (bad lid) so @@ -1077,8 +1076,7 @@ __osm_pr_rcv_get_end_points( status = cl_ptr_vector_at( &p_rcv->p_subn->port_lid_tbl, cl_ntoh16(p_pr->dlid), (void**)pp_dest_port ); - if( ( (status != CL_SUCCESS) || (*pp_dest_port == NULL) ) && - (p_sa_mad->method == IB_MAD_METHOD_GET) ) + if( (status != CL_SUCCESS) || (*pp_dest_port == NULL) ) { /* This 'error' is the client's fault (bad lid) so @@ -1521,22 +1519,30 @@ __osm_pr_rcv_respond( * C15-0.1.30: * If we do a SubnAdmGet and got more than one record it is an error ! */ - if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && - (num_rec > 1)) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_pr_rcv_respond: ERR 1F13: " - "Got more than one record for SubnAdmGet (%u)\n", - num_rec ); - osm_sa_send_error( p_rcv->p_resp, p_madw, - IB_SA_MAD_STATUS_TOO_MANY_RECORDS ); - /* need to set the mem free ... */ - p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); - while( p_pr_item != (osm_pr_item_t*)cl_qlist_end( p_list ) ) + if (p_rcvd_mad->method == IB_MAD_METHOD_GET) + { + if (num_rec == 0) { - cl_qlock_pool_put( &p_rcv->pr_pool, &p_pr_item->pool_item ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; + } + if (num_rec > 1) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_respond: ERR 1F13: " + "Got more than one record for SubnAdmGet (%u)\n", + num_rec ); + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_TOO_MANY_RECORDS ); + /* need to set the mem free ... */ p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); + while( p_pr_item != (osm_pr_item_t*)cl_qlist_end( p_list ) ) + { + cl_qlock_pool_put( &p_rcv->pr_pool, &p_pr_item->pool_item ); + p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); + } + goto Exit; } - goto Exit; } pre_trim_num_rec = num_rec; @@ -1704,40 +1710,36 @@ osm_pr_rcv_process( sa_status = __osm_pr_rcv_get_end_points( p_rcv, p_madw, &p_src_port, &p_dest_port ); - if( sa_status != IB_SA_MAD_STATUS_SUCCESS ) + if( sa_status == IB_SA_MAD_STATUS_SUCCESS ) { - cl_plock_release( p_rcv->p_lock ); - osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status ); - goto Exit; - } - - /* - What happens next depends on the type of endpoint information - that was specified.... - */ - if( p_src_port ) - { - if( p_dest_port ) - __osm_pr_rcv_process_pair( p_rcv, p_madw, requester_port, - p_src_port, p_dest_port, - p_sa_mad->comp_mask, &pr_list ); - else - __osm_pr_rcv_process_half( p_rcv, p_madw, requester_port, - p_src_port, NULL, - p_sa_mad->comp_mask, &pr_list ); - } - else - { - if( p_dest_port ) - __osm_pr_rcv_process_half( p_rcv, p_madw, requester_port, - NULL, p_dest_port, - p_sa_mad->comp_mask, &pr_list ); + /* + What happens next depends on the type of endpoint information + that was specified.... + */ + if( p_src_port ) + { + if( p_dest_port ) + __osm_pr_rcv_process_pair( p_rcv, p_madw, requester_port, + p_src_port, p_dest_port, + p_sa_mad->comp_mask, &pr_list ); + else + __osm_pr_rcv_process_half( p_rcv, p_madw, requester_port, + p_src_port, NULL, + p_sa_mad->comp_mask, &pr_list ); + } else - /* - Katie, bar the door! - */ - __osm_pr_rcv_process_world( p_rcv, p_madw, requester_port, - p_sa_mad->comp_mask, &pr_list ); + { + if( p_dest_port ) + __osm_pr_rcv_process_half( p_rcv, p_madw, requester_port, + NULL, p_dest_port, + p_sa_mad->comp_mask, &pr_list ); + else + /* + Katie, bar the door! + */ + __osm_pr_rcv_process_world( p_rcv, p_madw, requester_port, + p_sa_mad->comp_mask, &pr_list ); + } } goto Unlock; Index: opensm/osm_sa_pkey_record.c =================================================================== --- opensm/osm_sa_pkey_record.c (revision 8140) +++ opensm/osm_sa_pkey_record.c (working copy) @@ -332,7 +332,7 @@ osm_pkey_rec_rcv_process( uint32_t i; osm_pkey_search_ctxt_t context; osm_pkey_item_t* p_rec_item; - ib_api_status_t status; + ib_api_status_t status = IB_SUCCESS; ib_net64_t comp_mask; osm_physp_t* p_req_physp; @@ -421,30 +421,38 @@ osm_pkey_rec_rcv_process( if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid)) { - p_port = cl_ptr_vector_get( p_tbl, cl_ntoh16(p_rcvd_rec->lid) ); + status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port ); + if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) ) + { + status = IB_NOT_FOUND; + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_pkey_rec_rcv_process: ERR 460B: " + "No port found with LID 0x%x\n", + cl_ntoh16(p_rcvd_rec->lid) ); + } } else { /* port out of range */ - cl_plock_release( p_rcv->p_lock ); - + status = IB_NOT_FOUND; osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_pkey_rec_rcv_process: ERR 4609: " "Given LID (0x%X) is out of range:0x%X\n", cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) ); - osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID ); - goto Exit; } } - /* if we got a unique port - no need for a port search */ - if( p_port ) - /* this does the loop on all the port phys ports */ - __osm_sa_pkey_by_comp_mask( p_rcv, p_port, &context ); - else - { - cl_qmap_apply_func( &p_rcv->p_subn->port_guid_tbl, - __osm_sa_pkey_by_comp_mask_cb, - &context ); + if ( status == IB_SUCCESS ) + { + /* if we got a unique port - no need for a port search */ + if( p_port ) + /* this does the loop on all the port phys ports */ + __osm_sa_pkey_by_comp_mask( p_rcv, p_port, &context ); + else + { + cl_qmap_apply_func( &p_rcv->p_subn->port_guid_tbl, + __osm_sa_pkey_by_comp_mask_cb, + &context ); + } } cl_plock_release( p_rcv->p_lock ); @@ -455,24 +463,32 @@ osm_pkey_rec_rcv_process( * C15-0.1.30: * If we do a SubnAdmGet and got more than one record it is an error ! */ - if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && - (num_rec > 1)) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_pkey_rec_rcv_process: ERR 460A: " - "Got more than one record for SubnAdmGet (%u)\n", - num_rec ); - osm_sa_send_error( p_rcv->p_resp, p_madw, - IB_SA_MAD_STATUS_TOO_MANY_RECORDS); - - /* need to set the mem free ... */ - p_rec_item = (osm_pkey_item_t*)cl_qlist_remove_head( &rec_list ); - while( p_rec_item != (osm_pkey_item_t*)cl_qlist_end( &rec_list ) ) + if (p_rcvd_mad->method == IB_MAD_METHOD_GET) + { + if (num_rec == 0) { - cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); - p_rec_item = (osm_pkey_item_t*)cl_qlist_remove_head( &rec_list ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; } + if (num_rec > 1) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_pkey_rec_rcv_process: ERR 460A: " + "Got more than one record for SubnAdmGet (%u)\n", + num_rec ); + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_TOO_MANY_RECORDS); - goto Exit; + /* need to set the mem free ... */ + p_rec_item = (osm_pkey_item_t*)cl_qlist_remove_head( &rec_list ); + while( p_rec_item != (osm_pkey_item_t*)cl_qlist_end( &rec_list ) ) + { + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_rec_item = (osm_pkey_item_t*)cl_qlist_remove_head( &rec_list ); + } + + goto Exit; + } } pre_trim_num_rec = num_rec; Index: opensm/osm_sa_slvl_record.c =================================================================== --- opensm/osm_sa_slvl_record.c (revision 8140) +++ opensm/osm_sa_slvl_record.c (working copy) @@ -317,7 +317,7 @@ osm_slvl_rec_rcv_process( uint32_t i; osm_slvl_search_ctxt_t context; osm_slvl_item_t* p_rec_item; - ib_api_status_t status; + ib_api_status_t status = IB_SUCCESS; ib_net64_t comp_mask; osm_physp_t* p_req_physp; @@ -389,30 +389,38 @@ osm_slvl_rec_rcv_process( if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid)) { - p_port = cl_ptr_vector_get( p_tbl, cl_ntoh16(p_rcvd_rec->lid) ); + status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port ); + if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) ) + { + status = IB_NOT_FOUND; + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_slvl_rec_rcv_process: ERR 2608: " + "No port found with LID 0x%x\n", + cl_ntoh16(p_rcvd_rec->lid) ); + } } else { /* port out of range */ - cl_plock_release( p_rcv->p_lock ); - + status = IB_NOT_FOUND; osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_slvl_rec_rcv_process: ERR 2601: " "Given LID (0x%X) is out of range:0x%X\n", cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl)); - osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID ); - goto Exit; } } - /* if we have a unique port - no need for a port search */ - if( p_port ) - /* this does the loop on all the port phys ports */ - __osm_sa_slvl_by_comp_mask( p_rcv, p_port, &context ); - else - { - cl_qmap_apply_func( &p_rcv->p_subn->port_guid_tbl, - __osm_sa_slvl_by_comp_mask_cb, - &context ); + if ( status == IB_SUCCESS ) + { + /* if we have a unique port - no need for a port search */ + if( p_port ) + /* this does the loop on all the port phys ports */ + __osm_sa_slvl_by_comp_mask( p_rcv, p_port, &context ); + else + { + cl_qmap_apply_func( &p_rcv->p_subn->port_guid_tbl, + __osm_sa_slvl_by_comp_mask_cb, + &context ); + } } cl_plock_release( p_rcv->p_lock ); @@ -423,24 +431,32 @@ osm_slvl_rec_rcv_process( * C15-0.1.30: * If we do a SubnAdmGet and got more than one record it is an error ! */ - if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && - (num_rec > 1)) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_slvl_rec_rcv_process: ERR 2607: " - "Got more than one record for SubnAdmGet (%u)\n", - num_rec ); - osm_sa_send_error( p_rcv->p_resp, p_madw, - IB_SA_MAD_STATUS_TOO_MANY_RECORDS ); - - /* need to set the mem free ... */ - p_rec_item = (osm_slvl_item_t*)cl_qlist_remove_head( &rec_list ); - while( p_rec_item != (osm_slvl_item_t*)cl_qlist_end( &rec_list ) ) + if (p_rcvd_mad->method == IB_MAD_METHOD_GET) + { + if (num_rec == 0) { - cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); - p_rec_item = (osm_slvl_item_t*)cl_qlist_remove_head( &rec_list ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; } + if (num_rec > 1) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_slvl_rec_rcv_process: ERR 2607: " + "Got more than one record for SubnAdmGet (%u)\n", + num_rec ); + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_TOO_MANY_RECORDS ); - goto Exit; + /* need to set the mem free ... */ + p_rec_item = (osm_slvl_item_t*)cl_qlist_remove_head( &rec_list ); + while( p_rec_item != (osm_slvl_item_t*)cl_qlist_end( &rec_list ) ) + { + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_rec_item = (osm_slvl_item_t*)cl_qlist_remove_head( &rec_list ); + } + + goto Exit; + } } pre_trim_num_rec = num_rec; Index: opensm/osm_sa_vlarb_record.c =================================================================== --- opensm/osm_sa_vlarb_record.c (revision 8140) +++ opensm/osm_sa_vlarb_record.c (working copy) @@ -337,7 +337,7 @@ osm_vlarb_rec_rcv_process( uint32_t i; osm_vl_arb_search_ctxt_t context; osm_vl_arb_item_t* p_rec_item; - ib_api_status_t status; + ib_api_status_t status = IB_SUCCESS; ib_net64_t comp_mask; osm_physp_t* p_req_physp; @@ -409,30 +409,38 @@ osm_vlarb_rec_rcv_process( if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid)) { - p_port = cl_ptr_vector_get( p_tbl, cl_ntoh16(p_rcvd_rec->lid) ); + status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port ); + if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) ) + { + status = IB_NOT_FOUND; + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_vlarb_rec_rcv_process: ERR 2A09: " + "No port found with LID 0x%x\n", + cl_ntoh16(p_rcvd_rec->lid) ); + } } else { /* port out of range */ - cl_plock_release( p_rcv->p_lock ); - + status = IB_NOT_FOUND; osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_vlarb_rec_rcv_process: ERR 2A01: " "Given LID (0x%X) is out of range:0x%X\n", cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) ); - osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID ); - goto Exit; } } - /* if we got a unique port - no need for a port search */ - if( p_port ) - /* this does the loop on all the port phys ports */ - __osm_sa_vl_arb_by_comp_mask( p_rcv, p_port, &context ); - else - { - cl_qmap_apply_func( &p_rcv->p_subn->port_guid_tbl, - __osm_sa_vl_arb_by_comp_mask_cb, - &context ); + if ( status == IB_SUCCESS ) + { + /* if we got a unique port - no need for a port search */ + if( p_port ) + /* this does the loop on all the port phys ports */ + __osm_sa_vl_arb_by_comp_mask( p_rcv, p_port, &context ); + else + { + cl_qmap_apply_func( &p_rcv->p_subn->port_guid_tbl, + __osm_sa_vl_arb_by_comp_mask_cb, + &context ); + } } cl_plock_release( p_rcv->p_lock ); @@ -443,24 +451,32 @@ osm_vlarb_rec_rcv_process( * C15-0.1.30: * If we do a SubnAdmGet and got more than one record it is an error ! */ - if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && - (num_rec > 1)) { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_vlarb_rec_rcv_process: ERR 2A08: " - "Got more than one record for SubnAdmGet (%u)\n", - num_rec ); - osm_sa_send_error( p_rcv->p_resp, p_madw, - IB_SA_MAD_STATUS_TOO_MANY_RECORDS ); - - /* need to set the mem free ... */ - p_rec_item = (osm_vl_arb_item_t*)cl_qlist_remove_head( &rec_list ); - while( p_rec_item != (osm_vl_arb_item_t*)cl_qlist_end( &rec_list ) ) + if (p_rcvd_mad->method == IB_MAD_METHOD_GET) + { + if (num_rec == 0) { - cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); - p_rec_item = (osm_vl_arb_item_t*)cl_qlist_remove_head( &rec_list ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; } + if (num_rec > 1) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_vlarb_rec_rcv_process: ERR 2A08: " + "Got more than one record for SubnAdmGet (%u)\n", + num_rec ); + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_TOO_MANY_RECORDS ); - goto Exit; + /* need to set the mem free ... */ + p_rec_item = (osm_vl_arb_item_t*)cl_qlist_remove_head( &rec_list ); + while( p_rec_item != (osm_vl_arb_item_t*)cl_qlist_end( &rec_list ) ) + { + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_rec_item = (osm_vl_arb_item_t*)cl_qlist_remove_head( &rec_list ); + } + + goto Exit; + } } pre_trim_num_rec = num_rec; From halr at voltaire.com Tue Jun 20 09:50:01 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Jun 2006 12:50:01 -0400 Subject: [openib-general] ib_gid lookup In-Reply-To: References: <1150798111.4391.111384.camel@hal.voltaire.com> Message-ID: <1150822037.4391.126581.camel@hal.voltaire.com> Hi again Amit, On Tue, 2006-06-20 at 12:27, amit byron wrote: > > Hal Rosenstock voltaire.com> writes: > > > > > > Hi Amit, > > > > On Mon, 2006-06-19 at 20:36, Amit Byron wrote: > > > hello, > > > i'm trying to find whether i can do a lookup of ib_gid by either > > > node name or node's ip address. is this information available from > > > the subnet manager? > > > > The SM doesn't know the node name but you might be able to do this by > > NodeDescription depending on how the subnet was setup (the > > NodeDescriptions would need to be made unique on each node; a script for > > this was supplied for mthca; there is also a current standards issue > > with the SM detecting that these had changed which is being worked on). > > If that were to be done, the SA could be queried by NodeDescription > > which would return a NodeRecord which would obtain the NodeInfo which > > includes the NodeGUID and PortGUID. Note it also returns the base LID as > > well. > > hi Hal, > thank you very much for your suggestions. > > do you mean to say setting up subnet through the topology file? No (although the topology file does display this information). > are > there any examples on how to setup the topology file? also, where can > i find the mthca script that you mention above. management/diags/scripts/set_mthca_nodedesc.sh > > The SM does not know the IP addresses unless they are registered by DAPL > > (via ServiceRecords) but I'm not sure that is done anymore or whether > > DAPL runs in your environment. > > > > if i run DAPL in my environment will it work or this is already made > obsolete? I don't know. James or maybe Arlin would be the ones to answer. You could also look at the code to figure this out. -- Hal > thanks again, > Amit > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From robert.j.woodruff at intel.com Tue Jun 20 09:55:57 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 20 Jun 2006 09:55:57 -0700 Subject: [openib-general] ipath verbs does not compile against the latest SVN trunk verbs Message-ID: <1AC79F16F5C5284499BB9591B33D6F0008057B9B@orsmsx408> When I try to build SVN 8112 I get the following errors trying to build the ipath verbs. src/ipathverbs.c:148: warning: its scope is only this definition or declaration, which is probably not what you want src/ipathverbs.c: In function `openib_driver_init': src/ipathverbs.c:156: warning: implicit declaration of function `sysfs_get_classdev_device' src/ipathverbs.c:156: warning: assignment makes pointer from integer without a cast src/ipathverbs.c:160: warning: implicit declaration of function `sysfs_get_device_attr' src/ipathverbs.c:160: warning: assignment makes pointer from integer without a cast src/ipathverbs.c:163: error: dereferencing pointer to incomplete type src/ipathverbs.c:164: warning: implicit declaration of function `sysfs_close_attribute' src/ipathverbs.c:166: warning: assignment makes pointer from integer without a cast src/ipathverbs.c:169: error: dereferencing pointer to incomplete type src/ipathverbs.c:183: error: dereferencing pointer to incomplete type -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Jun 20 10:24:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Jun 2006 13:24:28 -0400 Subject: [openib-general] [PATCH] OpenSM/osm_sa_link_record.c: Only need base LID rather than LID range in __osm_lr_rcv_get_physp_link Message-ID: <1150824264.4391.127940.camel@hal.voltaire.com> OpenSM/osm_sa_link_record.c: Only need base LID rather than LID range in __osm_lr_rcv_get_physp_link Signed-off-by: Hal Rosenstock Index: opensm/osm_sa_link_record.c =================================================================== --- opensm/osm_sa_link_record.c (revision 8140) +++ opensm/osm_sa_link_record.c (working copy) @@ -166,13 +166,10 @@ __osm_lr_rcv_build_physp_link( /********************************************************************** **********************************************************************/ static void -__get_lid_range( +__get_base_lid( IN const osm_physp_t* p_physp, - OUT uint16_t * p_base_lid, - OUT uint16_t * p_max_lid ) + OUT uint16_t * p_base_lid ) { - uint8_t lmc; - if(p_physp->p_node->node_info.node_type == IB_NODE_TYPE_SWITCH) { *p_base_lid = @@ -180,14 +177,11 @@ __get_lid_range( osm_physp_get_base_lid( osm_node_get_physp_ptr(p_physp->p_node, 0)) ); - *p_max_lid = *p_base_lid; } else { *p_base_lid = cl_ntoh16(osm_physp_get_base_lid(p_physp)); - lmc = osm_physp_get_lmc( p_physp ); - *p_max_lid = (uint16_t)(*p_base_lid + (1<p_log, __osm_lr_rcv_get_physp_link ); @@ -312,8 +304,8 @@ __osm_lr_rcv_get_physp_link( dest_port_num ); } - __get_lid_range(p_src_physp, &from_base_lid_ho, &from_max_lid_ho); - __get_lid_range(p_dest_physp, &to_base_lid_ho, &to_max_lid_ho); + __get_base_lid(p_src_physp, &from_base_lid_ho); + __get_base_lid(p_dest_physp, &to_base_lid_ho); __osm_lr_rcv_build_physp_link(p_rcv, cl_ntoh16(from_base_lid_ho), cl_ntoh16(to_base_lid_ho), From halr at voltaire.com Tue Jun 20 10:42:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Jun 2006 13:42:18 -0400 Subject: [openib-general] ib_gid lookup In-Reply-To: <1150798111.4391.111384.camel@hal.voltaire.com> References: <1150798111.4391.111384.camel@hal.voltaire.com> Message-ID: <1150825337.4391.128609.camel@hal.voltaire.com> Hi again Amit, On Tue, 2006-06-20 at 06:08, Hal Rosenstock wrote: > Hi Amit, > > On Mon, 2006-06-19 at 20:36, Amit Byron wrote: > > hello, > > i'm trying to find whether i can do a lookup of ib_gid by either > > node name or node's ip address. is this information available from > > the subnet manager? > > The SM doesn't know the node name but you might be able to do this by > NodeDescription depending on how the subnet was setup (the > NodeDescriptions would need to be made unique on each node; a script for > this was supplied for mthca; there is also a current standards issue > with the SM detecting that these had changed which is being worked on). > If that were to be done, the SA could be queried by NodeDescription > which would return a NodeRecord which would obtain the NodeInfo which > includes the NodeGUID and PortGUID. Note it also returns the base LID as > well. > > The SM does not know the IP addresses unless they are registered by DAPL > (via ServiceRecords) but I'm not sure that is done anymore or whether > DAPL runs in your environment. Generating an ARP to the IP address could resolve the GID. This API is exposed through the RDMA CM (in both kernel and user space). That might be your best option. -- Hal > -- Hal > > > thanks, > > Amit. > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From koop at cse.ohio-state.edu Tue Jun 20 11:35:09 2006 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Tue, 20 Jun 2006 14:35:09 -0400 (EDT) Subject: [openib-general] [mvapich-discuss] mvapich xhpl memory usage In-Reply-To: <0D6FBA307D01EA42BAC8715725643AA01EDB53@EXCHG2003.microtech-ks.com> Message-ID: Brady, It appears that the OFED 1.0 release uses a script other than the make.mvapich.gen2 script to specify the CFLAGS before building the RPMs. When using the default MVAPICH from our website/svn make.mvapich.gen2 is still correct though. For this reason, your change did not update the compilation flags. To change the OFED 1.0 CFLAGS you will need to edit the "mvapich.make" script (instead of make.mvapich.gen2) in mvapich-0.9.7-mlx2.1.0 and remove "-DLAZY_MEM_UNREGISTER" from line 308. Please let us know if you have any other questions or if this does not solve your issue. Thanks, Matthew Koop - Network-Based Computing Laboratory Ohio State University > Hello, I installed OFED 1.0 (mvapich 0.97)and compile Linpack > benchmark. When I run xhpl, the memory usage creeps up with each NB. > and as each N changes the memory allocated is not freed. > LAZY_MEM_REGISTER is not defined per > http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2006-March/000057.html > I removed it from Make.mvapich.gen2. tar it back up and reran the > install. From swise at opengridcomputing.com Tue Jun 20 13:03:08 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:03:08 -0500 Subject: [openib-general] [PATCH v2 1/2] iWARP changes to libibverbs. In-Reply-To: <20060620200304.20092.44110.stgit@stevo-desktop> References: <20060620200304.20092.44110.stgit@stevo-desktop> Message-ID: <20060620200308.20092.76324.stgit@stevo-desktop> Cache the node type (iWARP vs IB) in the ib_device struct to enable transport-dependent logic. --- libibverbs/include/infiniband/verbs.h | 44 ++++++++++++++++++++++++++++++++- libibverbs/src/device.c | 16 ++++++++++++ 2 files changed, 59 insertions(+), 1 deletions(-) diff --git a/libibverbs/include/infiniband/verbs.h b/libibverbs/include/infiniband/verbs.h index 7679436..0ff97e9 100644 --- a/libibverbs/include/infiniband/verbs.h +++ b/libibverbs/include/infiniband/verbs.h @@ -66,9 +66,17 @@ union ibv_gid { }; enum ibv_node_type { + IBV_NODE_UNKNOWN=-1, IBV_NODE_CA = 1, IBV_NODE_SWITCH, - IBV_NODE_ROUTER + IBV_NODE_ROUTER, + IBV_NODE_RNIC +}; + +enum ibv_transport_type { + IBV_TRANSPORT_UNKNOWN=0, + IBV_TRANSPORT_IB=1, + IBV_TRANSPORT_IWARP=2 }; enum ibv_device_cap_flags { @@ -574,6 +582,7 @@ enum { struct ibv_device { struct ibv_driver *driver; + enum ibv_node_type node_type; struct ibv_device_ops ops; /* Name of underlying kernel IB device, eg "mthca0" */ char name[IBV_SYSFS_NAME_MAX]; @@ -673,6 +682,39 @@ const char *ibv_get_device_name(struct i uint64_t ibv_get_device_guid(struct ibv_device *device); /** + * ibv_get_transport_type - Return device's network transport type + */ +static inline enum ibv_transport_type +ibv_get_transport_type(struct ibv_context *context) +{ + if (!context->device) + return IBV_TRANSPORT_UNKNOWN; + + switch (context->device->node_type) { + case IBV_NODE_CA: + case IBV_NODE_SWITCH: + case IBV_NODE_ROUTER: + return IBV_TRANSPORT_IB; + case IBV_NODE_RNIC: + return IBV_TRANSPORT_IWARP; + default: + return IBV_TRANSPORT_UNKNOWN; + } +} + +/** + * ibv_get_node_type - Return device's node type + */ +static inline enum ibv_node_type +ibv_get_node_type(struct ibv_context *context) +{ + if (!context->device) + return IBV_NODE_UNKNOWN; + + return context->device->node_type; +} + +/** * ibv_open_device - Initialize device for use */ struct ibv_context *ibv_open_device(struct ibv_device *device); diff --git a/libibverbs/src/device.c b/libibverbs/src/device.c index de97d4d..f08059e 100644 --- a/libibverbs/src/device.c +++ b/libibverbs/src/device.c @@ -107,6 +107,20 @@ uint64_t ibv_get_device_guid(struct ibv_ return htonll(guid); } +static enum ibv_node_type query_node_type(struct ibv_device *device) +{ + char node_desc[24]; + char node_str[24]; + int node_type; + + if (ibv_read_sysfs_file(device->ibdev_path, "node_type", + node_desc, sizeof(node_desc)) < 0) + return IBV_NODE_UNKNOWN; + + sscanf(node_desc, "%d: %s\n", (int*)&node_type, node_str); + return (enum ibv_node_type) node_type; +} + struct ibv_context *ibv_open_device(struct ibv_device *device) { char *devpath; @@ -125,6 +139,8 @@ struct ibv_context *ibv_open_device(stru if (cmd_fd < 0) return NULL; + device->node_type = query_node_type(device); + context = device->ops.alloc_context(device, cmd_fd); if (!context) goto err; From swise at opengridcomputing.com Tue Jun 20 13:03:04 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:03:04 -0500 Subject: [openib-general] [PATCH v2 0/2] [RFC] iWARP Core Usermode Support Message-ID: <20060620200304.20092.44110.stgit@stevo-desktop> This patchset defines the modifications to the Open Fabrics gen2 userspace tree to support iWARP devices. This is the 2nd review of most of these changes and we have incorporated all comments from the 1st review. We're submitting it for review with the goal for inclusion in the gen2 svn trunk. It is not dependent on the kernel iWARP patchset currently under review, so we could commit this to the svn trunk now if desired. This patchset is based on revision 7620 of the svn trunk. It consists of 2 patches: 1 - Changes to libibverbs/ 2 - Changes to librdmacm/ Signed-off-by: Tom Tucker Signed-off-by: Steve Wise From swise at opengridcomputing.com Tue Jun 20 13:03:12 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:03:12 -0500 Subject: [openib-general] [PATCH v2 2/2] iWARP changes to librdmacm. In-Reply-To: <20060620200304.20092.44110.stgit@stevo-desktop> References: <20060620200304.20092.44110.stgit@stevo-desktop> Message-ID: <20060620200312.20092.87834.stgit@stevo-desktop> For iWARP, rdma_disconnect() moves the QP to SQD instead of ERR. The iWARP providers map SQD to the RDMAC verbs CLOSING state. --- librdmacm/src/cma.c | 22 +++++++++++++++++++++- 1 files changed, 21 insertions(+), 1 deletions(-) diff --git a/librdmacm/src/cma.c b/librdmacm/src/cma.c index e99d15c..a250f69 100644 --- a/librdmacm/src/cma.c +++ b/librdmacm/src/cma.c @@ -633,6 +633,17 @@ static int ucma_modify_qp_rts(struct rdm return ibv_modify_qp(id->qp, &qp_attr, qp_attr_mask); } +static int ucma_modify_qp_sqd(struct rdma_cm_id *id) +{ + struct ibv_qp_attr qp_attr; + + if (!id->qp) + return 0; + + qp_attr.qp_state = IBV_QPS_SQD; + return ibv_modify_qp(id->qp, &qp_attr, IBV_QP_STATE); +} + static int ucma_modify_qp_err(struct rdma_cm_id *id) { struct ibv_qp_attr qp_attr; @@ -881,7 +892,16 @@ int rdma_disconnect(struct rdma_cm_id *i void *msg; int ret, size; - ret = ucma_modify_qp_err(id); + switch (ibv_get_transport_type(id->verbs)) { + case IBV_TRANSPORT_IB: + ret = ucma_modify_qp_err(id); + break; + case IBV_TRANSPORT_IWARP: + ret = ucma_modify_qp_sqd(id); + break; + default: + ret = -EINVAL; + } if (ret) return ret; From swise at opengridcomputing.com Tue Jun 20 13:04:30 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:04:30 -0500 Subject: [openib-general] [PATCH v1 0/2] [RFC] Ammasso 1100 iWARP Library Message-ID: <20060620200430.20732.58792.stgit@stevo-desktop> This patchset implements a user verbs library for the Ammasso 1100 device. We're submitting it for review with the goal for inclusion in the gen2 trunk. Signed-off-by: Tom Tucker Signed-off-by: Steve Wise From swise at opengridcomputing.com Tue Jun 20 13:04:39 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:04:39 -0500 Subject: [openib-general] [PATCH v1 2/2] AMSO1100 Makefiles. In-Reply-To: <20060620200430.20732.58792.stgit@stevo-desktop> References: <20060620200430.20732.58792.stgit@stevo-desktop> Message-ID: <20060620200439.20732.71569.stgit@stevo-desktop> --- libamso/Makefile.am | 27 +++++++++++++++++++++++ libamso/autogen.sh | 8 +++++++ libamso/configure.in | 41 ++++++++++++++++++++++++++++++++++ libamso/libamso.spec.in | 56 +++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 132 insertions(+), 0 deletions(-) diff --git a/libamso/Makefile.am b/libamso/Makefile.am new file mode 100644 index 0000000..9e2cbc1 --- /dev/null +++ b/libamso/Makefile.am @@ -0,0 +1,27 @@ +# $Id: $ + +amsolibdir = $(libdir)/infiniband + +amsolib_LTLIBRARIES = src/amso.la + +src_amso_la_CFLAGS = -g -Wall -D_GNU_SOURCE + +if HAVE_LD_VERSION_SCRIPT + amso_version_script = -Wl,--version-script=$(srcdir)/src/amso.map +else + amso_version_script = +endif + +src_amso_la_SOURCES = src/cq.c src/amso.c src/qp.c \ + src/verbs.c +src_amso_la_LDFLAGS = -avoid-version -module \ + $(amso_version_script) + +#DEBIAN = debian/changelog debian/compat debian/control debian/copyright \ +# debian/libamso1.install debian/libamso-dev.install debian/rules + +EXTRA_DIST = src/amso.h src/amso-abi.h \ + src/amso.map libamso.spec.in $(DEBIAN) + +dist-hook: libamso.spec + cp libamso.spec $(distdir) diff --git a/libamso/autogen.sh b/libamso/autogen.sh new file mode 100755 index 0000000..fd47839 --- /dev/null +++ b/libamso/autogen.sh @@ -0,0 +1,8 @@ +#! /bin/sh + +set -x +aclocal -I config +libtoolize --force --copy +autoheader +automake --foreign --add-missing --copy +autoconf diff --git a/libamso/configure.in b/libamso/configure.in new file mode 100644 index 0000000..4a920c4 --- /dev/null +++ b/libamso/configure.in @@ -0,0 +1,41 @@ +dnl Process this file with autoconf to produce a configure script. + +AC_PREREQ(2.57) +AC_INIT(libamso, 1.0-rc4, openib-general at openib.org) +AC_CONFIG_SRCDIR([src/amso.h]) +AC_CONFIG_AUX_DIR(config) +AM_CONFIG_HEADER(config.h) +AM_INIT_AUTOMAKE(libamso, 1.0-rc4) +AM_PROG_LIBTOOL + +dnl Checks for programs +AC_PROG_CC + +dnl Checks for libraries +AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], + AC_MSG_ERROR([ibv_get_device_list() not found. libmthca requires libibverbs.])) + +dnl Checks for header files. +AC_CHECK_HEADERS(sysfs/libsysfs.h) +AC_CHECK_HEADER(infiniband/driver.h, [], + AC_MSG_ERROR([ not found. Is libibverbs installed?])) +AC_HEADER_STDC + +dnl Checks for typedefs, structures, and compiler characteristics. +AC_C_CONST +AC_CHECK_SIZEOF(long) + +dnl Checks for library functions +AC_CHECK_FUNCS(ibv_read_sysfs_file) + +AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, + if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then + ac_cv_version_script=yes + else + ac_cv_version_script=no + fi) + +AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") + +AC_CONFIG_FILES([Makefile libamso.spec]) +AC_OUTPUT diff --git a/libamso/libamso.spec.in b/libamso/libamso.spec.in new file mode 100644 index 0000000..1bbb9cb --- /dev/null +++ b/libamso/libamso.spec.in @@ -0,0 +1,56 @@ +# $Id: $ + +%define ver @VERSION@ + +Name: libamso +Version: 1.0 +Release: 0.2.rc4%{?dist} +Summary: AMSO1100 Userspace Library + +Group: System Environment/Libraries +License: GPL/BSD +Url: http://openib.org/ +Source: http://openib.org/downloads/%{name}-%{ver}.tar.gz +BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) + +BuildRequires: libibverbs-devel + +%description +libamso provides a device-specific userspace driver for Chelsio RNICs +for use with the libibverbs library. + +%package devel +Summary: Development files for the libamso driver +Group: System Environment/Libraries +Requires: %{name} = %{version}-%{release} + +%description devel +Static version of libamso that may be linked directly to an +application, which may be useful for debugging. + +%prep +%setup -q -n %{name}-%{ver} + +%build +%configure +make %{?_smp_mflags} + +%install +rm -rf $RPM_BUILD_ROOT +%makeinstall +# remove unpackaged files from the buildroot +rm -f $RPM_BUILD_ROOT%{_libdir}/infiniband/*.la + +%clean +rm -rf $RPM_BUILD_ROOT + +%files +%defattr(-,root,root,-) +%{_libdir}/infiniband/amso.so +%doc AUTHORS COPYING ChangeLog README + +%files devel +%defattr(-,root,root,-) +%{_libdir}/infiniband/amso.a + +%changelog From swise at opengridcomputing.com Tue Jun 20 13:04:34 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:04:34 -0500 Subject: [openib-general] [PATCH v1 1/2] AMSO1100 Verbs Library. In-Reply-To: <20060620200430.20732.58792.stgit@stevo-desktop> References: <20060620200430.20732.58792.stgit@stevo-desktop> Message-ID: <20060620200434.20732.99171.stgit@stevo-desktop> This code implements user verbs for the Ammasso 1100 device. This library doesn't do kernel bypass (but it could someday). --- libamso/src/amso-abi.h | 79 +++++++++++++ libamso/src/amso.c | 180 +++++++++++++++++++++++++++++ libamso/src/amso.h | 156 +++++++++++++++++++++++++ libamso/src/amso.map | 6 + libamso/src/cq.c | 57 +++++++++ libamso/src/qp.c | 55 +++++++++ libamso/src/verbs.c | 303 ++++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 836 insertions(+), 0 deletions(-) diff --git a/libamso/src/amso-abi.h b/libamso/src/amso-abi.h new file mode 100644 index 0000000..a3df617 --- /dev/null +++ b/libamso/src/amso-abi.h @@ -0,0 +1,79 @@ +/* + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef AMSO_ABI_H +#define AMSO_ABI_H + +#include + +struct amso_alloc_ucontext_resp { + struct ibv_get_context_resp ibv_resp; +}; + +struct amso_alloc_pd_resp { + struct ibv_alloc_pd_resp ibv_resp; +}; + +struct amso_create_cq { + struct ibv_create_cq ibv_cmd; +}; + + +struct amso_create_cq_resp { + struct ibv_create_cq_resp ibv_resp; + __u32 cqid; + __u32 entries; + __u64 physaddr; /* library mmaps this to get addressability */ + __u64 queue; +}; + +struct amso_create_qp { + struct ibv_create_qp ibv_cmd; +}; + +struct amso_create_qp_resp { + struct ibv_create_qp_resp ibv_resp; + __u32 qpid; + __u32 entries; /* actual number of entries after creation */ + __u64 physaddr; /* library mmaps this to get addressability */ + __u64 physsize; /* library mmaps this to get addressability */ + __u64 queue; +}; + + +struct t3_cqe { + __u32 header:32; + __u32 len:32; + __u32 wrid_hi_stag:32; + __u32 wrid_low_msn:32; +}; + +#endif /* AMSO_ABI_H */ diff --git a/libamso/src/amso.c b/libamso/src/amso.c new file mode 100644 index 0000000..c017281 --- /dev/null +++ b/libamso/src/amso.c @@ -0,0 +1,180 @@ +/* + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include + +#include "amso.h" +#include "amso-abi.h" + +#define PCI_VENDOR_ID_AMSO 0x18b8 +#define PCI_DEVICE_ID_AMSO_1100 0xb001 + +#define HCA(v, d, t) \ + { .vendor = PCI_VENDOR_ID_##v, \ + .device = PCI_DEVICE_ID_AMSO_##d, \ + .type = AMSO_##t } + +struct { + unsigned vendor; + unsigned device; + enum amso_hca_type type; +} hca_table[] = { + HCA(AMSO, 1100, 1100), +}; + +static struct ibv_context_ops amso_ctx_ops = { + .query_device = amso_query_device, + .query_port = amso_query_port, + .alloc_pd = amso_alloc_pd, + .dealloc_pd = amso_free_pd, + .reg_mr = amso_reg_mr, + .dereg_mr = amso_dereg_mr, + .create_cq = amso_create_cq, + .resize_cq = amso_resize_cq, + .poll_cq = amso_poll_cq, + .destroy_cq = amso_destroy_cq, + .create_srq = amso_create_srq, + .modify_srq = amso_modify_srq, + .destroy_srq = amso_destroy_srq, + .create_qp = amso_create_qp, + .modify_qp = amso_modify_qp, + .destroy_qp = amso_destroy_qp, + .create_ah = amso_create_ah, + .destroy_ah = amso_destroy_ah, + .attach_mcast = amso_attach_mcast, + .detach_mcast = amso_detach_mcast +}; + +static struct ibv_context *amso_alloc_context(struct ibv_device *ibdev, + int cmd_fd) +{ + struct amso_context *context; + struct ibv_get_context cmd; + struct amso_alloc_ucontext_resp resp; + + context = malloc(sizeof *context); + if (!context) + return NULL; + + context->ibv_ctx.cmd_fd = cmd_fd; + + if (ibv_cmd_get_context(&context->ibv_ctx, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp)) + goto err_free; + + context->ibv_ctx.device = ibdev; + context->ibv_ctx.ops = amso_ctx_ops; + context->ibv_ctx.ops.req_notify_cq = amso_arm_cq; + context->ibv_ctx.ops.cq_event = NULL; + context->ibv_ctx.ops.post_send = amso_post_send; + context->ibv_ctx.ops.post_recv = amso_post_recv; + context->ibv_ctx.ops.post_srq_recv = amso_post_srq_recv; + + return &context->ibv_ctx; +err_free: + free(context); + return NULL; +} + +static void amso_free_context(struct ibv_context *ibctx) +{ + struct amso_context *context = to_amso_ctx(ibctx); + + free(context); +} + +static struct ibv_device_ops amso_dev_ops = { + .alloc_context = amso_alloc_context, + .free_context = amso_free_context +}; + +struct ibv_device *ibv_driver_init(const char *uverbs_sys_path, + int abi_version) +{ + char value[8]; + struct amso_device *dev; + unsigned vendor, device; + int i; + + if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor", + value, sizeof value) < 0) + return NULL; + sscanf(value, "%i", &vendor); + + if (ibv_read_sysfs_file(uverbs_sys_path, "device/device", + value, sizeof value) < 0) + return NULL; + sscanf(value, "%i", &device); + + + for (i = 0; i < sizeof hca_table / sizeof hca_table[0]; ++i) + if (vendor == hca_table[i].vendor && + device == hca_table[i].device) + goto found; + + return NULL; + +found: + dev = malloc(sizeof *dev); + if (!dev) { + return NULL; + } + + dev->ibv_dev.ops = amso_dev_ops; + dev->hca_type = hca_table[i].type; + dev->page_size = sysconf(_SC_PAGESIZE); + + return &dev->ibv_dev; +} + +#ifdef HAVE_SYSFS_LIBSYSFS_H +struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev) +{ + int abi_ver = 0; + char value[8]; + + if (ibv_read_sysfs_file(sysdev->path, "abi_version", + value, sizeof value) > 0) + abi_ver = strtol(value, NULL, 10); + + return ibv_driver_init(sysdev->path, abi_ver); +} +#endif /* HAVE_SYSFS_LIBSYSFS_H */ diff --git a/libamso/src/amso.h b/libamso/src/amso.h new file mode 100644 index 0000000..eea4319 --- /dev/null +++ b/libamso/src/amso.h @@ -0,0 +1,156 @@ +/* + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef AMSO_H +#define AMSO_H + +#include +#include + +#define HIDDEN __attribute__((visibility ("hidden"))) + +#define PFX "amso: " + +enum amso_hca_type { + AMSO_1100 +}; + +struct amso_device { + struct ibv_device ibv_dev; + enum amso_hca_type hca_type; + int page_size; +}; + +struct amso_context { + struct ibv_context ibv_ctx; +}; + +struct amso_pd { + struct ibv_pd ibv_pd; +}; + +struct amso_cq { + struct ibv_cq ibv_cq; + __u32 cqid; + __u32 entries; + __u64 physaddr; + __u64 queue; +}; + +struct amso_qp { + struct ibv_qp ibv_qp; + __u32 qpid; + __u32 entries; + __u64 physaddr; + __u64 physsize; + __u64 queue; +}; + +#define to_amso_xxx(xxx, type) \ + ((struct amso_##type *) \ + ((void *) ib##xxx - offsetof(struct amso_##type, ibv_##xxx))) + +static inline struct amso_device *to_amso_dev(struct ibv_device *ibdev) +{ + return to_amso_xxx(dev, device); +} + +static inline struct amso_context *to_amso_ctx(struct ibv_context *ibctx) +{ + return to_amso_xxx(ctx, context); +} + +static inline struct amso_pd *to_amso_pd(struct ibv_pd *ibpd) +{ + return to_amso_xxx(pd, pd); +} + +static inline struct amso_cq *to_amso_cq(struct ibv_cq *ibcq) +{ + return to_amso_xxx(cq, cq); +} + +static inline struct amso_qp *to_amso_qp(struct ibv_qp *ibqp) +{ + return to_amso_xxx(qp, qp); +} + + +extern int amso_query_device(struct ibv_context *context, + struct ibv_device_attr *attr); +extern int amso_query_port(struct ibv_context *context, uint8_t port, + struct ibv_port_attr *attr); + +extern struct ibv_pd *amso_alloc_pd(struct ibv_context *context); +extern int amso_free_pd(struct ibv_pd *pd); + +extern struct ibv_mr *amso_reg_mr(struct ibv_pd *pd, void *addr, + size_t length, enum ibv_access_flags access); +extern int amso_dereg_mr(struct ibv_mr *mr); + +struct ibv_cq *amso_create_cq(struct ibv_context *context, int cqe, + struct ibv_comp_channel *channel, + int comp_vector); +extern int amso_resize_cq(struct ibv_cq *cq, int cqe); +extern int amso_destroy_cq(struct ibv_cq *cq); +extern int amso_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); +extern int amso_arm_cq(struct ibv_cq *cq, int solicited); +extern void amso_cq_event(struct ibv_cq *cq); +extern void amso_init_cq_buf(struct amso_cq *cq, int nent); + +extern struct ibv_srq *amso_create_srq(struct ibv_pd *pd, + struct ibv_srq_init_attr *attr); +extern int amso_modify_srq(struct ibv_srq *srq, + struct ibv_srq_attr *attr, + enum ibv_srq_attr_mask mask); +extern int amso_destroy_srq(struct ibv_srq *srq); +extern int amso_post_srq_recv(struct ibv_srq *ibsrq, + struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr); + +extern struct ibv_qp *amso_create_qp(struct ibv_pd *pd, + struct ibv_qp_init_attr *attr); +extern int amso_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, + enum ibv_qp_attr_mask attr_mask); +extern int amso_destroy_qp(struct ibv_qp *qp); +extern int amso_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, + struct ibv_send_wr **bad_wr); +extern int amso_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr); +extern struct ibv_ah *amso_create_ah(struct ibv_pd *pd, + struct ibv_ah_attr *ah_attr); +extern int amso_destroy_ah(struct ibv_ah *ah); +extern int amso_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, + uint16_t lid); +extern int amso_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, + uint16_t lid); + +#endif /* AMSO_H */ diff --git a/libamso/src/amso.map b/libamso/src/amso.map new file mode 100644 index 0000000..59a8bae --- /dev/null +++ b/libamso/src/amso.map @@ -0,0 +1,6 @@ +{ + global: + ibv_driver_init; + openib_driver_init; + local: *; +}; diff --git a/libamso/src/cq.c b/libamso/src/cq.c new file mode 100644 index 0000000..65360ce --- /dev/null +++ b/libamso/src/cq.c @@ -0,0 +1,57 @@ +/* + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include + +#include + +#include "amso.h" +#include "amso-abi.h" + + +int amso_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc) +{ + return ibv_cmd_poll_cq(ibcq, ne, wc); +} + + +int amso_arm_cq(struct ibv_cq *cq, int solicited) +{ + return ibv_cmd_req_notify_cq(cq, solicited); +} + + diff --git a/libamso/src/qp.c b/libamso/src/qp.c new file mode 100644 index 0000000..e0d99bb --- /dev/null +++ b/libamso/src/qp.c @@ -0,0 +1,55 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include + +#include "amso.h" +#include + +int amso_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, + struct ibv_send_wr **bad_wr) +{ + return ibv_cmd_post_send(ibqp, wr, bad_wr); +} + +int amso_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + return ibv_cmd_post_recv(ibqp, wr, bad_wr); +} + diff --git a/libamso/src/verbs.c b/libamso/src/verbs.c new file mode 100644 index 0000000..1cd79d8 --- /dev/null +++ b/libamso/src/verbs.c @@ -0,0 +1,303 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include + +#include "amso.h" +#include "amso-abi.h" + + +int amso_query_device(struct ibv_context *context, struct ibv_device_attr *attr) +{ + struct ibv_query_device cmd; + uint64_t raw_fw_ver; + unsigned major, minor, sub_minor; + int ret; + + ret = + ibv_cmd_query_device(context, attr, &raw_fw_ver, &cmd, sizeof cmd); + if (ret) + return ret; + + major = (raw_fw_ver >> 32) & 0xffff; + minor = (raw_fw_ver >> 16) & 0xffff; + sub_minor = raw_fw_ver & 0xffff; + + snprintf(attr->fw_ver, sizeof attr->fw_ver, + "%d.%d.%d", major, minor, sub_minor); + + return 0; +} + +int amso_query_port(struct ibv_context *context, uint8_t port, + struct ibv_port_attr *attr) +{ + struct ibv_query_port cmd; + + return ibv_cmd_query_port(context, port, attr, &cmd, sizeof cmd); +} + +struct ibv_pd *amso_alloc_pd(struct ibv_context *context) +{ + struct ibv_alloc_pd cmd; + struct amso_alloc_pd_resp resp; + struct amso_pd *pd; + + pd = malloc(sizeof *pd); + if (!pd) + return NULL; + + if (ibv_cmd_alloc_pd(context, &pd->ibv_pd, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp)) { + free(pd); + return NULL; + } + + return &pd->ibv_pd; +} + +int amso_free_pd(struct ibv_pd *pd) +{ + int ret; + + ret = ibv_cmd_dealloc_pd(pd); + if (ret) + return ret; + + free(pd); + return 0; +} + +static struct ibv_mr *__amso_reg_mr(struct ibv_pd *pd, void *addr, + size_t length, uint64_t hca_va, + enum ibv_access_flags access) +{ + struct ibv_mr *mr; + struct ibv_reg_mr cmd; + + mr = malloc(sizeof *mr); + if (!mr) + return NULL; + + if (ibv_cmd_reg_mr(pd, addr, length, hca_va, + access, mr, &cmd, sizeof cmd)) { + free(mr); + return NULL; + } + + return mr; +} + +struct ibv_mr *amso_reg_mr(struct ibv_pd *pd, void *addr, + size_t length, enum ibv_access_flags access) +{ + return __amso_reg_mr(pd, addr, length, (uintptr_t) addr, access); +} + +int amso_dereg_mr(struct ibv_mr *mr) +{ + int ret; + + ret = ibv_cmd_dereg_mr(mr); + if (ret) + return ret; + + free(mr); + return 0; +} + +struct ibv_cq *amso_create_cq(struct ibv_context *context, int cqe, + struct ibv_comp_channel *channel, int comp_vector) +{ + struct amso_create_cq cmd; + struct amso_create_cq_resp resp; + struct amso_cq *cq; + int ret; + + cq = malloc(sizeof *cq); + if (!cq) { + goto err; + } + + ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, + &cq->ibv_cq, &cmd.ibv_cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); + if (ret) + goto err; + +#if 0 /* A reminder for bypass functionality */ + cq->physaddr = resp.physaddr; + cq->queue = + (unsigned long) mmap(NULL, cqe * sizeof(struct t3_cqe), PROT_WRITE, + MAP_SHARED, context->cmd_fd, cq->physaddr); +#endif + + return &cq->ibv_cq; + + +err: + free(cq); + + return NULL; +} + +int amso_resize_cq(struct ibv_cq *cq, int cqe) +{ + int ret; + struct ibv_resize_cq cmd; + + ret = ibv_cmd_resize_cq(cq, cqe, &cmd, sizeof cmd); + if (ret) + return ret; + /* We will need to unmap and remap when we implement user mode */ + + return 0; +} + +int amso_destroy_cq(struct ibv_cq *cq) +{ + int ret; + + ret = ibv_cmd_destroy_cq(cq); + if (ret) + return ret; + + return 0; +} + +struct ibv_srq *amso_create_srq(struct ibv_pd *pd, + struct ibv_srq_init_attr *attr) +{ + return (void *) -ENOSYS; +} + +int amso_modify_srq(struct ibv_srq *srq, + struct ibv_srq_attr *attr, enum ibv_srq_attr_mask attr_mask) +{ + return -ENOSYS; +} + +int amso_destroy_srq(struct ibv_srq *srq) +{ + return -ENOSYS; +} + +int amso_post_srq_recv(struct ibv_srq *ibsrq, + struct ibv_recv_wr *wr, struct ibv_recv_wr **bad_wr) +{ + return -ENOSYS; +} + +struct ibv_qp *amso_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) +{ + struct amso_create_qp cmd; + struct amso_create_qp_resp resp; + struct amso_qp *qp; + int ret; + + /* Sanity check QP size before proceeding */ + if (attr->cap.max_send_wr > 65536 || + attr->cap.max_recv_wr > 65536 || + attr->cap.max_send_sge > 4 || + attr->cap.max_recv_sge > 4 || attr->cap.max_inline_data > 1024) + return NULL; + + qp = malloc(sizeof *qp); + if (!qp) + return NULL; + + ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); + if (ret) + return NULL; + +#if 0 /* A reminder for bypass functionality */ + qp->physaddr = resp.physaddr; +#endif + + return &qp->ibv_qp; + + + return NULL; +} + +int amso_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, + enum ibv_qp_attr_mask attr_mask) +{ + struct ibv_modify_qp cmd; + + return ibv_cmd_modify_qp(qp, attr, attr_mask, &cmd, sizeof cmd); +} + +int amso_destroy_qp(struct ibv_qp *qp) +{ + int ret; + + ret = ibv_cmd_destroy_qp(qp); + if (ret) + return ret; + + free(qp); + + return 0; +} + +struct ibv_ah *amso_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr) +{ + return (void *) -ENOSYS; +} + +int amso_destroy_ah(struct ibv_ah *ah) +{ + return -ENOSYS; +} + +int amso_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +{ + return -ENOSYS; +} + +int amso_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +{ + return -ENOSYS; +} + From swise at opengridcomputing.com Tue Jun 20 13:24:42 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:24:42 -0500 Subject: [openib-general] [PATCH v3 0/2][RFC] iWARP Core Support Message-ID: <20060620202442.28922.27402.stgit@stevo-desktop> This patchset defines the modifications to the Linux infiniband subsystem to support iWARP devices. We're submitting it for review now with the goal for inclusion in the 2.6.19 kernel. This code has gone through several reviews in the openib-general list. Now we are submitting it for external review by the linux community. This StGIT patchset is cloned from Roland Dreier's infiniband.git for-2.6.19 branch. The patchset consists of 2 patches: 1 - New iWARP CM implementation. 2 - Core changes to support iWARP. I believe I've addressed all the round 1 and 2 review comments. Details of the changes are tracked in each patch comment. Signed-off-by: Tom Tucker Signed-off-by: Steve Wise From swise at opengridcomputing.com Tue Jun 20 13:24:47 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:24:47 -0500 Subject: [openib-general] [PATCH v3 1/2] iWARP Connection Manager. In-Reply-To: <20060620202442.28922.27402.stgit@stevo-desktop> References: <20060620202442.28922.27402.stgit@stevo-desktop> Message-ID: <20060620202447.28922.42550.stgit@stevo-desktop> This patch provides the new files implementing the iWARP Connection Manager. This module is a logical instance of the xx_cm where xx is the transport type (ib or iw). The symbols exported are used by the transport independent rdma_cm module, and are available also for transport dependent ULPs. V2 Review Changes: - BUG_ON(1) -> BUG() - Don't typecast whan assigning between something* and void* - pre-allocate iwcm_work objects to avoid allocating them in the interrupt context. - copy private data on connect request and connect reply events. - #if !defined() -> #ifndef V1 Review Changes: - sizeof -> sizeof() - removed printks - removed TT debug code - cleaned up lock/unlock around switch statements. - waitqueue -> completion for destroy path. --- drivers/infiniband/core/iwcm.c | 1008 ++++++++++++++++++++++++++++++++++++++++ include/rdma/iw_cm.h | 255 ++++++++++ include/rdma/iw_cm_private.h | 63 +++ 3 files changed, 1326 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c new file mode 100644 index 0000000..fe43c00 --- /dev/null +++ b/drivers/infiniband/core/iwcm.c @@ -0,0 +1,1008 @@ +/* + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Tom Tucker"); +MODULE_DESCRIPTION("iWARP CM"); +MODULE_LICENSE("Dual BSD/GPL"); + +static struct workqueue_struct *iwcm_wq; +struct iwcm_work { + struct work_struct work; + struct iwcm_id_private *cm_id; + struct list_head list; + struct iw_cm_event event; + struct list_head free_list; +}; + +/* + * The following services provide a mechanism for pre-allocating iwcm_work + * elements. The design pre-allocates them based on the cm_id type: + * LISTENING IDS: Get enough elements preallocated to handle the + * listen backlog. + * ACTIVE IDS: 4: CONNECT_REPLY, ESTABLISHED, DISCONNECT, CLOSE + * PASSIVE IDS: 3: ESTABLISHED, DISCONNECT, CLOSE + * + * Allocating them in connect and listen avoids having to deal + * with allocation failures on the event upcall from the provider (which + * is called in the interrupt context). + * + * One exception is when creating the cm_id for incoming connection requests. + * There are two cases: + * 1) in the event upcall, cm_event_handler(), for a listening cm_id. If + * the backlog is exceeded, then no more connection request events will + * be processed. cm_event_handler() returns -ENOMEM in this case. Its up + * to the provider to reject the connectino request. + * 2) in the connection request workqueue handler, cm_conn_req_handler(). + * If work elements cannot be allocated for the new connect request cm_id, + * then IWCM will call the provider reject method. This is ok since + * cm_conn_req_handler() runs in the workqueue thread context. + */ + +static struct iwcm_work *get_work(struct iwcm_id_private *cm_id_priv) +{ + struct iwcm_work *work; + + if (list_empty(&cm_id_priv->work_free_list)) + return NULL; + work = list_entry(cm_id_priv->work_free_list.next, struct iwcm_work, + free_list); + list_del_init(&work->free_list); + return work; +} + +static void put_work(struct iwcm_work *work) +{ + list_add(&work->free_list, &work->cm_id->work_free_list); +} + +static void dealloc_work_entries(struct iwcm_id_private *cm_id_priv) +{ + struct list_head *e, *tmp; + + list_for_each_safe(e, tmp, &cm_id_priv->work_free_list) + kfree(list_entry(e, struct iwcm_work, free_list)); +} + +static int alloc_work_entries(struct iwcm_id_private *cm_id_priv, int count) +{ + struct iwcm_work *work; + + BUG_ON(!list_empty(&cm_id_priv->work_free_list)); + while (count--) { + work = kmalloc(sizeof(struct iwcm_work), GFP_KERNEL); + if (!work) { + dealloc_work_entries(cm_id_priv); + return -ENOMEM; + } + work->cm_id = cm_id_priv; + INIT_LIST_HEAD(&work->list); + put_work(work); + } + return 0; +} + +/* + * Save private data from incoming connection requests in the + * cm_id_priv so the low level driver doesn't have to. Adjust + * the event ptr to point to the local copy. + */ +static int copy_private_data(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *event) +{ + void *p; + + p = kmalloc(event->private_data_len, GFP_ATOMIC); + if (!p) + return -ENOMEM; + memcpy(p, event->private_data, event->private_data_len); + event->private_data = p; + return 0; +} + +/* + * Release a reference on cm_id. If the last reference is being removed + * and iw_destroy_cm_id is waiting, wake up the waiting thread. + */ +static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv) +{ + int ret = 0; + + BUG_ON(atomic_read(&cm_id_priv->refcount)==0); + if (atomic_dec_and_test(&cm_id_priv->refcount)) { + BUG_ON(!list_empty(&cm_id_priv->work_list)); + if (waitqueue_active(&cm_id_priv->destroy_comp.wait)) { + BUG_ON(cm_id_priv->state != IW_CM_STATE_DESTROYING); + BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, + &cm_id_priv->flags)); + ret = 1; + } + complete(&cm_id_priv->destroy_comp); + } + + return ret; +} + +static void add_ref(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + atomic_inc(&cm_id_priv->refcount); +} + +static void rem_ref(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + iwcm_deref_id(cm_id_priv); +} + +static int cm_event_handler(struct iw_cm_id *cm_id, struct iw_cm_event *event); + +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context) +{ + struct iwcm_id_private *cm_id_priv; + + cm_id_priv = kzalloc(sizeof(*cm_id_priv), GFP_KERNEL); + if (!cm_id_priv) + return ERR_PTR(-ENOMEM); + + cm_id_priv->state = IW_CM_STATE_IDLE; + cm_id_priv->id.device = device; + cm_id_priv->id.cm_handler = cm_handler; + cm_id_priv->id.context = context; + cm_id_priv->id.event_handler = cm_event_handler; + cm_id_priv->id.add_ref = add_ref; + cm_id_priv->id.rem_ref = rem_ref; + spin_lock_init(&cm_id_priv->lock); + atomic_set(&cm_id_priv->refcount, 1); + init_waitqueue_head(&cm_id_priv->connect_wait); + init_completion(&cm_id_priv->destroy_comp); + INIT_LIST_HEAD(&cm_id_priv->work_list); + INIT_LIST_HEAD(&cm_id_priv->work_free_list); + + return &cm_id_priv->id; +} +EXPORT_SYMBOL(iw_create_cm_id); + + +static int iwcm_modify_qp_err(struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + if (!qp) + return -EINVAL; + + qp_attr.qp_state = IB_QPS_ERR; + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE); +} + +/* + * This is really the RDMAC CLOSING state. It is most similar to the + * IB SQD QP state. + */ +static int iwcm_modify_qp_sqd(struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + BUG_ON(qp == NULL); + qp_attr.qp_state = IB_QPS_SQD; + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE); +} + +/* + * CM_ID <-- CLOSING + * + * Block if a passive or active connection is currenlty being processed. Then + * process the event as follows: + * - If we are ESTABLISHED, move to CLOSING and modify the QP state + * based on the abrupt flag + * - If the connection is already in the CLOSING or IDLE state, the peer is + * disconnecting concurrently with us and we've already seen the + * DISCONNECT event -- ignore the request and return 0 + * - Disconnect on a listening endpoint returns -EINVAL + */ +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + struct ib_qp *qp = NULL; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + /* Wait if we're currently in a connect or accept downcall */ + wait_event(cm_id_priv->connect_wait, + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_ESTABLISHED: + cm_id_priv->state = IW_CM_STATE_CLOSING; + + /* QP could be for user-mode client */ + if (cm_id_priv->qp) + qp = cm_id_priv->qp; + else + ret = -EINVAL; + break; + case IW_CM_STATE_LISTEN: + ret = -EINVAL; + break; + case IW_CM_STATE_CLOSING: + /* remote peer closed first */ + case IW_CM_STATE_IDLE: + /* accept or connect returned !0 */ + break; + case IW_CM_STATE_CONN_RECV: + /* + * App called disconnect before/without calling accept after + * connect_request event delivered. + */ + break; + case IW_CM_STATE_CONN_SENT: + /* Can only get here if wait above fails */ + default: + BUG(); + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + if (qp) { + if (abrupt) + ret = iwcm_modify_qp_err(qp); + else + ret = iwcm_modify_qp_sqd(qp); + + /* + * If both sides are disconnecting the QP could + * already be in ERR or SQD states + */ + ret = 0; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_disconnect); + +/* + * CM_ID <-- DESTROYING + * + * Clean up all resources associated with the connection and release + * the initial reference taken by iw_create_cm_id. + */ +static void destroy_cm_id(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + /* Wait if we're currently in a connect or accept downcall. A + * listening endpoint should never block here. */ + wait_event(cm_id_priv->connect_wait, + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_LISTEN: + cm_id_priv->state = IW_CM_STATE_DESTROYING; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + /* destroy the listening endpoint */ + ret = cm_id->device->iwcm->destroy_listen(cm_id); + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_ESTABLISHED: + cm_id_priv->state = IW_CM_STATE_DESTROYING; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + /* Abrupt close of the connection */ + (void)iwcm_modify_qp_err(cm_id_priv->qp); + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_IDLE: + case IW_CM_STATE_CLOSING: + cm_id_priv->state = IW_CM_STATE_DESTROYING; + break; + case IW_CM_STATE_CONN_RECV: + /* + * App called destroy before/without calling accept after + * receiving connection request event notification. + */ + cm_id_priv->state = IW_CM_STATE_DESTROYING; + break; + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_DESTROYING: + default: + BUG(); + break; + } + if (cm_id_priv->qp) { + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); + cm_id_priv->qp = NULL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + (void)iwcm_deref_id(cm_id_priv); +} + +/* + * This function is only called by the application thread and cannot + * be called by the event thread. The function will wait for all + * references to be released on the cm_id and then kfree the cm_id + * object. + */ +void iw_destroy_cm_id(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)); + + destroy_cm_id(cm_id); + + wait_for_completion(&cm_id_priv->destroy_comp); + + dealloc_work_entries(cm_id_priv); + + kfree(cm_id_priv); +} +EXPORT_SYMBOL(iw_destroy_cm_id); + +/* + * CM_ID <-- LISTEN + * + * Start listening for connect requests. Generates one CONNECT_REQUEST + * event for each inbound connect request. + */ +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + ret = alloc_work_entries(cm_id_priv, backlog); + if (ret) + return ret; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_IDLE: + cm_id_priv->state = IW_CM_STATE_LISTEN; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id->device->iwcm->create_listen(cm_id, backlog); + if (ret) + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + default: + ret = -EINVAL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + return ret; +} +EXPORT_SYMBOL(iw_cm_listen); + +/* + * CM_ID <-- IDLE + * + * Rejects an inbound connection request. No events are generated. + */ +int iw_cm_reject(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + return -EINVAL; + } + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->reject(cm_id, private_data, + private_data_len); + + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + + return ret; +} +EXPORT_SYMBOL(iw_cm_reject); + +/* + * CM_ID <-- ESTABLISHED + * + * Accepts an inbound connection request and generates an ESTABLISHED + * event. Callers of iw_cm_disconnect and iw_destroy_cm_id will block + * until the ESTABLISHED event is received from the provider. + */ +int iw_cm_accept(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *iw_param) +{ + struct iwcm_id_private *cm_id_priv; + struct ib_qp *qp; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + return -EINVAL; + } + /* Get the ib_qp given the QPN */ + qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); + if (!qp) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return -EINVAL; + } + cm_id->device->iwcm->add_ref(qp); + cm_id_priv->qp = qp; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->accept(cm_id, iw_param); + if (ret) { + /* An error on accept precludes provider events */ + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV); + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->qp) { + cm_id->device->iwcm->rem_ref(qp); + cm_id_priv->qp = NULL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_accept); + +/* + * Active Side: CM_ID <-- CONN_SENT + * + * If successful, results in the generation of a CONNECT_REPLY + * event. iw_cm_disconnect and iw_cm_destroy will block until the + * CONNECT_REPLY event is received from the provider. + */ +int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + struct iwcm_id_private *cm_id_priv; + int ret = 0; + unsigned long flags; + struct ib_qp *qp; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + ret = alloc_work_entries(cm_id_priv, 4); + if (ret) + return ret; + + set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + spin_lock_irqsave(&cm_id_priv->lock, flags); + + if (cm_id_priv->state != IW_CM_STATE_IDLE) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + return -EINVAL; + } + + /* Get the ib_qp given the QPN */ + qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); + if (!qp) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return -EINVAL; + } + cm_id->device->iwcm->add_ref(qp); + cm_id_priv->qp = qp; + cm_id_priv->state = IW_CM_STATE_CONN_SENT; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->connect(cm_id, iw_param); + if (ret) { + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->qp) { + cm_id->device->iwcm->rem_ref(qp); + cm_id_priv->qp = NULL; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT); + cm_id_priv->state = IW_CM_STATE_IDLE; + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + wake_up_all(&cm_id_priv->connect_wait); + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_connect); + +/* + * Passive Side: new CM_ID <-- CONN_RECV + * + * Handles an inbound connect request. The function creates a new + * iw_cm_id to represent the new connection and inherits the client + * callback function and other attributes from the listening parent. + * + * The work item contains a pointer to the listen_cm_id and the event. The + * listen_cm_id contains the client cm_handler, context and + * device. These are copied when the device is cloned. The event + * contains the new four tuple. + * + * An error on the child should not affect the parent, so this + * function does not return a value. + */ +static void cm_conn_req_handler(struct iwcm_id_private *listen_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + struct iw_cm_id *cm_id; + struct iwcm_id_private *cm_id_priv; + int ret; + + /* The provider should never generate a connection request + * event with a bad status. + */ + BUG_ON(iw_event->status); + + /* We could be destroying the listening id. If so, ignore this + * upcall. */ + spin_lock_irqsave(&listen_id_priv->lock, flags); + if (listen_id_priv->state != IW_CM_STATE_LISTEN) { + spin_unlock_irqrestore(&listen_id_priv->lock, flags); + return; + } + spin_unlock_irqrestore(&listen_id_priv->lock, flags); + + cm_id = iw_create_cm_id(listen_id_priv->id.device, + listen_id_priv->id.cm_handler, + listen_id_priv->id.context); + /* If the cm_id could not be created, ignore the request */ + if (IS_ERR(cm_id)) + return; + + cm_id->provider_data = iw_event->provider_data; + cm_id->local_addr = iw_event->local_addr; + cm_id->remote_addr = iw_event->remote_addr; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + cm_id_priv->state = IW_CM_STATE_CONN_RECV; + + ret = alloc_work_entries(cm_id_priv, 3); + if (ret) { + iw_cm_reject(cm_id, NULL, 0); + iw_destroy_cm_id(cm_id); + return; + } + + /* Call the client CM handler */ + ret = cm_id->cm_handler(cm_id, iw_event); + if (ret) { + set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); + destroy_cm_id(cm_id); + if (atomic_read(&cm_id_priv->refcount)==0) + kfree(cm_id); + } + + if (iw_event->private_data_len) + kfree(iw_event->private_data); +} + +/* + * Passive Side: CM_ID <-- ESTABLISHED + * + * The provider generated an ESTABLISHED event which means that + * the MPA negotion has completed successfully and we are now in MPA + * FPDU mode. + * + * This event can only be received in the CONN_RECV state. If the + * remote peer closed, the ESTABLISHED event would be received followed + * by the CLOSE event. If the app closes, it will block until we wake + * it up after processing this event. + */ +static int cm_conn_est_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + + /* We clear the CONNECT_WAIT bit here to allow the callback + * function to call iw_cm_disconnect. Calling iw_destroy_cm_id + * from a callback handler is not allowed */ + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV); + cm_id_priv->state = IW_CM_STATE_ESTABLISHED; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); + wake_up_all(&cm_id_priv->connect_wait); + + return ret; +} + +/* + * Active Side: CM_ID <-- ESTABLISHED + * + * The app has called connect and is waiting for the established event to + * post it's requests to the server. This event will wake up anyone + * blocked in iw_cm_disconnect or iw_destroy_id. + */ +static int cm_conn_rep_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + /* Clear the connect wait bit so a callback function calling + * iw_cm_disconnect will not wait and deadlock this thread */ + clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags); + BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT); + if (iw_event->status == IW_CM_EVENT_STATUS_ACCEPTED) { + cm_id_priv->id.local_addr = iw_event->local_addr; + cm_id_priv->id.remote_addr = iw_event->remote_addr; + cm_id_priv->state = IW_CM_STATE_ESTABLISHED; + } else { + /* REJECTED or RESET */ + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); + cm_id_priv->qp = NULL; + cm_id_priv->state = IW_CM_STATE_IDLE; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); + + if (iw_event->private_data_len) + kfree(iw_event->private_data); + + /* Wake up waiters on connect complete */ + wake_up_all(&cm_id_priv->connect_wait); + + return ret; +} + +/* + * CM_ID <-- CLOSING + * + * If in the ESTABLISHED state, move to CLOSING. + */ +static void cm_disconnect_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->state == IW_CM_STATE_ESTABLISHED) + cm_id_priv->state = IW_CM_STATE_CLOSING; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); +} + +/* + * CM_ID <-- IDLE + * + * If in the ESTBLISHED or CLOSING states, the QP will have have been + * moved by the provider to the ERR state. Disassociate the CM_ID from + * the QP, move to IDLE, and remove the 'connected' reference. + * + * If in some other state, the cm_id was destroyed asynchronously. + * This is the last reference that will result in waking up + * the app thread blocked in iw_destroy_cm_id. + */ +static int cm_close_handler(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + unsigned long flags; + int ret = 0; + spin_lock_irqsave(&cm_id_priv->lock, flags); + + if (cm_id_priv->qp) { + cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); + cm_id_priv->qp = NULL; + } + switch (cm_id_priv->state) { + case IW_CM_STATE_ESTABLISHED: + case IW_CM_STATE_CLOSING: + cm_id_priv->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event); + spin_lock_irqsave(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_DESTROYING: + break; + default: + BUG(); + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + return ret; +} + +static int process_event(struct iwcm_id_private *cm_id_priv, + struct iw_cm_event *iw_event) +{ + int ret = 0; + + switch (iw_event->event) { + case IW_CM_EVENT_CONNECT_REQUEST: + cm_conn_req_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_CONNECT_REPLY: + ret = cm_conn_rep_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_ESTABLISHED: + ret = cm_conn_est_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_DISCONNECT: + cm_disconnect_handler(cm_id_priv, iw_event); + break; + case IW_CM_EVENT_CLOSE: + ret = cm_close_handler(cm_id_priv, iw_event); + break; + default: + BUG(); + } + + return ret; +} + +/* + * Process events on the work_list for the cm_id. If the callback + * function requests that the cm_id be deleted, a flag is set in the + * cm_id flags to indicate that when the last reference is + * removed, the cm_id is to be destroyed. This is necessary to + * distinguish between an object that will be destroyed by the app + * thread asleep on the destroy_comp list vs. an object destroyed + * here synchronously when the last reference is removed. + */ +static void cm_work_handler(void *arg) +{ + struct iwcm_work *work = arg, lwork; + struct iwcm_id_private *cm_id_priv = work->cm_id; + unsigned long flags; + int empty; + int ret = 0; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + empty = list_empty(&cm_id_priv->work_list); + while (!empty) { + work = list_entry(cm_id_priv->work_list.next, + struct iwcm_work, list); + list_del_init(&work->list); + empty = list_empty(&cm_id_priv->work_list); + lwork = *work; + put_work(work); + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = process_event(cm_id_priv, &work->event); + if (ret) { + set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); + destroy_cm_id(&cm_id_priv->id); + } + BUG_ON(atomic_read(&cm_id_priv->refcount)==0); + if (iwcm_deref_id(cm_id_priv)) + return; + + if (atomic_read(&cm_id_priv->refcount)==0 && + test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)) { + dealloc_work_entries(cm_id_priv); + kfree(cm_id_priv); + return; + } + spin_lock_irqsave(&cm_id_priv->lock, flags); + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); +} + +/* + * This function is called on interrupt context. Schedule events on + * the iwcm_wq thread to allow callback functions to downcall into + * the CM and/or block. Events are queued to a per-CM_ID + * work_list. If this is the first event on the work_list, the work + * element is also queued on the iwcm_wq thread. + * + * Each event holds a reference on the cm_id. Until the last posted + * event has been delivered and processed, the cm_id cannot be + * deleted. + * + * Returns: + * 0 - the event was handled. + * -ENOMEM - the event was not handled due to lack of resources. + */ +static int cm_event_handler(struct iw_cm_id *cm_id, + struct iw_cm_event *iw_event) +{ + struct iwcm_work *work; + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + work = get_work(cm_id_priv); + if (!work) { + ret = -ENOMEM; + goto out; + } + + INIT_WORK(&work->work, cm_work_handler, work); + work->cm_id = cm_id_priv; + work->event = *iw_event; + + if ((work->event.event == IW_CM_EVENT_CONNECT_REQUEST || + work->event.event == IW_CM_EVENT_CONNECT_REPLY) && + work->event.private_data_len) { + ret = copy_private_data(cm_id_priv, &work->event); + if (ret) { + put_work(work); + goto out; + } + } + + atomic_inc(&cm_id_priv->refcount); + if (list_empty(&cm_id_priv->work_list)) { + list_add_tail(&work->list, &cm_id_priv->work_list); + queue_work(iwcm_wq, &work->work); + } else + list_add_tail(&work->list, &cm_id_priv->work_list); +out: + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return ret; +} + +static int iwcm_init_qp_init_attr(struct iwcm_id_private *cm_id_priv, + struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_IDLE: + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_CONN_RECV: + case IW_CM_STATE_ESTABLISHED: + *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS; + qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE| + IB_ACCESS_REMOTE_READ; + ret = 0; + break; + default: + ret = -EINVAL; + break; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return ret; +} + +static int iwcm_init_qp_rts_attr(struct iwcm_id_private *cm_id_priv, + struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_IDLE: + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_CONN_RECV: + case IW_CM_STATE_ESTABLISHED: + *qp_attr_mask = 0; + ret = 0; + break; + default: + ret = -EINVAL; + break; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + return ret; +} + +int iw_cm_init_qp_attr(struct iw_cm_id *cm_id, + struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + struct iwcm_id_private *cm_id_priv; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + switch (qp_attr->qp_state) { + case IB_QPS_INIT: + case IB_QPS_RTR: + ret = iwcm_init_qp_init_attr(cm_id_priv, + qp_attr, qp_attr_mask); + break; + case IB_QPS_RTS: + ret = iwcm_init_qp_rts_attr(cm_id_priv, + qp_attr, qp_attr_mask); + break; + default: + ret = -EINVAL; + break; + } + return ret; +} +EXPORT_SYMBOL(iw_cm_init_qp_attr); + +static int __init iw_cm_init(void) +{ + iwcm_wq = create_singlethread_workqueue("iw_cm_wq"); + if (!iwcm_wq) + return -ENOMEM; + + return 0; +} + +static void __exit iw_cm_cleanup(void) +{ + destroy_workqueue(iwcm_wq); +} + +module_init(iw_cm_init); +module_exit(iw_cm_cleanup); diff --git a/include/rdma/iw_cm.h b/include/rdma/iw_cm.h new file mode 100644 index 0000000..36f44aa --- /dev/null +++ b/include/rdma/iw_cm.h @@ -0,0 +1,255 @@ +/* + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef IW_CM_H +#define IW_CM_H + +#include +#include + +struct iw_cm_id; + +enum iw_cm_event_type { + IW_CM_EVENT_CONNECT_REQUEST = 1, /* connect request received */ + IW_CM_EVENT_CONNECT_REPLY, /* reply from active connect request */ + IW_CM_EVENT_ESTABLISHED, /* passive side accept successful */ + IW_CM_EVENT_DISCONNECT, /* orderly shutdown */ + IW_CM_EVENT_CLOSE /* close complete */ +}; +enum iw_cm_event_status { + IW_CM_EVENT_STATUS_OK = 0, /* request successful */ + IW_CM_EVENT_STATUS_ACCEPTED = 0, /* connect request accepted */ + IW_CM_EVENT_STATUS_REJECTED, /* connect request rejected */ + IW_CM_EVENT_STATUS_TIMEOUT, /* the operation timed out */ + IW_CM_EVENT_STATUS_RESET, /* reset from remote peer */ + IW_CM_EVENT_STATUS_EINVAL, /* asynchronous failure for bad parm */ +}; +struct iw_cm_event { + enum iw_cm_event_type event; + enum iw_cm_event_status status; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + void *private_data; + u8 private_data_len; + void* provider_data; +}; + +/** + * iw_cm_handler - Function to be called by the IW CM when delivering events + * to the client. + * + * @cm_id: The IW CM identifier associated with the event. + * @event: Pointer to the event structure. + */ +typedef int (*iw_cm_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); + +/** + * iw_event_handler - Function called by the provider when delivering provider + * events to the IW CM. Returns either 0 indicating the event was processed + * or -errno if the event could not be processed. + * + * @cm_id: The IW CM identifier associated with the event. + * @event: Pointer to the event structure. + */ +typedef int (*iw_event_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); +struct iw_cm_id { + iw_cm_handler cm_handler; /* client callback function */ + void *context; /* client cb context */ + struct ib_device *device; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + void *provider_data; /* provider private data */ + iw_event_handler event_handler; /* cb for provider + events */ + /* Used by provider to add and remove refs on IW cm_id */ + void (*add_ref)(struct iw_cm_id *); + void (*rem_ref)(struct iw_cm_id *); +}; + +struct iw_cm_conn_param { + const void *private_data; + u16 private_data_len; + u32 ord; + u32 ird; + u32 qpn; +}; + +struct iw_cm_verbs { + void (*add_ref)(struct ib_qp *qp); + + void (*rem_ref)(struct ib_qp *qp); + + struct ib_qp * (*get_qp)(struct ib_device *device, + int qpn); + + int (*connect)(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *conn_param); + + int (*accept)(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *conn_param); + + int (*reject)(struct iw_cm_id *cm_id, + const void *pdata, u8 pdata_len); + + int (*create_listen)(struct iw_cm_id *cm_id, + int backlog); + + int (*destroy_listen)(struct iw_cm_id *cm_id); +}; + +/** + * iw_create_cm_id - Create an IW CM identifier. + * + * @device: The IB device on which to create the IW CM identier. + * @event_handler: User callback invoked to report events associated with the + * returned IW CM identifier. + * @context: User specified context associated with the id. + */ +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, void *context); + +/** + * iw_destroy_cm_id - Destroy an IW CM identifier. + * + * @cm_id: The previously created IW CM identifier to destroy. + * + * The client can assume that no events will be delivered for the CM ID after + * this function returns. + */ +void iw_destroy_cm_id(struct iw_cm_id *cm_id); + +/** + * iw_cm_bind_qp - Unbind the specified IW CM identifier and QP + * + * @cm_id: The IW CM idenfier to unbind from the QP. + * @qp: The QP + * + * This is called by the provider when destroying the QP to ensure + * that any references held by the IWCM are released. It may also + * be called by the IWCM when destroying a CM_ID to that any + * references held by the provider are released. + */ +void iw_cm_unbind_qp(struct iw_cm_id *cm_id, struct ib_qp *qp); + +/** + * iw_cm_get_qp - Return the ib_qp associated with a QPN + * + * @ib_device: The IB device + * @qpn: The queue pair number + */ +struct ib_qp *iw_cm_get_qp(struct ib_device *device, int qpn); + +/** + * iw_cm_listen - Listen for incoming connection requests on the + * specified IW CM id. + * + * @cm_id: The IW CM identifier. + * @backlog: The maximum number of outstanding un-accepted inbound listen + * requests to queue. + * + * The source address and port number are specified in the IW CM identifier + * structure. + */ +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog); + +/** + * iw_cm_accept - Called to accept an incoming connect request. + * + * @cm_id: The IW CM identifier associated with the connection request. + * @iw_param: Pointer to a structure containing connection establishment + * parameters. + * + * The specified cm_id will have been provided in the event data for a + * CONNECT_REQUEST event. Subsequent events related to this connection will be + * delivered to the specified IW CM identifier prior and may occur prior to + * the return of this function. If this function returns a non-zero value, the + * client can assume that no events will be delivered to the specified IW CM + * identifier. + */ +int iw_cm_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param); + +/** + * iw_cm_reject - Reject an incoming connection request. + * + * @cm_id: Connection identifier associated with the request. + * @private_daa: Pointer to data to deliver to the remote peer as part of the + * reject message. + * @private_data_len: The number of bytes in the private_data parameter. + * + * The client can assume that no events will be delivered to the specified IW + * CM identifier following the return of this function. The private_data + * buffer is available for reuse when this function returns. + */ +int iw_cm_reject(struct iw_cm_id *cm_id, const void *private_data, + u8 private_data_len); + +/** + * iw_cm_connect - Called to request a connection to a remote peer. + * + * @cm_id: The IW CM identifier for the connection. + * @iw_param: Pointer to a structure containing connection establishment + * parameters. + * + * Events may be delivered to the specified IW CM identifier prior to the + * return of this function. If this function returns a non-zero value, the + * client can assume that no events will be delivered to the specified IW CM + * identifier. + */ +int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param); + +/** + * iw_cm_disconnect - Close the specified connection. + * + * @cm_id: The IW CM identifier to close. + * @abrupt: If 0, the connection will be closed gracefully, otherwise, the + * connection will be reset. + * + * The IW CM identifier is still active until the IW_CM_EVENT_CLOSE event is + * delivered. + */ +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt); + +/** + * iw_cm_init_qp_attr - Called to initialize the attributes of the QP + * associated with a IW CM identifier. + * + * @cm_id: The IW CM identifier associated with the QP + * @qp_attr: Pointer to the QP attributes structure. + * @qp_attr_mask: Pointer to a bit vector specifying which QP attributes are + * valid. + */ +int iw_cm_init_qp_attr(struct iw_cm_id *cm_id, struct ib_qp_attr *qp_attr, + int *qp_attr_mask); + +#endif /* IW_CM_H */ diff --git a/include/rdma/iw_cm_private.h b/include/rdma/iw_cm_private.h new file mode 100644 index 0000000..fc28e34 --- /dev/null +++ b/include/rdma/iw_cm_private.h @@ -0,0 +1,63 @@ +/* + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef IW_CM_PRIVATE_H +#define IW_CM_PRIVATE_H + +#include + +enum iw_cm_state { + IW_CM_STATE_IDLE, /* unbound, inactive */ + IW_CM_STATE_LISTEN, /* listen waiting for connect */ + IW_CM_STATE_CONN_RECV, /* inbound waiting for user accept */ + IW_CM_STATE_CONN_SENT, /* outbound waiting for peer accept */ + IW_CM_STATE_ESTABLISHED, /* established */ + IW_CM_STATE_CLOSING, /* disconnect */ + IW_CM_STATE_DESTROYING /* object being deleted */ +}; + +struct iwcm_id_private { + struct iw_cm_id id; + enum iw_cm_state state; + unsigned long flags; + struct ib_qp *qp; + struct completion destroy_comp; + wait_queue_head_t connect_wait; + struct list_head work_list; + spinlock_t lock; + atomic_t refcount; + struct list_head work_free_list; +}; +#define IWCM_F_CALLBACK_DESTROY 1 +#define IWCM_F_CONNECT_WAIT 2 + +#endif /* IW_CM_PRIVATE_H */ From swise at opengridcomputing.com Tue Jun 20 13:24:52 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:24:52 -0500 Subject: [openib-general] [PATCH v3 2/2] iWARP Core Changes. In-Reply-To: <20060620202442.28922.27402.stgit@stevo-desktop> References: <20060620202442.28922.27402.stgit@stevo-desktop> Message-ID: <20060620202452.28922.39114.stgit@stevo-desktop> This patch contains modifications to the existing rdma header files, core files, drivers, and ulp files to support iWARP. V2 Review updates: V1 Review updates: - copy_addr() -> rdma_copy_addr() - dst_dev_addr param in rdma_copy_addr to const. - various spacing nits with recasting - include linux/inetdevice.h to get ip_dev_find() prototype. - dev_put() after successful ip_dev_find() --- drivers/infiniband/core/Makefile | 4 drivers/infiniband/core/addr.c | 19 + drivers/infiniband/core/cache.c | 8 - drivers/infiniband/core/cm.c | 3 drivers/infiniband/core/cma.c | 355 +++++++++++++++++++++++--- drivers/infiniband/core/device.c | 6 drivers/infiniband/core/mad.c | 11 + drivers/infiniband/core/sa_query.c | 5 drivers/infiniband/core/smi.c | 18 + drivers/infiniband/core/sysfs.c | 18 + drivers/infiniband/core/ucm.c | 5 drivers/infiniband/core/user_mad.c | 9 - drivers/infiniband/hw/ipath/ipath_verbs.c | 2 drivers/infiniband/hw/mthca/mthca_provider.c | 2 drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 + drivers/infiniband/ulp/srp/ib_srp.c | 2 include/rdma/ib_addr.h | 15 + include/rdma/ib_verbs.h | 39 ++- 18 files changed, 437 insertions(+), 92 deletions(-) diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 68e73ec..163d991 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -1,7 +1,7 @@ infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS) := ib_addr.o rdma_cm.o obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o $(infiniband-y) + ib_cm.o iw_cm.o $(infiniband-y) obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o @@ -14,6 +14,8 @@ ib_sa-y := sa_query.o ib_cm-y := cm.o +iw_cm-y := iwcm.o + rdma_cm-y := cma.o ib_addr-y := addr.o diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index d294bbc..83f84ef 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -32,6 +32,7 @@ #include #include #include #include +#include #include #include #include @@ -60,12 +61,15 @@ static LIST_HEAD(req_list); static DECLARE_WORK(work, process_req, NULL); static struct workqueue_struct *addr_wq; -static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, - unsigned char *dst_dev_addr) +int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, + const unsigned char *dst_dev_addr) { switch (dev->type) { case ARPHRD_INFINIBAND: - dev_addr->dev_type = IB_NODE_CA; + dev_addr->dev_type = RDMA_NODE_IB_CA; + break; + case ARPHRD_ETHER: + dev_addr->dev_type = RDMA_NODE_RNIC; break; default: return -EADDRNOTAVAIL; @@ -77,6 +81,7 @@ static int copy_addr(struct rdma_dev_add memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); return 0; } +EXPORT_SYMBOL(rdma_copy_addr); int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) { @@ -88,7 +93,7 @@ int rdma_translate_ip(struct sockaddr *a if (!dev) return -EADDRNOTAVAIL; - ret = copy_addr(dev_addr, dev, NULL); + ret = rdma_copy_addr(dev_addr, dev, NULL); dev_put(dev); return ret; } @@ -160,7 +165,7 @@ static int addr_resolve_remote(struct so /* If the device does ARP internally, return 'done' */ if (rt->idev->dev->flags & IFF_NOARP) { - copy_addr(addr, rt->idev->dev, NULL); + rdma_copy_addr(addr, rt->idev->dev, NULL); goto put; } @@ -180,7 +185,7 @@ static int addr_resolve_remote(struct so src_in->sin_addr.s_addr = rt->rt_src; } - ret = copy_addr(addr, neigh->dev, neigh->ha); + ret = rdma_copy_addr(addr, neigh->dev, neigh->ha); release: neigh_release(neigh); put: @@ -244,7 +249,7 @@ static int addr_resolve_local(struct soc if (ZERONET(src_ip)) { src_in->sin_family = dst_in->sin_family; src_in->sin_addr.s_addr = dst_ip; - ret = copy_addr(addr, dev, dev->dev_addr); + ret = rdma_copy_addr(addr, dev, dev->dev_addr); } else if (LOOPBACK(src_ip)) { ret = rdma_translate_ip((struct sockaddr *)dst_in, addr); if (!ret) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index e05ca2c..061858c 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -32,13 +32,12 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $ + * $Id: cache.c 6885 2006-05-03 18:22:02Z sean.hefty $ */ #include #include #include -#include /* INIT_WORK, schedule_work(), flush_scheduled_work() */ #include @@ -62,12 +61,13 @@ struct ib_update_work { static inline int start_port(struct ib_device *device) { - return device->node_type == IB_NODE_SWITCH ? 0 : 1; + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; } static inline int end_port(struct ib_device *device) { - return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt; + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; } int ib_get_cached_gid(struct ib_device *device, diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 450adfe..070dda9 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -3244,6 +3244,9 @@ static void cm_add_one(struct ib_device int ret; u8 i; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index a76834e..52a74f5 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -35,6 +35,7 @@ #include #include #include #include +#include #include @@ -43,6 +44,7 @@ #include #include #include #include +#include MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); @@ -124,6 +126,7 @@ struct rdma_id_private { int query_id; union { struct ib_cm_id *ib; + struct iw_cm_id *iw; } cm_id; u32 seq_num; @@ -259,13 +262,23 @@ static void cma_detach_from_dev(struct r id_priv->cma_dev = NULL; } -static int cma_acquire_ib_dev(struct rdma_id_private *id_priv) +static int cma_acquire_dev(struct rdma_id_private *id_priv) { + enum rdma_node_type dev_type = id_priv->id.route.addr.dev_addr.dev_type; struct cma_device *cma_dev; union ib_gid *gid; int ret = -ENODEV; - gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + switch (rdma_node_get_transport(dev_type)) { + case RDMA_TRANSPORT_IB: + gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + break; + case RDMA_TRANSPORT_IWARP: + gid = iw_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + break; + default: + return -ENODEV; + } mutex_lock(&lock); list_for_each_entry(cma_dev, &dev_list, list) { @@ -280,16 +293,6 @@ static int cma_acquire_ib_dev(struct rdm return ret; } -static int cma_acquire_dev(struct rdma_id_private *id_priv) -{ - switch (id_priv->id.route.addr.dev_addr.dev_type) { - case IB_NODE_CA: - return cma_acquire_ib_dev(id_priv); - default: - return -ENODEV; - } -} - static void cma_deref_id(struct rdma_id_private *id_priv) { if (atomic_dec_and_test(&id_priv->refcount)) @@ -347,6 +350,16 @@ static int cma_init_ib_qp(struct rdma_id IB_QP_PKEY_INDEX | IB_QP_PORT); } +static int cma_init_iw_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS); +} + int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, struct ib_qp_init_attr *qp_init_attr) { @@ -362,10 +375,13 @@ int rdma_create_qp(struct rdma_cm_id *id if (IS_ERR(qp)) return PTR_ERR(qp); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_init_ib_qp(id_priv, qp); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_init_iw_qp(id_priv, qp); + break; default: ret = -ENOSYS; break; @@ -451,13 +467,17 @@ int rdma_init_qp_attr(struct rdma_cm_id int ret; id_priv = container_of(id, struct rdma_id_private, id); - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr, qp_attr_mask); if (qp_attr->qp_state == IB_QPS_RTR) qp_attr->rq_psn = id_priv->seq_num; break; + case RDMA_TRANSPORT_IWARP: + ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr, + qp_attr_mask); + break; default: ret = -ENOSYS; break; @@ -590,8 +610,8 @@ static int cma_notify_user(struct rdma_i static void cma_cancel_route(struct rdma_id_private *id_priv) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->query) ib_sa_cancel_query(id_priv->query_id, id_priv->query); break; @@ -611,11 +631,15 @@ static void cma_destroy_listen(struct rd cma_exch(id_priv, CMA_DESTROYING); if (id_priv->cma_dev) { - switch (id_priv->id.device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; + case RDMA_TRANSPORT_IWARP: + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) + iw_destroy_cm_id(id_priv->cm_id.iw); + break; default: break; } @@ -690,11 +714,15 @@ void rdma_destroy_id(struct rdma_cm_id * cma_cancel_operation(id_priv, state); if (id_priv->cma_dev) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; + case RDMA_TRANSPORT_IWARP: + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) + iw_destroy_cm_id(id_priv->cm_id.iw); + break; default: break; } @@ -868,7 +896,7 @@ static struct rdma_id_private *cma_new_i ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); - rt->addr.dev_addr.dev_type = IB_NODE_CA; + rt->addr.dev_addr.dev_type = RDMA_NODE_IB_CA; id_priv = container_of(id, struct rdma_id_private, id); id_priv->state = CMA_CONNECT; @@ -897,7 +925,7 @@ static int cma_req_handler(struct ib_cm_ } atomic_inc(&conn_id->dev_remove); - ret = cma_acquire_ib_dev(conn_id); + ret = cma_acquire_dev(conn_id); if (ret) { ret = -ENODEV; cma_release_remove(conn_id); @@ -981,6 +1009,125 @@ static void cma_set_compare_data(enum rd } } +static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event) +{ + struct rdma_id_private *id_priv = iw_id->context; + enum rdma_cm_event_type event = 0; + struct sockaddr_in *sin; + int ret = 0; + + atomic_inc(&id_priv->dev_remove); + + switch (iw_event->event) { + case IW_CM_EVENT_CLOSE: + event = RDMA_CM_EVENT_DISCONNECTED; + break; + case IW_CM_EVENT_CONNECT_REPLY: + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + *sin = iw_event->local_addr; + sin = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr; + *sin = iw_event->remote_addr; + if (iw_event->status) + event = RDMA_CM_EVENT_REJECTED; + else + event = RDMA_CM_EVENT_ESTABLISHED; + break; + case IW_CM_EVENT_ESTABLISHED: + event = RDMA_CM_EVENT_ESTABLISHED; + break; + default: + BUG_ON(1); + } + + ret = cma_notify_user(id_priv, event, iw_event->status, + iw_event->private_data, + iw_event->private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + id_priv->cm_id.iw = NULL; + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + rdma_destroy_id(&id_priv->id); + return ret; + } + + cma_release_remove(id_priv); + return ret; +} + +static int iw_conn_req_handler(struct iw_cm_id *cm_id, + struct iw_cm_event *iw_event) +{ + struct rdma_cm_id *new_cm_id; + struct rdma_id_private *listen_id, *conn_id; + struct sockaddr_in *sin; + struct net_device *dev = NULL; + int ret; + + listen_id = cm_id->context; + atomic_inc(&listen_id->dev_remove); + if (!cma_comp(listen_id, CMA_LISTEN)) { + ret = -ECONNABORTED; + goto out; + } + + /* Create a new RDMA id for the new IW CM ID */ + new_cm_id = rdma_create_id(listen_id->id.event_handler, + listen_id->id.context, + RDMA_PS_TCP); + if (!new_cm_id) { + ret = -ENOMEM; + goto out; + } + conn_id = container_of(new_cm_id, struct rdma_id_private, id); + atomic_inc(&conn_id->dev_remove); + conn_id->state = CMA_CONNECT; + + dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr); + if (!dev) { + ret = -EADDRNOTAVAIL; + rdma_destroy_id(new_cm_id); + goto out; + } + ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL); + if (ret) { + rdma_destroy_id(new_cm_id); + goto out; + } + + ret = cma_acquire_dev(conn_id); + if (ret) { + rdma_destroy_id(new_cm_id); + goto out; + } + + conn_id->cm_id.iw = cm_id; + cm_id->context = conn_id; + cm_id->cm_handler = cma_iw_handler; + + sin = (struct sockaddr_in *) &new_cm_id->route.addr.src_addr; + *sin = iw_event->local_addr; + sin = (struct sockaddr_in *) &new_cm_id->route.addr.dst_addr; + *sin = iw_event->remote_addr; + + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, + iw_event->private_data, + iw_event->private_data_len); + if (ret) { + /* User wants to destroy the CM ID */ + conn_id->cm_id.iw = NULL; + cma_exch(conn_id, CMA_DESTROYING); + cma_release_remove(conn_id); + rdma_destroy_id(&conn_id->id); + } + +out: + if (!dev) + dev_put(dev); + cma_release_remove(listen_id); + return ret; +} + static int cma_ib_listen(struct rdma_id_private *id_priv) { struct ib_cm_compare_data compare_data; @@ -1010,6 +1157,30 @@ static int cma_ib_listen(struct rdma_id_ return ret; } +static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog) +{ + int ret; + struct sockaddr_in *sin; + + id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, + iw_conn_req_handler, + id_priv); + if (IS_ERR(id_priv->cm_id.iw)) + return PTR_ERR(id_priv->cm_id.iw); + + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + id_priv->cm_id.iw->local_addr = *sin; + + ret = iw_cm_listen(id_priv->cm_id.iw, backlog); + + if (ret) { + iw_destroy_cm_id(id_priv->cm_id.iw); + id_priv->cm_id.iw = NULL; + } + + return ret; +} + static int cma_listen_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) { @@ -1086,12 +1257,17 @@ int rdma_listen(struct rdma_cm_id *id, i id_priv->backlog = backlog; if (id->device) { - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_ib_listen(id_priv); if (ret) goto err; break; + case RDMA_TRANSPORT_IWARP: + ret = cma_iw_listen(id_priv, backlog); + if (ret) + goto err; + break; default: ret = -ENOSYS; goto err; @@ -1230,6 +1406,23 @@ err: } EXPORT_SYMBOL(rdma_set_ib_paths); +static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) +{ + struct cma_work *work; + + work = kzalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + work->id = id_priv; + INIT_WORK(&work->work, cma_work_handler, work); + work->old_state = CMA_ROUTE_QUERY; + work->new_state = CMA_ROUTE_RESOLVED; + work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; + queue_work(cma_wq, &work->work); + return 0; +} + int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) { struct rdma_id_private *id_priv; @@ -1240,10 +1433,13 @@ int rdma_resolve_route(struct rdma_cm_id return -EINVAL; atomic_inc(&id_priv->refcount); - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_resolve_ib_route(id_priv, timeout_ms); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_resolve_iw_route(id_priv, timeout_ms); + break; default: ret = -ENOSYS; break; @@ -1355,8 +1551,8 @@ static int cma_resolve_loopback(struct r ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr)); if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) { - src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr; - dst_in = (struct sockaddr_in *)&id_priv->id.route.addr.dst_addr; + src_in = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + dst_in = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr; src_in->sin_family = dst_in->sin_family; src_in->sin_addr.s_addr = dst_in->sin_addr.s_addr; } @@ -1647,6 +1843,47 @@ out: return ret; } +static int cma_connect_iw(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct iw_cm_id *cm_id; + struct sockaddr_in* sin; + int ret; + struct iw_cm_conn_param iw_param; + + cm_id = iw_create_cm_id(id_priv->id.device, cma_iw_handler, id_priv); + if (IS_ERR(cm_id)) { + ret = PTR_ERR(cm_id); + goto out; + } + + id_priv->cm_id.iw = cm_id; + + sin = (struct sockaddr_in*) &id_priv->id.route.addr.src_addr; + cm_id->local_addr = *sin; + + sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr; + cm_id->remote_addr = *sin; + + ret = cma_modify_qp_rtr(&id_priv->id); + if (ret) { + iw_destroy_cm_id(cm_id); + return ret; + } + + iw_param.ord = conn_param->initiator_depth; + iw_param.ird = conn_param->responder_resources; + iw_param.private_data = conn_param->private_data; + iw_param.private_data_len = conn_param->private_data_len; + if (id_priv->id.qp) + iw_param.qpn = id_priv->qp_num; + else + iw_param.qpn = conn_param->qp_num; + ret = iw_cm_connect(cm_id, &iw_param); +out: + return ret; +} + int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct rdma_id_private *id_priv; @@ -1662,10 +1899,13 @@ int rdma_connect(struct rdma_cm_id *id, id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = cma_connect_ib(id_priv, conn_param); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_connect_iw(id_priv, conn_param); + break; default: ret = -ENOSYS; break; @@ -1706,6 +1946,28 @@ static int cma_accept_ib(struct rdma_id_ return ib_send_cm_rep(id_priv->cm_id.ib, &rep); } +static int cma_accept_iw(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct iw_cm_conn_param iw_param; + int ret; + + ret = cma_modify_qp_rtr(&id_priv->id); + if (ret) + return ret; + + iw_param.ord = conn_param->initiator_depth; + iw_param.ird = conn_param->responder_resources; + iw_param.private_data = conn_param->private_data; + iw_param.private_data_len = conn_param->private_data_len; + if (id_priv->id.qp) { + iw_param.qpn = id_priv->qp_num; + } else + iw_param.qpn = conn_param->qp_num; + + return iw_cm_accept(id_priv->cm_id.iw, &iw_param); +} + int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct rdma_id_private *id_priv; @@ -1721,13 +1983,16 @@ int rdma_accept(struct rdma_cm_id *id, s id_priv->srq = conn_param->srq; } - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: if (conn_param) ret = cma_accept_ib(id_priv, conn_param); else ret = cma_rep_recv(id_priv); break; + case RDMA_TRANSPORT_IWARP: + ret = cma_accept_iw(id_priv, conn_param); + break; default: ret = -ENOSYS; break; @@ -1754,12 +2019,16 @@ int rdma_reject(struct rdma_cm_id *id, c if (!cma_comp(id_priv, CMA_CONNECT)) return -EINVAL; - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: ret = ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, private_data, private_data_len); break; + case RDMA_TRANSPORT_IWARP: + ret = iw_cm_reject(id_priv->cm_id.iw, + private_data, private_data_len); + break; default: ret = -ENOSYS; break; @@ -1778,16 +2047,18 @@ int rdma_disconnect(struct rdma_cm_id *i !cma_comp(id_priv, CMA_DISCONNECT)) return -EINVAL; - ret = cma_modify_qp_err(id); - if (ret) - goto out; - - switch (id->device->node_type) { - case IB_NODE_CA: + switch (rdma_node_get_transport(id->device->node_type)) { + case RDMA_TRANSPORT_IB: + ret = cma_modify_qp_err(id); + if (ret) + goto out; /* Initiate or respond to a disconnect. */ if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); break; + case RDMA_TRANSPORT_IWARP: + ret = iw_cm_disconnect(id_priv->cm_id.iw, 0); + break; default: break; } diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index b2f3cb9..7318fba 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -30,7 +30,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: device.c 1349 2004-12-16 21:09:43Z roland $ + * $Id: device.c 5943 2006-03-22 00:58:04Z roland $ */ #include @@ -505,7 +505,7 @@ int ib_query_port(struct ib_device *devi u8 port_num, struct ib_port_attr *port_attr) { - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { if (port_num) return -EINVAL; } else if (port_num < 1 || port_num > device->phys_port_cnt) @@ -580,7 +580,7 @@ int ib_modify_port(struct ib_device *dev u8 port_num, int port_modify_mask, struct ib_port_modify *port_modify) { - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { if (port_num) return -EINVAL; } else if (port_num < 1 || port_num > device->phys_port_cnt) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index b38e02a..a928ecf 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. * @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ + * $Id: mad.c 7294 2006-05-17 18:12:30Z roland $ */ #include #include @@ -2877,7 +2877,10 @@ static void ib_mad_init_device(struct ib { int start, end, i; - if (device->node_type == IB_NODE_SWITCH) { + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) { start = 0; end = 0; } else { @@ -2924,7 +2927,7 @@ static void ib_mad_remove_device(struct { int i, num_ports, cur_port; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { num_ports = 1; cur_port = 0; } else { diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index e911c99..12a9425 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -918,7 +918,10 @@ static void ib_sa_add_one(struct ib_devi struct ib_sa_device *sa_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) s = e = 0; else { s = 1; diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c index 35852e7..b81b2b9 100644 --- a/drivers/infiniband/core/smi.c +++ b/drivers/infiniband/core/smi.c @@ -34,7 +34,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: smi.c 1389 2004-12-27 22:56:47Z roland $ + * $Id: smi.c 5258 2006-02-01 20:32:40Z sean.hefty $ */ #include @@ -64,7 +64,7 @@ int smi_handle_dr_smp_send(struct ib_smp /* C14-9:2 */ if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; /* smp->return_path set when received */ @@ -77,7 +77,7 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_ptr == hop_cnt) { /* smp->return_path set when received */ smp->hop_ptr++; - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_dlid == IB_LID_PERMISSIVE); } @@ -95,7 +95,7 @@ int smi_handle_dr_smp_send(struct ib_smp /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; smp->hop_ptr--; @@ -107,7 +107,7 @@ int smi_handle_dr_smp_send(struct ib_smp if (hop_ptr == 1) { smp->hop_ptr--; /* C14-13:3 -- SMPs destined for SM shouldn't be here */ - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_slid == IB_LID_PERMISSIVE); } @@ -142,7 +142,7 @@ int smi_handle_dr_smp_recv(struct ib_smp /* C14-9:2 -- intermediate hop */ if (hop_ptr && hop_ptr < hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; smp->return_path[hop_ptr] = port_num; @@ -156,7 +156,7 @@ int smi_handle_dr_smp_recv(struct ib_smp smp->return_path[hop_ptr] = port_num; /* smp->hop_ptr updated when sending */ - return (node_type == IB_NODE_SWITCH || + return (node_type == RDMA_NODE_IB_SWITCH || smp->dr_dlid == IB_LID_PERMISSIVE); } @@ -175,7 +175,7 @@ int smi_handle_dr_smp_recv(struct ib_smp /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (node_type != IB_NODE_SWITCH) + if (node_type != RDMA_NODE_IB_SWITCH) return 0; /* smp->hop_ptr updated when sending */ @@ -190,7 +190,7 @@ int smi_handle_dr_smp_recv(struct ib_smp return 1; } /* smp->hop_ptr updated when sending */ - return (node_type == IB_NODE_SWITCH); + return (node_type == RDMA_NODE_IB_SWITCH); } /* C14-13:4 -- hop_ptr = 0 -> give to SM */ diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index 21f9282..cfd2c06 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: sysfs.c 1349 2004-12-16 21:09:43Z roland $ + * $Id: sysfs.c 6940 2006-05-04 17:04:55Z roland $ */ #include "core_priv.h" @@ -589,10 +589,16 @@ static ssize_t show_node_type(struct cla return -ENODEV; switch (dev->node_type) { - case IB_NODE_CA: return sprintf(buf, "%d: CA\n", dev->node_type); - case IB_NODE_SWITCH: return sprintf(buf, "%d: switch\n", dev->node_type); - case IB_NODE_ROUTER: return sprintf(buf, "%d: router\n", dev->node_type); - default: return sprintf(buf, "%d: \n", dev->node_type); + case RDMA_NODE_IB_CA: + return sprintf(buf, "%d: CA\n", dev->node_type); + case RDMA_NODE_RNIC: + return sprintf(buf, "%d: RNIC\n", dev->node_type); + case RDMA_NODE_IB_SWITCH: + return sprintf(buf, "%d: switch\n", dev->node_type); + case RDMA_NODE_IB_ROUTER: + return sprintf(buf, "%d: router\n", dev->node_type); + default: + return sprintf(buf, "%d: \n", dev->node_type); } } @@ -708,7 +714,7 @@ int ib_device_register_sysfs(struct ib_d if (ret) goto err_put; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { ret = add_port(device, 0); if (ret) goto err_put; diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index c1c6fda..936afc8 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -30,7 +30,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $ + * $Id: ucm.c 7119 2006-05-11 16:40:38Z sean.hefty $ */ #include @@ -1247,7 +1247,8 @@ static void ib_ucm_add_one(struct ib_dev { struct ib_ucm_device *ucm_dev; - if (!device->alloc_ucontext) + if (!device->alloc_ucontext || + rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) return; ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index afe70a5..0cbd692 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: user_mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ + * $Id: user_mad.c 6041 2006-03-27 21:06:00Z halr $ */ #include @@ -967,7 +967,10 @@ static void ib_umad_add_one(struct ib_de struct ib_umad_device *umad_dev; int s, e, i; - if (device->node_type == IB_NODE_SWITCH) + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) s = e = 0; else { s = 1; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 28fdbda..e4b45d7 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -984,7 +984,7 @@ static void *ipath_register_ib_device(in (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); - dev->node_type = IB_NODE_CA; + dev->node_type = RDMA_NODE_IB_CA; dev->phys_port_cnt = 1; dev->dma_device = ipath_layer_get_device(dd); dev->class_dev.dev = dev->dma_device; diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 230ae21..2103ee8 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1292,7 +1292,7 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); - dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.node_type = RDMA_NODE_IB_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; dev->ib_dev.class_dev.dev = &dev->pdev->dev; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 1c6ea1c..262427f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1084,13 +1084,16 @@ static void ipoib_add_one(struct ib_devi struct ipoib_dev_priv *priv; int s, e, p; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) return; INIT_LIST_HEAD(dev_list); - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { s = 0; e = 0; } else { @@ -1114,6 +1117,9 @@ static void ipoib_remove_one(struct ib_d struct ipoib_dev_priv *priv, *tmp; struct list_head *dev_list; + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + dev_list = ib_get_client_data(device, &ipoib_client); list_for_each_entry_safe(priv, tmp, dev_list, list) { diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 4e22afe..37ea240 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1879,7 +1879,7 @@ static void srp_add_one(struct ib_device if (IS_ERR(srp_dev->fmr_pool)) srp_dev->fmr_pool = NULL; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { s = 0; e = 0; } else { diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h index fcb5ba8..d95d3eb 100644 --- a/include/rdma/ib_addr.h +++ b/include/rdma/ib_addr.h @@ -40,7 +40,7 @@ struct rdma_dev_addr { unsigned char src_dev_addr[MAX_ADDR_LEN]; unsigned char dst_dev_addr[MAX_ADDR_LEN]; unsigned char broadcast[MAX_ADDR_LEN]; - enum ib_node_type dev_type; + enum rdma_node_type dev_type; }; /** @@ -72,6 +72,9 @@ int rdma_resolve_ip(struct sockaddr *src void rdma_addr_cancel(struct rdma_dev_addr *addr); +int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, + const unsigned char *dst_dev_addr); + static inline int ip_addr_size(struct sockaddr *addr) { return addr->sa_family == AF_INET6 ? @@ -111,4 +114,14 @@ static inline void ib_addr_set_dgid(stru memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid); } +static inline union ib_gid* iw_addr_get_sgid(struct rdma_dev_addr* rda) +{ + return (union ib_gid *) rda->src_dev_addr; +} + +static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda) +{ + return (union ib_gid *) rda->dst_dev_addr; +} + #endif /* IB_ADDR_H */ diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index ee1f3a3..4b4c30a 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -35,7 +35,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $ + * $Id: ib_verbs.h 6885 2006-05-03 18:22:02Z sean.hefty $ */ #if !defined(IB_VERBS_H) @@ -56,12 +56,35 @@ union ib_gid { } global; }; -enum ib_node_type { - IB_NODE_CA = 1, - IB_NODE_SWITCH, - IB_NODE_ROUTER +enum rdma_node_type { + /* IB values map to NodeInfo:NodeType. */ + RDMA_NODE_IB_CA = 1, + RDMA_NODE_IB_SWITCH, + RDMA_NODE_IB_ROUTER, + RDMA_NODE_RNIC }; +enum rdma_transport_type { + RDMA_TRANSPORT_IB, + RDMA_TRANSPORT_IWARP +}; + +static inline enum rdma_transport_type +rdma_node_get_transport(enum rdma_node_type node_type) +{ + switch (node_type) { + case RDMA_NODE_IB_CA: + case RDMA_NODE_IB_SWITCH: + case RDMA_NODE_IB_ROUTER: + return RDMA_TRANSPORT_IB; + case RDMA_NODE_RNIC: + return RDMA_TRANSPORT_IWARP; + default: + BUG(); + return 0; + } +} + enum ib_device_cap_flags { IB_DEVICE_RESIZE_MAX_WR = 1, IB_DEVICE_BAD_PKEY_CNTR = (1<<1), @@ -78,6 +101,9 @@ enum ib_device_cap_flags { IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), IB_DEVICE_SRQ_RESIZE = (1<<13), IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_ZERO_STAG = (1<<15), + IB_DEVICE_SEND_W_INV = (1<<16), + IB_DEVICE_MEM_WINDOW = (1<<17) }; enum ib_atomic_cap { @@ -835,6 +861,7 @@ struct ib_cache { u8 *lmc_cache; }; +struct iw_cm_verbs; struct ib_device { struct device *dma_device; @@ -851,6 +878,8 @@ struct ib_device { u32 flags; + struct iw_cm_verbs *iwcm; + int (*query_device)(struct ib_device *device, struct ib_device_attr *device_attr); int (*query_port)(struct ib_device *device, From swise at opengridcomputing.com Tue Jun 20 13:30:50 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:30:50 -0500 Subject: [openib-general] [PATCH v3 0/7][RFC] Ammasso 1100 iWARP Driver Message-ID: <20060620203050.31536.5341.stgit@stevo-desktop> This patchset implements the iWARP provider driver for the Ammasso 1100 RNIC. It is dependent on the "iWARP Core Support" patch set. We're submitting it for review with the goal for inclusion in the 2.6.19 kernel. This code has gone through several reviews in the openib-general list. Now we are submitting it for external review by the linux community. This StGIT patchset is cloned from Roland Dreier's infiniband.git for-2.6.19 branch. The patchset consists of 7 patches: 1 - Low-level device interface and native stack support 2 - Work request definitions 3 - Provider interface 4 - Memory management 5 - User mode message queue implementation 6 - Verbs queue implementation 7 - Kconfig and Makefile I believe I've addressed all the round 1 and 2 review comments. Details of the changes are tracked in each patch comment. Signed-off-by: Tom Tucker Signed-off-by: Steve Wise From swise at opengridcomputing.com Tue Jun 20 13:31:00 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:31:00 -0500 Subject: [openib-general] [PATCH v3 2/7] AMSO1100 WR / Event Definitions. In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop> References: <20060620203050.31536.5341.stgit@stevo-desktop> Message-ID: <20060620203100.31536.50860.stgit@stevo-desktop> Review Changes: - C2_DEBUG -> DEBUG - removed useless comments --- drivers/infiniband/hw/amso1100/c2_ae.h | 108 ++ drivers/infiniband/hw/amso1100/c2_status.h | 158 +++ drivers/infiniband/hw/amso1100/c2_wr.h | 1520 ++++++++++++++++++++++++++++ 3 files changed, 1786 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_ae.h b/drivers/infiniband/hw/amso1100/c2_ae.h new file mode 100644 index 0000000..3a065c3 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_ae.h @@ -0,0 +1,108 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _C2_AE_H_ +#define _C2_AE_H_ + +/* + * WARNING: If you change this file, also bump C2_IVN_BASE + * in common/include/clustercore/c2_ivn.h. + */ + +/* + * Asynchronous Event Identifiers + * + * These start at 0x80 only so it's obvious from inspection that + * they are not work-request statuses. This isn't critical. + * + * NOTE: these event id's must fit in eight bits. + */ +enum c2_event_id { + CCAE_REMOTE_SHUTDOWN = 0x80, + CCAE_ACTIVE_CONNECT_RESULTS, + CCAE_CONNECTION_REQUEST, + CCAE_LLP_CLOSE_COMPLETE, + CCAE_TERMINATE_MESSAGE_RECEIVED, + CCAE_LLP_CONNECTION_RESET, + CCAE_LLP_CONNECTION_LOST, + CCAE_LLP_SEGMENT_SIZE_INVALID, + CCAE_LLP_INVALID_CRC, + CCAE_LLP_BAD_FPDU, + CCAE_INVALID_DDP_VERSION, + CCAE_INVALID_RDMA_VERSION, + CCAE_UNEXPECTED_OPCODE, + CCAE_INVALID_DDP_QUEUE_NUMBER, + CCAE_RDMA_READ_NOT_ENABLED, + CCAE_RDMA_WRITE_NOT_ENABLED, + CCAE_RDMA_READ_TOO_SMALL, + CCAE_NO_L_BIT, + CCAE_TAGGED_INVALID_STAG, + CCAE_TAGGED_BASE_BOUNDS_VIOLATION, + CCAE_TAGGED_ACCESS_RIGHTS_VIOLATION, + CCAE_TAGGED_INVALID_PD, + CCAE_WRAP_ERROR, + CCAE_BAD_CLOSE, + CCAE_BAD_LLP_CLOSE, + CCAE_INVALID_MSN_RANGE, + CCAE_INVALID_MSN_GAP, + CCAE_IRRQ_OVERFLOW, + CCAE_IRRQ_MSN_GAP, + CCAE_IRRQ_MSN_RANGE, + CCAE_IRRQ_INVALID_STAG, + CCAE_IRRQ_BASE_BOUNDS_VIOLATION, + CCAE_IRRQ_ACCESS_RIGHTS_VIOLATION, + CCAE_IRRQ_INVALID_PD, + CCAE_IRRQ_WRAP_ERROR, + CCAE_CQ_SQ_COMPLETION_OVERFLOW, + CCAE_CQ_RQ_COMPLETION_ERROR, + CCAE_QP_SRQ_WQE_ERROR, + CCAE_QP_LOCAL_CATASTROPHIC_ERROR, + CCAE_CQ_OVERFLOW, + CCAE_CQ_OPERATION_ERROR, + CCAE_SRQ_LIMIT_REACHED, + CCAE_QP_RQ_LIMIT_REACHED, + CCAE_SRQ_CATASTROPHIC_ERROR, + CCAE_RNIC_CATASTROPHIC_ERROR +/* WARNING If you add more id's, make sure their values fit in eight bits. */ +}; + +/* + * Resource Indicators and Identifiers + */ +enum c2_resource_indicator { + C2_RES_IND_QP = 1, + C2_RES_IND_EP, + C2_RES_IND_CQ, + C2_RES_IND_SRQ, +}; + +#endif /* _C2_AE_H_ */ diff --git a/drivers/infiniband/hw/amso1100/c2_status.h b/drivers/infiniband/hw/amso1100/c2_status.h new file mode 100644 index 0000000..6ee4aa9 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_status.h @@ -0,0 +1,158 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _C2_STATUS_H_ +#define _C2_STATUS_H_ + +/* + * Verbs Status Codes + */ +enum c2_status { + C2_OK = 0, /* This must be zero */ + CCERR_INSUFFICIENT_RESOURCES = 1, + CCERR_INVALID_MODIFIER = 2, + CCERR_INVALID_MODE = 3, + CCERR_IN_USE = 4, + CCERR_INVALID_RNIC = 5, + CCERR_INTERRUPTED_OPERATION = 6, + CCERR_INVALID_EH = 7, + CCERR_INVALID_CQ = 8, + CCERR_CQ_EMPTY = 9, + CCERR_NOT_IMPLEMENTED = 10, + CCERR_CQ_DEPTH_TOO_SMALL = 11, + CCERR_PD_IN_USE = 12, + CCERR_INVALID_PD = 13, + CCERR_INVALID_SRQ = 14, + CCERR_INVALID_ADDRESS = 15, + CCERR_INVALID_NETMASK = 16, + CCERR_INVALID_QP = 17, + CCERR_INVALID_QP_STATE = 18, + CCERR_TOO_MANY_WRS_POSTED = 19, + CCERR_INVALID_WR_TYPE = 20, + CCERR_INVALID_SGL_LENGTH = 21, + CCERR_INVALID_SQ_DEPTH = 22, + CCERR_INVALID_RQ_DEPTH = 23, + CCERR_INVALID_ORD = 24, + CCERR_INVALID_IRD = 25, + CCERR_QP_ATTR_CANNOT_CHANGE = 26, + CCERR_INVALID_STAG = 27, + CCERR_QP_IN_USE = 28, + CCERR_OUTSTANDING_WRS = 29, + CCERR_STAG_IN_USE = 30, + CCERR_INVALID_STAG_INDEX = 31, + CCERR_INVALID_SGL_FORMAT = 32, + CCERR_ADAPTER_TIMEOUT = 33, + CCERR_INVALID_CQ_DEPTH = 34, + CCERR_INVALID_PRIVATE_DATA_LENGTH = 35, + CCERR_INVALID_EP = 36, + CCERR_MR_IN_USE = CCERR_STAG_IN_USE, + CCERR_FLUSHED = 38, + CCERR_INVALID_WQE = 39, + CCERR_LOCAL_QP_CATASTROPHIC_ERROR = 40, + CCERR_REMOTE_TERMINATION_ERROR = 41, + CCERR_BASE_AND_BOUNDS_VIOLATION = 42, + CCERR_ACCESS_VIOLATION = 43, + CCERR_INVALID_PD_ID = 44, + CCERR_WRAP_ERROR = 45, + CCERR_INV_STAG_ACCESS_ERROR = 46, + CCERR_ZERO_RDMA_READ_RESOURCES = 47, + CCERR_QP_NOT_PRIVILEGED = 48, + CCERR_STAG_STATE_NOT_INVALID = 49, + CCERR_INVALID_PAGE_SIZE = 50, + CCERR_INVALID_BUFFER_SIZE = 51, + CCERR_INVALID_PBE = 52, + CCERR_INVALID_FBO = 53, + CCERR_INVALID_LENGTH = 54, + CCERR_INVALID_ACCESS_RIGHTS = 55, + CCERR_PBL_TOO_BIG = 56, + CCERR_INVALID_VA = 57, + CCERR_INVALID_REGION = 58, + CCERR_INVALID_WINDOW = 59, + CCERR_TOTAL_LENGTH_TOO_BIG = 60, + CCERR_INVALID_QP_ID = 61, + CCERR_ADDR_IN_USE = 62, + CCERR_ADDR_NOT_AVAIL = 63, + CCERR_NET_DOWN = 64, + CCERR_NET_UNREACHABLE = 65, + CCERR_CONN_ABORTED = 66, + CCERR_CONN_RESET = 67, + CCERR_NO_BUFS = 68, + CCERR_CONN_TIMEDOUT = 69, + CCERR_CONN_REFUSED = 70, + CCERR_HOST_UNREACHABLE = 71, + CCERR_INVALID_SEND_SGL_DEPTH = 72, + CCERR_INVALID_RECV_SGL_DEPTH = 73, + CCERR_INVALID_RDMA_WRITE_SGL_DEPTH = 74, + CCERR_INSUFFICIENT_PRIVILEGES = 75, + CCERR_STACK_ERROR = 76, + CCERR_INVALID_VERSION = 77, + CCERR_INVALID_MTU = 78, + CCERR_INVALID_IMAGE = 79, + CCERR_PENDING = 98, /* not an error; user internally by adapter */ + CCERR_DEFER = 99, /* not an error; used internally by adapter */ + CCERR_FAILED_WRITE = 100, + CCERR_FAILED_ERASE = 101, + CCERR_FAILED_VERIFICATION = 102, + CCERR_NOT_FOUND = 103, + +}; + +/* + * CCAE_ACTIVE_CONNECT_RESULTS status result codes. + */ +enum c2_connect_status { + C2_CONN_STATUS_SUCCESS = C2_OK, + C2_CONN_STATUS_NO_MEM = CCERR_INSUFFICIENT_RESOURCES, + C2_CONN_STATUS_TIMEDOUT = CCERR_CONN_TIMEDOUT, + C2_CONN_STATUS_REFUSED = CCERR_CONN_REFUSED, + C2_CONN_STATUS_NETUNREACH = CCERR_NET_UNREACHABLE, + C2_CONN_STATUS_HOSTUNREACH = CCERR_HOST_UNREACHABLE, + C2_CONN_STATUS_INVALID_RNIC = CCERR_INVALID_RNIC, + C2_CONN_STATUS_INVALID_QP = CCERR_INVALID_QP, + C2_CONN_STATUS_INVALID_QP_STATE = CCERR_INVALID_QP_STATE, + C2_CONN_STATUS_REJECTED = CCERR_CONN_RESET, + C2_CONN_STATUS_ADDR_NOT_AVAIL = CCERR_ADDR_NOT_AVAIL, +}; + +/* + * Flash programming status codes. + */ +enum c2_flash_status { + C2_FLASH_STATUS_SUCCESS = 0x0000, + C2_FLASH_STATUS_VERIFY_ERR = 0x0002, + C2_FLASH_STATUS_IMAGE_ERR = 0x0004, + C2_FLASH_STATUS_ECLBS = 0x0400, + C2_FLASH_STATUS_PSLBS = 0x0800, + C2_FLASH_STATUS_VPENS = 0x1000, +}; + +#endif /* _C2_STATUS_H_ */ diff --git a/drivers/infiniband/hw/amso1100/c2_wr.h b/drivers/infiniband/hw/amso1100/c2_wr.h new file mode 100644 index 0000000..bd9905b --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_wr.h @@ -0,0 +1,1520 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _C2_WR_H_ +#define _C2_WR_H_ + +#ifdef CCDEBUG +#define CCWR_MAGIC 0xb07700b0 +#endif + +#define C2_QP_NO_ATTR_CHANGE 0xFFFFFFFF + +/* Maximum allowed size in bytes of private_data exchange + * on connect. + */ +#define C2_MAX_PRIVATE_DATA_SIZE 200 + +/* + * These types are shared among the adapter, host, and CCIL consumer. + */ +enum c2_cq_notification_type { + C2_CQ_NOTIFICATION_TYPE_NONE = 1, + C2_CQ_NOTIFICATION_TYPE_NEXT, + C2_CQ_NOTIFICATION_TYPE_NEXT_SE +}; + +enum c2_setconfig_cmd { + C2_CFG_ADD_ADDR = 1, + C2_CFG_DEL_ADDR = 2, + C2_CFG_ADD_ROUTE = 3, + C2_CFG_DEL_ROUTE = 4 +}; + +enum c2_getconfig_cmd { + C2_GETCONFIG_ROUTES = 1, + C2_GETCONFIG_ADDRS +}; + +/* + * CCIL Work Request Identifiers + */ +enum c2wr_ids { + CCWR_RNIC_OPEN = 1, + CCWR_RNIC_QUERY, + CCWR_RNIC_SETCONFIG, + CCWR_RNIC_GETCONFIG, + CCWR_RNIC_CLOSE, + CCWR_CQ_CREATE, + CCWR_CQ_QUERY, + CCWR_CQ_MODIFY, + CCWR_CQ_DESTROY, + CCWR_QP_CONNECT, + CCWR_PD_ALLOC, + CCWR_PD_DEALLOC, + CCWR_SRQ_CREATE, + CCWR_SRQ_QUERY, + CCWR_SRQ_MODIFY, + CCWR_SRQ_DESTROY, + CCWR_QP_CREATE, + CCWR_QP_QUERY, + CCWR_QP_MODIFY, + CCWR_QP_DESTROY, + CCWR_NSMR_STAG_ALLOC, + CCWR_NSMR_REGISTER, + CCWR_NSMR_PBL, + CCWR_STAG_DEALLOC, + CCWR_NSMR_REREGISTER, + CCWR_SMR_REGISTER, + CCWR_MR_QUERY, + CCWR_MW_ALLOC, + CCWR_MW_QUERY, + CCWR_EP_CREATE, + CCWR_EP_GETOPT, + CCWR_EP_SETOPT, + CCWR_EP_DESTROY, + CCWR_EP_BIND, + CCWR_EP_CONNECT, + CCWR_EP_LISTEN, + CCWR_EP_SHUTDOWN, + CCWR_EP_LISTEN_CREATE, + CCWR_EP_LISTEN_DESTROY, + CCWR_EP_QUERY, + CCWR_CR_ACCEPT, + CCWR_CR_REJECT, + CCWR_CONSOLE, + CCWR_TERM, + CCWR_FLASH_INIT, + CCWR_FLASH, + CCWR_BUF_ALLOC, + CCWR_BUF_FREE, + CCWR_FLASH_WRITE, + CCWR_INIT, /* WARNING: Don't move this ever again! */ + + + + /* Add new IDs here */ + + + + /* + * WARNING: CCWR_LAST must always be the last verbs id defined! + * All the preceding IDs are fixed, and must not change. + * You can add new IDs, but must not remove or reorder + * any IDs. If you do, YOU will ruin any hope of + * compatability between versions. + */ + CCWR_LAST, + + /* + * Start over at 1 so that arrays indexed by user wr id's + * begin at 1. This is OK since the verbs and user wr id's + * are always used on disjoint sets of queues. + */ + /* + * The order of the CCWR_SEND_XX verbs must + * match the order of the RDMA_OPs + */ + CCWR_SEND = 1, + CCWR_SEND_INV, + CCWR_SEND_SE, + CCWR_SEND_SE_INV, + CCWR_RDMA_WRITE, + CCWR_RDMA_READ, + CCWR_RDMA_READ_INV, + CCWR_MW_BIND, + CCWR_NSMR_FASTREG, + CCWR_STAG_INVALIDATE, + CCWR_RECV, + CCWR_NOP, + CCWR_UNIMPL, +/* WARNING: This must always be the last user wr id defined! */ +}; +#define RDMA_SEND_OPCODE_FROM_WR_ID(x) (x+2) + +/* + * SQ/RQ Work Request Types + */ +enum c2_wr_type { + C2_WR_TYPE_SEND = CCWR_SEND, + C2_WR_TYPE_SEND_SE = CCWR_SEND_SE, + C2_WR_TYPE_SEND_INV = CCWR_SEND_INV, + C2_WR_TYPE_SEND_SE_INV = CCWR_SEND_SE_INV, + C2_WR_TYPE_RDMA_WRITE = CCWR_RDMA_WRITE, + C2_WR_TYPE_RDMA_READ = CCWR_RDMA_READ, + C2_WR_TYPE_RDMA_READ_INV_STAG = CCWR_RDMA_READ_INV, + C2_WR_TYPE_BIND_MW = CCWR_MW_BIND, + C2_WR_TYPE_FASTREG_NSMR = CCWR_NSMR_FASTREG, + C2_WR_TYPE_INV_STAG = CCWR_STAG_INVALIDATE, + C2_WR_TYPE_RECV = CCWR_RECV, + C2_WR_TYPE_NOP = CCWR_NOP, +}; + +struct c2_netaddr { + u32 ip_addr; + u32 netmask; + u32 mtu; +}; + +struct c2_route { + u32 ip_addr; /* 0 indicates the default route */ + u32 netmask; /* netmask associated with dst */ + u32 flags; + union { + u32 ipaddr; /* address of the nexthop interface */ + u8 enaddr[6]; + } nexthop; +}; + +/* + * A Scatter Gather Entry. + */ +struct c2_data_addr { + u32 stag; + u32 length; + u64 to; +}; + +/* + * MR and MW flags used by the consumer, RI, and RNIC. + */ +enum c2_mm_flags { + MEM_REMOTE = 0x0001, /* allow mw binds with remote access. */ + MEM_VA_BASED = 0x0002, /* Not Zero-based */ + MEM_PBL_COMPLETE = 0x0004, /* PBL array is complete in this msg */ + MEM_LOCAL_READ = 0x0008, /* allow local reads */ + MEM_LOCAL_WRITE = 0x0010, /* allow local writes */ + MEM_REMOTE_READ = 0x0020, /* allow remote reads */ + MEM_REMOTE_WRITE = 0x0040, /* allow remote writes */ + MEM_WINDOW_BIND = 0x0080, /* binds allowed */ + MEM_SHARED = 0x0100, /* set if MR is shared */ + MEM_STAG_VALID = 0x0200 /* set if STAG is in valid state */ +}; + +/* + * CCIL API ACF flags defined in terms of the low level mem flags. + * This minimizes translation needed in the user API + */ +enum c2_acf { + C2_ACF_LOCAL_READ = MEM_LOCAL_READ, + C2_ACF_LOCAL_WRITE = MEM_LOCAL_WRITE, + C2_ACF_REMOTE_READ = MEM_REMOTE_READ, + C2_ACF_REMOTE_WRITE = MEM_REMOTE_WRITE, + C2_ACF_WINDOW_BIND = MEM_WINDOW_BIND +}; + +/* + * Image types of objects written to flash + */ +#define C2_FLASH_IMG_BITFILE 1 +#define C2_FLASH_IMG_OPTION_ROM 2 +#define C2_FLASH_IMG_VPD 3 + +/* + * to fix bug 1815 we define the max size allowable of the + * terminate message (per the IETF spec).Refer to the IETF + * protocal specification, section 12.1.6, page 64) + * The message is prefixed by 20 types of DDP info. + * + * Then the message has 6 bytes for the terminate control + * and DDP segment length info plus a DDP header (either + * 14 or 18 byts) plus 28 bytes for the RDMA header. + * Thus the max size in: + * 20 + (6 + 18 + 28) = 72 + */ +#define C2_MAX_TERMINATE_MESSAGE_SIZE (72) + +/* + * Build String Length. It must be the same as C2_BUILD_STR_LEN in ccil_api.h + */ +#define WR_BUILD_STR_LEN 64 + +/* + * WARNING: All of these structs need to align any 64bit types on + * 64 bit boundaries! 64bit types include u64 and u64. + */ + +/* + * Clustercore Work Request Header. Be sensitive to field layout + * and alignment. + */ +struct c2wr_hdr { + /* wqe_count is part of the cqe. It is put here so the + * adapter can write to it while the wr is pending without + * clobbering part of the wr. This word need not be dma'd + * from the host to adapter by libccil, but we copy it anyway + * to make the memcpy to the adapter better aligned. + */ + u32 wqe_count; + + /* Put these fields next so that later 32- and 64-bit + * quantities are naturally aligned. + */ + u8 id; + u8 result; /* adapter -> host */ + u8 sge_count; /* host -> adapter */ + u8 flags; /* host -> adapter */ + + u64 context; +#ifdef CCMSGMAGIC + u32 magic; + u32 pad; +#endif +} __attribute__((packed)); + +/* + *------------------------ RNIC ------------------------ + */ + +/* + * WR_RNIC_OPEN + */ + +/* + * Flags for the RNIC WRs + */ +enum c2_rnic_flags { + RNIC_IRD_STATIC = 0x0001, + RNIC_ORD_STATIC = 0x0002, + RNIC_QP_STATIC = 0x0004, + RNIC_SRQ_SUPPORTED = 0x0008, + RNIC_PBL_BLOCK_MODE = 0x0010, + RNIC_SRQ_MODEL_ARRIVAL = 0x0020, + RNIC_CQ_OVF_DETECTED = 0x0040, + RNIC_PRIV_MODE = 0x0080 +}; + +struct c2wr_rnic_open_req { + struct c2wr_hdr hdr; + u64 user_context; + u16 flags; /* See enum c2_rnic_flags */ + u16 port_num; +} __attribute__((packed)); + +struct c2wr_rnic_open_rep { + struct c2wr_hdr hdr; + u32 rnic_handle; +} __attribute__((packed)); + +union c2wr_rnic_open { + struct c2wr_rnic_open_req req; + struct c2wr_rnic_open_rep rep; +} __attribute__((packed)); + +struct c2wr_rnic_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; +} __attribute__((packed)); + +/* + * WR_RNIC_QUERY + */ +struct c2wr_rnic_query_rep { + struct c2wr_hdr hdr; + u64 user_context; + u32 vendor_id; + u32 part_number; + u32 hw_version; + u32 fw_ver_major; + u32 fw_ver_minor; + u32 fw_ver_patch; + char fw_ver_build_str[WR_BUILD_STR_LEN]; + u32 max_qps; + u32 max_qp_depth; + u32 max_srq_depth; + u32 max_send_sgl_depth; + u32 max_rdma_sgl_depth; + u32 max_cqs; + u32 max_cq_depth; + u32 max_cq_event_handlers; + u32 max_mrs; + u32 max_pbl_depth; + u32 max_pds; + u32 max_global_ird; + u32 max_global_ord; + u32 max_qp_ird; + u32 max_qp_ord; + u32 flags; + u32 max_mws; + u32 pbe_range_low; + u32 pbe_range_high; + u32 max_srqs; + u32 page_size; +} __attribute__((packed)); + +union c2wr_rnic_query { + struct c2wr_rnic_query_req req; + struct c2wr_rnic_query_rep rep; +} __attribute__((packed)); + +/* + * WR_RNIC_GETCONFIG + */ + +struct c2wr_rnic_getconfig_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 option; /* see c2_getconfig_cmd_t */ + u64 reply_buf; + u32 reply_buf_len; +} __attribute__((packed)) ; + +struct c2wr_rnic_getconfig_rep { + struct c2wr_hdr hdr; + u32 option; /* see c2_getconfig_cmd_t */ + u32 count_len; /* length of the number of addresses configured */ +} __attribute__((packed)) ; + +union c2wr_rnic_getconfig { + struct c2wr_rnic_getconfig_req req; + struct c2wr_rnic_getconfig_rep rep; +} __attribute__((packed)) ; + +/* + * WR_RNIC_SETCONFIG + */ +struct c2wr_rnic_setconfig_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 option; /* See c2_setconfig_cmd_t */ + /* variable data and pad. See c2_netaddr and c2_route */ + u8 data[0]; +} __attribute__((packed)) ; + +struct c2wr_rnic_setconfig_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_rnic_setconfig { + struct c2wr_rnic_setconfig_req req; + struct c2wr_rnic_setconfig_rep rep; +} __attribute__((packed)) ; + +/* + * WR_RNIC_CLOSE + */ +struct c2wr_rnic_close_req { + struct c2wr_hdr hdr; + u32 rnic_handle; +} __attribute__((packed)) ; + +struct c2wr_rnic_close_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_rnic_close { + struct c2wr_rnic_close_req req; + struct c2wr_rnic_close_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ CQ ------------------------ + */ +struct c2wr_cq_create_req { + struct c2wr_hdr hdr; + u64 shared_ht; + u64 user_context; + u64 msg_pool; + u32 rnic_handle; + u32 msg_size; + u32 depth; +} __attribute__((packed)) ; + +struct c2wr_cq_create_rep { + struct c2wr_hdr hdr; + u32 mq_index; + u32 adapter_shared; + u32 cq_handle; +} __attribute__((packed)) ; + +union c2wr_cq_create { + struct c2wr_cq_create_req req; + struct c2wr_cq_create_rep rep; +} __attribute__((packed)) ; + +struct c2wr_cq_modify_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 cq_handle; + u32 new_depth; + u64 new_msg_pool; +} __attribute__((packed)) ; + +struct c2wr_cq_modify_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_cq_modify { + struct c2wr_cq_modify_req req; + struct c2wr_cq_modify_rep rep; +} __attribute__((packed)) ; + +struct c2wr_cq_destroy_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 cq_handle; +} __attribute__((packed)) ; + +struct c2wr_cq_destroy_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_cq_destroy { + struct c2wr_cq_destroy_req req; + struct c2wr_cq_destroy_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ PD ------------------------ + */ +struct c2wr_pd_alloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_pd_alloc_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_pd_alloc { + struct c2wr_pd_alloc_req req; + struct c2wr_pd_alloc_rep rep; +} __attribute__((packed)) ; + +struct c2wr_pd_dealloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_pd_dealloc_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_pd_dealloc { + struct c2wr_pd_dealloc_req req; + struct c2wr_pd_dealloc_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ SRQ ------------------------ + */ +struct c2wr_srq_create_req { + struct c2wr_hdr hdr; + u64 shared_ht; + u64 user_context; + u32 rnic_handle; + u32 srq_depth; + u32 srq_limit; + u32 sgl_depth; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_srq_create_rep { + struct c2wr_hdr hdr; + u32 srq_depth; + u32 sgl_depth; + u32 msg_size; + u32 mq_index; + u32 mq_start; + u32 srq_handle; +} __attribute__((packed)) ; + +union c2wr_srq_create { + struct c2wr_srq_create_req req; + struct c2wr_srq_create_rep rep; +} __attribute__((packed)) ; + +struct c2wr_srq_destroy_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 srq_handle; +} __attribute__((packed)) ; + +struct c2wr_srq_destroy_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_srq_destroy { + struct c2wr_srq_destroy_req req; + struct c2wr_srq_destroy_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ QP ------------------------ + */ +enum c2wr_qp_flags { + QP_RDMA_READ = 0x00000001, /* RDMA read enabled? */ + QP_RDMA_WRITE = 0x00000002, /* RDMA write enabled? */ + QP_MW_BIND = 0x00000004, /* MWs enabled */ + QP_ZERO_STAG = 0x00000008, /* enabled? */ + QP_REMOTE_TERMINATION = 0x00000010, /* remote end terminated */ + QP_RDMA_READ_RESPONSE = 0x00000020 /* Remote RDMA read */ + /* enabled? */ +}; + +struct c2wr_qp_create_req { + struct c2wr_hdr hdr; + u64 shared_sq_ht; + u64 shared_rq_ht; + u64 user_context; + u32 rnic_handle; + u32 sq_cq_handle; + u32 rq_cq_handle; + u32 sq_depth; + u32 rq_depth; + u32 srq_handle; + u32 srq_limit; + u32 flags; /* see enum c2wr_qp_flags */ + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u32 ord; + u32 ird; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_qp_create_rep { + struct c2wr_hdr hdr; + u32 sq_depth; + u32 rq_depth; + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u32 ord; + u32 ird; + u32 sq_msg_size; + u32 sq_mq_index; + u32 sq_mq_start; + u32 rq_msg_size; + u32 rq_mq_index; + u32 rq_mq_start; + u32 qp_handle; +} __attribute__((packed)) ; + +union c2wr_qp_create { + struct c2wr_qp_create_req req; + struct c2wr_qp_create_rep rep; +} __attribute__((packed)) ; + +struct c2wr_qp_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 qp_handle; +} __attribute__((packed)) ; + +struct c2wr_qp_query_rep { + struct c2wr_hdr hdr; + u64 user_context; + u32 rnic_handle; + u32 sq_depth; + u32 rq_depth; + u32 send_sgl_depth; + u32 rdma_write_sgl_depth; + u32 recv_sgl_depth; + u32 ord; + u32 ird; + u16 qp_state; + u16 flags; /* see c2wr_qp_flags_t */ + u32 qp_id; + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; + u32 terminate_msg_length; /* 0 if not present */ + u8 data[0]; + /* Terminate Message in-line here. */ +} __attribute__((packed)) ; + +union c2wr_qp_query { + struct c2wr_qp_query_req req; + struct c2wr_qp_query_rep rep; +} __attribute__((packed)) ; + +struct c2wr_qp_modify_req { + struct c2wr_hdr hdr; + u64 stream_msg; + u32 stream_msg_length; + u32 rnic_handle; + u32 qp_handle; + u32 next_qp_state; + u32 ord; + u32 ird; + u32 sq_depth; + u32 rq_depth; + u32 llp_ep_handle; +} __attribute__((packed)) ; + +struct c2wr_qp_modify_rep { + struct c2wr_hdr hdr; + u32 ord; + u32 ird; + u32 sq_depth; + u32 rq_depth; + u32 sq_msg_size; + u32 sq_mq_index; + u32 sq_mq_start; + u32 rq_msg_size; + u32 rq_mq_index; + u32 rq_mq_start; +} __attribute__((packed)) ; + +union c2wr_qp_modify { + struct c2wr_qp_modify_req req; + struct c2wr_qp_modify_rep rep; +} __attribute__((packed)) ; + +struct c2wr_qp_destroy_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 qp_handle; +} __attribute__((packed)) ; + +struct c2wr_qp_destroy_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_qp_destroy { + struct c2wr_qp_destroy_req req; + struct c2wr_qp_destroy_rep rep; +} __attribute__((packed)) ; + +/* + * The CCWR_QP_CONNECT msg is posted on the verbs request queue. It can + * only be posted when a QP is in IDLE state. After the connect request is + * submitted to the LLP, the adapter moves the QP to CONNECT_PENDING state. + * No synchronous reply from adapter to this WR. The results of + * connection are passed back in an async event CCAE_ACTIVE_CONNECT_RESULTS + * See c2wr_ae_active_connect_results_t + */ +struct c2wr_qp_connect_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 qp_handle; + u32 remote_addr; + u16 remote_port; + u16 pad; + u32 private_data_length; + u8 private_data[0]; /* Private data in-line. */ +} __attribute__((packed)) ; + +struct c2wr_qp_connect { + struct c2wr_qp_connect_req req; + /* no synchronous reply. */ +} __attribute__((packed)) ; + + +/* + *------------------------ MM ------------------------ + */ + +struct c2wr_nsmr_stag_alloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 pbl_depth; + u32 pd_id; + u32 flags; +} __attribute__((packed)) ; + +struct c2wr_nsmr_stag_alloc_rep { + struct c2wr_hdr hdr; + u32 pbl_depth; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_nsmr_stag_alloc { + struct c2wr_nsmr_stag_alloc_req req; + struct c2wr_nsmr_stag_alloc_rep rep; +} __attribute__((packed)) ; + +struct c2wr_nsmr_register_req { + struct c2wr_hdr hdr; + u64 va; + u32 rnic_handle; + u16 flags; + u8 stag_key; + u8 pad; + u32 pd_id; + u32 pbl_depth; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} __attribute__((packed)) ; + +struct c2wr_nsmr_register_rep { + struct c2wr_hdr hdr; + u32 pbl_depth; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_nsmr_register { + struct c2wr_nsmr_register_req req; + struct c2wr_nsmr_register_rep rep; +} __attribute__((packed)) ; + +struct c2wr_nsmr_pbl_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 flags; + u32 stag_index; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} __attribute__((packed)) ; + +struct c2wr_nsmr_pbl_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_nsmr_pbl { + struct c2wr_nsmr_pbl_req req; + struct c2wr_nsmr_pbl_rep rep; +} __attribute__((packed)) ; + +struct c2wr_mr_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 stag_index; +} __attribute__((packed)) ; + +struct c2wr_mr_query_rep { + struct c2wr_hdr hdr; + u8 stag_key; + u8 pad[3]; + u32 pd_id; + u32 flags; + u32 pbl_depth; +} __attribute__((packed)) ; + +union c2wr_mr_query { + struct c2wr_mr_query_req req; + struct c2wr_mr_query_rep rep; +} __attribute__((packed)) ; + +struct c2wr_mw_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 stag_index; +} __attribute__((packed)) ; + +struct c2wr_mw_query_rep { + struct c2wr_hdr hdr; + u8 stag_key; + u8 pad[3]; + u32 pd_id; + u32 flags; +} __attribute__((packed)) ; + +union c2wr_mw_query { + struct c2wr_mw_query_req req; + struct c2wr_mw_query_rep rep; +} __attribute__((packed)) ; + + +struct c2wr_stag_dealloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 stag_index; +} __attribute__((packed)) ; + +struct c2wr_stag_dealloc_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)) ; + +union c2wr_stag_dealloc { + struct c2wr_stag_dealloc_req req; + struct c2wr_stag_dealloc_rep rep; +} __attribute__((packed)) ; + +struct c2wr_nsmr_reregister_req { + struct c2wr_hdr hdr; + u64 va; + u32 rnic_handle; + u16 flags; + u8 stag_key; + u8 pad; + u32 stag_index; + u32 pd_id; + u32 pbl_depth; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + u32 pad1; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} __attribute__((packed)) ; + +struct c2wr_nsmr_reregister_rep { + struct c2wr_hdr hdr; + u32 pbl_depth; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_nsmr_reregister { + struct c2wr_nsmr_reregister_req req; + struct c2wr_nsmr_reregister_rep rep; +} __attribute__((packed)) ; + +struct c2wr_smr_register_req { + struct c2wr_hdr hdr; + u64 va; + u32 rnic_handle; + u16 flags; + u8 stag_key; + u8 pad; + u32 stag_index; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_smr_register_rep { + struct c2wr_hdr hdr; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_smr_register { + struct c2wr_smr_register_req req; + struct c2wr_smr_register_rep rep; +} __attribute__((packed)) ; + +struct c2wr_mw_alloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 pd_id; +} __attribute__((packed)) ; + +struct c2wr_mw_alloc_rep { + struct c2wr_hdr hdr; + u32 stag_index; +} __attribute__((packed)) ; + +union c2wr_mw_alloc { + struct c2wr_mw_alloc_req req; + struct c2wr_mw_alloc_rep rep; +} __attribute__((packed)) ; + +/* + *------------------------ WRs ----------------------- + */ + +struct c2wr_user_hdr { + struct c2wr_hdr hdr; /* Has status and WR Type */ +} __attribute__((packed)) ; + +enum c2_qp_state { + C2_QP_STATE_IDLE = 0x01, + C2_QP_STATE_CONNECTING = 0x02, + C2_QP_STATE_RTS = 0x04, + C2_QP_STATE_CLOSING = 0x08, + C2_QP_STATE_TERMINATE = 0x10, + C2_QP_STATE_ERROR = 0x20, +}; + +/* Completion queue entry. */ +struct c2wr_ce { + struct c2wr_hdr hdr; /* Has status and WR Type */ + u64 qp_user_context; /* c2_user_qp_t * */ + u32 qp_state; /* Current QP State */ + u32 handle; /* QPID or EP Handle */ + u32 bytes_rcvd; /* valid for RECV WCs */ + u32 stag; +} __attribute__((packed)) ; + + +/* + * Flags used for all post-sq WRs. These must fit in the flags + * field of the struct c2wr_hdr (eight bits). + */ +enum { + SQ_SIGNALED = 0x01, + SQ_READ_FENCE = 0x02, + SQ_FENCE = 0x04, +}; + +/* + * Common fields for all post-sq WRs. Namely the standard header and a + * secondary header with fields common to all post-sq WRs. + */ +struct c2_sq_hdr { + struct c2wr_user_hdr user_hdr; +} __attribute__((packed)); + +/* + * Same as above but for post-rq WRs. + */ +struct c2_rq_hdr { + struct c2wr_user_hdr user_hdr; +} __attribute__((packed)); + +/* + * use the same struct for all sends. + */ +struct c2wr_send_req { + struct c2_sq_hdr sq_hdr; + u32 sge_len; + u32 remote_stag; + u8 data[0]; /* SGE array */ +} __attribute__((packed)); + +union c2wr_send { + struct c2wr_send_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_rdma_write_req { + struct c2_sq_hdr sq_hdr; + u64 remote_to; + u32 remote_stag; + u32 sge_len; + u8 data[0]; /* SGE array */ +} __attribute__((packed)); + +union c2wr_rdma_write { + struct c2wr_rdma_write_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_rdma_read_req { + struct c2_sq_hdr sq_hdr; + u64 local_to; + u64 remote_to; + u32 local_stag; + u32 remote_stag; + u32 length; +} __attribute__((packed)); + +union c2wr_rdma_read { + struct c2wr_rdma_read_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_mw_bind_req { + struct c2_sq_hdr sq_hdr; + u64 va; + u8 stag_key; + u8 pad[3]; + u32 mw_stag_index; + u32 mr_stag_index; + u32 length; + u32 flags; +} __attribute__((packed)); + +union c2wr_mw_bind { + struct c2wr_mw_bind_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_nsmr_fastreg_req { + struct c2_sq_hdr sq_hdr; + u64 va; + u8 stag_key; + u8 pad[3]; + u32 stag_index; + u32 pbe_size; + u32 fbo; + u32 length; + u32 addrs_length; + /* array of paddrs (must be aligned on a 64bit boundary) */ + u64 paddrs[0]; +} __attribute__((packed)); + +union c2wr_nsmr_fastreg { + struct c2wr_nsmr_fastreg_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_stag_invalidate_req { + struct c2_sq_hdr sq_hdr; + u8 stag_key; + u8 pad[3]; + u32 stag_index; +} __attribute__((packed)); + +union c2wr_stag_invalidate { + struct c2wr_stag_invalidate_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +union c2wr_sqwr { + struct c2_sq_hdr sq_hdr; + struct c2wr_send_req send; + struct c2wr_send_req send_se; + struct c2wr_send_req send_inv; + struct c2wr_send_req send_se_inv; + struct c2wr_rdma_write_req rdma_write; + struct c2wr_rdma_read_req rdma_read; + struct c2wr_mw_bind_req mw_bind; + struct c2wr_nsmr_fastreg_req nsmr_fastreg; + struct c2wr_stag_invalidate_req stag_inv; +} __attribute__((packed)); + + +/* + * RQ WRs + */ +struct c2wr_rqwr { + struct c2_rq_hdr rq_hdr; + u8 data[0]; /* array of SGEs */ +} __attribute__((packed)); + +union c2wr_recv { + struct c2wr_rqwr req; + struct c2wr_ce rep; +} __attribute__((packed)); + +/* + * All AEs start with this header. Most AEs only need to convey the + * information in the header. Some, like LLP connection events, need + * more info. The union typdef c2wr_ae_t has all the possible AEs. + * + * hdr.context is the user_context from the rnic_open WR. NULL If this + * is not affiliated with an rnic + * + * hdr.id is the AE identifier (eg; CCAE_REMOTE_SHUTDOWN, + * CCAE_LLP_CLOSE_COMPLETE) + * + * resource_type is one of: C2_RES_IND_QP, C2_RES_IND_CQ, C2_RES_IND_SRQ + * + * user_context is the context passed down when the host created the resource. + */ +struct c2wr_ae_hdr { + struct c2wr_hdr hdr; + u64 user_context; /* user context for this res. */ + u32 resource_type; /* see enum c2_resource_indicator */ + u32 resource; /* handle for resource */ + u32 qp_state; /* current QP State */ +} __attribute__((packed)); + +/* + * After submitting the CCAE_ACTIVE_CONNECT_RESULTS message on the AEQ, + * the adapter moves the QP into RTS state + */ +struct c2wr_ae_active_connect_results { + struct c2wr_ae_hdr ae_hdr; + u32 laddr; + u32 raddr; + u16 lport; + u16 rport; + u32 private_data_length; + u8 private_data[0]; /* data is in-line in the msg. */ +} __attribute__((packed)); + +/* + * When connections are established by the stack (and the private data + * MPA frame is received), the adapter will generate an event to the host. + * The details of the connection, any private data, and the new connection + * request handle is passed up via the CCAE_CONNECTION_REQUEST msg on the + * AE queue: + */ +struct c2wr_ae_connection_request { + struct c2wr_ae_hdr ae_hdr; + u32 cr_handle; /* connreq handle (sock ptr) */ + u32 laddr; + u32 raddr; + u16 lport; + u16 rport; + u32 private_data_length; + u8 private_data[0]; /* data is in-line in the msg. */ +} __attribute__((packed)); + +union c2wr_ae { + struct c2wr_ae_hdr ae_generic; + struct c2wr_ae_active_connect_results ae_active_connect_results; + struct c2wr_ae_connection_request ae_connection_request; +} __attribute__((packed)); + +struct c2wr_init_req { + struct c2wr_hdr hdr; + u64 hint_count; + u64 q0_host_shared; + u64 q1_host_shared; + u64 q1_host_msg_pool; + u64 q2_host_shared; + u64 q2_host_msg_pool; +} __attribute__((packed)); + +struct c2wr_init_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_init { + struct c2wr_init_req req; + struct c2wr_init_rep rep; +} __attribute__((packed)); + +/* + * For upgrading flash. + */ + +struct c2wr_flash_init_req { + struct c2wr_hdr hdr; + u32 rnic_handle; +} __attribute__((packed)); + +struct c2wr_flash_init_rep { + struct c2wr_hdr hdr; + u32 adapter_flash_buf_offset; + u32 adapter_flash_len; +} __attribute__((packed)); + +union c2wr_flash_init { + struct c2wr_flash_init_req req; + struct c2wr_flash_init_rep rep; +} __attribute__((packed)); + +struct c2wr_flash_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 len; +} __attribute__((packed)); + +struct c2wr_flash_rep { + struct c2wr_hdr hdr; + u32 status; +} __attribute__((packed)); + +union c2wr_flash { + struct c2wr_flash_req req; + struct c2wr_flash_rep rep; +} __attribute__((packed)); + +struct c2wr_buf_alloc_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 size; +} __attribute__((packed)); + +struct c2wr_buf_alloc_rep { + struct c2wr_hdr hdr; + u32 offset; /* 0 if mem not available */ + u32 size; /* 0 if mem not available */ +} __attribute__((packed)); + +union c2wr_buf_alloc { + struct c2wr_buf_alloc_req req; + struct c2wr_buf_alloc_rep rep; +} __attribute__((packed)); + +struct c2wr_buf_free_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 offset; /* Must match value from alloc */ + u32 size; /* Must match value from alloc */ +} __attribute__((packed)); + +struct c2wr_buf_free_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_buf_free { + struct c2wr_buf_free_req req; + struct c2wr_ce rep; +} __attribute__((packed)); + +struct c2wr_flash_write_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 offset; + u32 size; + u32 type; + u32 flags; +} __attribute__((packed)); + +struct c2wr_flash_write_rep { + struct c2wr_hdr hdr; + u32 status; +} __attribute__((packed)); + +union c2wr_flash_write { + struct c2wr_flash_write_req req; + struct c2wr_flash_write_rep rep; +} __attribute__((packed)); + +/* + * Messages for LLP connection setup. + */ + +/* + * Listen Request. This allocates a listening endpoint to allow passive + * connection setup. Newly established LLP connections are passed up + * via an AE. See c2wr_ae_connection_request_t + */ +struct c2wr_ep_listen_create_req { + struct c2wr_hdr hdr; + u64 user_context; /* returned in AEs. */ + u32 rnic_handle; + u32 local_addr; /* local addr, or 0 */ + u16 local_port; /* 0 means "pick one" */ + u16 pad; + u32 backlog; /* tradional tcp listen bl */ +} __attribute__((packed)); + +struct c2wr_ep_listen_create_rep { + struct c2wr_hdr hdr; + u32 ep_handle; /* handle to new listening ep */ + u16 local_port; /* resulting port... */ + u16 pad; +} __attribute__((packed)); + +union c2wr_ep_listen_create { + struct c2wr_ep_listen_create_req req; + struct c2wr_ep_listen_create_rep rep; +} __attribute__((packed)); + +struct c2wr_ep_listen_destroy_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 ep_handle; +} __attribute__((packed)); + +struct c2wr_ep_listen_destroy_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_ep_listen_destroy { + struct c2wr_ep_listen_destroy_req req; + struct c2wr_ep_listen_destroy_rep rep; +} __attribute__((packed)); + +struct c2wr_ep_query_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 ep_handle; +} __attribute__((packed)); + +struct c2wr_ep_query_rep { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; +} __attribute__((packed)); + +union c2wr_ep_query { + struct c2wr_ep_query_req req; + struct c2wr_ep_query_rep rep; +} __attribute__((packed)); + + +/* + * The host passes this down to indicate acceptance of a pending iWARP + * connection. The cr_handle was obtained from the CONNECTION_REQUEST + * AE passed up by the adapter. See c2wr_ae_connection_request_t. + */ +struct c2wr_cr_accept_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 qp_handle; /* QP to bind to this LLP conn */ + u32 ep_handle; /* LLP handle to accept */ + u32 private_data_length; + u8 private_data[0]; /* data in-line in msg. */ +} __attribute__((packed)); + +/* + * adapter sends reply when private data is successfully submitted to + * the LLP. + */ +struct c2wr_cr_accept_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_cr_accept { + struct c2wr_cr_accept_req req; + struct c2wr_cr_accept_rep rep; +} __attribute__((packed)); + +/* + * The host sends this down if a given iWARP connection request was + * rejected by the consumer. The cr_handle was obtained from a + * previous c2wr_ae_connection_request_t AE sent by the adapter. + */ +struct c2wr_cr_reject_req { + struct c2wr_hdr hdr; + u32 rnic_handle; + u32 ep_handle; /* LLP handle to reject */ +} __attribute__((packed)); + +/* + * Dunno if this is needed, but we'll add it for now. The adapter will + * send the reject_reply after the LLP endpoint has been destroyed. + */ +struct c2wr_cr_reject_rep { + struct c2wr_hdr hdr; +} __attribute__((packed)); + +union c2wr_cr_reject { + struct c2wr_cr_reject_req req; + struct c2wr_cr_reject_rep rep; +} __attribute__((packed)); + +/* + * console command. Used to implement a debug console over the verbs + * request and reply queues. + */ + +/* + * Console request message. It contains: + * - message hdr with id = CCWR_CONSOLE + * - the physaddr/len of host memory to be used for the reply. + * - the command string. eg: "netstat -s" or "zoneinfo" + */ +struct c2wr_console_req { + struct c2wr_hdr hdr; /* id = CCWR_CONSOLE */ + u64 reply_buf; /* pinned host buf for reply */ + u32 reply_buf_len; /* length of reply buffer */ + u8 command[0]; /* NUL terminated ascii string */ + /* containing the command req */ +} __attribute__((packed)); + +/* + * flags used in the console reply. + */ +enum c2_console_flags { + CONS_REPLY_TRUNCATED = 0x00000001 /* reply was truncated */ +} __attribute__((packed)); + +/* + * Console reply message. + * hdr.result contains the c2_status_t error if the reply was _not_ generated, + * or C2_OK if the reply was generated. + */ +struct c2wr_console_rep { + struct c2wr_hdr hdr; /* id = CCWR_CONSOLE */ + u32 flags; +} __attribute__((packed)); + +union c2wr_console { + struct c2wr_console_req req; + struct c2wr_console_rep rep; +} __attribute__((packed)); + + +/* + * Giant union with all WRs. Makes life easier... + */ +union c2wr { + struct c2wr_hdr hdr; + struct c2wr_user_hdr user_hdr; + union c2wr_rnic_open rnic_open; + union c2wr_rnic_query rnic_query; + union c2wr_rnic_getconfig rnic_getconfig; + union c2wr_rnic_setconfig rnic_setconfig; + union c2wr_rnic_close rnic_close; + union c2wr_cq_create cq_create; + union c2wr_cq_modify cq_modify; + union c2wr_cq_destroy cq_destroy; + union c2wr_pd_alloc pd_alloc; + union c2wr_pd_dealloc pd_dealloc; + union c2wr_srq_create srq_create; + union c2wr_srq_destroy srq_destroy; + union c2wr_qp_create qp_create; + union c2wr_qp_query qp_query; + union c2wr_qp_modify qp_modify; + union c2wr_qp_destroy qp_destroy; + struct c2wr_qp_connect qp_connect; + union c2wr_nsmr_stag_alloc nsmr_stag_alloc; + union c2wr_nsmr_register nsmr_register; + union c2wr_nsmr_pbl nsmr_pbl; + union c2wr_mr_query mr_query; + union c2wr_mw_query mw_query; + union c2wr_stag_dealloc stag_dealloc; + union c2wr_sqwr sqwr; + struct c2wr_rqwr rqwr; + struct c2wr_ce ce; + union c2wr_ae ae; + union c2wr_init init; + union c2wr_ep_listen_create ep_listen_create; + union c2wr_ep_listen_destroy ep_listen_destroy; + union c2wr_cr_accept cr_accept; + union c2wr_cr_reject cr_reject; + union c2wr_console console; + union c2wr_flash_init flash_init; + union c2wr_flash flash; + union c2wr_buf_alloc buf_alloc; + union c2wr_buf_free buf_free; + union c2wr_flash_write flash_write; +} __attribute__((packed)); + + +/* + * Accessors for the wr fields that are packed together tightly to + * reduce the wr message size. The wr arguments are void* so that + * either a struct c2wr*, a struct c2wr_hdr*, or a pointer to any of the types + * in the struct c2wr union can be passed in. + */ +static __inline__ u8 c2_wr_get_id(void *wr) +{ + return ((struct c2wr_hdr *) wr)->id; +} +static __inline__ void c2_wr_set_id(void *wr, u8 id) +{ + ((struct c2wr_hdr *) wr)->id = id; +} +static __inline__ u8 c2_wr_get_result(void *wr) +{ + return ((struct c2wr_hdr *) wr)->result; +} +static __inline__ void c2_wr_set_result(void *wr, u8 result) +{ + ((struct c2wr_hdr *) wr)->result = result; +} +static __inline__ u8 c2_wr_get_flags(void *wr) +{ + return ((struct c2wr_hdr *) wr)->flags; +} +static __inline__ void c2_wr_set_flags(void *wr, u8 flags) +{ + ((struct c2wr_hdr *) wr)->flags = flags; +} +static __inline__ u8 c2_wr_get_sge_count(void *wr) +{ + return ((struct c2wr_hdr *) wr)->sge_count; +} +static __inline__ void c2_wr_set_sge_count(void *wr, u8 sge_count) +{ + ((struct c2wr_hdr *) wr)->sge_count = sge_count; +} +static __inline__ u32 c2_wr_get_wqe_count(void *wr) +{ + return ((struct c2wr_hdr *) wr)->wqe_count; +} +static __inline__ void c2_wr_set_wqe_count(void *wr, u32 wqe_count) +{ + ((struct c2wr_hdr *) wr)->wqe_count = wqe_count; +} + +#endif /* _C2_WR_H_ */ From swise at opengridcomputing.com Tue Jun 20 13:30:55 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:30:55 -0500 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop> References: <20060620203050.31536.5341.stgit@stevo-desktop> Message-ID: <20060620203055.31536.15131.stgit@stevo-desktop> This is the core of the driver and includes the hardware probe, low-level device interfaces and native Ethernet support. V2 Review Changes: - fixed private data memory leak incoming connect requests. No longer need to copy the private data. The IWCM will. - correctly map host memory for DMA (don't use __pa()). V1 Review Changes - sizeof -> sizeof() - dprintk() -> pr_debug() - removed useless asserts - assert() -> BUG_ON() - C2_DEBUG -> DEBUG - removed debug netevent code - removed arp request squelch code from intr handler, replacing it with setting arp_ignore when the c2 netdev is brought up. - removed c2_set_mac_addr(). --- drivers/infiniband/hw/amso1100/c2.c | 1255 ++++++++++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2.h | 552 +++++++++++++ drivers/infiniband/hw/amso1100/c2_ae.c | 321 ++++++++ drivers/infiniband/hw/amso1100/c2_intr.c | 209 +++++ drivers/infiniband/hw/amso1100/c2_rnic.c | 664 ++++++++++++++++ 5 files changed, 3001 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2.c b/drivers/infiniband/hw/amso1100/c2.c new file mode 100644 index 0000000..4fdbd80 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2.c @@ -0,0 +1,1255 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include "c2.h" +#include "c2_provider.h" + +MODULE_AUTHOR("Tom Tucker "); +MODULE_DESCRIPTION("Ammasso AMSO1100 Low-level iWARP Driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK + | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN; + +static int debug = -1; /* defaults above */ +module_param(debug, int, 0); +MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)"); + +static int c2_up(struct net_device *netdev); +static int c2_down(struct net_device *netdev); +static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev); +static void c2_tx_interrupt(struct net_device *netdev); +static void c2_rx_interrupt(struct net_device *netdev); +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs); +static void c2_tx_timeout(struct net_device *netdev); +static int c2_change_mtu(struct net_device *netdev, int new_mtu); +static void c2_reset(struct c2_port *c2_port); +static struct net_device_stats *c2_get_stats(struct net_device *netdev); + +static struct pci_device_id c2_pci_table[] = { + {0x18b8, 0xb001, PCI_ANY_ID, PCI_ANY_ID}, + {0} +}; + +MODULE_DEVICE_TABLE(pci, c2_pci_table); + +static void c2_print_macaddr(struct net_device *netdev) +{ + pr_debug("%s: MAC %02X:%02X:%02X:%02X:%02X:%02X, " + "IRQ %u\n", netdev->name, + netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2], + netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5], + netdev->irq); +} + +static void c2_set_rxbufsize(struct c2_port *c2_port) +{ + struct net_device *netdev = c2_port->netdev; + + if (netdev->mtu > RX_BUF_SIZE) + c2_port->rx_buf_size = + netdev->mtu + ETH_HLEN + sizeof(struct c2_rxp_hdr) + + NET_IP_ALIGN; + else + c2_port->rx_buf_size = sizeof(struct c2_rxp_hdr) + RX_BUF_SIZE; +} + +/* + * Allocate TX ring elements and chain them together. + * One-to-one association of adapter descriptors with ring elements. + */ +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr, + dma_addr_t base, void __iomem * mmio_txp_ring) +{ + struct c2_tx_desc *tx_desc; + struct c2_txp_desc __iomem *txp_desc; + struct c2_element *elem; + int i; + + tx_ring->start = kmalloc(sizeof(*elem) * tx_ring->count, GFP_KERNEL); + if (!tx_ring->start) + return -ENOMEM; + + elem = tx_ring->start; + tx_desc = vaddr; + txp_desc = mmio_txp_ring; + for (i = 0; i < tx_ring->count; i++, elem++, tx_desc++, txp_desc++) { + tx_desc->len = 0; + tx_desc->status = 0; + + /* Set TXP_HTXD_UNINIT */ + __raw_writeq(cpu_to_be64(0x1122334455667788ULL), + (void __iomem *) txp_desc + C2_TXP_ADDR); + __raw_writew(0, (void __iomem *) txp_desc + C2_TXP_LEN); + __raw_writew(cpu_to_be16(TXP_HTXD_UNINIT), + (void __iomem *) txp_desc + C2_TXP_FLAGS); + + elem->skb = NULL; + elem->ht_desc = tx_desc; + elem->hw_desc = txp_desc; + + if (i == tx_ring->count - 1) { + elem->next = tx_ring->start; + tx_desc->next_offset = base; + } else { + elem->next = elem + 1; + tx_desc->next_offset = + base + (i + 1) * sizeof(*tx_desc); + } + } + + tx_ring->to_use = tx_ring->to_clean = tx_ring->start; + + return 0; +} + +/* + * Allocate RX ring elements and chain them together. + * One-to-one association of adapter descriptors with ring elements. + */ +static int c2_rx_ring_alloc(struct c2_ring *rx_ring, void *vaddr, + dma_addr_t base, void __iomem * mmio_rxp_ring) +{ + struct c2_rx_desc *rx_desc; + struct c2_rxp_desc __iomem *rxp_desc; + struct c2_element *elem; + int i; + + rx_ring->start = kmalloc(sizeof(*elem) * rx_ring->count, GFP_KERNEL); + if (!rx_ring->start) + return -ENOMEM; + + elem = rx_ring->start; + rx_desc = vaddr; + rxp_desc = mmio_rxp_ring; + for (i = 0; i < rx_ring->count; i++, elem++, rx_desc++, rxp_desc++) { + rx_desc->len = 0; + rx_desc->status = 0; + + /* Set RXP_HRXD_UNINIT */ + __raw_writew(cpu_to_be16(RXP_HRXD_OK), + (void __iomem *) rxp_desc + C2_RXP_STATUS); + __raw_writew(0, (void __iomem *) rxp_desc + C2_RXP_COUNT); + __raw_writew(0, (void __iomem *) rxp_desc + C2_RXP_LEN); + __raw_writeq(cpu_to_be64(0x99aabbccddeeffULL), + (void __iomem *) rxp_desc + C2_RXP_ADDR); + __raw_writew(cpu_to_be16(RXP_HRXD_UNINIT), + (void __iomem *) rxp_desc + C2_RXP_FLAGS); + + elem->skb = NULL; + elem->ht_desc = rx_desc; + elem->hw_desc = rxp_desc; + + if (i == rx_ring->count - 1) { + elem->next = rx_ring->start; + rx_desc->next_offset = base; + } else { + elem->next = elem + 1; + rx_desc->next_offset = + base + (i + 1) * sizeof(*rx_desc); + } + } + + rx_ring->to_use = rx_ring->to_clean = rx_ring->start; + + return 0; +} + +/* Setup buffer for receiving */ +static inline int c2_rx_alloc(struct c2_port *c2_port, struct c2_element *elem) +{ + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_rx_desc *rx_desc = elem->ht_desc; + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen; + struct c2_rxp_hdr *rxp_hdr; + + skb = dev_alloc_skb(c2_port->rx_buf_size); + if (unlikely(!skb)) { + pr_debug("%s: out of memory for receive\n", + c2_port->netdev->name); + return -ENOMEM; + } + + /* Zero out the rxp hdr in the sk_buff */ + memset(skb->data, 0, sizeof(*rxp_hdr)); + + skb->dev = c2_port->netdev; + + maplen = c2_port->rx_buf_size; + mapaddr = + pci_map_single(c2dev->pcidev, skb->data, maplen, + PCI_DMA_FROMDEVICE); + + /* Set the sk_buff RXP_header to RXP_HRXD_READY */ + rxp_hdr = (struct c2_rxp_hdr *) skb->data; + rxp_hdr->flags = RXP_HRXD_READY; + + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS); + __raw_writew(cpu_to_be16((u16) maplen - sizeof(*rxp_hdr)), + elem->hw_desc + C2_RXP_LEN); + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_RXP_ADDR); + __raw_writew(cpu_to_be16(RXP_HRXD_READY), elem->hw_desc + C2_RXP_FLAGS); + + elem->skb = skb; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + rx_desc->len = maplen; + + return 0; +} + +/* + * Allocate buffers for the Rx ring + * For receive: rx_ring.to_clean is next received frame + */ +static int c2_rx_fill(struct c2_port *c2_port) +{ + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + int ret = 0; + + elem = rx_ring->start; + do { + if (c2_rx_alloc(c2_port, elem)) { + ret = 1; + break; + } + } while ((elem = elem->next) != rx_ring->start); + + rx_ring->to_clean = rx_ring->start; + return ret; +} + +/* Free all buffers in RX ring, assumes receiver stopped */ +static void c2_rx_clean(struct c2_port *c2_port) +{ + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + struct c2_rx_desc *rx_desc; + + elem = rx_ring->start; + do { + rx_desc = elem->ht_desc; + rx_desc->len = 0; + + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS); + __raw_writew(0, elem->hw_desc + C2_RXP_COUNT); + __raw_writew(0, elem->hw_desc + C2_RXP_LEN); + __raw_writeq(cpu_to_be64(0x99aabbccddeeffULL), + elem->hw_desc + C2_RXP_ADDR); + __raw_writew(cpu_to_be16(RXP_HRXD_UNINIT), + elem->hw_desc + C2_RXP_FLAGS); + + if (elem->skb) { + pci_unmap_single(c2dev->pcidev, elem->mapaddr, + elem->maplen, PCI_DMA_FROMDEVICE); + dev_kfree_skb(elem->skb); + elem->skb = NULL; + } + } while ((elem = elem->next) != rx_ring->start); +} + +static inline int c2_tx_free(struct c2_dev *c2dev, struct c2_element *elem) +{ + struct c2_tx_desc *tx_desc = elem->ht_desc; + + tx_desc->len = 0; + + pci_unmap_single(c2dev->pcidev, elem->mapaddr, elem->maplen, + PCI_DMA_TODEVICE); + + if (elem->skb) { + dev_kfree_skb_any(elem->skb); + elem->skb = NULL; + } + + return 0; +} + +/* Free all buffers in TX ring, assumes transmitter stopped */ +static void c2_tx_clean(struct c2_port *c2_port) +{ + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + struct c2_txp_desc txp_htxd; + int retry; + unsigned long flags; + + spin_lock_irqsave(&c2_port->tx_lock, flags); + + elem = tx_ring->start; + + do { + retry = 0; + do { + txp_htxd.flags = + readw(elem->hw_desc + C2_TXP_FLAGS); + + if (txp_htxd.flags == TXP_HTXD_READY) { + retry = 1; + __raw_writew(0, + elem->hw_desc + C2_TXP_LEN); + __raw_writeq(0, + elem->hw_desc + C2_TXP_ADDR); + __raw_writew(cpu_to_be16(TXP_HTXD_DONE), + elem->hw_desc + C2_TXP_FLAGS); + c2_port->netstats.tx_dropped++; + break; + } else { + __raw_writew(0, + elem->hw_desc + C2_TXP_LEN); + __raw_writeq(cpu_to_be64(0x1122334455667788ULL), + elem->hw_desc + C2_TXP_ADDR); + __raw_writew(cpu_to_be16(TXP_HTXD_UNINIT), + elem->hw_desc + C2_TXP_FLAGS); + } + + c2_tx_free(c2_port->c2dev, elem); + + } while ((elem = elem->next) != tx_ring->start); + } while (retry); + + c2_port->tx_avail = c2_port->tx_ring.count - 1; + c2_port->c2dev->cur_tx = tx_ring->to_use - tx_ring->start; + + if (c2_port->tx_avail > MAX_SKB_FRAGS + 1) + netif_wake_queue(c2_port->netdev); + + spin_unlock_irqrestore(&c2_port->tx_lock, flags); +} + +/* + * Process transmit descriptors marked 'DONE' by the firmware, + * freeing up their unneeded sk_buffs. + */ +static void c2_tx_interrupt(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + struct c2_txp_desc txp_htxd; + + spin_lock(&c2_port->tx_lock); + + for (elem = tx_ring->to_clean; elem != tx_ring->to_use; + elem = elem->next) { + txp_htxd.flags = + be16_to_cpu(readw(elem->hw_desc + C2_TXP_FLAGS)); + + if (txp_htxd.flags != TXP_HTXD_DONE) + break; + + if (netif_msg_tx_done(c2_port)) { + /* PCI reads are expensive in fast path */ + txp_htxd.len = + be16_to_cpu(readw(elem->hw_desc + C2_TXP_LEN)); + pr_debug("%s: tx done slot %3Zu status 0x%x len " + "%5u bytes\n", + netdev->name, elem - tx_ring->start, + txp_htxd.flags, txp_htxd.len); + } + + c2_tx_free(c2dev, elem); + ++(c2_port->tx_avail); + } + + tx_ring->to_clean = elem; + + if (netif_queue_stopped(netdev) + && c2_port->tx_avail > MAX_SKB_FRAGS + 1) + netif_wake_queue(netdev); + + spin_unlock(&c2_port->tx_lock); +} + +static void c2_rx_error(struct c2_port *c2_port, struct c2_element *elem) +{ + struct c2_rx_desc *rx_desc = elem->ht_desc; + struct c2_rxp_hdr *rxp_hdr = (struct c2_rxp_hdr *) elem->skb->data; + + if (rxp_hdr->status != RXP_HRXD_OK || + rxp_hdr->len > (rx_desc->len - sizeof(*rxp_hdr))) { + pr_debug("BAD RXP_HRXD\n"); + pr_debug(" rx_desc : %p\n", rx_desc); + pr_debug(" index : %Zu\n", + elem - c2_port->rx_ring.start); + pr_debug(" len : %u\n", rx_desc->len); + pr_debug(" rxp_hdr : %p [PA %p]\n", rxp_hdr, + (void *) __pa((unsigned long) rxp_hdr)); + pr_debug(" flags : 0x%x\n", rxp_hdr->flags); + pr_debug(" status: 0x%x\n", rxp_hdr->status); + pr_debug(" len : %u\n", rxp_hdr->len); + pr_debug(" rsvd : 0x%x\n", rxp_hdr->rsvd); + } + + /* Setup the skb for reuse since we're dropping this pkt */ + elem->skb->tail = elem->skb->data = elem->skb->head; + + /* Zero out the rxp hdr in the sk_buff */ + memset(elem->skb->data, 0, sizeof(*rxp_hdr)); + + /* Write the descriptor to the adapter's rx ring */ + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS); + __raw_writew(0, elem->hw_desc + C2_RXP_COUNT); + __raw_writew(cpu_to_be16((u16) elem->maplen - sizeof(*rxp_hdr)), + elem->hw_desc + C2_RXP_LEN); + __raw_writeq(cpu_to_be64(elem->mapaddr), elem->hw_desc + C2_RXP_ADDR); + __raw_writew(cpu_to_be16(RXP_HRXD_READY), elem->hw_desc + C2_RXP_FLAGS); + + pr_debug("packet dropped\n"); + c2_port->netstats.rx_dropped++; +} + +static void c2_rx_interrupt(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + struct c2_rx_desc *rx_desc; + struct c2_rxp_hdr *rxp_hdr; + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen, buflen; + unsigned long flags; + + spin_lock_irqsave(&c2dev->lock, flags); + + /* Begin where we left off */ + rx_ring->to_clean = rx_ring->start + c2dev->cur_rx; + + for (elem = rx_ring->to_clean; elem->next != rx_ring->to_clean; + elem = elem->next) { + rx_desc = elem->ht_desc; + mapaddr = elem->mapaddr; + maplen = elem->maplen; + skb = elem->skb; + rxp_hdr = (struct c2_rxp_hdr *) skb->data; + + if (rxp_hdr->flags != RXP_HRXD_DONE) + break; + buflen = rxp_hdr->len; + + /* Sanity check the RXP header */ + if (rxp_hdr->status != RXP_HRXD_OK || + buflen > (rx_desc->len - sizeof(*rxp_hdr))) { + c2_rx_error(c2_port, elem); + continue; + } + + /* + * Allocate and map a new skb for replenishing the host + * RX desc + */ + if (c2_rx_alloc(c2_port, elem)) { + c2_rx_error(c2_port, elem); + continue; + } + + /* Unmap the old skb */ + pci_unmap_single(c2dev->pcidev, mapaddr, maplen, + PCI_DMA_FROMDEVICE); + + prefetch(skb->data); + + /* + * Skip past the leading 8 bytes comprising of the + * "struct c2_rxp_hdr", prepended by the adapter + * to the usual Ethernet header ("struct ethhdr"), + * to the start of the raw Ethernet packet. + * + * Fix up the various fields in the sk_buff before + * passing it up to netif_rx(). The transfer size + * (in bytes) specified by the adapter len field of + * the "struct rxp_hdr_t" does NOT include the + * "sizeof(struct c2_rxp_hdr)". + */ + skb->data += sizeof(*rxp_hdr); + skb->tail = skb->data + buflen; + skb->len = buflen; + skb->dev = netdev; + skb->protocol = eth_type_trans(skb, netdev); + + netif_rx(skb); + + netdev->last_rx = jiffies; + c2_port->netstats.rx_packets++; + c2_port->netstats.rx_bytes += buflen; + } + + /* Save where we left off */ + rx_ring->to_clean = elem; + c2dev->cur_rx = elem - rx_ring->start; + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); + + spin_unlock_irqrestore(&c2dev->lock, flags); +} + +/* + * Handle netisr0 TX & RX interrupts. + */ +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs) +{ + unsigned int netisr0, dmaisr; + int handled = 0; + struct c2_dev *c2dev = (struct c2_dev *) dev_id; + + /* Process CCILNET interrupts */ + netisr0 = readl(c2dev->regs + C2_NISR0); + if (netisr0) { + + /* + * There is an issue with the firmware that always + * provides the status of RX for both TX & RX + * interrupts. So process both queues here. + */ + c2_rx_interrupt(c2dev->netdev); + c2_tx_interrupt(c2dev->netdev); + + /* Clear the interrupt */ + writel(netisr0, c2dev->regs + C2_NISR0); + handled++; + } + + /* Process RNIC interrupts */ + dmaisr = readl(c2dev->regs + C2_DISR); + if (dmaisr) { + writel(dmaisr, c2dev->regs + C2_DISR); + c2_rnic_interrupt(c2dev); + handled++; + } + + if (handled) { + return IRQ_HANDLED; + } else { + return IRQ_NONE; + } +} + +static int c2_up(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_element *elem; + struct c2_rxp_hdr *rxp_hdr; + struct in_device *in_dev; + size_t rx_size, tx_size; + int ret, i; + unsigned int netimr0; + + if (netif_msg_ifup(c2_port)) + pr_debug("%s: enabling interface\n", netdev->name); + + /* Set the Rx buffer size based on MTU */ + c2_set_rxbufsize(c2_port); + + /* Allocate DMA'able memory for Tx/Rx host descriptor rings */ + rx_size = c2_port->rx_ring.count * sizeof(struct c2_rx_desc); + tx_size = c2_port->tx_ring.count * sizeof(struct c2_tx_desc); + + c2_port->mem_size = tx_size + rx_size; + c2_port->mem = pci_alloc_consistent(c2dev->pcidev, c2_port->mem_size, + &c2_port->dma); + if (c2_port->mem == NULL) { + pr_debug("Unable to allocate memory for " + "host descriptor rings\n"); + return -ENOMEM; + } + + memset(c2_port->mem, 0, c2_port->mem_size); + + /* Create the Rx host descriptor ring */ + if ((ret = + c2_rx_ring_alloc(&c2_port->rx_ring, c2_port->mem, c2_port->dma, + c2dev->mmio_rxp_ring))) { + pr_debug("Unable to create RX ring\n"); + goto bail0; + } + + /* Allocate Rx buffers for the host descriptor ring */ + if (c2_rx_fill(c2_port)) { + pr_debug("Unable to fill RX ring\n"); + goto bail1; + } + + /* Create the Tx host descriptor ring */ + if ((ret = c2_tx_ring_alloc(&c2_port->tx_ring, c2_port->mem + rx_size, + c2_port->dma + rx_size, + c2dev->mmio_txp_ring))) { + pr_debug("Unable to create TX ring\n"); + goto bail1; + } + + /* Set the TX pointer to where we left off */ + c2_port->tx_avail = c2_port->tx_ring.count - 1; + c2_port->tx_ring.to_use = c2_port->tx_ring.to_clean = + c2_port->tx_ring.start + c2dev->cur_tx; + + /* missing: Initialize MAC */ + + BUG_ON(c2_port->tx_ring.to_use != c2_port->tx_ring.to_clean); + + /* Reset the adapter, ensures the driver is in sync with the RXP */ + c2_reset(c2_port); + + /* Reset the READY bit in the sk_buff RXP headers & adapter HRXDQ */ + for (i = 0, elem = c2_port->rx_ring.start; i < c2_port->rx_ring.count; + i++, elem++) { + rxp_hdr = (struct c2_rxp_hdr *) elem->skb->data; + rxp_hdr->flags = 0; + __raw_writew(cpu_to_be16(RXP_HRXD_READY), + elem->hw_desc + C2_RXP_FLAGS); + } + + /* Enable network packets */ + netif_start_queue(netdev); + + /* Enable IRQ */ + writel(0, c2dev->regs + C2_IDIS); + netimr0 = readl(c2dev->regs + C2_NIMR0); + netimr0 &= ~(C2_PCI_HTX_INT | C2_PCI_HRX_INT); + writel(netimr0, c2dev->regs + C2_NIMR0); + + /* Tell the stack to ignore arp requests for ipaddrs bound to + * other interfaces. This is needed to prevent the host stack + * from responding to arp requests to the ipaddr bound on the + * rdma interface. + */ + in_dev = in_dev_get(netdev); + in_dev->cnf.arp_ignore = 1; + in_dev_put(in_dev); + + return 0; + + bail1: + c2_rx_clean(c2_port); + kfree(c2_port->rx_ring.start); + + bail0: + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, + c2_port->dma); + + return ret; +} + +static int c2_down(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + + if (netif_msg_ifdown(c2_port)) + pr_debug("%s: disabling interface\n", + netdev->name); + + /* Wait for all the queued packets to get sent */ + c2_tx_interrupt(netdev); + + /* Disable network packets */ + netif_stop_queue(netdev); + + /* Disable IRQs by clearing the interrupt mask */ + writel(1, c2dev->regs + C2_IDIS); + writel(0, c2dev->regs + C2_NIMR0); + + /* missing: Stop transmitter */ + + /* missing: Stop receiver */ + + /* Reset the adapter, ensures the driver is in sync with the RXP */ + c2_reset(c2_port); + + /* missing: Turn off LEDs here */ + + /* Free all buffers in the host descriptor rings */ + c2_tx_clean(c2_port); + c2_rx_clean(c2_port); + + /* Free the host descriptor rings */ + kfree(c2_port->rx_ring.start); + kfree(c2_port->tx_ring.start); + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, + c2_port->dma); + + return 0; +} + +static void c2_reset(struct c2_port *c2_port) +{ + struct c2_dev *c2dev = c2_port->c2dev; + unsigned int cur_rx = c2dev->cur_rx; + + /* Tell the hardware to quiesce */ + C2_SET_CUR_RX(c2dev, cur_rx | C2_PCI_HRX_QUI); + + /* + * The hardware will reset the C2_PCI_HRX_QUI bit once + * the RXP is quiesced. Wait 2 seconds for this. + */ + ssleep(2); + + cur_rx = C2_GET_CUR_RX(c2dev); + + if (cur_rx & C2_PCI_HRX_QUI) + pr_debug("c2_reset: failed to quiesce the hardware!\n"); + + cur_rx &= ~C2_PCI_HRX_QUI; + + c2dev->cur_rx = cur_rx; + + pr_debug("Current RX: %u\n", c2dev->cur_rx); +} + +static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + dma_addr_t mapaddr; + u32 maplen; + unsigned long flags; + unsigned int i; + + spin_lock_irqsave(&c2_port->tx_lock, flags); + + if (unlikely(c2_port->tx_avail < (skb_shinfo(skb)->nr_frags + 1))) { + netif_stop_queue(netdev); + spin_unlock_irqrestore(&c2_port->tx_lock, flags); + + pr_debug("%s: Tx ring full when queue awake!\n", + netdev->name); + return NETDEV_TX_BUSY; + } + + maplen = skb_headlen(skb); + mapaddr = + pci_map_single(c2dev->pcidev, skb->data, maplen, PCI_DMA_TODEVICE); + + elem = tx_ring->to_use; + elem->skb = skb; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + + /* Tell HW to xmit */ + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR); + __raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN); + __raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS); + + c2_port->netstats.tx_packets++; + c2_port->netstats.tx_bytes += maplen; + + /* Loop thru additional data fragments and queue them */ + if (skb_shinfo(skb)->nr_frags) { + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + maplen = frag->size; + mapaddr = + pci_map_page(c2dev->pcidev, frag->page, + frag->page_offset, maplen, + PCI_DMA_TODEVICE); + + elem = elem->next; + elem->skb = NULL; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + + /* Tell HW to xmit */ + __raw_writeq(cpu_to_be64(mapaddr), + elem->hw_desc + C2_TXP_ADDR); + __raw_writew(cpu_to_be16(maplen), + elem->hw_desc + C2_TXP_LEN); + __raw_writew(cpu_to_be16(TXP_HTXD_READY), + elem->hw_desc + C2_TXP_FLAGS); + + c2_port->netstats.tx_packets++; + c2_port->netstats.tx_bytes += maplen; + } + } + + tx_ring->to_use = elem->next; + c2_port->tx_avail -= (skb_shinfo(skb)->nr_frags + 1); + + if (c2_port->tx_avail <= MAX_SKB_FRAGS + 1) { + netif_stop_queue(netdev); + if (netif_msg_tx_queued(c2_port)) + pr_debug("%s: transmit queue full\n", + netdev->name); + } + + spin_unlock_irqrestore(&c2_port->tx_lock, flags); + + netdev->trans_start = jiffies; + + return NETDEV_TX_OK; +} + +static struct net_device_stats *c2_get_stats(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + + return &c2_port->netstats; +} + +static void c2_tx_timeout(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + + if (netif_msg_timer(c2_port)) + pr_debug("%s: tx timeout\n", netdev->name); + + c2_tx_clean(c2_port); +} + +static int c2_change_mtu(struct net_device *netdev, int new_mtu) +{ + int ret = 0; + + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) + return -EINVAL; + + netdev->mtu = new_mtu; + + if (netif_running(netdev)) { + c2_down(netdev); + + c2_up(netdev); + } + + return ret; +} + +/* Initialize network device */ +static struct net_device *c2_devinit(struct c2_dev *c2dev, + void __iomem * mmio_addr) +{ + struct c2_port *c2_port = NULL; + struct net_device *netdev = alloc_etherdev(sizeof(*c2_port)); + + if (!netdev) { + pr_debug("c2_port etherdev alloc failed"); + return NULL; + } + + SET_MODULE_OWNER(netdev); + SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev); + + netdev->open = c2_up; + netdev->stop = c2_down; + netdev->hard_start_xmit = c2_xmit_frame; + netdev->get_stats = c2_get_stats; + netdev->tx_timeout = c2_tx_timeout; + netdev->change_mtu = c2_change_mtu; + netdev->watchdog_timeo = C2_TX_TIMEOUT; + netdev->irq = c2dev->pcidev->irq; + + c2_port = netdev_priv(netdev); + c2_port->netdev = netdev; + c2_port->c2dev = c2dev; + c2_port->msg_enable = netif_msg_init(debug, default_msg); + c2_port->tx_ring.count = C2_NUM_TX_DESC; + c2_port->rx_ring.count = C2_NUM_RX_DESC; + + spin_lock_init(&c2_port->tx_lock); + + /* Copy our 48-bit ethernet hardware address */ + memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_ENADDR, 6); + + /* Validate the MAC address */ + if (!is_valid_ether_addr(netdev->dev_addr)) { + pr_debug("Invalid MAC Address\n"); + c2_print_macaddr(netdev); + free_netdev(netdev); + return NULL; + } + + c2dev->netdev = netdev; + + return netdev; +} + +static int __devinit c2_probe(struct pci_dev *pcidev, + const struct pci_device_id *ent) +{ + int ret = 0, i; + unsigned long reg0_start, reg0_flags, reg0_len; + unsigned long reg2_start, reg2_flags, reg2_len; + unsigned long reg4_start, reg4_flags, reg4_len; + unsigned kva_map_size; + struct net_device *netdev = NULL; + struct c2_dev *c2dev = NULL; + void __iomem *mmio_regs = NULL; + + printk(KERN_INFO PFX "AMSO1100 Gigabit Ethernet driver v%s loaded\n", + DRV_VERSION); + + /* Enable PCI device */ + ret = pci_enable_device(pcidev); + if (ret) { + printk(KERN_ERR PFX "%s: Unable to enable PCI device\n", + pci_name(pcidev)); + goto bail0; + } + + reg0_start = pci_resource_start(pcidev, BAR_0); + reg0_len = pci_resource_len(pcidev, BAR_0); + reg0_flags = pci_resource_flags(pcidev, BAR_0); + + reg2_start = pci_resource_start(pcidev, BAR_2); + reg2_len = pci_resource_len(pcidev, BAR_2); + reg2_flags = pci_resource_flags(pcidev, BAR_2); + + reg4_start = pci_resource_start(pcidev, BAR_4); + reg4_len = pci_resource_len(pcidev, BAR_4); + reg4_flags = pci_resource_flags(pcidev, BAR_4); + + pr_debug("BAR0 size = 0x%lX bytes\n", reg0_len); + pr_debug("BAR2 size = 0x%lX bytes\n", reg2_len); + pr_debug("BAR4 size = 0x%lX bytes\n", reg4_len); + + /* Make sure PCI base addr are MMIO */ + if (!(reg0_flags & IORESOURCE_MEM) || + !(reg2_flags & IORESOURCE_MEM) || !(reg4_flags & IORESOURCE_MEM)) { + printk(KERN_ERR PFX "PCI regions not an MMIO resource\n"); + ret = -ENODEV; + goto bail1; + } + + /* Check for weird/broken PCI region reporting */ + if ((reg0_len < C2_REG0_SIZE) || + (reg2_len < C2_REG2_SIZE) || (reg4_len < C2_REG4_SIZE)) { + printk(KERN_ERR PFX "Invalid PCI region sizes\n"); + ret = -ENODEV; + goto bail1; + } + + /* Reserve PCI I/O and memory resources */ + ret = pci_request_regions(pcidev, DRV_NAME); + if (ret) { + printk(KERN_ERR PFX "%s: Unable to request regions\n", + pci_name(pcidev)); + goto bail1; + } + + if ((sizeof(dma_addr_t) > 4)) { + ret = pci_set_dma_mask(pcidev, DMA_64BIT_MASK); + if (ret < 0) { + printk(KERN_ERR PFX "64b DMA configuration failed\n"); + goto bail2; + } + } else { + ret = pci_set_dma_mask(pcidev, DMA_32BIT_MASK); + if (ret < 0) { + printk(KERN_ERR PFX "32b DMA configuration failed\n"); + goto bail2; + } + } + + /* Enables bus-mastering on the device */ + pci_set_master(pcidev); + + /* Remap the adapter PCI registers in BAR4 */ + mmio_regs = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, + sizeof(struct c2_adapter_pci_regs)); + if (mmio_regs == 0UL) { + printk(KERN_ERR PFX + "Unable to remap adapter PCI registers in BAR4\n"); + ret = -EIO; + goto bail2; + } + + /* Validate PCI regs magic */ + for (i = 0; i < sizeof(c2_magic); i++) { + if (c2_magic[i] != readb(mmio_regs + C2_REGS_MAGIC + i)) { + printk(KERN_ERR PFX "Downlevel Firmware boot loader " + "[%d/%Zd: got 0x%x, exp 0x%x]. Use the cc_flash " + "utility to update your boot loader\n", + i + 1, sizeof(c2_magic), + readb(mmio_regs + C2_REGS_MAGIC + i), + c2_magic[i]); + printk(KERN_ERR PFX "Adapter not claimed\n"); + iounmap(mmio_regs); + ret = -EIO; + goto bail2; + } + } + + /* Validate the adapter version */ + if (be32_to_cpu(readl(mmio_regs + C2_REGS_VERS)) != C2_VERSION) { + printk(KERN_ERR PFX "Version mismatch " + "[fw=%u, c2=%u], Adapter not claimed\n", + be32_to_cpu(readl(mmio_regs + C2_REGS_VERS)), + C2_VERSION); + ret = -EINVAL; + iounmap(mmio_regs); + goto bail2; + } + + /* Validate the adapter IVN */ + if (be32_to_cpu(readl(mmio_regs + C2_REGS_IVN)) != C2_IVN) { + printk(KERN_ERR PFX "Downlevel FIrmware level. You should be using " + "the OpenIB device support kit. " + "[fw=0x%x, c2=0x%x], Adapter not claimed\n", + be32_to_cpu(readl(mmio_regs + C2_REGS_IVN)), + C2_IVN); + ret = -EINVAL; + iounmap(mmio_regs); + goto bail2; + } + + /* Allocate hardware structure */ + c2dev = (struct c2_dev *) ib_alloc_device(sizeof(*c2dev)); + if (!c2dev) { + printk(KERN_ERR PFX "%s: Unable to alloc hardware struct\n", + pci_name(pcidev)); + ret = -ENOMEM; + iounmap(mmio_regs); + goto bail2; + } + + memset(c2dev, 0, sizeof(*c2dev)); + spin_lock_init(&c2dev->lock); + c2dev->pcidev = pcidev; + c2dev->cur_tx = 0; + + /* Get the last RX index */ + c2dev->cur_rx = + (be32_to_cpu(readl(mmio_regs + C2_REGS_HRX_CUR)) - + 0xffffc000) / sizeof(struct c2_rxp_desc); + + /* Request an interrupt line for the driver */ + ret = request_irq(pcidev->irq, c2_interrupt, SA_SHIRQ, DRV_NAME, c2dev); + if (ret) { + printk(KERN_ERR PFX "%s: requested IRQ %u is busy\n", + pci_name(pcidev), pcidev->irq); + iounmap(mmio_regs); + goto bail3; + } + + /* Set driver specific data */ + pci_set_drvdata(pcidev, c2dev); + + /* Initialize network device */ + if ((netdev = c2_devinit(c2dev, mmio_regs)) == NULL) { + iounmap(mmio_regs); + goto bail4; + } + + /* Save off the actual size prior to unmapping mmio_regs */ + kva_map_size = be32_to_cpu(readl(mmio_regs + C2_REGS_PCI_WINSIZE)); + + /* Unmap the adapter PCI registers in BAR4 */ + iounmap(mmio_regs); + + /* Register network device */ + ret = register_netdev(netdev); + if (ret) { + printk(KERN_ERR PFX "Unable to register netdev, ret = %d\n", + ret); + goto bail5; + } + + /* Disable network packets */ + netif_stop_queue(netdev); + + /* Remap the adapter HRXDQ PA space to kernel VA space */ + c2dev->mmio_rxp_ring = ioremap_nocache(reg4_start + C2_RXP_HRXDQ_OFFSET, + C2_RXP_HRXDQ_SIZE); + if (c2dev->mmio_rxp_ring == 0UL) { + printk(KERN_ERR PFX "Unable to remap MMIO HRXDQ region\n"); + ret = -EIO; + goto bail6; + } + + /* Remap the adapter HTXDQ PA space to kernel VA space */ + c2dev->mmio_txp_ring = ioremap_nocache(reg4_start + C2_TXP_HTXDQ_OFFSET, + C2_TXP_HTXDQ_SIZE); + if (c2dev->mmio_txp_ring == 0UL) { + printk(KERN_ERR PFX "Unable to remap MMIO HTXDQ region\n"); + ret = -EIO; + goto bail7; + } + + /* Save off the current RX index in the last 4 bytes of the TXP Ring */ + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); + + /* Remap the PCI registers in adapter BAR0 to kernel VA space */ + c2dev->regs = ioremap_nocache(reg0_start, reg0_len); + if (c2dev->regs == 0UL) { + printk(KERN_ERR PFX "Unable to remap BAR0\n"); + ret = -EIO; + goto bail8; + } + + /* Remap the PCI registers in adapter BAR4 to kernel VA space */ + c2dev->pa = reg4_start + C2_PCI_REGS_OFFSET; + c2dev->kva = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, + kva_map_size); + if (c2dev->kva == 0UL) { + printk(KERN_ERR PFX "Unable to remap BAR4\n"); + ret = -EIO; + goto bail9; + } + + /* Print out the MAC address */ + c2_print_macaddr(netdev); + + ret = c2_rnic_init(c2dev); + if (ret) { + printk(KERN_ERR PFX "c2_rnic_init failed: %d\n", ret); + goto bail10; + } + + c2_register_device(c2dev); + + return 0; + + bail10: + iounmap(c2dev->kva); + + bail9: + iounmap(c2dev->regs); + + bail8: + iounmap(c2dev->mmio_txp_ring); + + bail7: + iounmap(c2dev->mmio_rxp_ring); + + bail6: + unregister_netdev(netdev); + + bail5: + free_netdev(netdev); + + bail4: + free_irq(pcidev->irq, c2dev); + + bail3: + ib_dealloc_device(&c2dev->ibdev); + + bail2: + pci_release_regions(pcidev); + + bail1: + pci_disable_device(pcidev); + + bail0: + return ret; +} + +static void __devexit c2_remove(struct pci_dev *pcidev) +{ + struct c2_dev *c2dev = pci_get_drvdata(pcidev); + struct net_device *netdev = c2dev->netdev; + + /* Unregister with OpenIB */ + c2_unregister_device(c2dev); + + /* Clean up the RNIC resources */ + c2_rnic_term(c2dev); + + /* Remove network device from the kernel */ + unregister_netdev(netdev); + + /* Free network device */ + free_netdev(netdev); + + /* Free the interrupt line */ + free_irq(pcidev->irq, c2dev); + + /* missing: Turn LEDs off here */ + + /* Unmap adapter PA space */ + iounmap(c2dev->kva); + iounmap(c2dev->regs); + iounmap(c2dev->mmio_txp_ring); + iounmap(c2dev->mmio_rxp_ring); + + /* Free the hardware structure */ + ib_dealloc_device(&c2dev->ibdev); + + /* Release reserved PCI I/O and memory resources */ + pci_release_regions(pcidev); + + /* Disable PCI device */ + pci_disable_device(pcidev); + + /* Clear driver specific data */ + pci_set_drvdata(pcidev, NULL); +} + +static struct pci_driver c2_pci_driver = { + .name = DRV_NAME, + .id_table = c2_pci_table, + .probe = c2_probe, + .remove = __devexit_p(c2_remove), +}; + +static int __init c2_init_module(void) +{ + return pci_module_init(&c2_pci_driver); +} + +static void __exit c2_exit_module(void) +{ + pci_unregister_driver(&c2_pci_driver); +} + +module_init(c2_init_module); +module_exit(c2_exit_module); diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h new file mode 100644 index 0000000..3b17530 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2.h @@ -0,0 +1,552 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef __C2_H +#define __C2_H + +#include +#include +#include +#include +#include +#include +#include + +#include "c2_provider.h" +#include "c2_mq.h" +#include "c2_status.h" + +#define DRV_NAME "c2" +#define DRV_VERSION "1.1" +#define PFX DRV_NAME ": " + +#define BAR_0 0 +#define BAR_2 2 +#define BAR_4 4 + +#define RX_BUF_SIZE (1536 + 8) +#define ETH_JUMBO_MTU 9000 +#define C2_MAGIC "CEPHEUS" +#define C2_VERSION 4 +#define C2_IVN (18 & 0x7fffffff) + +#define C2_REG0_SIZE (16 * 1024) +#define C2_REG2_SIZE (2 * 1024 * 1024) +#define C2_REG4_SIZE (256 * 1024 * 1024) +#define C2_NUM_TX_DESC 341 +#define C2_NUM_RX_DESC 256 +#define C2_PCI_REGS_OFFSET (0x10000) +#define C2_RXP_HRXDQ_OFFSET (((C2_REG4_SIZE)/2)) +#define C2_RXP_HRXDQ_SIZE (4096) +#define C2_TXP_HTXDQ_OFFSET (((C2_REG4_SIZE)/2) + C2_RXP_HRXDQ_SIZE) +#define C2_TXP_HTXDQ_SIZE (4096) +#define C2_TX_TIMEOUT (6*HZ) + +/* CEPHEUS */ +static const u8 c2_magic[] = { + 0x43, 0x45, 0x50, 0x48, 0x45, 0x55, 0x53 +}; + +enum adapter_pci_regs { + C2_REGS_MAGIC = 0x0000, + C2_REGS_VERS = 0x0008, + C2_REGS_IVN = 0x000C, + C2_REGS_PCI_WINSIZE = 0x0010, + C2_REGS_Q0_QSIZE = 0x0014, + C2_REGS_Q0_MSGSIZE = 0x0018, + C2_REGS_Q0_POOLSTART = 0x001C, + C2_REGS_Q0_SHARED = 0x0020, + C2_REGS_Q1_QSIZE = 0x0024, + C2_REGS_Q1_MSGSIZE = 0x0028, + C2_REGS_Q1_SHARED = 0x0030, + C2_REGS_Q2_QSIZE = 0x0034, + C2_REGS_Q2_MSGSIZE = 0x0038, + C2_REGS_Q2_SHARED = 0x0040, + C2_REGS_ENADDR = 0x004C, + C2_REGS_RDMA_ENADDR = 0x0054, + C2_REGS_HRX_CUR = 0x006C, +}; + +struct c2_adapter_pci_regs { + char reg_magic[8]; + u32 version; + u32 ivn; + u32 pci_window_size; + u32 q0_q_size; + u32 q0_msg_size; + u32 q0_pool_start; + u32 q0_shared; + u32 q1_q_size; + u32 q1_msg_size; + u32 q1_pool_start; + u32 q1_shared; + u32 q2_q_size; + u32 q2_msg_size; + u32 q2_pool_start; + u32 q2_shared; + u32 log_start; + u32 log_size; + u8 host_enaddr[8]; + u8 rdma_enaddr[8]; + u32 crash_entry; + u32 crash_ready[2]; + u32 fw_txd_cur; + u32 fw_hrxd_cur; + u32 fw_rxd_cur; +}; + +enum pci_regs { + C2_HISR = 0x0000, + C2_DISR = 0x0004, + C2_HIMR = 0x0008, + C2_DIMR = 0x000C, + C2_NISR0 = 0x0010, + C2_NISR1 = 0x0014, + C2_NIMR0 = 0x0018, + C2_NIMR1 = 0x001C, + C2_IDIS = 0x0020, +}; + +enum { + C2_PCI_HRX_INT = 1 << 8, + C2_PCI_HTX_INT = 1 << 17, + C2_PCI_HRX_QUI = 1 << 31, +}; + +/* + * Cepheus registers in BAR0. + */ +struct c2_pci_regs { + u32 hostisr; + u32 dmaisr; + u32 hostimr; + u32 dmaimr; + u32 netisr0; + u32 netisr1; + u32 netimr0; + u32 netimr1; + u32 int_disable; +}; + +/* TXP flags */ +enum c2_txp_flags { + TXP_HTXD_DONE = 0, + TXP_HTXD_READY = 1 << 0, + TXP_HTXD_UNINIT = 1 << 1, +}; + +/* RXP flags */ +enum c2_rxp_flags { + RXP_HRXD_UNINIT = 0, + RXP_HRXD_READY = 1 << 0, + RXP_HRXD_DONE = 1 << 1, +}; + +/* RXP status */ +enum c2_rxp_status { + RXP_HRXD_ZERO = 0, + RXP_HRXD_OK = 1 << 0, + RXP_HRXD_BUF_OV = 1 << 1, +}; + +/* TXP descriptor fields */ +enum txp_desc { + C2_TXP_FLAGS = 0x0000, + C2_TXP_LEN = 0x0002, + C2_TXP_ADDR = 0x0004, +}; + +/* RXP descriptor fields */ +enum rxp_desc { + C2_RXP_FLAGS = 0x0000, + C2_RXP_STATUS = 0x0002, + C2_RXP_COUNT = 0x0004, + C2_RXP_LEN = 0x0006, + C2_RXP_ADDR = 0x0008, +}; + +struct c2_txp_desc { + u16 flags; + u16 len; + u64 addr; +} __attribute__ ((packed)); + +struct c2_rxp_desc { + u16 flags; + u16 status; + u16 count; + u16 len; + u64 addr; +} __attribute__ ((packed)); + +struct c2_rxp_hdr { + u16 flags; + u16 status; + u16 len; + u16 rsvd; +} __attribute__ ((packed)); + +struct c2_tx_desc { + u32 len; + u32 status; + dma_addr_t next_offset; +}; + +struct c2_rx_desc { + u32 len; + u32 status; + dma_addr_t next_offset; +}; + +struct c2_alloc { + u32 last; + u32 max; + spinlock_t lock; + unsigned long *table; +}; + +struct c2_array { + struct { + void **page; + int used; + } *page_list; +}; + +/* + * The MQ shared pointer pool is organized as a linked list of + * chunks. Each chunk contains a linked list of free shared pointers + * that can be allocated to a given user mode client. + * + */ +struct sp_chunk { + struct sp_chunk *next; + dma_addr_t dma_addr; + DECLARE_PCI_UNMAP_ADDR(mapping); + u16 head; + u16 shared_ptr[0]; +}; + +struct c2_pd_table { + u32 last; + u32 max; + spinlock_t lock; + unsigned long *table; +}; + +struct c2_qp_table { + struct idr idr; + spinlock_t lock; + int last; +}; + +struct c2_element { + struct c2_element *next; + void *ht_desc; /* host descriptor */ + void __iomem *hw_desc; /* hardware descriptor */ + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen; +}; + +struct c2_ring { + struct c2_element *to_clean; + struct c2_element *to_use; + struct c2_element *start; + unsigned long count; +}; + +struct c2_dev { + struct ib_device ibdev; + void __iomem *regs; + void __iomem *mmio_txp_ring; /* remapped adapter memory for hw rings */ + void __iomem *mmio_rxp_ring; + spinlock_t lock; + struct pci_dev *pcidev; + struct net_device *netdev; + struct net_device *pseudo_netdev; + unsigned int cur_tx; + unsigned int cur_rx; + u32 adapter_handle; + int device_cap_flags; + void __iomem *kva; /* KVA device memory */ + unsigned long pa; /* PA device memory */ + void **qptr_array; + + kmem_cache_t *host_msg_cache; + + struct list_head cca_link; /* adapter list */ + struct list_head eh_wakeup_list; /* event wakeup list */ + wait_queue_head_t req_vq_wo; + + /* Cached RNIC properties */ + struct ib_device_attr props; + + struct c2_pd_table pd_table; + struct c2_qp_table qp_table; + int ports; /* num of GigE ports */ + int devnum; + spinlock_t vqlock; /* sync vbs req MQ */ + + /* Verbs Queues */ + struct c2_mq req_vq; /* Verbs Request MQ */ + struct c2_mq rep_vq; /* Verbs Reply MQ */ + struct c2_mq aeq; /* Async Events MQ */ + + /* Kernel client MQs */ + struct sp_chunk *kern_mqsp_pool; + + /* Device updates these values when posting messages to a host + * target queue */ + u16 req_vq_shared; + u16 rep_vq_shared; + u16 aeq_shared; + u16 irq_claimed; + + /* + * Shared host target pages for user-accessible MQs. + */ + int hthead; /* index of first free entry */ + void *htpages; /* kernel vaddr */ + int htlen; /* length of htpages memory */ + void *htuva; /* user mapped vaddr */ + spinlock_t htlock; /* serialize allocation */ + + u64 adapter_hint_uva; /* access to the activity FIFO */ + + // spinlock_t aeq_lock; + // spinlock_t rnic_lock; + + u16 *hint_count; + dma_addr_t hint_count_dma; + u16 hints_read; + + int init; /* TRUE if it's ready */ + char ae_cache_name[16]; + char vq_cache_name[16]; +}; + +struct c2_port { + u32 msg_enable; + struct c2_dev *c2dev; + struct net_device *netdev; + + spinlock_t tx_lock; + u32 tx_avail; + struct c2_ring tx_ring; + struct c2_ring rx_ring; + + void *mem; /* PCI memory for host rings */ + dma_addr_t dma; + unsigned long mem_size; + + u32 rx_buf_size; + + struct net_device_stats netstats; +}; + +/* + * Activity FIFO registers in BAR0. + */ +#define PCI_BAR0_HOST_HINT 0x100 +#define PCI_BAR0_ADAPTER_HINT 0x2000 + +/* + * Ammasso PCI vendor id and Cepheus PCI device id. + */ +#define CQ_ARMED 0x01 +#define CQ_WAIT_FOR_DMA 0x80 + +/* + * The format of a hint is as follows: + * Lower 16 bits are the count of hints for the queue. + * Next 15 bits are the qp_index + * Upper most bit depends on who reads it: + * If read by producer, then it means Full (1) or Not-Full (0) + * If read by consumer, then it means Empty (1) or Not-Empty (0) + */ +#define C2_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count) +#define C2_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16) +#define C2_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF) + + +/* + * The following defines the offset in SDRAM for the c2_adapter_pci_regs_t + * struct. + */ +#define C2_ADAPTER_PCI_REGS_OFFSET 0x10000 + +#ifndef readq +static inline u64 readq(const void __iomem * addr) +{ + u64 ret = readl(addr + 4); + ret <<= 32; + ret |= readl(addr); + + return ret; +} +#endif + +#ifndef __raw_writeq +static inline void __raw_writeq(u64 val, void __iomem * addr) +{ + __raw_writel((u32) (val), addr); + __raw_writel((u32) (val >> 32), (addr + 4)); +} +#endif + +#define C2_SET_CUR_RX(c2dev, cur_rx) \ + __raw_writel(cpu_to_be32(cur_rx), c2dev->mmio_txp_ring + 4092) + +#define C2_GET_CUR_RX(c2dev) \ + be32_to_cpu(readl(c2dev->mmio_txp_ring + 4092)) + +static inline struct c2_dev *to_c2dev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct c2_dev, ibdev); +} + +static inline int c2_errno(void *reply) +{ + switch (c2_wr_get_result(reply)) { + case C2_OK: + return 0; + case CCERR_NO_BUFS: + case CCERR_INSUFFICIENT_RESOURCES: + case CCERR_ZERO_RDMA_READ_RESOURCES: + return -ENOMEM; + case CCERR_MR_IN_USE: + case CCERR_QP_IN_USE: + return -EBUSY; + case CCERR_ADDR_IN_USE: + return -EADDRINUSE; + case CCERR_ADDR_NOT_AVAIL: + return -EADDRNOTAVAIL; + case CCERR_CONN_RESET: + return -ECONNRESET; + case CCERR_NOT_IMPLEMENTED: + case CCERR_INVALID_WQE: + return -ENOSYS; + case CCERR_QP_NOT_PRIVILEGED: + return -EPERM; + case CCERR_STACK_ERROR: + return -EPROTO; + case CCERR_ACCESS_VIOLATION: + case CCERR_BASE_AND_BOUNDS_VIOLATION: + return -EFAULT; + case CCERR_STAG_STATE_NOT_INVALID: + case CCERR_INVALID_ADDRESS: + case CCERR_INVALID_CQ: + case CCERR_INVALID_EP: + case CCERR_INVALID_MODIFIER: + case CCERR_INVALID_MTU: + case CCERR_INVALID_PD_ID: + case CCERR_INVALID_QP: + case CCERR_INVALID_RNIC: + case CCERR_INVALID_STAG: + return -EINVAL; + default: + return -EAGAIN; + } +} + +/* Device */ +extern int c2_register_device(struct c2_dev *c2dev); +extern void c2_unregister_device(struct c2_dev *c2dev); +extern int c2_rnic_init(struct c2_dev *c2dev); +extern void c2_rnic_term(struct c2_dev *c2dev); +extern void c2_rnic_interrupt(struct c2_dev *c2dev); +extern int c2_rnic_query(struct c2_dev *c2dev, struct ib_device_attr *props); +extern int c2_del_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask); +extern int c2_add_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask); + +/* QPs */ +extern int c2_alloc_qp(struct c2_dev *c2dev, struct c2_pd *pd, + struct ib_qp_init_attr *qp_attrs, struct c2_qp *qp); +extern void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp); +extern struct ib_qp *c2_get_qp(struct ib_device *device, int qpn); +extern int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp, + struct ib_qp_attr *attr, int attr_mask); +extern int c2_qp_set_read_limits(struct c2_dev *c2dev, struct c2_qp *qp, + int ord, int ird); +extern int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, + struct ib_send_wr **bad_wr); +extern int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, + struct ib_recv_wr **bad_wr); +extern void __devinit c2_init_qp_table(struct c2_dev *c2dev); +extern void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev); +extern void c2_set_qp_state(struct c2_qp *, int); +extern struct c2_qp *c2_find_qpn(struct c2_dev *c2dev, int qpn); + +/* PDs */ +extern int c2_pd_alloc(struct c2_dev *c2dev, int privileged, struct c2_pd *pd); +extern void c2_pd_free(struct c2_dev *c2dev, struct c2_pd *pd); +extern int __devinit c2_init_pd_table(struct c2_dev *c2dev); +extern void __devexit c2_cleanup_pd_table(struct c2_dev *c2dev); + +/* CQs */ +extern int c2_init_cq(struct c2_dev *c2dev, int entries, + struct c2_ucontext *ctx, struct c2_cq *cq); +extern void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq); +extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); +extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index); +extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); + +/* CM */ +extern int c2_llp_connect(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *iw_param); +extern int c2_llp_accept(struct iw_cm_id *cm_id, + struct iw_cm_conn_param *iw_param); +extern int c2_llp_reject(struct iw_cm_id *cm_id, const void *pdata, + u8 pdata_len); +extern int c2_llp_service_create(struct iw_cm_id *cm_id, int backlog); +extern int c2_llp_service_destroy(struct iw_cm_id *cm_id); + +/* MM */ +extern int c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 *addr_list, + int page_size, int pbl_depth, u32 length, + u32 off, u64 *va, enum c2_acf acf, + struct c2_mr *mr); +extern int c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index); + +/* AE */ +extern void c2_ae_event(struct c2_dev *c2dev, u32 mq_index); + +/* MQSP Allocator */ +extern int c2_init_mqsp_pool(struct c2_dev *c2dev, gfp_t gfp_mask, + struct sp_chunk **root); +extern void c2_free_mqsp_pool(struct c2_dev *c2dev, struct sp_chunk *root); +extern u16 *c2_alloc_mqsp(struct c2_dev *c2dev, struct sp_chunk *head, + dma_addr_t *dma_addr, gfp_t gfp_mask); +extern void c2_free_mqsp(u16 * mqsp); +#endif diff --git a/drivers/infiniband/hw/amso1100/c2_ae.c b/drivers/infiniband/hw/amso1100/c2_ae.c new file mode 100644 index 0000000..495e614 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_ae.c @@ -0,0 +1,321 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include +#include "c2_status.h" +#include "c2_ae.h" + +static int c2_convert_cm_status(u32 c2_status) +{ + switch (c2_status) { + case C2_CONN_STATUS_SUCCESS: + return 0; + case C2_CONN_STATUS_REJECTED: + return -ENETRESET; + case C2_CONN_STATUS_REFUSED: + return -ECONNREFUSED; + case C2_CONN_STATUS_TIMEDOUT: + return -ETIMEDOUT; + case C2_CONN_STATUS_NETUNREACH: + return -ENETUNREACH; + case C2_CONN_STATUS_HOSTUNREACH: + return -EHOSTUNREACH; + case C2_CONN_STATUS_INVALID_RNIC: + return -EINVAL; + case C2_CONN_STATUS_INVALID_QP: + return -EINVAL; + case C2_CONN_STATUS_INVALID_QP_STATE: + return -EINVAL; + case C2_CONN_STATUS_ADDR_NOT_AVAIL: + return -EADDRNOTAVAIL; + default: + printk(KERN_ERR PFX + "%s - Unable to convert CM status: %d\n", + __FUNCTION__, c2_status); + return -EIO; + } +} + +#ifdef DEBUG +static const char* to_event_str(int event) +{ + static const char* event_str[] = { + "CCAE_REMOTE_SHUTDOWN", + "CCAE_ACTIVE_CONNECT_RESULTS", + "CCAE_CONNECTION_REQUEST", + "CCAE_LLP_CLOSE_COMPLETE", + "CCAE_TERMINATE_MESSAGE_RECEIVED", + "CCAE_LLP_CONNECTION_RESET", + "CCAE_LLP_CONNECTION_LOST", + "CCAE_LLP_SEGMENT_SIZE_INVALID", + "CCAE_LLP_INVALID_CRC", + "CCAE_LLP_BAD_FPDU", + "CCAE_INVALID_DDP_VERSION", + "CCAE_INVALID_RDMA_VERSION", + "CCAE_UNEXPECTED_OPCODE", + "CCAE_INVALID_DDP_QUEUE_NUMBER", + "CCAE_RDMA_READ_NOT_ENABLED", + "CCAE_RDMA_WRITE_NOT_ENABLED", + "CCAE_RDMA_READ_TOO_SMALL", + "CCAE_NO_L_BIT", + "CCAE_TAGGED_INVALID_STAG", + "CCAE_TAGGED_BASE_BOUNDS_VIOLATION", + "CCAE_TAGGED_ACCESS_RIGHTS_VIOLATION", + "CCAE_TAGGED_INVALID_PD", + "CCAE_WRAP_ERROR", + "CCAE_BAD_CLOSE", + "CCAE_BAD_LLP_CLOSE", + "CCAE_INVALID_MSN_RANGE", + "CCAE_INVALID_MSN_GAP", + "CCAE_IRRQ_OVERFLOW", + "CCAE_IRRQ_MSN_GAP", + "CCAE_IRRQ_MSN_RANGE", + "CCAE_IRRQ_INVALID_STAG", + "CCAE_IRRQ_BASE_BOUNDS_VIOLATION", + "CCAE_IRRQ_ACCESS_RIGHTS_VIOLATION", + "CCAE_IRRQ_INVALID_PD", + "CCAE_IRRQ_WRAP_ERROR", + "CCAE_CQ_SQ_COMPLETION_OVERFLOW", + "CCAE_CQ_RQ_COMPLETION_ERROR", + "CCAE_QP_SRQ_WQE_ERROR", + "CCAE_QP_LOCAL_CATASTROPHIC_ERROR", + "CCAE_CQ_OVERFLOW", + "CCAE_CQ_OPERATION_ERROR", + "CCAE_SRQ_LIMIT_REACHED", + "CCAE_QP_RQ_LIMIT_REACHED", + "CCAE_SRQ_CATASTROPHIC_ERROR", + "CCAE_RNIC_CATASTROPHIC_ERROR" + }; + + if (event < CCAE_REMOTE_SHUTDOWN || + event > CCAE_RNIC_CATASTROPHIC_ERROR) + return ""; + + event -= CCAE_REMOTE_SHUTDOWN; + return event_str[event]; +} + +const char *to_qp_state_str(int state) +{ + switch (state) { + case C2_QP_STATE_IDLE: + return "C2_QP_STATE_IDLE"; + case C2_QP_STATE_CONNECTING: + return "C2_QP_STATE_CONNECTING"; + case C2_QP_STATE_RTS: + return "C2_QP_STATE_RTS"; + case C2_QP_STATE_CLOSING: + return "C2_QP_STATE_CLOSING"; + case C2_QP_STATE_TERMINATE: + return "C2_QP_STATE_TERMINATE"; + case C2_QP_STATE_ERROR: + return "C2_QP_STATE_ERROR"; + default: + return ""; + }; +} +#endif + +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) +{ + struct c2_mq *mq = c2dev->qptr_array[mq_index]; + union c2wr *wr; + void *resource_user_context; + struct iw_cm_event cm_event; + struct ib_event ib_event; + enum c2_resource_indicator resource_indicator; + enum c2_event_id event_id; + unsigned long flags; + int status; + + /* + * retreive the message + */ + wr = c2_mq_consume(mq); + if (!wr) + return; + + memset(&ib_event, 0, sizeof(ib_event)); + memset(&cm_event, 0, sizeof(cm_event)); + + event_id = c2_wr_get_id(wr); + resource_indicator = be32_to_cpu(wr->ae.ae_generic.resource_type); + resource_user_context = + (void *) (unsigned long) wr->ae.ae_generic.user_context; + + status = cm_event.status = c2_convert_cm_status(c2_wr_get_result(wr)); + + pr_debug("event received c2_dev=%p, event_id=%d, " + "resource_indicator=%d, user_context=%p, status = %d\n", + c2dev, event_id, resource_indicator, resource_user_context, + status); + + switch (resource_indicator) { + case C2_RES_IND_QP:{ + + struct c2_qp *qp = (struct c2_qp *)resource_user_context; + struct iw_cm_id *cm_id = qp->cm_id; + struct c2wr_ae_active_connect_results *res; + + if (!cm_id) { + pr_debug("event received, but cm_id is , qp=%p!\n", + qp); + goto ignore_it; + } + pr_debug("%s: event = %s, user_context=%llx, " + "resource_type=%x, " + "resource=%x, qp_state=%s\n", + __FUNCTION__, + to_event_str(event_id), + be64_to_cpu(wr->ae.ae_generic.user_context), + be32_to_cpu(wr->ae.ae_generic.resource_type), + be32_to_cpu(wr->ae.ae_generic.resource), + to_qp_state_str(be32_to_cpu(wr->ae.ae_generic.qp_state))); + + c2_set_qp_state(qp, be32_to_cpu(wr->ae.ae_generic.qp_state)); + + switch (event_id) { + case CCAE_ACTIVE_CONNECT_RESULTS: + res = &wr->ae.ae_active_connect_results; + cm_event.event = IW_CM_EVENT_CONNECT_REPLY; + cm_event.local_addr.sin_addr.s_addr = res->laddr; + cm_event.remote_addr.sin_addr.s_addr = res->raddr; + cm_event.local_addr.sin_port = res->lport; + cm_event.remote_addr.sin_port = res->rport; + if (status == 0) { + cm_event.private_data_len = + be32_to_cpu(res->private_data_length); + cm_event.private_data = res->private_data; + } else { + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id) { + qp->cm_id->rem_ref(qp->cm_id); + qp->cm_id = NULL; + } + spin_unlock_irqrestore(&qp->lock, flags); + cm_event.private_data_len = 0; + cm_event.private_data = NULL; + } + if (cm_id->event_handler) + cm_id->event_handler(cm_id, &cm_event); + break; + case CCAE_TERMINATE_MESSAGE_RECEIVED: + case CCAE_CQ_SQ_COMPLETION_OVERFLOW: + ib_event.device = &c2dev->ibdev; + ib_event.element.qp = &qp->ibqp; + ib_event.event = IB_EVENT_QP_REQ_ERR; + + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&ib_event, + qp->ibqp. + qp_context); + break; + case CCAE_BAD_CLOSE: + case CCAE_LLP_CLOSE_COMPLETE: + case CCAE_LLP_CONNECTION_RESET: + case CCAE_LLP_CONNECTION_LOST: + BUG_ON(cm_id->event_handler==(void*)0x6b6b6b6b); + + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id) { + qp->cm_id->rem_ref(qp->cm_id); + qp->cm_id = NULL; + } + spin_unlock_irqrestore(&qp->lock, flags); + cm_event.event = IW_CM_EVENT_CLOSE; + cm_event.status = 0; + if (cm_id->event_handler) + cm_id->event_handler(cm_id, &cm_event); + break; + default: + BUG_ON(1); + pr_debug("%s:%d Unexpected event_id=%d on QP=%p, " + "CM_ID=%p\n", + __FUNCTION__, __LINE__, + event_id, qp, cm_id); + break; + } + break; + } + + case C2_RES_IND_EP:{ + + struct c2wr_ae_connection_request *req = + &wr->ae.ae_connection_request; + struct iw_cm_id *cm_id = + (struct iw_cm_id *)resource_user_context; + + pr_debug("C2_RES_IND_EP event_id=%d\n", event_id); + if (event_id != CCAE_CONNECTION_REQUEST) { + pr_debug("%s: Invalid event_id: %d\n", + __FUNCTION__, event_id); + break; + } + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; + cm_event.provider_data = (void*)(unsigned long)req->cr_handle; + cm_event.local_addr.sin_addr.s_addr = req->laddr; + cm_event.remote_addr.sin_addr.s_addr = req->raddr; + cm_event.local_addr.sin_port = req->lport; + cm_event.remote_addr.sin_port = req->rport; + cm_event.private_data_len = + be32_to_cpu(req->private_data_length); + cm_event.private_data = req->private_data; + + if (cm_id->event_handler) + cm_id->event_handler(cm_id, &cm_event); + break; + } + + case C2_RES_IND_CQ:{ + struct c2_cq *cq = + (struct c2_cq *) resource_user_context; + + pr_debug("IB_EVENT_CQ_ERR\n"); + ib_event.device = &c2dev->ibdev; + ib_event.element.cq = &cq->ibcq; + ib_event.event = IB_EVENT_CQ_ERR; + + if (cq->ibcq.event_handler) + cq->ibcq.event_handler(&ib_event, + cq->ibcq.cq_context); + } + + default: + printk("Bad resource indicator = %d\n", + resource_indicator); + break; + } + + ignore_it: + c2_mq_free(mq); +} diff --git a/drivers/infiniband/hw/amso1100/c2_intr.c b/drivers/infiniband/hw/amso1100/c2_intr.c new file mode 100644 index 0000000..454e3e0 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_intr.c @@ -0,0 +1,209 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include +#include "c2_vq.h" + +static void handle_mq(struct c2_dev *c2dev, u32 index); +static void handle_vq(struct c2_dev *c2dev, u32 mq_index); + +/* + * Handle RNIC interrupts + */ +void c2_rnic_interrupt(struct c2_dev *c2dev) +{ + unsigned int mq_index; + + while (c2dev->hints_read != be16_to_cpu(*c2dev->hint_count)) { + mq_index = readl(c2dev->regs + PCI_BAR0_HOST_HINT); + if (mq_index & 0x80000000) { + break; + } + + c2dev->hints_read++; + handle_mq(c2dev, mq_index); + } + +} + +/* + * Top level MQ handler + */ +static void handle_mq(struct c2_dev *c2dev, u32 mq_index) +{ + if (c2dev->qptr_array[mq_index] == NULL) { + pr_debug(KERN_INFO "handle_mq: stray activity for mq_index=%d\n", + mq_index); + return; + } + + switch (mq_index) { + case (0): + /* + * An index of 0 in the activity queue + * indicates the req vq now has messages + * available... + * + * Wake up any waiters waiting on req VQ + * message availability. + */ + wake_up(&c2dev->req_vq_wo); + break; + case (1): + handle_vq(c2dev, mq_index); + break; + case (2): + /* We have to purge the VQ in case there are pending + * accept reply requests that would result in the + * generation of an ESTABLISHED event. If we don't + * generate these first, a CLOSE event could end up + * being delivered before the ESTABLISHED event. + */ + handle_vq(c2dev, 1); + + c2_ae_event(c2dev, mq_index); + break; + default: + /* There is no event synchronization between CQ events + * and AE or CM events. In fact, CQE could be + * delivered for all of the I/O up to and including the + * FLUSH for a peer disconenct prior to the ESTABLISHED + * event being delivered to the app. The reason for this + * is that CM events are delivered on a thread, while AE + * and CM events are delivered on interrupt context. + */ + c2_cq_event(c2dev, mq_index); + break; + } + + return; +} + +/* + * Handles verbs WR replies. + */ +static void handle_vq(struct c2_dev *c2dev, u32 mq_index) +{ + void *adapter_msg, *reply_msg; + struct c2wr_hdr *host_msg; + struct c2wr_hdr tmp; + struct c2_mq *reply_vq; + struct c2_vq_req *req; + struct iw_cm_event cm_event; + int err; + + reply_vq = (struct c2_mq *) c2dev->qptr_array[mq_index]; + + /* + * get next msg from mq_index into adapter_msg. + * don't free it yet. + */ + adapter_msg = c2_mq_consume(reply_vq); + if (adapter_msg == NULL) { + return; + } + + host_msg = vq_repbuf_alloc(c2dev); + + /* + * If we can't get a host buffer, then we'll still + * wakeup the waiter, we just won't give him the msg. + * It is assumed the waiter will deal with this... + */ + if (!host_msg) { + pr_debug("handle_vq: no repbufs!\n"); + + /* + * just copy the WR header into a local variable. + * this allows us to still demux on the context + */ + host_msg = &tmp; + memcpy(host_msg, adapter_msg, sizeof(tmp)); + reply_msg = NULL; + } else { + memcpy(host_msg, adapter_msg, reply_vq->msg_size); + reply_msg = host_msg; + } + + /* + * consume the msg from the MQ + */ + c2_mq_free(reply_vq); + + /* + * wakeup the waiter. + */ + req = (struct c2_vq_req *) (unsigned long) host_msg->context; + if (req == NULL) { + /* + * We should never get here, as the adapter should + * never send us a reply that we're not expecting. + */ + vq_repbuf_free(c2dev, host_msg); + pr_debug("handle_vq: UNEXPECTEDLY got NULL req\n"); + return; + } + + err = c2_errno(reply_msg); + if (!err) switch (req->event) { + case IW_CM_EVENT_ESTABLISHED: + c2_set_qp_state(req->qp, + C2_QP_STATE_RTS); + case IW_CM_EVENT_CLOSE: + + /* + * Move the QP to RTS if this is + * the established event + */ + cm_event.event = req->event; + cm_event.status = 0; + cm_event.local_addr = req->cm_id->local_addr; + cm_event.remote_addr = req->cm_id->remote_addr; + cm_event.private_data = NULL; + cm_event.private_data_len = 0; + req->cm_id->event_handler(req->cm_id, &cm_event); + break; + default: + break; + } + + req->reply_msg = (u64) (unsigned long) (reply_msg); + atomic_set(&req->reply_ready, 1); + wake_up(&req->wait_object); + + /* + * If the request was cancelled, then this put will + * free the vq_req memory...and reply_msg!!! + */ + vq_req_put(c2dev, req); +} diff --git a/drivers/infiniband/hw/amso1100/c2_rnic.c b/drivers/infiniband/hw/amso1100/c2_rnic.c new file mode 100644 index 0000000..4d9cc57 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_rnic.c @@ -0,0 +1,664 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include +#include +#include +#include "c2.h" +#include "c2_vq.h" + +/* Device capabilities */ +#define C2_MIN_PAGESIZE 1024 + +#define C2_MAX_MRS 32768 +#define C2_MAX_QPS 16000 +#define C2_MAX_WQE_SZ 256 +#define C2_MAX_QP_WR ((128*1024)/C2_MAX_WQE_SZ) +#define C2_MAX_SGES 4 +#define C2_MAX_SGE_RD 1 +#define C2_MAX_CQS 32768 +#define C2_MAX_CQES 4096 +#define C2_MAX_PDS 16384 + +/* + * Send the adapter INIT message to the amso1100 + */ +static int c2_adapter_init(struct c2_dev *c2dev) +{ + struct c2wr_init_req wr; + int err; + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_INIT); + wr.hdr.context = 0; + wr.hint_count = cpu_to_be64(c2dev->hint_count_dma); + wr.q0_host_shared = cpu_to_be64(c2dev->req_vq.shared_dma); + wr.q1_host_shared = cpu_to_be64(c2dev->rep_vq.shared_dma); + wr.q1_host_msg_pool = cpu_to_be64(c2dev->rep_vq.host_dma); + wr.q2_host_shared = cpu_to_be64(c2dev->aeq.shared_dma); + wr.q2_host_msg_pool = cpu_to_be64(c2dev->aeq.host_dma); + + /* Post the init message */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + + return err; +} + +/* + * Send the adapter TERM message to the amso1100 + */ +static void c2_adapter_term(struct c2_dev *c2dev) +{ + struct c2wr_init_req wr; + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_TERM); + wr.hdr.context = 0; + + /* Post the init message */ + vq_send_wr(c2dev, (union c2wr *) & wr); + c2dev->init = 0; + + return; +} + +/* + * Query the adapter + */ +int c2_rnic_query(struct c2_dev *c2dev, + struct ib_device_attr *props) +{ + struct c2_vq_req *vq_req; + struct c2wr_rnic_query_req wr; + struct c2wr_rnic_query_rep *reply; + int err; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + c2_wr_set_id(&wr, CCWR_RNIC_QUERY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) &wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = + (struct c2wr_rnic_query_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) + err = -ENOMEM; + + err = c2_errno(reply); + if (err) + goto bail2; + + props->fw_ver = + ((u64)be32_to_cpu(reply->fw_ver_major) << 32) | + ((be32_to_cpu(reply->fw_ver_minor) && 0xFFFF) << 16) | + (be32_to_cpu(reply->fw_ver_patch) && 0xFFFF); + memcpy(&props->sys_image_guid, c2dev->netdev->dev_addr, 6); + props->max_mr_size = 0xFFFFFFFF; + props->page_size_cap = ~(C2_MIN_PAGESIZE-1); + props->vendor_id = be32_to_cpu(reply->vendor_id); + props->vendor_part_id = be32_to_cpu(reply->part_number); + props->hw_ver = be32_to_cpu(reply->hw_version); + props->max_qp = be32_to_cpu(reply->max_qps); + props->max_qp_wr = be32_to_cpu(reply->max_qp_depth); + props->device_cap_flags = c2dev->device_cap_flags; + props->max_sge = C2_MAX_SGES; + props->max_sge_rd = C2_MAX_SGE_RD; + props->max_cq = be32_to_cpu(reply->max_cqs); + props->max_cqe = be32_to_cpu(reply->max_cq_depth); + props->max_mr = be32_to_cpu(reply->max_mrs); + props->max_pd = be32_to_cpu(reply->max_pds); + props->max_qp_rd_atom = be32_to_cpu(reply->max_qp_ird); + props->max_ee_rd_atom = 0; + props->max_res_rd_atom = be32_to_cpu(reply->max_global_ird); + props->max_qp_init_rd_atom = be32_to_cpu(reply->max_qp_ord); + props->max_ee_init_rd_atom = 0; + props->atomic_cap = IB_ATOMIC_NONE; + props->max_ee = 0; + props->max_rdd = 0; + props->max_mw = be32_to_cpu(reply->max_mws); + props->max_raw_ipv6_qp = 0; + props->max_raw_ethy_qp = 0; + props->max_mcast_grp = 0; + props->max_mcast_qp_attach = 0; + props->max_total_mcast_qp_attach = 0; + props->max_ah = 0; + props->max_fmr = 0; + props->max_map_per_fmr = 0; + props->max_srq = 0; + props->max_srq_wr = 0; + props->max_srq_sge = 0; + props->max_pkeys = 0; + props->local_ca_ack_delay = 0; + + bail2: + vq_repbuf_free(c2dev, reply); + + bail1: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Add an IP address to the RNIC interface + */ +int c2_add_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask) +{ + struct c2_vq_req *vq_req; + struct c2wr_rnic_setconfig_req *wr; + struct c2wr_rnic_setconfig_rep *reply; + struct c2_netaddr netaddr; + int err, len; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + len = sizeof(struct c2_netaddr); + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + c2_wr_set_id(wr, CCWR_RNIC_SETCONFIG); + wr->hdr.context = (unsigned long) vq_req; + wr->rnic_handle = c2dev->adapter_handle; + wr->option = cpu_to_be32(C2_CFG_ADD_ADDR); + + netaddr.ip_addr = inaddr; + netaddr.netmask = inmask; + netaddr.mtu = 0; + + memcpy(wr->data, &netaddr, len); + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = + (struct c2wr_rnic_setconfig_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail1; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + + bail1: + kfree(wr); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Delete an IP address from the RNIC interface + */ +int c2_del_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask) +{ + struct c2_vq_req *vq_req; + struct c2wr_rnic_setconfig_req *wr; + struct c2wr_rnic_setconfig_rep *reply; + struct c2_netaddr netaddr; + int err, len; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + len = sizeof(struct c2_netaddr); + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + c2_wr_set_id(wr, CCWR_RNIC_SETCONFIG); + wr->hdr.context = (unsigned long) vq_req; + wr->rnic_handle = c2dev->adapter_handle; + wr->option = cpu_to_be32(C2_CFG_DEL_ADDR); + + netaddr.ip_addr = inaddr; + netaddr.netmask = inmask; + netaddr.mtu = 0; + + memcpy(wr->data, &netaddr, len); + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = + (struct c2wr_rnic_setconfig_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail1; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + + bail1: + kfree(wr); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Open a single RNIC instance to use with all + * low level openib calls + */ +static int c2_rnic_open(struct c2_dev *c2dev) +{ + struct c2_vq_req *vq_req; + union c2wr wr; + struct c2wr_rnic_open_rep *reply; + int err; + + vq_req = vq_req_alloc(c2dev); + if (vq_req == NULL) { + return -ENOMEM; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_RNIC_OPEN); + wr.rnic_open.req.hdr.context = (unsigned long) (vq_req); + wr.rnic_open.req.flags = cpu_to_be16(RNIC_PRIV_MODE); + wr.rnic_open.req.port_num = cpu_to_be16(0); + wr.rnic_open.req.user_context = (unsigned long) c2dev; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, &wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + reply = (struct c2wr_rnic_open_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + if ((err = c2_errno(reply)) != 0) { + goto bail1; + } + + c2dev->adapter_handle = reply->rnic_handle; + + bail1: + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Close the RNIC instance + */ +static int c2_rnic_close(struct c2_dev *c2dev) +{ + struct c2_vq_req *vq_req; + union c2wr wr; + struct c2wr_rnic_close_rep *reply; + int err; + + vq_req = vq_req_alloc(c2dev); + if (vq_req == NULL) { + return -ENOMEM; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_RNIC_CLOSE); + wr.rnic_close.req.hdr.context = (unsigned long) vq_req; + wr.rnic_close.req.rnic_handle = c2dev->adapter_handle; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, &wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + reply = (struct c2wr_rnic_close_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + if ((err = c2_errno(reply)) != 0) { + goto bail1; + } + + c2dev->adapter_handle = 0; + + bail1: + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +/* + * Called by c2_probe to initialize the RNIC. This principally + * involves initalizing the various limits and resouce pools that + * comprise the RNIC instance. + */ +int c2_rnic_init(struct c2_dev *c2dev) +{ + int err; + u32 qsize, msgsize; + void *q1_pages; + void *q2_pages; + void __iomem *mmio_regs; + + /* Device capabilities */ + c2dev->device_cap_flags = + (IB_DEVICE_RESIZE_MAX_WR | + IB_DEVICE_CURR_QP_STATE_MOD | + IB_DEVICE_SYS_IMAGE_GUID | + IB_DEVICE_ZERO_STAG | + IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW); + + /* Allocate the qptr_array */ + c2dev->qptr_array = vmalloc(C2_MAX_CQS * sizeof(void *)); + if (!c2dev->qptr_array) { + return -ENOMEM; + } + + /* Inialize the qptr_array */ + memset(c2dev->qptr_array, 0, C2_MAX_CQS * sizeof(void *)); + c2dev->qptr_array[0] = (void *) &c2dev->req_vq; + c2dev->qptr_array[1] = (void *) &c2dev->rep_vq; + c2dev->qptr_array[2] = (void *) &c2dev->aeq; + + /* Initialize data structures */ + init_waitqueue_head(&c2dev->req_vq_wo); + spin_lock_init(&c2dev->vqlock); + spin_lock_init(&c2dev->lock); + + /* Allocate MQ shared pointer pool for kernel clients. User + * mode client pools are hung off the user context + */ + err = c2_init_mqsp_pool(c2dev, GFP_KERNEL, &c2dev->kern_mqsp_pool); + if (err) { + goto bail0; + } + + /* Allocate shared pointers for Q0, Q1, and Q2 from + * the shared pointer pool. + */ + + c2dev->hint_count = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &c2dev->hint_count_dma, + GFP_KERNEL); + c2dev->req_vq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &c2dev->req_vq.shared_dma, + GFP_KERNEL); + c2dev->rep_vq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &c2dev->rep_vq.shared_dma, + GFP_KERNEL); + c2dev->aeq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &c2dev->aeq.shared_dma, GFP_KERNEL); + if (!c2dev->hint_count || !c2dev->req_vq.shared || + !c2dev->rep_vq.shared || !c2dev->aeq.shared) { + err = -ENOMEM; + goto bail1; + } + + mmio_regs = c2dev->kva; + /* Initialize the Verbs Request Queue */ + c2_mq_req_init(&c2dev->req_vq, 0, + be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_QSIZE)), + be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_MSGSIZE)), + mmio_regs + + be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_POOLSTART)), + mmio_regs + + be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_SHARED)), + C2_MQ_ADAPTER_TARGET); + + /* Initialize the Verbs Reply Queue */ + qsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_QSIZE)); + msgsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_MSGSIZE)); + q1_pages = kmalloc(qsize * msgsize, GFP_KERNEL); + if (!q1_pages) { + err = -ENOMEM; + goto bail1; + } + c2dev->rep_vq.host_dma = dma_map_single(c2dev->ibdev.dma_device, + (void *)q1_pages, qsize * msgsize, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&c2dev->rep_vq, mapping, c2dev->rep_vq.host_dma); + pr_debug("%s rep_vq va %p dma %llx\n", __FUNCTION__, q1_pages, + (u64)c2dev->rep_vq.host_dma); + c2_mq_rep_init(&c2dev->rep_vq, + 1, + qsize, + msgsize, + q1_pages, + mmio_regs + + be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_SHARED)), + C2_MQ_HOST_TARGET); + + /* Initialize the Asynchronus Event Queue */ + qsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_QSIZE)); + msgsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_MSGSIZE)); + q2_pages = kmalloc(qsize * msgsize, GFP_KERNEL); + if (!q2_pages) { + err = -ENOMEM; + goto bail2; + } + c2dev->aeq.host_dma = dma_map_single(c2dev->ibdev.dma_device, + (void *)q2_pages, qsize * msgsize, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&c2dev->aeq, mapping, c2dev->aeq.host_dma); + pr_debug("%s aeq va %p dma %llx\n", __FUNCTION__, q1_pages, + (u64)c2dev->rep_vq.host_dma); + c2_mq_rep_init(&c2dev->aeq, + 2, + qsize, + msgsize, + q2_pages, + mmio_regs + + be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_SHARED)), + C2_MQ_HOST_TARGET); + + /* Initialize the verbs request allocator */ + err = vq_init(c2dev); + if (err) + goto bail3; + + /* Enable interrupts on the adapter */ + writel(0, c2dev->regs + C2_IDIS); + + /* create the WR init message */ + err = c2_adapter_init(c2dev); + if (err) + goto bail4; + c2dev->init++; + + /* open an adapter instance */ + err = c2_rnic_open(c2dev); + if (err) + goto bail4; + + /* Initialize cached the adapter limits */ + if (c2_rnic_query(c2dev, &c2dev->props)) + goto bail5; + + /* Initialize the PD pool */ + err = c2_init_pd_table(c2dev); + if (err) + goto bail5; + + /* Initialize the QP pool */ + c2_init_qp_table(c2dev); + return 0; + + bail5: + c2_rnic_close(c2dev); + bail4: + vq_term(c2dev); + bail3: + dma_unmap_single(c2dev->ibdev.dma_device, + pci_unmap_addr(&c2dev->aeq, mapping), + c2dev->aeq.q_size * c2dev->aeq.msg_size, + DMA_FROM_DEVICE); + kfree(q2_pages); + bail2: + dma_unmap_single(c2dev->ibdev.dma_device, + pci_unmap_addr(&c2dev->rep_vq, mapping), + c2dev->rep_vq.q_size * c2dev->rep_vq.msg_size, + DMA_FROM_DEVICE); + kfree(q1_pages); + bail1: + c2_free_mqsp_pool(c2dev, c2dev->kern_mqsp_pool); + bail0: + vfree(c2dev->qptr_array); + + return err; +} + +/* + * Called by c2_remove to cleanup the RNIC resources. + */ +void c2_rnic_term(struct c2_dev *c2dev) +{ + + /* Close the open adapter instance */ + c2_rnic_close(c2dev); + + /* Send the TERM message to the adapter */ + c2_adapter_term(c2dev); + + /* Disable interrupts on the adapter */ + writel(1, c2dev->regs + C2_IDIS); + + /* Free the QP pool */ + c2_cleanup_qp_table(c2dev); + + /* Free the PD pool */ + c2_cleanup_pd_table(c2dev); + + /* Free the verbs request allocator */ + vq_term(c2dev); + + /* Unmap and free the asynchronus event queue */ + dma_unmap_single(c2dev->ibdev.dma_device, + pci_unmap_addr(&c2dev->aeq, mapping), + c2dev->aeq.q_size * c2dev->aeq.msg_size, + DMA_FROM_DEVICE); + kfree(c2dev->aeq.msg_pool.host); + + /* Unmap and free the verbs reply queue */ + dma_unmap_single(c2dev->ibdev.dma_device, + pci_unmap_addr(&c2dev->rep_vq, mapping), + c2dev->rep_vq.q_size * c2dev->rep_vq.msg_size, + DMA_FROM_DEVICE); + kfree(c2dev->rep_vq.msg_pool.host); + + /* Free the MQ shared pointer pool */ + c2_free_mqsp_pool(c2dev, c2dev->kern_mqsp_pool); + + /* Free the qptr_array */ + vfree(c2dev->qptr_array); + + return; +} From swise at opengridcomputing.com Tue Jun 20 13:31:11 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:31:11 -0500 Subject: [openib-general] [PATCH v3 4/7] AMSO1100 Memory Management. In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop> References: <20060620203050.31536.5341.stgit@stevo-desktop> Message-ID: <20060620203111.31536.80453.stgit@stevo-desktop> V2 Review Changes: - removed c2_array services and replaced them with the idr. - removed c2_alloc services and made them pd-specific. - don't use GFP_DMA. - correctly map host memory for DMA (don't use __pa()). V1 Review Changes: - sizeof -> sizeof() - cleaned up comments --- drivers/infiniband/hw/amso1100/c2_alloc.c | 144 +++++++++++ drivers/infiniband/hw/amso1100/c2_mm.c | 375 +++++++++++++++++++++++++++++ 2 files changed, 519 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_alloc.c b/drivers/infiniband/hw/amso1100/c2_alloc.c new file mode 100644 index 0000000..013b152 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_alloc.c @@ -0,0 +1,144 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "c2.h" + +static int c2_alloc_mqsp_chunk(struct c2_dev *c2dev, gfp_t gfp_mask, + struct sp_chunk **head) +{ + int i; + struct sp_chunk *new_head; + + new_head = (struct sp_chunk *) __get_free_page(gfp_mask); + if (new_head == NULL) + return -ENOMEM; + + new_head->dma_addr = dma_map_single(c2dev->ibdev.dma_device, new_head, + PAGE_SIZE, DMA_FROM_DEVICE); + pci_unmap_addr_set(new_head, mapping, new_head->dma_addr); + + new_head->next = NULL; + new_head->head = 0; + + /* build list where each index is the next free slot */ + for (i = 0; + i < (PAGE_SIZE - sizeof(struct sp_chunk) - + sizeof(u16)) / sizeof(u16) - 1; + i++) { + new_head->shared_ptr[i] = i + 1; + } + /* terminate list */ + new_head->shared_ptr[i] = 0xFFFF; + + *head = new_head; + return 0; +} + +int c2_init_mqsp_pool(struct c2_dev *c2dev, gfp_t gfp_mask, + struct sp_chunk **root) +{ + return c2_alloc_mqsp_chunk(c2dev, gfp_mask, root); +} + +void c2_free_mqsp_pool(struct c2_dev *c2dev, struct sp_chunk *root) +{ + struct sp_chunk *next; + + while (root) { + next = root->next; + dma_unmap_single(c2dev->ibdev.dma_device, + pci_unmap_addr(root, mapping), PAGE_SIZE, + DMA_FROM_DEVICE); + __free_page((struct page *) root); + root = next; + } +} + +u16 *c2_alloc_mqsp(struct c2_dev *c2dev, struct sp_chunk *head, + dma_addr_t *dma_addr, gfp_t gfp_mask) +{ + u16 mqsp; + + while (head) { + mqsp = head->head; + if (mqsp != 0xFFFF) { + head->head = head->shared_ptr[mqsp]; + break; + } else if (head->next == NULL) { + if (c2_alloc_mqsp_chunk(c2dev, gfp_mask, &head->next) == + 0) { + head = head->next; + mqsp = head->head; + head->head = head->shared_ptr[mqsp]; + break; + } else + return NULL; + } else + head = head->next; + } + if (head) { + *dma_addr = head->dma_addr + + ((unsigned long) &(head->shared_ptr[mqsp]) - + (unsigned long) head); + pr_debug("%s addr %p dma_addr %llx\n", __FUNCTION__, + &(head->shared_ptr[mqsp]), (u64)*dma_addr); + return &(head->shared_ptr[mqsp]); + } + return NULL; +} + +void c2_free_mqsp(u16 * mqsp) +{ + struct sp_chunk *head; + u16 idx; + + /* The chunk containing this ptr begins at the page boundary */ + head = (struct sp_chunk *) ((unsigned long) mqsp & PAGE_MASK); + + /* Link head to new mqsp */ + *mqsp = head->head; + + /* Compute the shared_ptr index */ + idx = ((unsigned long) mqsp & ~PAGE_MASK) >> 1; + idx -= (unsigned long) &(((struct sp_chunk *) 0)->shared_ptr[0]) >> 1; + + /* Point this index at the head */ + head->shared_ptr[idx] = head->head; + + /* Point head at this index */ + head->head = idx; +} diff --git a/drivers/infiniband/hw/amso1100/c2_mm.c b/drivers/infiniband/hw/amso1100/c2_mm.c new file mode 100644 index 0000000..314ec07 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_mm.c @@ -0,0 +1,375 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include "c2_vq.h" + +#define PBL_VIRT 1 +#define PBL_PHYS 2 + +/* + * Send all the PBL messages to convey the remainder of the PBL + * Wait for the adapter's reply on the last one. + * This is indicated by setting the MEM_PBL_COMPLETE in the flags. + * + * NOTE: vq_req is _not_ freed by this function. The VQ Host + * Reply buffer _is_ freed by this function. + */ +static int +send_pbl_messages(struct c2_dev *c2dev, u32 stag_index, + unsigned long va, u32 pbl_depth, + struct c2_vq_req *vq_req, int pbl_type) +{ + u32 pbe_count; /* amt that fits in a PBL msg */ + u32 count; /* amt in this PBL MSG. */ + struct c2wr_nsmr_pbl_req *wr; /* PBL WR ptr */ + struct c2wr_nsmr_pbl_rep *reply; /* reply ptr */ + int err, pbl_virt, pbl_index, i; + + switch (pbl_type) { + case PBL_VIRT: + pbl_virt = 1; + break; + case PBL_PHYS: + pbl_virt = 0; + break; + default: + return -EINVAL; + break; + } + + pbe_count = (c2dev->req_vq.msg_size - + sizeof(struct c2wr_nsmr_pbl_req)) / sizeof(u64); + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + return -ENOMEM; + } + c2_wr_set_id(wr, CCWR_NSMR_PBL); + + /* + * Only the last PBL message will generate a reply from the verbs, + * so we set the context to 0 indicating there is no kernel verbs + * handler blocked awaiting this reply. + */ + wr->hdr.context = 0; + wr->rnic_handle = c2dev->adapter_handle; + wr->stag_index = stag_index; /* already swapped */ + wr->flags = 0; + pbl_index = 0; + while (pbl_depth) { + count = min(pbe_count, pbl_depth); + wr->addrs_length = cpu_to_be32(count); + + /* + * If this is the last message, then reference the + * vq request struct cuz we're gonna wait for a reply. + * also make this PBL msg as the last one. + */ + if (count == pbl_depth) { + /* + * reference the request struct. dereferenced in the + * int handler. + */ + vq_req_get(c2dev, vq_req); + wr->flags = cpu_to_be32(MEM_PBL_COMPLETE); + + /* + * This is the last PBL message. + * Set the context to our VQ Request Object so we can + * wait for the reply. + */ + wr->hdr.context = (unsigned long) vq_req; + } + + /* + * If pbl_virt is set then va is a virtual address + * that describes a virtually contiguous memory + * allocation. The wr needs the start of each virtual page + * to be converted to the corresponding physical address + * of the page. If pbl_virt is not set then va is an array + * of physical addresses and there is no conversion to do. + * Just fill in the wr with what is in the array. + */ + for (i = 0; i < count; i++) { + if (pbl_virt) { + va += PAGE_SIZE; + } else { + wr->paddrs[i] = + cpu_to_be64(((u64 *)va)[pbl_index + i]); + } + } + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + if (count <= pbe_count) { + vq_req_put(c2dev, vq_req); + } + goto bail0; + } + pbl_depth -= count; + pbl_index += count; + } + + /* + * Now wait for the reply... + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (struct c2wr_nsmr_pbl_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + + vq_repbuf_free(c2dev, reply); + bail0: + kfree(wr); + return err; +} + +#define C2_PBL_MAX_DEPTH 131072 +int +c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 *addr_list, + int page_size, int pbl_depth, u32 length, + u32 offset, u64 *va, enum c2_acf acf, + struct c2_mr *mr) +{ + struct c2_vq_req *vq_req; + struct c2wr_nsmr_register_req *wr; + struct c2wr_nsmr_register_rep *reply; + u16 flags; + int i, pbe_count, count; + int err; + + if (!va || !length || !addr_list || !pbl_depth) + return -EINTR; + + /* + * Verify PBL depth is within rnic max + */ + if (pbl_depth > C2_PBL_MAX_DEPTH) { + return -EINTR; + } + + /* + * allocate verbs request object + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + /* + * build the WR + */ + c2_wr_set_id(wr, CCWR_NSMR_REGISTER); + wr->hdr.context = (unsigned long) vq_req; + wr->rnic_handle = c2dev->adapter_handle; + + flags = (acf | MEM_VA_BASED | MEM_REMOTE); + + /* + * compute how many pbes can fit in the message + */ + pbe_count = (c2dev->req_vq.msg_size - + sizeof(struct c2wr_nsmr_register_req)) / sizeof(u64); + + if (pbl_depth <= pbe_count) { + flags |= MEM_PBL_COMPLETE; + } + wr->flags = cpu_to_be16(flags); + wr->stag_key = 0; //stag_key; + wr->va = cpu_to_be64(*va); + wr->pd_id = mr->pd->pd_id; + wr->pbe_size = cpu_to_be32(page_size); + wr->length = cpu_to_be32(length); + wr->pbl_depth = cpu_to_be32(pbl_depth); + wr->fbo = cpu_to_be32(offset); + count = min(pbl_depth, pbe_count); + wr->addrs_length = cpu_to_be32(count); + + /* + * fill out the PBL for this message + */ + for (i = 0; i < count; i++) { + wr->paddrs[i] = cpu_to_be64(addr_list[i]); + } + + /* + * regerence the request struct + */ + vq_req_get(c2dev, vq_req); + + /* + * send the WR to the adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + /* + * wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail1; + } + + /* + * process reply + */ + reply = + (struct c2wr_nsmr_register_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail1; + } + if ((err = c2_errno(reply))) { + goto bail2; + } + //*p_pb_entries = be32_to_cpu(reply->pbl_depth); + mr->ibmr.lkey = mr->ibmr.rkey = be32_to_cpu(reply->stag_index); + vq_repbuf_free(c2dev, reply); + + /* + * if there are still more PBEs we need to send them to + * the adapter and wait for a reply on the final one. + * reuse vq_req for this purpose. + */ + pbl_depth -= count; + if (pbl_depth) { + + vq_req->reply_msg = (unsigned long) NULL; + atomic_set(&vq_req->reply_ready, 0); + err = send_pbl_messages(c2dev, + cpu_to_be32(mr->ibmr.lkey), + (unsigned long) &addr_list[i], + pbl_depth, vq_req, PBL_PHYS); + if (err) { + goto bail1; + } + } + + vq_req_free(c2dev, vq_req); + kfree(wr); + + return err; + + bail2: + vq_repbuf_free(c2dev, reply); + bail1: + kfree(wr); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +int c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index) +{ + struct c2_vq_req *vq_req; /* verbs request object */ + struct c2wr_stag_dealloc_req wr; /* work request */ + struct c2wr_stag_dealloc_rep *reply; /* WR reply */ + int err; + + + /* + * allocate verbs request object + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + return -ENOMEM; + } + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_STAG_DEALLOC); + wr.hdr.context = (u64) (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.stag_index = cpu_to_be32(stag_index); + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (struct c2wr_stag_dealloc_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} From swise at opengridcomputing.com Tue Jun 20 13:31:06 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:31:06 -0500 Subject: [openib-general] [PATCH v3 3/7] AMSO1100 OpenFabrics Provider. In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop> References: <20060620203050.31536.5341.stgit@stevo-desktop> Message-ID: <20060620203105.31536.53569.stgit@stevo-desktop> V2 Review Changes: - removed useless atomic_t in c2_pd struct. - qp ids now allocated and mapped using IDR - pd ids now allocated using private bitarrary allocator. - correctly map host memory for DMA (don't use __pa()). V1 Review Changes: - sizeof -> sizeof() - dprintk() -> pr_debug() - assert() -> BUG_ON() - C2_DEBUG -> DEBUG - cleaned up comments --- drivers/infiniband/hw/amso1100/c2_cm.c | 452 ++++++++++++ drivers/infiniband/hw/amso1100/c2_cq.c | 433 ++++++++++++ drivers/infiniband/hw/amso1100/c2_pd.c | 89 ++ drivers/infiniband/hw/amso1100/c2_provider.c | 867 +++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2_provider.h | 181 +++++ drivers/infiniband/hw/amso1100/c2_qp.c | 975 ++++++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2_user.h | 82 ++ 7 files changed, 3079 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_cm.c b/drivers/infiniband/hw/amso1100/c2_cm.c new file mode 100644 index 0000000..018d11f --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_cm.c @@ -0,0 +1,452 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include "c2.h" +#include "c2_wr.h" +#include "c2_vq.h" +#include + +int c2_llp_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + struct c2_dev *c2dev = to_c2dev(cm_id->device); + struct ib_qp *ibqp; + struct c2_qp *qp; + struct c2wr_qp_connect_req *wr; /* variable size needs a malloc. */ + struct c2_vq_req *vq_req; + int err; + + ibqp = c2_get_qp(cm_id->device, iw_param->qpn); + if (!ibqp) + return -EINVAL; + qp = to_c2qp(ibqp); + + /* Associate QP <--> CM_ID */ + cm_id->provider_data = qp; + cm_id->add_ref(cm_id); + qp->cm_id = cm_id; + + /* + * only support the max private_data length + */ + if (iw_param->private_data_len > C2_MAX_PRIVATE_DATA_SIZE) { + err = -EINVAL; + goto bail0; + } + /* + * Set the rdma read limits + */ + err = c2_qp_set_read_limits(c2dev, qp, iw_param->ord, iw_param->ird); + if (err) + goto bail0; + + /* + * Create and send a WR_QP_CONNECT... + */ + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail0; + } + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + err = -ENOMEM; + goto bail1; + } + + c2_wr_set_id(wr, CCWR_QP_CONNECT); + wr->hdr.context = 0; + wr->rnic_handle = c2dev->adapter_handle; + wr->qp_handle = qp->adapter_handle; + + wr->remote_addr = cm_id->remote_addr.sin_addr.s_addr; + wr->remote_port = cm_id->remote_addr.sin_port; + + /* + * Move any private data from the callers's buf into + * the WR. + */ + if (iw_param->private_data) { + wr->private_data_length = + cpu_to_be32(iw_param->private_data_len); + memcpy(&wr->private_data[0], iw_param->private_data, + iw_param->private_data_len); + } else + wr->private_data_length = 0; + + /* + * Send WR to adapter. NOTE: There is no synch reply from + * the adapter. + */ + err = vq_send_wr(c2dev, (union c2wr *) wr); + vq_req_free(c2dev, vq_req); + + bail1: + kfree(wr); + bail0: + if (err) { + /* + * If we fail, release reference on QP and + * disassociate QP from CM_ID + */ + cm_id->provider_data = NULL; + qp->cm_id = NULL; + cm_id->rem_ref(cm_id); + } + return err; +} + +int c2_llp_service_create(struct iw_cm_id *cm_id, int backlog) +{ + struct c2_dev *c2dev; + struct c2wr_ep_listen_create_req wr; + struct c2wr_ep_listen_create_rep *reply; + struct c2_vq_req *vq_req; + int err; + + c2dev = to_c2dev(cm_id->device); + if (c2dev == NULL) + return -EINVAL; + + /* + * Allocate verbs request. + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_EP_LISTEN_CREATE); + wr.hdr.context = (u64) (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.local_addr = cm_id->local_addr.sin_addr.s_addr; + wr.local_port = cm_id->local_addr.sin_port; + wr.backlog = cpu_to_be32(backlog); + wr.user_context = (u64) (unsigned long) cm_id; + + /* + * Reference the request struct. Dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + /* + * Process reply + */ + reply = + (struct c2wr_ep_listen_create_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail1; + } + + if ((err = c2_errno(reply)) != 0) + goto bail1; + + /* + * Keep the adapter handle. Used in subsequent destroy + */ + cm_id->provider_data = (void*)(unsigned long) reply->ep_handle; + + /* + * free vq stuff + */ + vq_repbuf_free(c2dev, reply); + vq_req_free(c2dev, vq_req); + + return 0; + + bail1: + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + + +int c2_llp_service_destroy(struct iw_cm_id *cm_id) +{ + + struct c2_dev *c2dev; + struct c2wr_ep_listen_destroy_req wr; + struct c2wr_ep_listen_destroy_rep *reply; + struct c2_vq_req *vq_req; + int err; + + c2dev = to_c2dev(cm_id->device); + if (c2dev == NULL) + return -EINVAL; + + /* + * Allocate verbs request. + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_EP_LISTEN_DESTROY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.ep_handle = (u32)(unsigned long)cm_id->provider_data; + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + /* + * Process reply + */ + reply=(struct c2wr_ep_listen_destroy_rep *)(unsigned long)vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + if ((err = c2_errno(reply)) != 0) + goto bail1; + + bail1: + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +int c2_llp_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + struct c2_dev *c2dev = to_c2dev(cm_id->device); + struct c2_qp *qp; + struct ib_qp *ibqp; + struct c2wr_cr_accept_req *wr; /* variable length WR */ + struct c2_vq_req *vq_req; + struct c2wr_cr_accept_rep *reply; /* VQ Reply msg ptr. */ + int err; + + ibqp = c2_get_qp(cm_id->device, iw_param->qpn); + if (!ibqp) + return -EINVAL; + qp = to_c2qp(ibqp); + + /* Set the RDMA read limits */ + err = c2_qp_set_read_limits(c2dev, qp, iw_param->ord, iw_param->ird); + if (err) + goto bail0; + + /* Allocate verbs request. */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + err = -ENOMEM; + goto bail1; + } + vq_req->qp = qp; + vq_req->cm_id = cm_id; + vq_req->event = IW_CM_EVENT_ESTABLISHED; + + wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL); + if (!wr) { + err = -ENOMEM; + goto bail2; + } + + /* Build the WR */ + c2_wr_set_id(wr, CCWR_CR_ACCEPT); + wr->hdr.context = (unsigned long) vq_req; + wr->rnic_handle = c2dev->adapter_handle; + wr->ep_handle = (u32) (unsigned long) cm_id->provider_data; + wr->qp_handle = qp->adapter_handle; + + /* Replace the cr_handle with the QP after accept */ + cm_id->provider_data = qp; + cm_id->add_ref(cm_id); + qp->cm_id = cm_id; + + cm_id->provider_data = qp; + + /* Validate private_data length */ + if (iw_param->private_data_len > C2_MAX_PRIVATE_DATA_SIZE) { + err = -EINVAL; + goto bail2; + } + + if (iw_param->private_data) { + wr->private_data_length = cpu_to_be32(iw_param->private_data_len); + memcpy(&wr->private_data[0], + iw_param->private_data, iw_param->private_data_len); + } else + wr->private_data_length = 0; + + /* Reference the request struct. Dereferenced in the int handler. */ + vq_req_get(c2dev, vq_req); + + /* Send WR to adapter */ + err = vq_send_wr(c2dev, (union c2wr *) wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail2; + } + + /* Wait for reply from adapter */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail2; + + /* Check that reply is present */ + reply = (struct c2wr_cr_accept_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail2; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + + if (!err) + c2_set_qp_state(qp, C2_QP_STATE_RTS); + bail2: + kfree(wr); + bail1: + vq_req_free(c2dev, vq_req); + bail0: + if (err) { + /* + * If we fail, release reference on QP and + * disassociate QP from CM_ID + */ + cm_id->provider_data = NULL; + qp->cm_id = NULL; + cm_id->rem_ref(cm_id); + } + return err; +} + +int c2_llp_reject(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) +{ + struct c2_dev *c2dev; + struct c2wr_cr_reject_req wr; + struct c2_vq_req *vq_req; + struct c2wr_cr_reject_rep *reply; + int err; + + c2dev = to_c2dev(cm_id->device); + + /* + * Allocate verbs request. + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + /* + * Build the WR + */ + c2_wr_set_id(&wr, CCWR_CR_REJECT); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.ep_handle = (u32) (unsigned long) cm_id->provider_data; + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + /* + * Process reply + */ + reply = (struct c2wr_cr_reject_rep *) (unsigned long) + vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + err = c2_errno(reply); + /* + * free vq stuff + */ + vq_repbuf_free(c2dev, reply); + + bail0: + vq_req_free(c2dev, vq_req); + return err; +} diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c new file mode 100644 index 0000000..d24da05 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -0,0 +1,433 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Cisco Systems, Inc. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include "c2.h" +#include "c2_vq.h" +#include "c2_status.h" + +#define C2_CQ_MSG_SIZE ((sizeof(struct c2wr_ce) + 32-1) & ~(32-1)) + +struct c2_cq *c2_cq_get(struct c2_dev *c2dev, int cqn) +{ + struct c2_cq *cq; + unsigned long flags; + + spin_lock_irqsave(&c2dev->lock, flags); + cq = c2dev->qptr_array[cqn]; + if (!cq) { + spin_unlock_irqrestore(&c2dev->lock, flags); + return NULL; + } + atomic_inc(&cq->refcount); + spin_unlock_irqrestore(&c2dev->lock, flags); + return cq; +} + +void c2_cq_put(struct c2_cq *cq) +{ + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +void c2_cq_event(struct c2_dev *c2dev, u32 mq_index) +{ + struct c2_cq *cq; + + cq = c2_cq_get(c2dev, mq_index); + if (!cq) { + printk("discarding events on destroyed CQN=%d\n", mq_index); + return; + } + + (*cq->ibcq.comp_handler) (&cq->ibcq, cq->ibcq.cq_context); + c2_cq_put(cq); +} + +void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index) +{ + struct c2_cq *cq; + struct c2_mq *q; + + cq = c2_cq_get(c2dev, mq_index); + if (!cq) + return; + + spin_lock_irq(&cq->lock); + q = &cq->mq; + if (q && !c2_mq_empty(q)) { + u16 priv = q->priv; + struct c2wr_ce *msg; + + while (priv != be16_to_cpu(*q->shared)) { + msg = (struct c2wr_ce *) + (q->msg_pool.host + priv * q->msg_size); + if (msg->qp_user_context == (u64) (unsigned long) qp) { + msg->qp_user_context = (u64) 0; + } + priv = (priv + 1) % q->q_size; + } + } + spin_unlock_irq(&cq->lock); + c2_cq_put(cq); +} + +static inline enum ib_wc_status c2_cqe_status_to_openib(u8 status) +{ + switch (status) { + case C2_OK: + return IB_WC_SUCCESS; + case CCERR_FLUSHED: + return IB_WC_WR_FLUSH_ERR; + case CCERR_BASE_AND_BOUNDS_VIOLATION: + return IB_WC_LOC_PROT_ERR; + case CCERR_ACCESS_VIOLATION: + return IB_WC_LOC_ACCESS_ERR; + case CCERR_TOTAL_LENGTH_TOO_BIG: + return IB_WC_LOC_LEN_ERR; + case CCERR_INVALID_WINDOW: + return IB_WC_MW_BIND_ERR; + default: + return IB_WC_GENERAL_ERR; + } +} + + +static inline int c2_poll_one(struct c2_dev *c2dev, + struct c2_cq *cq, struct ib_wc *entry) +{ + struct c2wr_ce *ce; + struct c2_qp *qp; + int is_recv = 0; + + ce = (struct c2wr_ce *) c2_mq_consume(&cq->mq); + if (!ce) { + return -EAGAIN; + } + + /* + * if the qp returned is null then this qp has already + * been freed and we are unable process the completion. + * try pulling the next message + */ + while ((qp = + (struct c2_qp *) (unsigned long) ce->qp_user_context) == NULL) { + c2_mq_free(&cq->mq); + ce = (struct c2wr_ce *) c2_mq_consume(&cq->mq); + if (!ce) + return -EAGAIN; + } + + entry->status = c2_cqe_status_to_openib(c2_wr_get_result(ce)); + entry->wr_id = ce->hdr.context; + entry->qp_num = ce->handle; + entry->wc_flags = 0; + entry->slid = 0; + entry->sl = 0; + entry->src_qp = 0; + entry->dlid_path_bits = 0; + entry->pkey_index = 0; + + switch (c2_wr_get_id(ce)) { + case C2_WR_TYPE_SEND: + entry->opcode = IB_WC_SEND; + break; + case C2_WR_TYPE_RDMA_WRITE: + entry->opcode = IB_WC_RDMA_WRITE; + break; + case C2_WR_TYPE_RDMA_READ: + entry->opcode = IB_WC_RDMA_READ; + break; + case C2_WR_TYPE_BIND_MW: + entry->opcode = IB_WC_BIND_MW; + break; + case C2_WR_TYPE_RECV: + entry->byte_len = be32_to_cpu(ce->bytes_rcvd); + entry->opcode = IB_WC_RECV; + is_recv = 1; + break; + default: + break; + } + + /* consume the WQEs */ + if (is_recv) + c2_mq_lconsume(&qp->rq_mq, 1); + else + c2_mq_lconsume(&qp->sq_mq, + be32_to_cpu(c2_wr_get_wqe_count(ce)) + 1); + + /* free the message */ + c2_mq_free(&cq->mq); + + return 0; +} + +int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) +{ + struct c2_dev *c2dev = to_c2dev(ibcq->device); + struct c2_cq *cq = to_c2cq(ibcq); + unsigned long flags; + int npolled, err; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled) { + + err = c2_poll_one(c2dev, cq, entry + npolled); + if (err) + break; + } + + spin_unlock_irqrestore(&cq->lock, flags); + + return npolled; +} + +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +{ + struct c2_mq_shared __iomem *shared; + struct c2_cq *cq; + + cq = to_c2cq(ibcq); + shared = cq->mq.peer; + + if (notify == IB_CQ_NEXT_COMP) + writeb(C2_CQ_NOTIFICATION_TYPE_NEXT, &shared->notification_type); + else if (notify == IB_CQ_SOLICITED) + writeb(C2_CQ_NOTIFICATION_TYPE_NEXT_SE, &shared->notification_type); + else + return -EINVAL; + + writeb(CQ_WAIT_FOR_DMA | CQ_ARMED, &shared->armed); + + /* + * Now read back shared->armed to make the PCI + * write synchronous. This is necessary for + * correct cq notification semantics. + */ + readb(&shared->armed); + + return 0; +} + +static void c2_free_cq_buf(struct c2_dev *c2dev, struct c2_mq *mq) +{ + + dma_unmap_single(c2dev->ibdev.dma_device, pci_unmap_addr(mq, mapping), + mq->q_size * mq->msg_size, DMA_FROM_DEVICE); + free_pages((unsigned long) mq->msg_pool.host, + get_order(mq->q_size * mq->msg_size)); +} + +static int c2_alloc_cq_buf(struct c2_dev *c2dev, struct c2_mq *mq, int q_size, + int msg_size) +{ + unsigned long pool_start; + + pool_start = __get_free_pages(GFP_KERNEL, + get_order(q_size * msg_size)); + if (!pool_start) + return -ENOMEM; + + c2_mq_rep_init(mq, + 0, /* index (currently unknown) */ + q_size, + msg_size, + (u8 *) pool_start, + NULL, /* peer (currently unknown) */ + C2_MQ_HOST_TARGET); + + mq->host_dma = dma_map_single(c2dev->ibdev.dma_device, + (void *)pool_start, + q_size * msg_size, DMA_FROM_DEVICE); + pci_unmap_addr_set(mq, mapping, mq->host_dma); + + return 0; +} + +int c2_init_cq(struct c2_dev *c2dev, int entries, + struct c2_ucontext *ctx, struct c2_cq *cq) +{ + struct c2wr_cq_create_req wr; + struct c2wr_cq_create_rep *reply; + unsigned long peer_pa; + struct c2_vq_req *vq_req; + int err; + + might_sleep(); + + cq->ibcq.cqe = entries - 1; + cq->is_kernel = !ctx; + + /* Allocate a shared pointer */ + cq->mq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &cq->mq.shared_dma, GFP_KERNEL); + if (!cq->mq.shared) + return -ENOMEM; + + /* Allocate pages for the message pool */ + err = c2_alloc_cq_buf(c2dev, &cq->mq, entries + 1, C2_CQ_MSG_SIZE); + if (err) + goto bail0; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + err = -ENOMEM; + goto bail1; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_CQ_CREATE); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.msg_size = cpu_to_be32(cq->mq.msg_size); + wr.depth = cpu_to_be32(cq->mq.q_size); + wr.shared_ht = cpu_to_be64(cq->mq.shared_dma); + wr.msg_pool = cpu_to_be64(cq->mq.host_dma); + wr.user_context = (u64) (unsigned long) (cq); + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail2; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail2; + + reply = (struct c2wr_cq_create_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail2; + } + + if ((err = c2_errno(reply)) != 0) + goto bail3; + + cq->adapter_handle = reply->cq_handle; + cq->mq.index = be32_to_cpu(reply->mq_index); + + peer_pa = c2dev->pa + be32_to_cpu(reply->adapter_shared); + cq->mq.peer = ioremap_nocache(peer_pa, PAGE_SIZE); + if (!cq->mq.peer) { + err = -ENOMEM; + goto bail3; + } + + vq_repbuf_free(c2dev, reply); + vq_req_free(c2dev, vq_req); + + spin_lock_init(&cq->lock); + atomic_set(&cq->refcount, 1); + init_waitqueue_head(&cq->wait); + + /* + * Use the MQ index allocated by the adapter to + * store the CQ in the qptr_array + */ + cq->cqn = cq->mq.index; + c2dev->qptr_array[cq->cqn] = cq; + + return 0; + + bail3: + vq_repbuf_free(c2dev, reply); + bail2: + vq_req_free(c2dev, vq_req); + bail1: + c2_free_cq_buf(c2dev, &cq->mq); + bail0: + c2_free_mqsp(cq->mq.shared); + + return err; +} + +void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq) +{ + int err; + struct c2_vq_req *vq_req; + struct c2wr_cq_destroy_req wr; + struct c2wr_cq_destroy_rep *reply; + + might_sleep(); + + /* Clear CQ from the qptr array */ + spin_lock_irq(&c2dev->lock); + c2dev->qptr_array[cq->mq.index] = NULL; + atomic_dec(&cq->refcount); + spin_unlock_irq(&c2dev->lock); + + wait_event(cq->wait, !atomic_read(&cq->refcount)); + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + goto bail0; + } + + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_CQ_DESTROY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.cq_handle = cq->adapter_handle; + + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail1; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail1; + + reply = (struct c2wr_cq_destroy_rep *) (unsigned long) (vq_req->reply_msg); + + vq_repbuf_free(c2dev, reply); + bail1: + vq_req_free(c2dev, vq_req); + bail0: + if (cq->is_kernel) { + c2_free_cq_buf(c2dev, &cq->mq); + } + + return; +} diff --git a/drivers/infiniband/hw/amso1100/c2_pd.c b/drivers/infiniband/hw/amso1100/c2_pd.c new file mode 100644 index 0000000..b9a647a --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_pd.c @@ -0,0 +1,89 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include "c2.h" +#include "c2_provider.h" + +int c2_pd_alloc(struct c2_dev *c2dev, int privileged, struct c2_pd *pd) +{ + u32 obj; + int ret = 0; + + spin_lock(&c2dev->pd_table.lock); + obj = find_next_zero_bit(c2dev->pd_table.table, c2dev->pd_table.max, + c2dev->pd_table.last); + if (obj >= c2dev->pd_table.max) + obj = find_first_zero_bit(c2dev->pd_table.table, + c2dev->pd_table.max); + if (obj < c2dev->pd_table.max) { + pd->pd_id = obj; + __set_bit(obj, c2dev->pd_table.table); + c2dev->pd_table.last = obj+1; + if (c2dev->pd_table.last >= c2dev->pd_table.max) + c2dev->pd_table.last = 0; + } else + ret = -ENOMEM; + spin_unlock(&c2dev->pd_table.lock); + return ret; +} + +void c2_pd_free(struct c2_dev *c2dev, struct c2_pd *pd) +{ + spin_lock(&c2dev->pd_table.lock); + __clear_bit(pd->pd_id, c2dev->pd_table.table); + spin_unlock(&c2dev->pd_table.lock); +} + +int __devinit c2_init_pd_table(struct c2_dev *c2dev) +{ + + c2dev->pd_table.last = 0; + c2dev->pd_table.max = c2dev->props.max_pd; + spin_lock_init(&c2dev->pd_table.lock); + c2dev->pd_table.table = kmalloc(BITS_TO_LONGS(c2dev->props.max_pd) * + sizeof(long), GFP_KERNEL); + if (!c2dev->pd_table.table) + return -ENOMEM; + bitmap_zero(c2dev->pd_table.table, c2dev->props.max_pd); + return 0; +} + +void __devexit c2_cleanup_pd_table(struct c2_dev *c2dev) +{ + kfree(c2dev->pd_table.table); +} diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c new file mode 100644 index 0000000..a0c176e --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -0,0 +1,867 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include +#include "c2.h" +#include "c2_provider.h" +#include "c2_user.h" + +static int c2_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct c2_dev *c2dev = to_c2dev(ibdev); + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + *props = c2dev->props; + return 0; +} + +static int c2_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + props->max_mtu = IB_MTU_4096; + props->lid = 0; + props->lmc = 0; + props->sm_lid = 0; + props->sm_sl = 0; + props->state = IB_PORT_ACTIVE; + props->phys_state = 0; + props->port_cap_flags = + IB_PORT_CM_SUP | + IB_PORT_REINIT_SUP | + IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP; + props->gid_tbl_len = 1; + props->pkey_tbl_len = 1; + props->qkey_viol_cntr = 0; + props->active_width = 1; + props->active_speed = 1; + + return 0; +} + +static int c2_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return 0; +} + +static int c2_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 * pkey) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + *pkey = 0; + return 0; +} + +static int c2_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct c2_dev *c2dev = to_c2dev(ibdev); + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + memset(&(gid->raw[0]), 0, sizeof(gid->raw)); + memcpy(&(gid->raw[0]), c2dev->pseudo_netdev->dev_addr, 6); + + return 0; +} + +/* Allocate the user context data structure. This keeps track + * of all objects associated with a particular user-mode client. + */ +static struct ib_ucontext *c2_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct c2_ucontext *context; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + context = kmalloc(sizeof(*context), GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + + return &context->ibucontext; +} + +static int c2_dealloc_ucontext(struct ib_ucontext *context) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + kfree(context); + return 0; +} + +static int c2_mmap_uar(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static struct ib_pd *c2_alloc_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct c2_pd *pd; + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + pd = kmalloc(sizeof(*pd), GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + err = c2_pd_alloc(to_c2dev(ibdev), !context, pd); + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + if (context) { + if (ib_copy_to_udata(udata, &pd->pd_id, sizeof(__u32))) { + c2_pd_free(to_c2dev(ibdev), pd); + kfree(pd); + return ERR_PTR(-EFAULT); + } + } + + return &pd->ibpd; +} + +static int c2_dealloc_pd(struct ib_pd *pd) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + c2_pd_free(to_c2dev(pd->device), to_c2pd(pd)); + kfree(pd); + + return 0; +} + +static struct ib_ah *c2_ah_create(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return ERR_PTR(-ENOSYS); +} + +static int c2_ah_destroy(struct ib_ah *ah) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static void c2_add_ref(struct ib_qp *ibqp) +{ + struct c2_qp *qp; + BUG_ON(!ibqp); + qp = to_c2qp(ibqp); + atomic_inc(&qp->refcount); +} + +static void c2_rem_ref(struct ib_qp *ibqp) +{ + struct c2_qp *qp; + BUG_ON(!ibqp); + qp = to_c2qp(ibqp); + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +struct ib_qp *c2_get_qp(struct ib_device *device, int qpn) +{ + struct c2_dev* c2dev = to_c2dev(device); + struct c2_qp *qp; + + qp = c2_find_qpn(c2dev, qpn); + pr_debug("%s Returning QP=%p for QPN=%d, device=%p, refcount=%d\n", + __FUNCTION__, qp, qpn, device, + (qp?atomic_read(&qp->refcount):0)); + + return (qp?&qp->ibqp:NULL); +} + +static struct ib_qp *c2_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata) +{ + struct c2_qp *qp; + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + switch (init_attr->qp_type) { + case IB_QPT_RC: + qp = kzalloc(sizeof(*qp), GFP_KERNEL); + if (!qp) { + pr_debug("%s: Unable to allocate QP\n", __FUNCTION__); + return ERR_PTR(-ENOMEM); + } + spin_lock_init(&qp->lock); + if (pd->uobject) { + /* userspace specific */ + } + + err = c2_alloc_qp(to_c2dev(pd->device), + to_c2pd(pd), init_attr, qp); + + if (err && pd->uobject) { + /* userspace specific */ + } + + break; + default: + pr_debug("%s: Invalid QP type: %d\n", __FUNCTION__, + init_attr->qp_type); + return ERR_PTR(-EINVAL); + break; + } + + if (err) { + kfree(qp); + return ERR_PTR(err); + } + + return &qp->ibqp; +} + +static int c2_destroy_qp(struct ib_qp *ib_qp) +{ + struct c2_qp *qp = to_c2qp(ib_qp); + + pr_debug("%s:%u qp=%p,qp->state=%d\n", + __FUNCTION__, __LINE__,ib_qp,qp->state); + c2_free_qp(to_c2dev(ib_qp->device), qp); + kfree(qp); + return 0; +} + +static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct c2_cq *cq; + int err; + + cq = kmalloc(sizeof(*cq), GFP_KERNEL); + if (!cq) { + pr_debug("%s: Unable to allocate CQ\n", __FUNCTION__); + return ERR_PTR(-ENOMEM); + } + + err = c2_init_cq(to_c2dev(ibdev), entries, NULL, cq); + if (err) { + pr_debug("%s: error initializing CQ\n", __FUNCTION__); + kfree(cq); + return ERR_PTR(err); + } + + return &cq->ibcq; +} + +static int c2_destroy_cq(struct ib_cq *ib_cq) +{ + struct c2_cq *cq = to_c2cq(ib_cq); + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + c2_free_cq(to_c2dev(ib_cq->device), cq); + kfree(cq); + + return 0; +} + +static inline u32 c2_convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_WRITE ? C2_ACF_REMOTE_WRITE : 0) | + (acc & IB_ACCESS_REMOTE_READ ? C2_ACF_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? C2_ACF_LOCAL_WRITE : 0) | + C2_ACF_LOCAL_READ | C2_ACF_WINDOW_BIND; +} + +static struct ib_mr *c2_reg_phys_mr(struct ib_pd *ib_pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, int acc, u64 * iova_start) +{ + struct c2_mr *mr; + u64 *page_list; + u32 total_len; + int err, i, j, k, page_shift, pbl_depth; + + pbl_depth = 0; + total_len = 0; + + page_shift = PAGE_SHIFT; + /* + * If there is only 1 buffer we assume this could + * be a map of all phy mem...use a 32k page_shift. + */ + if (num_phys_buf == 1) + page_shift += 3; + + for (i = 0; i < num_phys_buf; i++) { + + if (buffer_list[i].addr & ~PAGE_MASK) { + pr_debug("Unaligned Memory Buffer: 0x%x\n", + (unsigned int) buffer_list[i].addr); + return ERR_PTR(-EINVAL); + } + + if (!buffer_list[i].size) { + pr_debug("Invalid Buffer Size\n"); + return ERR_PTR(-EINVAL); + } + + total_len += buffer_list[i].size; + pbl_depth += ALIGN(buffer_list[i].size, + (1 << page_shift)) >> page_shift; + } + + page_list = vmalloc(sizeof(u64) * pbl_depth); + if (!page_list) { + pr_debug("couldn't vmalloc page_list of size %zd\n", + (sizeof(u64) * pbl_depth)); + return ERR_PTR(-ENOMEM); + } + + for (i = 0, j = 0; i < num_phys_buf; i++) { + + int naddrs; + + naddrs = ALIGN(buffer_list[i].size, + (1 << page_shift)) >> page_shift; + for (k = 0; k < naddrs; k++) + page_list[j++] = (buffer_list[i].addr + + (k << page_shift)); + } + + mr = kmalloc(sizeof(*mr), GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + mr->pd = to_c2pd(ib_pd); + pr_debug("%s - page shift %d, pbl_depth %d, total_len %u, " + "*iova_start %llx, first pa %llx, last pa %llx\n", + __FUNCTION__, page_shift, pbl_depth, total_len, + *iova_start, page_list[0], page_list[pbl_depth-1]); + err = c2_nsmr_register_phys_kern(to_c2dev(ib_pd->device), page_list, + (1 << page_shift), pbl_depth, + total_len, 0, iova_start, + c2_convert_access(acc), mr); + vfree(page_list); + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + return &mr->ibmr; +} + +static struct ib_mr *c2_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct ib_phys_buf bl; + u64 kva = 0; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + /* AMSO1100 limit */ + bl.size = 0xffffffff; + bl.addr = 0; + return c2_reg_phys_mr(pd, &bl, 1, acc, &kva); +} + +static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, + int acc, struct ib_udata *udata) +{ + u64 *pages; + u64 kva = 0; + int shift, n, len; + int i, j, k; + int err = 0; + struct ib_umem_chunk *chunk; + struct c2_pd *c2pd = to_c2pd(pd); + struct c2_mr *c2mr; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + shift = ffs(region->page_size) - 1; + + c2mr = kmalloc(sizeof(*c2mr), GFP_KERNEL); + if (!c2mr) + return ERR_PTR(-ENOMEM); + c2mr->pd = c2pd; + + n = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) + n += chunk->nents; + + pages = kmalloc(n * sizeof(u64), GFP_KERNEL); + if (!pages) { + err = -ENOMEM; + goto err; + } + + i = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) { + for (j = 0; j < chunk->nmap; ++j) { + len = sg_dma_len(&chunk->page_list[j]) >> shift; + for (k = 0; k < len; ++k) { + pages[i++] = + sg_dma_address(&chunk->page_list[j]) + + (region->page_size * k); + } + } + } + + kva = (u64)region->virt_base; + err = c2_nsmr_register_phys_kern(to_c2dev(pd->device), + pages, + region->page_size, + i, + region->length, + region->offset, + &kva, + c2_convert_access(acc), + c2mr); + kfree(pages); + if (err) { + kfree(c2mr); + return ERR_PTR(err); + } + return &c2mr->ibmr; + +err: + kfree(c2mr); + return ERR_PTR(err); +} + +static int c2_dereg_mr(struct ib_mr *ib_mr) +{ + struct c2_mr *mr = to_c2mr(ib_mr); + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + err = c2_stag_dealloc(to_c2dev(ib_mr->device), ib_mr->lkey); + if (err) + pr_debug("c2_stag_dealloc failed: %d\n", err); + else + kfree(mr); + + return err; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return sprintf(buf, "%x\n", dev->props.hw_ver); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return sprintf(buf, "%x.%x.%x\n", + (int) (dev->props.fw_ver >> 32), + (int) (dev->props.fw_ver >> 16) & 0xffff, + (int) (dev->props.fw_ver & 0xffff)); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return sprintf(buf, "AMSO1100\n"); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return sprintf(buf, "%.*s\n", 32, "AMSO1100 Board ID"); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); + +static struct class_device_attribute *c2_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type, + &class_device_attr_board_id +}; + +static int c2_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask) +{ + int err; + + err = + c2_qp_modify(to_c2dev(ibqp->device), to_c2qp(ibqp), attr, + attr_mask); + + return err; +} + +static int c2_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int c2_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int c2_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int c2_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + /* Request a connection */ + return c2_llp_connect(cm_id, iw_param); +} + +static int c2_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + /* Accept the new connection */ + return c2_llp_accept(cm_id, iw_param); +} + +static int c2_reject(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) +{ + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + err = c2_llp_reject(cm_id, pdata, pdata_len); + return err; +} + +static int c2_service_create(struct iw_cm_id *cm_id, int backlog) +{ + int err; + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + err = c2_llp_service_create(cm_id, backlog); + pr_debug("%s:%u err=%d\n", + __FUNCTION__, __LINE__, + err); + return err; +} + +static int c2_service_destroy(struct iw_cm_id *cm_id) +{ + int err; + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + + err = c2_llp_service_destroy(cm_id); + + return err; +} + +static int c2_pseudo_up(struct net_device *netdev) +{ + struct in_device *ind; + struct c2_dev *c2dev = netdev->priv; + + ind = in_dev_get(netdev); + if (!ind) + return 0; + + pr_debug("adding...\n"); + for_ifa(ind) { +#ifdef DEBUG + u8 *ip = (u8 *) & ifa->ifa_address; + + pr_debug("%s: %d.%d.%d.%d\n", + ifa->ifa_label, ip[0], ip[1], ip[2], ip[3]); +#endif + c2_add_addr(c2dev, ifa->ifa_address, ifa->ifa_mask); + } + endfor_ifa(ind); + in_dev_put(ind); + + return 0; +} + +static int c2_pseudo_down(struct net_device *netdev) +{ + struct in_device *ind; + struct c2_dev *c2dev = netdev->priv; + + ind = in_dev_get(netdev); + if (!ind) + return 0; + + pr_debug("deleting...\n"); + for_ifa(ind) { +#ifdef DEBUG + u8 *ip = (u8 *) & ifa->ifa_address; + + pr_debug("%s: %d.%d.%d.%d\n", + ifa->ifa_label, ip[0], ip[1], ip[2], ip[3]); +#endif + c2_del_addr(c2dev, ifa->ifa_address, ifa->ifa_mask); + } + endfor_ifa(ind); + in_dev_put(ind); + + return 0; +} + +static int c2_pseudo_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +{ + kfree_skb(skb); + return NETDEV_TX_OK; +} + +static int c2_pseudo_change_mtu(struct net_device *netdev, int new_mtu) +{ + int ret = 0; + + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) + return -EINVAL; + + netdev->mtu = new_mtu; + + /* TODO: Tell rnic about new rmda interface mtu */ + return ret; +} + +static void setup(struct net_device *netdev) +{ + SET_MODULE_OWNER(netdev); + netdev->open = c2_pseudo_up; + netdev->stop = c2_pseudo_down; + netdev->hard_start_xmit = c2_pseudo_xmit_frame; + netdev->get_stats = NULL; + netdev->tx_timeout = NULL; + netdev->set_mac_address = NULL; + netdev->change_mtu = c2_pseudo_change_mtu; + netdev->watchdog_timeo = 0; + netdev->type = ARPHRD_ETHER; + netdev->mtu = 1500; + netdev->hard_header_len = ETH_HLEN; + netdev->addr_len = ETH_ALEN; + netdev->tx_queue_len = 0; + netdev->flags |= IFF_NOARP; + return; +} + +static struct net_device *c2_pseudo_netdev_init(struct c2_dev *c2dev) +{ + char name[IFNAMSIZ]; + struct net_device *netdev; + + /* change ethxxx to iwxxx */ + strcpy(name, "iw"); + strcat(name, &c2dev->netdev->name[3]); + netdev = alloc_netdev(sizeof(*netdev), name, setup); + if (!netdev) { + printk(KERN_ERR PFX "%s - etherdev alloc failed", + __FUNCTION__); + return NULL; + } + + netdev->priv = c2dev; + + SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev); + + memcpy_fromio(netdev->dev_addr, c2dev->kva + C2_REGS_RDMA_ENADDR, 6); + + /* Print out the MAC address */ + pr_debug("%s: MAC %02X:%02X:%02X:%02X:%02X:%02X\n", + netdev->name, + netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2], + netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5]); + + /* Disable network packets */ + netif_stop_queue(netdev); + return netdev; +} + +int c2_register_device(struct c2_dev *dev) +{ + int ret; + int i; + + /* Register pseudo network device */ + dev->pseudo_netdev = c2_pseudo_netdev_init(dev); + if (dev->pseudo_netdev) { + ret = register_netdev(dev->pseudo_netdev); + if (ret) { + printk(KERN_ERR PFX + "Unable to register netdev, ret = %d\n", ret); + free_netdev(dev->pseudo_netdev); + return ret; + } + } + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX); + dev->ibdev.owner = THIS_MODULE; + dev->ibdev.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV); + + dev->ibdev.node_type = RDMA_NODE_RNIC; + memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); + memcpy(&dev->ibdev.node_guid, dev->pseudo_netdev->dev_addr, 6); + dev->ibdev.phys_port_cnt = 1; + dev->ibdev.dma_device = &dev->pcidev->dev; + dev->ibdev.class_dev.dev = &dev->pcidev->dev; + dev->ibdev.query_device = c2_query_device; + dev->ibdev.query_port = c2_query_port; + dev->ibdev.modify_port = c2_modify_port; + dev->ibdev.query_pkey = c2_query_pkey; + dev->ibdev.query_gid = c2_query_gid; + dev->ibdev.alloc_ucontext = c2_alloc_ucontext; + dev->ibdev.dealloc_ucontext = c2_dealloc_ucontext; + dev->ibdev.mmap = c2_mmap_uar; + dev->ibdev.alloc_pd = c2_alloc_pd; + dev->ibdev.dealloc_pd = c2_dealloc_pd; + dev->ibdev.create_ah = c2_ah_create; + dev->ibdev.destroy_ah = c2_ah_destroy; + dev->ibdev.create_qp = c2_create_qp; + dev->ibdev.modify_qp = c2_modify_qp; + dev->ibdev.destroy_qp = c2_destroy_qp; + dev->ibdev.create_cq = c2_create_cq; + dev->ibdev.destroy_cq = c2_destroy_cq; + dev->ibdev.poll_cq = c2_poll_cq; + dev->ibdev.get_dma_mr = c2_get_dma_mr; + dev->ibdev.reg_phys_mr = c2_reg_phys_mr; + dev->ibdev.reg_user_mr = c2_reg_user_mr; + dev->ibdev.dereg_mr = c2_dereg_mr; + + dev->ibdev.alloc_fmr = NULL; + dev->ibdev.unmap_fmr = NULL; + dev->ibdev.dealloc_fmr = NULL; + dev->ibdev.map_phys_fmr = NULL; + + dev->ibdev.attach_mcast = c2_multicast_attach; + dev->ibdev.detach_mcast = c2_multicast_detach; + dev->ibdev.process_mad = c2_process_mad; + + dev->ibdev.req_notify_cq = c2_arm_cq; + dev->ibdev.post_send = c2_post_send; + dev->ibdev.post_recv = c2_post_receive; + + dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), GFP_KERNEL); + dev->ibdev.iwcm->add_ref = c2_add_ref; + dev->ibdev.iwcm->rem_ref = c2_rem_ref; + dev->ibdev.iwcm->get_qp = c2_get_qp; + dev->ibdev.iwcm->connect = c2_connect; + dev->ibdev.iwcm->accept = c2_accept; + dev->ibdev.iwcm->reject = c2_reject; + dev->ibdev.iwcm->create_listen = c2_service_create; + dev->ibdev.iwcm->destroy_listen = c2_service_destroy; + + ret = ib_register_device(&dev->ibdev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(c2_class_attributes); ++i) { + ret = class_device_create_file(&dev->ibdev.class_dev, + c2_class_attributes[i]); + if (ret) { + unregister_netdev(dev->pseudo_netdev); + free_netdev(dev->pseudo_netdev); + ib_unregister_device(&dev->ibdev); + return ret; + } + } + + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + return 0; +} + +void c2_unregister_device(struct c2_dev *dev) +{ + pr_debug("%s:%u\n", __FUNCTION__, __LINE__); + unregister_netdev(dev->pseudo_netdev); + free_netdev(dev->pseudo_netdev); + ib_unregister_device(&dev->ibdev); +} diff --git a/drivers/infiniband/hw/amso1100/c2_provider.h b/drivers/infiniband/hw/amso1100/c2_provider.h new file mode 100644 index 0000000..0fb6f1c --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_provider.h @@ -0,0 +1,181 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef C2_PROVIDER_H +#define C2_PROVIDER_H +#include + +#include +#include + +#include "c2_mq.h" +#include + +#define C2_MPT_FLAG_ATOMIC (1 << 14) +#define C2_MPT_FLAG_REMOTE_WRITE (1 << 13) +#define C2_MPT_FLAG_REMOTE_READ (1 << 12) +#define C2_MPT_FLAG_LOCAL_WRITE (1 << 11) +#define C2_MPT_FLAG_LOCAL_READ (1 << 10) + +struct c2_buf_list { + void *buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + + +/* The user context keeps track of objects allocated for a + * particular user-mode client. */ +struct c2_ucontext { + struct ib_ucontext ibucontext; +}; + +struct c2_mtt; + +/* All objects associated with a PD are kept in the + * associated user context if present. + */ +struct c2_pd { + struct ib_pd ibpd; + u32 pd_id; +}; + +struct c2_mr { + struct ib_mr ibmr; + struct c2_pd *pd; +}; + +struct c2_av; + +enum c2_ah_type { + C2_AH_ON_HCA, + C2_AH_PCI_POOL, + C2_AH_KMALLOC +}; + +struct c2_ah { + struct ib_ah ibah; +}; + +struct c2_cq { + struct ib_cq ibcq; + spinlock_t lock; + atomic_t refcount; + int cqn; + int is_kernel; + wait_queue_head_t wait; + + u32 adapter_handle; + struct c2_mq mq; +}; + +struct c2_wq { + spinlock_t lock; +}; +struct iw_cm_id; +struct c2_qp { + struct ib_qp ibqp; + struct iw_cm_id *cm_id; + spinlock_t lock; + atomic_t refcount; + wait_queue_head_t wait; + int qpn; + + u32 adapter_handle; + u32 send_sgl_depth; + u32 recv_sgl_depth; + u32 rdma_write_sgl_depth; + u8 state; + + struct c2_mq sq_mq; + struct c2_mq rq_mq; +}; + +struct c2_cr_query_attrs { + u32 local_addr; + u32 remote_addr; + u16 local_port; + u16 remote_port; +}; + +static inline struct c2_pd *to_c2pd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct c2_pd, ibpd); +} + +static inline struct c2_ucontext *to_c2ucontext(struct ib_ucontext *ibucontext) +{ + return container_of(ibucontext, struct c2_ucontext, ibucontext); +} + +static inline struct c2_mr *to_c2mr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct c2_mr, ibmr); +} + + +static inline struct c2_ah *to_c2ah(struct ib_ah *ibah) +{ + return container_of(ibah, struct c2_ah, ibah); +} + +static inline struct c2_cq *to_c2cq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct c2_cq, ibcq); +} + +static inline struct c2_qp *to_c2qp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct c2_qp, ibqp); +} + +static inline int is_rnic_addr(struct net_device *netdev, u32 addr) +{ + struct in_device *ind; + int ret = 0; + + ind = in_dev_get(netdev); + if (!ind) + return 0; + + for_ifa(ind) { + if (ifa->ifa_address == addr) { + ret = 1; + break; + } + } + endfor_ifa(ind); + in_dev_put(ind); + return ret; +} +#endif /* C2_PROVIDER_H */ diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c b/drivers/infiniband/hw/amso1100/c2_qp.c new file mode 100644 index 0000000..76a60bc --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_qp.c @@ -0,0 +1,975 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2004 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include "c2.h" +#include "c2_vq.h" +#include "c2_status.h" + +#define C2_MAX_ORD_PER_QP 128 +#define C2_MAX_IRD_PER_QP 128 + +#define C2_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count) +#define C2_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16) +#define C2_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF) + +#define NO_SUPPORT -1 +static const u8 c2_opcode[] = { + [IB_WR_SEND] = C2_WR_TYPE_SEND, + [IB_WR_SEND_WITH_IMM] = NO_SUPPORT, + [IB_WR_RDMA_WRITE] = C2_WR_TYPE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = NO_SUPPORT, + [IB_WR_RDMA_READ] = C2_WR_TYPE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = NO_SUPPORT, + [IB_WR_ATOMIC_FETCH_AND_ADD] = NO_SUPPORT, +}; + +static int to_c2_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: + return C2_QP_STATE_IDLE; + case IB_QPS_RTS: + return C2_QP_STATE_RTS; + case IB_QPS_SQD: + return C2_QP_STATE_CLOSING; + case IB_QPS_SQE: + return C2_QP_STATE_CLOSING; + case IB_QPS_ERR: + return C2_QP_STATE_ERROR; + default: + return -1; + } +} + +int to_ib_state(enum c2_qp_state c2_state) +{ + switch (c2_state) { + case C2_QP_STATE_IDLE: + return IB_QPS_RESET; + case C2_QP_STATE_CONNECTING: + return IB_QPS_RTR; + case C2_QP_STATE_RTS: + return IB_QPS_RTS; + case C2_QP_STATE_CLOSING: + return IB_QPS_SQD; + case C2_QP_STATE_ERROR: + return IB_QPS_ERR; + case C2_QP_STATE_TERMINATE: + return IB_QPS_SQE; + default: + return -1; + } +} + +const char *to_ib_state_str(int ib_state) +{ + static const char *state_str[] = { + "IB_QPS_RESET", + "IB_QPS_INIT", + "IB_QPS_RTR", + "IB_QPS_RTS", + "IB_QPS_SQD", + "IB_QPS_SQE", + "IB_QPS_ERR" + }; + if (ib_state < IB_QPS_RESET || + ib_state > IB_QPS_ERR) + return ""; + + ib_state -= IB_QPS_RESET; + return state_str[ib_state]; +} + +void c2_set_qp_state(struct c2_qp *qp, int c2_state) +{ + int new_state = to_ib_state(c2_state); + + pr_debug("%s: qp[%p] state modify %s --> %s\n", + __FUNCTION__, + qp, + to_ib_state_str(qp->state), + to_ib_state_str(new_state)); + qp->state = new_state; +} + +#define C2_QP_NO_ATTR_CHANGE 0xFFFFFFFF + +int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp, + struct ib_qp_attr *attr, int attr_mask) +{ + struct c2wr_qp_modify_req wr; + struct c2wr_qp_modify_rep *reply; + struct c2_vq_req *vq_req; + unsigned long flags; + u8 next_state; + int err; + + pr_debug("%s:%d qp=%p, %s --> %s\n", + __FUNCTION__, __LINE__, + qp, + to_ib_state_str(qp->state), + to_ib_state_str(attr->qp_state)); + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + c2_wr_set_id(&wr, CCWR_QP_MODIFY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.qp_handle = qp->adapter_handle; + wr.ord = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.ird = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.sq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.rq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + + if (attr_mask & IB_QP_STATE) { + /* Ensure the state is valid */ + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) + return -EINVAL; + + wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state)); + + if (attr->qp_state == IB_QPS_ERR) { + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id && qp->state == IB_QPS_RTS) { + pr_debug("Generating CLOSE event for QP-->ERR, " + "qp=%p, cm_id=%p\n",qp,qp->cm_id); + /* Generate an CLOSE event */ + vq_req->cm_id = qp->cm_id; + vq_req->event = IW_CM_EVENT_CLOSE; + } + spin_unlock_irqrestore(&qp->lock, flags); + } + next_state = attr->qp_state; + + } else if (attr_mask & IB_QP_CUR_STATE) { + + if (attr->cur_qp_state != IB_QPS_RTR && + attr->cur_qp_state != IB_QPS_RTS && + attr->cur_qp_state != IB_QPS_SQD && + attr->cur_qp_state != IB_QPS_SQE) + return -EINVAL; + else + wr.next_qp_state = + cpu_to_be32(to_c2_state(attr->cur_qp_state)); + + next_state = attr->cur_qp_state; + + } else { + err = 0; + goto bail0; + } + + /* reference the request struct */ + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + reply = (struct c2wr_qp_modify_rep *) (unsigned long) vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + if (!err) + qp->state = next_state; +#ifdef DEBUG + else + pr_debug("%s: c2_errno=%d\n", __FUNCTION__, err); +#endif + /* + * If we're going to error and generating the event here, then + * we need to remove the reference because there will be no + * close event generated by the adapter + */ + spin_lock_irqsave(&qp->lock, flags); + if (vq_req->event==IW_CM_EVENT_CLOSE && qp->cm_id) { + qp->cm_id->rem_ref(qp->cm_id); + qp->cm_id = NULL; + } + spin_unlock_irqrestore(&qp->lock, flags); + + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + + pr_debug("%s:%d qp=%p, cur_state=%s\n", + __FUNCTION__, __LINE__, + qp, + to_ib_state_str(qp->state)); + return err; +} + +int c2_qp_set_read_limits(struct c2_dev *c2dev, struct c2_qp *qp, + int ord, int ird) +{ + struct c2wr_qp_modify_req wr; + struct c2wr_qp_modify_rep *reply; + struct c2_vq_req *vq_req; + int err; + + vq_req = vq_req_alloc(c2dev); + if (!vq_req) + return -ENOMEM; + + c2_wr_set_id(&wr, CCWR_QP_MODIFY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.qp_handle = qp->adapter_handle; + wr.ord = cpu_to_be32(ord); + wr.ird = cpu_to_be32(ird); + wr.sq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.rq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + wr.next_qp_state = cpu_to_be32(C2_QP_NO_ATTR_CHANGE); + + /* reference the request struct */ + vq_req_get(c2dev, vq_req); + + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + err = vq_wait_for_reply(c2dev, vq_req); + if (err) + goto bail0; + + reply = (struct c2wr_qp_modify_rep *) (unsigned long) + vq_req->reply_msg; + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + err = c2_errno(reply); + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +static int destroy_qp(struct c2_dev *c2dev, struct c2_qp *qp) +{ + struct c2_vq_req *vq_req; + struct c2wr_qp_destroy_req wr; + struct c2wr_qp_destroy_rep *reply; + unsigned long flags; + int err; + + /* + * Allocate a verb request message + */ + vq_req = vq_req_alloc(c2dev); + if (!vq_req) { + return -ENOMEM; + } + + /* + * Initialize the WR + */ + c2_wr_set_id(&wr, CCWR_QP_DESTROY); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.qp_handle = qp->adapter_handle; + + /* + * reference the request struct. dereferenced in the int handler. + */ + vq_req_get(c2dev, vq_req); + + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id && qp->state == IB_QPS_RTS) { + pr_debug("destroy_qp: generating CLOSE event for QP-->ERR, " + "qp=%p, cm_id=%p\n",qp,qp->cm_id); + /* Generate an CLOSE event */ + vq_req->qp = qp; + vq_req->cm_id = qp->cm_id; + vq_req->event = IW_CM_EVENT_CLOSE; + } + spin_unlock_irqrestore(&qp->lock, flags); + + /* + * Send WR to adapter + */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail0; + } + + /* + * Wait for reply from adapter + */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail0; + } + + /* + * Process reply + */ + reply = (struct c2wr_qp_destroy_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail0; + } + + spin_lock_irqsave(&qp->lock, flags); + if (qp->cm_id) { + qp->cm_id->rem_ref(qp->cm_id); + qp->cm_id = NULL; + } + spin_unlock_irqrestore(&qp->lock, flags); + + vq_repbuf_free(c2dev, reply); + bail0: + vq_req_free(c2dev, vq_req); + return err; +} + +static int c2_alloc_qpn(struct c2_dev *c2dev, struct c2_qp *qp) +{ + int ret; + + do { + spin_lock_irq(&c2dev->qp_table.lock); + ret = idr_get_new_above(&c2dev->qp_table.idr, qp, + c2dev->qp_table.last++, &qp->qpn); + spin_unlock_irq(&c2dev->qp_table.lock); + } while ((ret == -EAGAIN) && + idr_pre_get(&c2dev->qp_table.idr, GFP_KERNEL)); + return ret; +} + +static void c2_free_qpn(struct c2_dev *c2dev, int qpn) +{ + spin_lock_irq(&c2dev->qp_table.lock); + idr_remove(&c2dev->qp_table.idr, qpn); + spin_unlock_irq(&c2dev->qp_table.lock); +} + +struct c2_qp *c2_find_qpn(struct c2_dev *c2dev, int qpn) +{ + unsigned long flags; + struct c2_qp *qp; + + spin_lock_irqsave(&c2dev->qp_table.lock, flags); + qp = idr_find(&c2dev->qp_table.idr, qpn); + spin_unlock_irqrestore(&c2dev->qp_table.lock, flags); + return qp; +} + +int c2_alloc_qp(struct c2_dev *c2dev, + struct c2_pd *pd, + struct ib_qp_init_attr *qp_attrs, struct c2_qp *qp) +{ + struct c2wr_qp_create_req wr; + struct c2wr_qp_create_rep *reply; + struct c2_vq_req *vq_req; + struct c2_cq *send_cq = to_c2cq(qp_attrs->send_cq); + struct c2_cq *recv_cq = to_c2cq(qp_attrs->recv_cq); + unsigned long peer_pa; + u32 q_size, msg_size, mmap_size; + void __iomem *mmap; + int err; + + err = c2_alloc_qpn(c2dev, qp); + if (err) + return err; + qp->ibqp.qp_num = qp->qpn; + qp->ibqp.qp_type = IB_QPT_RC; + + /* Allocate the SQ and RQ shared pointers */ + qp->sq_mq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &qp->sq_mq.shared_dma, GFP_KERNEL); + if (!qp->sq_mq.shared) { + err = -ENOMEM; + goto bail0; + } + + qp->rq_mq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, + &qp->rq_mq.shared_dma, GFP_KERNEL); + if (!qp->rq_mq.shared) { + err = -ENOMEM; + goto bail1; + } + + /* Allocate the verbs request */ + vq_req = vq_req_alloc(c2dev); + if (vq_req == NULL) { + err = -ENOMEM; + goto bail2; + } + + /* Initialize the work request */ + memset(&wr, 0, sizeof(wr)); + c2_wr_set_id(&wr, CCWR_QP_CREATE); + wr.hdr.context = (unsigned long) vq_req; + wr.rnic_handle = c2dev->adapter_handle; + wr.sq_cq_handle = send_cq->adapter_handle; + wr.rq_cq_handle = recv_cq->adapter_handle; + wr.sq_depth = cpu_to_be32(qp_attrs->cap.max_send_wr + 1); + wr.rq_depth = cpu_to_be32(qp_attrs->cap.max_recv_wr + 1); + wr.srq_handle = 0; + wr.flags = cpu_to_be32(QP_RDMA_READ | QP_RDMA_WRITE | QP_MW_BIND | + QP_ZERO_STAG | QP_RDMA_READ_RESPONSE); + wr.send_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge); + wr.recv_sgl_depth = cpu_to_be32(qp_attrs->cap.max_recv_sge); + wr.rdma_write_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge); + wr.shared_sq_ht = cpu_to_be64(qp->sq_mq.shared_dma); + wr.shared_rq_ht = cpu_to_be64(qp->rq_mq.shared_dma); + wr.ord = cpu_to_be32(C2_MAX_ORD_PER_QP); + wr.ird = cpu_to_be32(C2_MAX_IRD_PER_QP); + wr.pd_id = pd->pd_id; + wr.user_context = (unsigned long) qp; + + vq_req_get(c2dev, vq_req); + + /* Send the WR to the adapter */ + err = vq_send_wr(c2dev, (union c2wr *) & wr); + if (err) { + vq_req_put(c2dev, vq_req); + goto bail3; + } + + /* Wait for the verb reply */ + err = vq_wait_for_reply(c2dev, vq_req); + if (err) { + goto bail3; + } + + /* Process the reply */ + reply = (struct c2wr_qp_create_rep *) (unsigned long) (vq_req->reply_msg); + if (!reply) { + err = -ENOMEM; + goto bail3; + } + + if ((err = c2_wr_get_result(reply)) != 0) { + goto bail4; + } + + /* Fill in the kernel QP struct */ + atomic_set(&qp->refcount, 1); + qp->adapter_handle = reply->qp_handle; + qp->state = IB_QPS_RESET; + qp->send_sgl_depth = qp_attrs->cap.max_send_sge; + qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge; + qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge; + + /* Initialize the SQ MQ */ + q_size = be32_to_cpu(reply->sq_depth); + msg_size = be32_to_cpu(reply->sq_msg_size); + peer_pa = c2dev->pa + be32_to_cpu(reply->sq_mq_start); + mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size); + mmap = ioremap_nocache(peer_pa, mmap_size); + if (!mmap) { + err = -ENOMEM; + goto bail5; + } + + c2_mq_req_init(&qp->sq_mq, + be32_to_cpu(reply->sq_mq_index), + q_size, + msg_size, + mmap + sizeof(struct c2_mq_shared), /* pool start */ + mmap, /* peer */ + C2_MQ_ADAPTER_TARGET); + + /* Initialize the RQ mq */ + q_size = be32_to_cpu(reply->rq_depth); + msg_size = be32_to_cpu(reply->rq_msg_size); + peer_pa = c2dev->pa + be32_to_cpu(reply->rq_mq_start); + mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size); + mmap = ioremap_nocache(peer_pa, mmap_size); + if (!mmap) { + err = -ENOMEM; + goto bail6; + } + + c2_mq_req_init(&qp->rq_mq, + be32_to_cpu(reply->rq_mq_index), + q_size, + msg_size, + mmap + sizeof(struct c2_mq_shared), /* pool start */ + mmap, /* peer */ + C2_MQ_ADAPTER_TARGET); + + vq_repbuf_free(c2dev, reply); + vq_req_free(c2dev, vq_req); + + return 0; + + bail6: + iounmap(qp->sq_mq.peer); + bail5: + destroy_qp(c2dev, qp); + bail4: + vq_repbuf_free(c2dev, reply); + bail3: + vq_req_free(c2dev, vq_req); + bail2: + c2_free_mqsp(qp->rq_mq.shared); + bail1: + c2_free_mqsp(qp->sq_mq.shared); + bail0: + c2_free_qpn(c2dev, qp->qpn); + return err; +} + +void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp) +{ + struct c2_cq *send_cq; + struct c2_cq *recv_cq; + + send_cq = to_c2cq(qp->ibqp.send_cq); + recv_cq = to_c2cq(qp->ibqp.recv_cq); + + /* + * Lock CQs here, so that CQ polling code can do QP lookup + * without taking a lock. + */ + spin_lock_irq(&send_cq->lock); + if (send_cq != recv_cq) + spin_lock(&recv_cq->lock); + + c2_free_qpn(c2dev, qp->qpn); + + if (send_cq != recv_cq) + spin_unlock(&recv_cq->lock); + spin_unlock_irq(&send_cq->lock); + + /* + * Destory qp in the rnic... + */ + destroy_qp(c2dev, qp); + + /* + * Mark any unreaped CQEs as null and void. + */ + c2_cq_clean(c2dev, qp, send_cq->cqn); + if (send_cq != recv_cq) + c2_cq_clean(c2dev, qp, recv_cq->cqn); + /* + * Unmap the MQs and return the shared pointers + * to the message pool. + */ + iounmap(qp->sq_mq.peer); + iounmap(qp->rq_mq.peer); + c2_free_mqsp(qp->sq_mq.shared); + c2_free_mqsp(qp->rq_mq.shared); + + atomic_dec(&qp->refcount); + wait_event(qp->wait, !atomic_read(&qp->refcount)); +} + +/* + * Function: move_sgl + * + * Description: + * Move an SGL from the user's work request struct into a CCIL Work Request + * message, swapping to WR byte order and ensure the total length doesn't + * overflow. + * + * IN: + * dst - ptr to CCIL Work Request message SGL memory. + * src - ptr to the consumers SGL memory. + * + * OUT: none + * + * Return: + * CCIL status codes. + */ +static int +move_sgl(struct c2_data_addr * dst, struct ib_sge *src, int count, u32 * p_len, + u8 * actual_count) +{ + u32 tot = 0; /* running total */ + u8 acount = 0; /* running total non-0 len sge's */ + + while (count > 0) { + /* + * If the addition of this SGE causes the + * total SGL length to exceed 2^32-1, then + * fail-n-bail. + * + * If the current total plus the next element length + * wraps, then it will go negative and be less than the + * current total... + */ + if ((tot + src->length) < tot) { + return -EINVAL; + } + /* + * Bug: 1456 (as well as 1498 & 1643) + * Skip over any sge's supplied with len=0 + */ + if (src->length) { + tot += src->length; + dst->stag = cpu_to_be32(src->lkey); + dst->to = cpu_to_be64(src->addr); + dst->length = cpu_to_be32(src->length); + dst++; + acount++; + } + src++; + count--; + } + + if (acount == 0) { + /* + * Bug: 1476 (as well as 1498, 1456 and 1643) + * Setup the SGL in the WR to make it easier for the RNIC. + * This way, the FW doesn't have to deal with special cases. + * Setting length=0 should be sufficient. + */ + dst->stag = 0; + dst->to = 0; + dst->length = 0; + } + + *p_len = tot; + *actual_count = acount; + return 0; +} + +/* + * Function: c2_activity (private function) + * + * Description: + * Post an mq index to the host->adapter activity fifo. + * + * IN: + * c2dev - ptr to c2dev structure + * mq_index - mq index to post + * shared - value most recently written to shared + * + * OUT: + * + * Return: + * none + */ +static inline void c2_activity(struct c2_dev *c2dev, u32 mq_index, u16 shared) +{ + /* + * First read the register to see if the FIFO is full, and if so, + * spin until it's not. This isn't perfect -- there is no + * synchronization among the clients of the register, but in + * practice it prevents multiple CPU from hammering the bus + * with PCI RETRY. Note that when this does happen, the card + * cannot get on the bus and the card and system hang in a + * deadlock -- thus the need for this code. [TOT] + */ + while (readl(c2dev->regs + PCI_BAR0_ADAPTER_HINT) & 0x80000000) { + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_timeout(0); + } + + __raw_writel(C2_HINT_MAKE(mq_index, shared), + c2dev->regs + PCI_BAR0_ADAPTER_HINT); +} + +/* + * Function: qp_wr_post + * + * Description: + * This in-line function allocates a MQ msg, then moves the host-copy of + * the completed WR into msg. Then it posts the message. + * + * IN: + * q - ptr to user MQ. + * wr - ptr to host-copy of the WR. + * qp - ptr to user qp + * size - Number of bytes to post. Assumed to be divisible by 4. + * + * OUT: none + * + * Return: + * CCIL status codes. + */ +static int qp_wr_post(struct c2_mq *q, union c2wr * wr, struct c2_qp *qp, u32 size) +{ + union c2wr *msg; + + msg = c2_mq_alloc(q); + if (msg == NULL) { + return -EINVAL; + } +#ifdef CCMSGMAGIC + ((c2wr_hdr_t *) wr)->magic = cpu_to_be32(CCWR_MAGIC); +#endif + + /* + * Since all header fields in the WR are the same as the + * CQE, set the following so the adapter need not. + */ + c2_wr_set_result(wr, CCERR_PENDING); + + /* + * Copy the wr down to the adapter + */ + memcpy((void *) msg, (void *) wr, size); + + c2_mq_produce(q); + return 0; +} + + +int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, + struct ib_send_wr **bad_wr) +{ + struct c2_dev *c2dev = to_c2dev(ibqp->device); + struct c2_qp *qp = to_c2qp(ibqp); + union c2wr wr; + int err = 0; + + u32 flags; + u32 tot_len; + u8 actual_sge_count; + u32 msg_size; + + if (qp->state > IB_QPS_RTS) + return -EINVAL; + + while (ib_wr) { + + flags = 0; + wr.sqwr.sq_hdr.user_hdr.hdr.context = ib_wr->wr_id; + if (ib_wr->send_flags & IB_SEND_SIGNALED) { + flags |= SQ_SIGNALED; + } + + switch (ib_wr->opcode) { + case IB_WR_SEND: + if (ib_wr->send_flags & IB_SEND_SOLICITED) { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE); + msg_size = sizeof(struct c2wr_send_req); + } else { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND); + msg_size = sizeof(struct c2wr_send_req); + } + + wr.sqwr.send.remote_stag = 0; + msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; + if (ib_wr->num_sge > qp->send_sgl_depth) { + err = -EINVAL; + break; + } + if (ib_wr->send_flags & IB_SEND_FENCE) { + flags |= SQ_READ_FENCE; + } + err = move_sgl((struct c2_data_addr *) & (wr.sqwr.send.data), + ib_wr->sg_list, + ib_wr->num_sge, + &tot_len, &actual_sge_count); + wr.sqwr.send.sge_len = cpu_to_be32(tot_len); + c2_wr_set_sge_count(&wr, actual_sge_count); + break; + case IB_WR_RDMA_WRITE: + c2_wr_set_id(&wr, C2_WR_TYPE_RDMA_WRITE); + msg_size = sizeof(struct c2wr_rdma_write_req) + + (sizeof(struct c2_data_addr) * ib_wr->num_sge); + if (ib_wr->num_sge > qp->rdma_write_sgl_depth) { + err = -EINVAL; + break; + } + if (ib_wr->send_flags & IB_SEND_FENCE) { + flags |= SQ_READ_FENCE; + } + wr.sqwr.rdma_write.remote_stag = + cpu_to_be32(ib_wr->wr.rdma.rkey); + wr.sqwr.rdma_write.remote_to = + cpu_to_be64(ib_wr->wr.rdma.remote_addr); + err = move_sgl((struct c2_data_addr *) + & (wr.sqwr.rdma_write.data), + ib_wr->sg_list, + ib_wr->num_sge, + &tot_len, &actual_sge_count); + wr.sqwr.rdma_write.sge_len = cpu_to_be32(tot_len); + c2_wr_set_sge_count(&wr, actual_sge_count); + break; + case IB_WR_RDMA_READ: + c2_wr_set_id(&wr, C2_WR_TYPE_RDMA_READ); + msg_size = sizeof(struct c2wr_rdma_read_req); + + /* IWarp only suppots 1 sge for RDMA reads */ + if (ib_wr->num_sge > 1) { + err = -EINVAL; + break; + } + + /* + * Move the local and remote stag/to/len into the WR. + */ + wr.sqwr.rdma_read.local_stag = + cpu_to_be32(ib_wr->sg_list->lkey); + wr.sqwr.rdma_read.local_to = + cpu_to_be64(ib_wr->sg_list->addr); + wr.sqwr.rdma_read.remote_stag = + cpu_to_be32(ib_wr->wr.rdma.rkey); + wr.sqwr.rdma_read.remote_to = + cpu_to_be64(ib_wr->wr.rdma.remote_addr); + wr.sqwr.rdma_read.length = + cpu_to_be32(ib_wr->sg_list->length); + break; + default: + /* error */ + msg_size = 0; + err = -EINVAL; + break; + } + + /* + * If we had an error on the last wr build, then + * break out. Possible errors include bogus WR + * type, and a bogus SGL length... + */ + if (err) { + break; + } + + /* + * Store flags + */ + c2_wr_set_flags(&wr, flags); + + /* + * Post the puppy! + */ + err = qp_wr_post(&qp->sq_mq, &wr, qp, msg_size); + if (err) { + break; + } + + /* + * Enqueue mq index to activity FIFO. + */ + c2_activity(c2dev, qp->sq_mq.index, qp->sq_mq.hint_count); + + ib_wr = ib_wr->next; + } + + if (err) + *bad_wr = ib_wr; + return err; +} + +int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, + struct ib_recv_wr **bad_wr) +{ + struct c2_dev *c2dev = to_c2dev(ibqp->device); + struct c2_qp *qp = to_c2qp(ibqp); + union c2wr wr; + int err = 0; + + if (qp->state > IB_QPS_RTS) + return -EINVAL; + + /* + * Try and post each work request + */ + while (ib_wr) { + u32 tot_len; + u8 actual_sge_count; + + if (ib_wr->num_sge > qp->recv_sgl_depth) { + err = -EINVAL; + break; + } + + /* + * Create local host-copy of the WR + */ + wr.rqwr.rq_hdr.user_hdr.hdr.context = ib_wr->wr_id; + c2_wr_set_id(&wr, CCWR_RECV); + c2_wr_set_flags(&wr, 0); + + /* sge_count is limited to eight bits. */ + BUG_ON(ib_wr->num_sge >= 256); + err = move_sgl((struct c2_data_addr *) & (wr.rqwr.data), + ib_wr->sg_list, + ib_wr->num_sge, &tot_len, &actual_sge_count); + c2_wr_set_sge_count(&wr, actual_sge_count); + + /* + * If we had an error on the last wr build, then + * break out. Possible errors include bogus WR + * type, and a bogus SGL length... + */ + if (err) { + break; + } + + err = qp_wr_post(&qp->rq_mq, &wr, qp, qp->rq_mq.msg_size); + if (err) { + break; + } + + /* + * Enqueue mq index to activity FIFO + */ + c2_activity(c2dev, qp->rq_mq.index, qp->rq_mq.hint_count); + + ib_wr = ib_wr->next; + } + + if (err) + *bad_wr = ib_wr; + return err; +} + +void __devinit c2_init_qp_table(struct c2_dev *c2dev) +{ + spin_lock_init(&c2dev->qp_table.lock); + idr_init(&c2dev->qp_table.idr); +} + +void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev) +{ + idr_destroy(&c2dev->qp_table.idr); +} diff --git a/drivers/infiniband/hw/amso1100/c2_user.h b/drivers/infiniband/hw/amso1100/c2_user.h new file mode 100644 index 0000000..7e9e7ad --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_user.h @@ -0,0 +1,82 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef C2_USER_H +#define C2_USER_H + +#include + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct c2_alloc_ucontext_resp { + __u32 qp_tab_size; + __u32 uarc_size; +}; + +struct c2_alloc_pd_resp { + __u32 pdn; + __u32 reserved; +}; + +struct c2_create_cq { + __u32 lkey; + __u32 pdn; + __u64 arm_db_page; + __u64 set_db_page; + __u32 arm_db_index; + __u32 set_db_index; +}; + +struct c2_create_cq_resp { + __u32 cqn; + __u32 reserved; +}; + +struct c2_create_qp { + __u32 lkey; + __u32 reserved; + __u64 sq_db_page; + __u64 rq_db_page; + __u32 sq_db_index; + __u32 rq_db_index; +}; + +#endif /* C2_USER_H */ From swise at opengridcomputing.com Tue Jun 20 13:31:16 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:31:16 -0500 Subject: [openib-general] [PATCH v3 5/7] AMSO1100 Message Queues. In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop> References: <20060620203050.31536.5341.stgit@stevo-desktop> Message-ID: <20060620203116.31536.27965.stgit@stevo-desktop> V2 Review Changes: - correctly map host memory for DMA (don't use __pa()). V1 Review Changes: - remove useless asserts - assert() -> BUG_ON() - C2_DEBUG -> DEBUG --- drivers/infiniband/hw/amso1100/c2_mq.c | 175 ++++++++++++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2_mq.h | 107 ++++++++++++++++++++ 2 files changed, 282 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_mq.c b/drivers/infiniband/hw/amso1100/c2_mq.c new file mode 100644 index 0000000..96bbe9a --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_mq.c @@ -0,0 +1,175 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "c2.h" +#include "c2_mq.h" + +void *c2_mq_alloc(struct c2_mq *q) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_ADAPTER_TARGET); + + if (c2_mq_full(q)) { + return NULL; + } else { +#ifdef DEBUG + struct c2wr_hdr *m = + (struct c2wr_hdr *) (q->msg_pool.host + q->priv * q->msg_size); +#ifdef CCMSGMAGIC + BUG_ON(m->magic != be32_to_cpu(~CCWR_MAGIC)); + m->magic = cpu_to_be32(CCWR_MAGIC); +#endif + return m; +#else + return q->msg_pool.host + q->priv * q->msg_size; +#endif + } +} + +void c2_mq_produce(struct c2_mq *q) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_ADAPTER_TARGET); + + if (!c2_mq_full(q)) { + q->priv = (q->priv + 1) % q->q_size; + q->hint_count++; + /* Update peer's offset. */ + __raw_writew(cpu_to_be16(q->priv), &q->peer->shared); + } +} + +void *c2_mq_consume(struct c2_mq *q) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_HOST_TARGET); + + if (c2_mq_empty(q)) { + return NULL; + } else { +#ifdef DEBUG + struct c2wr_hdr *m = (struct c2wr_hdr *) + (q->msg_pool.host + q->priv * q->msg_size); +#ifdef CCMSGMAGIC + BUG_ON(m->magic != be32_to_cpu(CCWR_MAGIC)); +#endif + return m; +#else + return q->msg_pool.host + q->priv * q->msg_size; +#endif + } +} + +void c2_mq_free(struct c2_mq *q) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_HOST_TARGET); + + if (!c2_mq_empty(q)) { + +#ifdef CCMSGMAGIC + { + struct c2wr_hdr __iomem *m = (struct c2wr_hdr __iomem *) + (q->msg_pool.adapter + q->priv * q->msg_size); + __raw_writel(cpu_to_be32(~CCWR_MAGIC), &m->magic); + } +#endif + q->priv = (q->priv + 1) % q->q_size; + /* Update peer's offset. */ + __raw_writew(cpu_to_be16(q->priv), &q->peer->shared); + } +} + + +void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count) +{ + BUG_ON(q->magic != C2_MQ_MAGIC); + BUG_ON(q->type != C2_MQ_ADAPTER_TARGET); + + while (wqe_count--) { + BUG_ON(c2_mq_empty(q)); + *q->shared = cpu_to_be16((be16_to_cpu(*q->shared)+1) % q->q_size); + } +} + + +u32 c2_mq_count(struct c2_mq *q) +{ + s32 count; + + if (q->type == C2_MQ_HOST_TARGET) { + count = be16_to_cpu(*q->shared) - q->priv; + } else { + count = q->priv - be16_to_cpu(*q->shared); + } + + if (count < 0) { + count += q->q_size; + } + + return (u32) count; +} + +void c2_mq_req_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size, + u8 __iomem *pool_start, u16 __iomem *peer, u32 type) +{ + BUG_ON(!q->shared); + + /* This code assumes the byte swapping has already been done! */ + q->index = index; + q->q_size = q_size; + q->msg_size = msg_size; + q->msg_pool.adapter = pool_start; + q->peer = (struct c2_mq_shared __iomem *) peer; + q->magic = C2_MQ_MAGIC; + q->type = type; + q->priv = 0; + q->hint_count = 0; + return; +} +void c2_mq_rep_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size, + u8 *pool_start, u16 __iomem *peer, u32 type) +{ + BUG_ON(!q->shared); + + /* This code assumes the byte swapping has already been done! */ + q->index = index; + q->q_size = q_size; + q->msg_size = msg_size; + q->msg_pool.host = pool_start; + q->peer = (struct c2_mq_shared __iomem *) peer; + q->magic = C2_MQ_MAGIC; + q->type = type; + q->priv = 0; + q->hint_count = 0; + return; +} diff --git a/drivers/infiniband/hw/amso1100/c2_mq.h b/drivers/infiniband/hw/amso1100/c2_mq.h new file mode 100644 index 0000000..9b1296e --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_mq.h @@ -0,0 +1,107 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef _C2_MQ_H_ +#define _C2_MQ_H_ +#include +#include +#include "c2_wr.h" + +enum c2_shared_regs { + + C2_SHARED_ARMED = 0x10, + C2_SHARED_NOTIFY = 0x18, + C2_SHARED_SHARED = 0x40, +}; + +struct c2_mq_shared { + u16 unused1; + u8 armed; + u8 notification_type; + u32 unused2; + u16 shared; + /* Pad to 64 bytes. */ + u8 pad[64 - sizeof(u16) - 2 * sizeof(u8) - sizeof(u32) - sizeof(u16)]; +}; + +enum c2_mq_type { + C2_MQ_HOST_TARGET = 1, + C2_MQ_ADAPTER_TARGET = 2, +}; + +/* + * c2_mq_t is for kernel-mode MQs like the VQs Cand the AEQ. + * c2_user_mq_t (which is the same format) is for user-mode MQs... + */ +#define C2_MQ_MAGIC 0x4d512020 /* 'MQ ' */ +struct c2_mq { + u32 magic; + union { + u8 *host; + u8 __iomem *adapter; + } msg_pool; + dma_addr_t host_dma; + DECLARE_PCI_UNMAP_ADDR(mapping); + u16 hint_count; + u16 priv; + struct c2_mq_shared __iomem *peer; + u16 *shared; + dma_addr_t shared_dma; + u32 q_size; + u32 msg_size; + u32 index; + enum c2_mq_type type; +}; + +static __inline__ int c2_mq_empty(struct c2_mq *q) +{ + return q->priv == be16_to_cpu(*q->shared); +} + +static __inline__ int c2_mq_full(struct c2_mq *q) +{ + return q->priv == (be16_to_cpu(*q->shared) + q->q_size - 1) % q->q_size; +} + +extern void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count); +extern void *c2_mq_alloc(struct c2_mq *q); +extern void c2_mq_produce(struct c2_mq *q); +extern void *c2_mq_consume(struct c2_mq *q); +extern void c2_mq_free(struct c2_mq *q); +extern u32 c2_mq_count(struct c2_mq *q); +extern void c2_mq_req_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size, + u8 __iomem *pool_start, u16 __iomem *peer, u32 type); +extern void c2_mq_rep_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size, + u8 *pool_start, u16 __iomem *peer, u32 type); + +#endif /* _C2_MQ_H_ */ From swise at opengridcomputing.com Tue Jun 20 13:31:21 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:31:21 -0500 Subject: [openib-general] [PATCH v3 6/7] AMSO1100: Privileged Verbs Queues. In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop> References: <20060620203050.31536.5341.stgit@stevo-desktop> Message-ID: <20060620203121.31536.73315.stgit@stevo-desktop> Review Changes: dprintk() -> pr_debug() --- drivers/infiniband/hw/amso1100/c2_vq.c | 260 ++++++++++++++++++++++++++++++++ drivers/infiniband/hw/amso1100/c2_vq.h | 63 ++++++++ 2 files changed, 323 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_vq.c b/drivers/infiniband/hw/amso1100/c2_vq.c new file mode 100644 index 0000000..445b1ed --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_vq.c @@ -0,0 +1,260 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include "c2_vq.h" +#include "c2_provider.h" + +/* + * Verbs Request Objects: + * + * VQ Request Objects are allocated by the kernel verbs handlers. + * They contain a wait object, a refcnt, an atomic bool indicating that the + * adapter has replied, and a copy of the verb reply work request. + * A pointer to the VQ Request Object is passed down in the context + * field of the work request message, and reflected back by the adapter + * in the verbs reply message. The function handle_vq() in the interrupt + * path will use this pointer to: + * 1) append a copy of the verbs reply message + * 2) mark that the reply is ready + * 3) wake up the kernel verbs handler blocked awaiting the reply. + * + * + * The kernel verbs handlers do a "get" to put a 2nd reference on the + * VQ Request object. If the kernel verbs handler exits before the adapter + * can respond, this extra reference will keep the VQ Request object around + * until the adapter's reply can be processed. The reason we need this is + * because a pointer to this object is stuffed into the context field of + * the verbs work request message, and reflected back in the reply message. + * It is used in the interrupt handler (handle_vq()) to wake up the appropriate + * kernel verb handler that is blocked awaiting the verb reply. + * So handle_vq() will do a "put" on the object when it's done accessing it. + * NOTE: If we guarantee that the kernel verb handler will never bail before + * getting the reply, then we don't need these refcnts. + * + * + * VQ Request objects are freed by the kernel verbs handlers only + * after the verb has been processed, or when the adapter fails and + * does not reply. + * + * + * Verbs Reply Buffers: + * + * VQ Reply bufs are local host memory copies of a + * outstanding Verb Request reply + * message. The are always allocated by the kernel verbs handlers, and _may_ be + * freed by either the kernel verbs handler -or- the interrupt handler. The + * kernel verbs handler _must_ free the repbuf, then free the vq request object + * in that order. + */ + +int vq_init(struct c2_dev *c2dev) +{ + sprintf(c2dev->vq_cache_name, "c2-vq:dev%c", + (char) ('0' + c2dev->devnum)); + c2dev->host_msg_cache = + kmem_cache_create(c2dev->vq_cache_name, c2dev->rep_vq.msg_size, 0, + SLAB_HWCACHE_ALIGN, NULL, NULL); + if (c2dev->host_msg_cache == NULL) { + return -ENOMEM; + } + return 0; +} + +void vq_term(struct c2_dev *c2dev) +{ + kmem_cache_destroy(c2dev->host_msg_cache); +} + +/* vq_req_alloc - allocate a VQ Request Object and initialize it. + * The refcnt is set to 1. + */ +struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev) +{ + struct c2_vq_req *r; + + r = kmalloc(sizeof(struct c2_vq_req), GFP_KERNEL); + if (r) { + init_waitqueue_head(&r->wait_object); + r->reply_msg = (u64) NULL; + r->event = 0; + r->cm_id = NULL; + r->qp = NULL; + atomic_set(&r->refcnt, 1); + atomic_set(&r->reply_ready, 0); + } + return r; +} + + +/* vq_req_free - free the VQ Request Object. It is assumed the verbs handler + * has already free the VQ Reply Buffer if it existed. + */ +void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *r) +{ + r->reply_msg = (u64) NULL; + if (atomic_dec_and_test(&r->refcnt)) { + kfree(r); + } +} + +/* vq_req_get - reference a VQ Request Object. Done + * only in the kernel verbs handlers. + */ +void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *r) +{ + atomic_inc(&r->refcnt); +} + + +/* vq_req_put - dereference and potentially free a VQ Request Object. + * + * This is only called by handle_vq() on the + * interrupt when it is done processing + * a verb reply message. If the associated + * kernel verbs handler has already bailed, + * then this put will actually free the VQ + * Request object _and_ the VQ Reply Buffer + * if it exists. + */ +void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *r) +{ + if (atomic_dec_and_test(&r->refcnt)) { + if (r->reply_msg != (u64) NULL) + vq_repbuf_free(c2dev, + (void *) (unsigned long) r->reply_msg); + kfree(r); + } +} + + +/* + * vq_repbuf_alloc - allocate a VQ Reply Buffer. + */ +void *vq_repbuf_alloc(struct c2_dev *c2dev) +{ + return kmem_cache_alloc(c2dev->host_msg_cache, SLAB_ATOMIC); +} + +/* + * vq_send_wr - post a verbs request message to the Verbs Request Queue. + * If a message is not available in the MQ, then block until one is available. + * NOTE: handle_mq() on the interrupt context will wake up threads blocked here. + * When the adapter drains the Verbs Request Queue, + * it inserts MQ index 0 in to the + * adapter->host activity fifo and interrupts the host. + */ +int vq_send_wr(struct c2_dev *c2dev, union c2wr *wr) +{ + void *msg; + wait_queue_t __wait; + + /* + * grab adapter vq lock + */ + spin_lock(&c2dev->vqlock); + + /* + * allocate msg + */ + msg = c2_mq_alloc(&c2dev->req_vq); + + /* + * If we cannot get a msg, then we'll wait + * When a messages are available, the int handler will wake_up() + * any waiters. + */ + while (msg == NULL) { + pr_debug("%s:%d no available msg in VQ, waiting...\n", + __FUNCTION__, __LINE__); + init_waitqueue_entry(&__wait, current); + add_wait_queue(&c2dev->req_vq_wo, &__wait); + spin_unlock(&c2dev->vqlock); + for (;;) { + set_current_state(TASK_INTERRUPTIBLE); + if (!c2_mq_full(&c2dev->req_vq)) { + break; + } + if (!signal_pending(current)) { + schedule_timeout(1 * HZ); /* 1 second... */ + continue; + } + set_current_state(TASK_RUNNING); + remove_wait_queue(&c2dev->req_vq_wo, &__wait); + return -EINTR; + } + set_current_state(TASK_RUNNING); + remove_wait_queue(&c2dev->req_vq_wo, &__wait); + spin_lock(&c2dev->vqlock); + msg = c2_mq_alloc(&c2dev->req_vq); + } + + /* + * copy wr into adapter msg + */ + memcpy(msg, wr, c2dev->req_vq.msg_size); + + /* + * post msg + */ + c2_mq_produce(&c2dev->req_vq); + + /* + * release adapter vq lock + */ + spin_unlock(&c2dev->vqlock); + return 0; +} + + +/* + * vq_wait_for_reply - block until the adapter posts a Verb Reply Message. + */ +int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req) +{ + if (!wait_event_timeout(req->wait_object, + atomic_read(&req->reply_ready), + 60*HZ)) + return -ETIMEDOUT; + + return 0; +} + +/* + * vq_repbuf_free - Free a Verbs Reply Buffer. + */ +void vq_repbuf_free(struct c2_dev *c2dev, void *reply) +{ + kmem_cache_free(c2dev->host_msg_cache, reply); +} diff --git a/drivers/infiniband/hw/amso1100/c2_vq.h b/drivers/infiniband/hw/amso1100/c2_vq.h new file mode 100644 index 0000000..3380562 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/c2_vq.h @@ -0,0 +1,63 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _C2_VQ_H_ +#define _C2_VQ_H_ +#include +#include "c2.h" +#include "c2_wr.h" +#include "c2_provider.h" + +struct c2_vq_req { + u64 reply_msg; /* ptr to reply msg */ + wait_queue_head_t wait_object; /* wait object for vq reqs */ + atomic_t reply_ready; /* set when reply is ready */ + atomic_t refcnt; /* used to cancel WRs... */ + int event; + struct iw_cm_id *cm_id; + struct c2_qp *qp; +}; + +extern int vq_init(struct c2_dev *c2dev); +extern void vq_term(struct c2_dev *c2dev); + +extern struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev); +extern void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *req); +extern void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *req); +extern void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *req); +extern int vq_send_wr(struct c2_dev *c2dev, union c2wr * wr); + +extern void *vq_repbuf_alloc(struct c2_dev *c2dev); +extern void vq_repbuf_free(struct c2_dev *c2dev, void *reply); + +extern int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req); +#endif /* _C2_VQ_H_ */ From swise at opengridcomputing.com Tue Jun 20 13:31:26 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 Jun 2006 15:31:26 -0500 Subject: [openib-general] [PATCH v3 7/7] AMSO1100 Makefiles and Kconfig changes. In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop> References: <20060620203050.31536.5341.stgit@stevo-desktop> Message-ID: <20060620203126.31536.78501.stgit@stevo-desktop> Review Changes: - C2DEBUG -> DEBUG --- drivers/infiniband/Kconfig | 1 + drivers/infiniband/Makefile | 1 + drivers/infiniband/hw/amso1100/Kbuild | 10 ++++++++++ drivers/infiniband/hw/amso1100/Kconfig | 15 +++++++++++++++ drivers/infiniband/hw/amso1100/README | 11 +++++++++++ 5 files changed, 38 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index ba2d650..04e6d4f 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -36,6 +36,7 @@ config INFINIBAND_ADDR_TRANS source "drivers/infiniband/hw/mthca/Kconfig" source "drivers/infiniband/hw/ipath/Kconfig" +source "drivers/infiniband/hw/amso1100/Kconfig" source "drivers/infiniband/ulp/ipoib/Kconfig" diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index eea2732..e2b93f9 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -1,5 +1,6 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ obj-$(CONFIG_IPATH_CORE) += hw/ipath/ +obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ diff --git a/drivers/infiniband/hw/amso1100/Kbuild b/drivers/infiniband/hw/amso1100/Kbuild new file mode 100644 index 0000000..e1f10ab --- /dev/null +++ b/drivers/infiniband/hw/amso1100/Kbuild @@ -0,0 +1,10 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +ifdef CONFIG_INFINIBAND_AMSO1100_DEBUG +EXTRA_CFLAGS += -DDEBUG +endif + +obj-$(CONFIG_INFINIBAND_AMSO1100) += iw_c2.o + +iw_c2-y := c2.o c2_provider.o c2_rnic.o c2_alloc.o c2_mq.o c2_ae.o c2_vq.o \ + c2_intr.o c2_cq.o c2_qp.o c2_cm.o c2_mm.o c2_pd.o diff --git a/drivers/infiniband/hw/amso1100/Kconfig b/drivers/infiniband/hw/amso1100/Kconfig new file mode 100644 index 0000000..809cb14 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/Kconfig @@ -0,0 +1,15 @@ +config INFINIBAND_AMSO1100 + tristate "Ammasso 1100 HCA support" + depends on PCI && INET && INFINIBAND + ---help--- + This is a low-level driver for the Ammasso 1100 host + channel adapter (HCA). + +config INFINIBAND_AMSO1100_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_AMSO1100 + default n + ---help--- + This option causes the amso1100 driver to produce a bunch of + debug messages. Select this if you are developing the driver + or trying to diagnose a problem. diff --git a/drivers/infiniband/hw/amso1100/README b/drivers/infiniband/hw/amso1100/README new file mode 100644 index 0000000..1331353 --- /dev/null +++ b/drivers/infiniband/hw/amso1100/README @@ -0,0 +1,11 @@ +This is the OpenFabrics provider driver for the +AMSO1100 1Gb RNIC adapter. + +This adapter is available in limited quantities +for development purposes from Open Grid Computing. + +This driver requires the IWCM and CMA mods necessary +to support iWARP. + +Contact tom at opengridcomputing.com for more information. + From arjan at infradead.org Tue Jun 20 13:43:46 2006 From: arjan at infradead.org (Arjan van de Ven) Date: Tue, 20 Jun 2006 22:43:46 +0200 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <20060620203055.31536.15131.stgit@stevo-desktop> References: <20060620203050.31536.5341.stgit@stevo-desktop> <20060620203055.31536.15131.stgit@stevo-desktop> Message-ID: <1150836226.2891.231.camel@laptopd505.fenrus.org> On Tue, 2006-06-20 at 15:30 -0500, Steve Wise wrote: > +/* > + * Allocate TX ring elements and chain them together. > + * One-to-one association of adapter descriptors with ring elements. > + */ > +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr, > + dma_addr_t base, void __iomem * mmio_txp_ring) > +{ > + struct c2_tx_desc *tx_desc; > + struct c2_txp_desc __iomem *txp_desc; > + struct c2_element *elem; > + int i; > + > + tx_ring->start = kmalloc(sizeof(*elem) * tx_ring->count, GFP_KERNEL); I would think this needs a dma_alloc_coherent() rather than a kmalloc... > + > +/* Free all buffers in RX ring, assumes receiver stopped */ > +static void c2_rx_clean(struct c2_port *c2_port) > +{ > + struct c2_dev *c2dev = c2_port->c2dev; > + struct c2_ring *rx_ring = &c2_port->rx_ring; > + struct c2_element *elem; > + struct c2_rx_desc *rx_desc; > + > + elem = rx_ring->start; > + do { > + rx_desc = elem->ht_desc; > + rx_desc->len = 0; > + > + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS); > + __raw_writew(0, elem->hw_desc + C2_RXP_COUNT); > + __raw_writew(0, elem->hw_desc + C2_RXP_LEN); you seem to be a fan of the __raw_write() functions... any reason why? __raw_ is not a magic "go faster" prefix.... Also on a related note, have you checked the driver for the needed PCI posting flushes? > + > + /* Disable IRQs by clearing the interrupt mask */ > + writel(1, c2dev->regs + C2_IDIS); > + writel(0, c2dev->regs + C2_NIMR0); like here... > + > + elem = tx_ring->to_use; > + elem->skb = skb; > + elem->mapaddr = mapaddr; > + elem->maplen = maplen; > + > + /* Tell HW to xmit */ > + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR); > + __raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN); > + __raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS); or here > +static int c2_change_mtu(struct net_device *netdev, int new_mtu) > +{ > + int ret = 0; > + > + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) > + return -EINVAL; > + > + netdev->mtu = new_mtu; > + > + if (netif_running(netdev)) { > + c2_down(netdev); > + > + c2_up(netdev); > + } this looks odd... From rdreier at cisco.com Tue Jun 20 15:27:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 Jun 2006 15:27:45 -0700 Subject: [openib-general] [PATCH v2 1/2] iWARP changes to libibverbs. In-Reply-To: <20060620200308.20092.76324.stgit@stevo-desktop> (Steve Wise's message of "Tue, 20 Jun 2006 15:03:08 -0500") References: <20060620200304.20092.44110.stgit@stevo-desktop> <20060620200308.20092.76324.stgit@stevo-desktop> Message-ID: Looks pretty good. I'll get this into the libibverbs development tree soon (I'm working on the MADV_DONTFORK stuff right now). - R. From sean.hefty at intel.com Tue Jun 20 16:27:05 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 20 Jun 2006 16:27:05 -0700 Subject: [openib-general] ib_gid lookup In-Reply-To: Message-ID: <000001c694c1$0c963f50$36781cac@amr.corp.intel.com> > i'm trying to find whether i can do a lookup of ib_gid by either >node name or node's ip address. is this information available from >the subnet manager? A lookup is done from IP address to GID using the address translation module (ib_addr). This functionality is exposed to userspace through the rdma_cm (resolve_addr routine). - Sean From eitan at mellanox.co.il Tue Jun 20 23:10:10 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 21 Jun 2006 09:10:10 +0300 Subject: [openib-general] [PATCHv5] osm: partition manager force policy In-Reply-To: <1150808795.4391.118133.camel@hal.voltaire.com> References: <86d5d5ge54.fsf@mtl066.yok.mtl.com> <1150808795.4391.118133.camel@hal.voltaire.com> Message-ID: <4498E2C2.2080906@mellanox.co.il> Hi Hal, Thanks for applying the patch. Regarding the issues : Hal Rosenstock wrote: >>+ >>+ CL_ASSERT( p_pkey_tbl ); > > > Should the other routines also assert on this or should this be > consistent with the others ? Yes it should b consistent. Normally I add assertion on OUT parameters such that a "misuse" is caught. The idea is that parameters provided by reference are more likely to be passed by mistake as NULL. So I would remove the assert on p_key_tbl. > > >>+ CL_ASSERT( p_block_idx != NULL ); >>+ CL_ASSERT( p_pkey_idx != NULL ); > > > There is no p_pkey_idx parameter. I presume this should be p_pkey_index. Ooops - this means the code will not compile in debug mode ! I see you fixed that. > > > Also, two things about osm_pkey_mgr.c: > > Was there a need to reorder the routines ? This broke the diff so it had > to be done largely by hand. I reordered to to be defined in the order used. Already agree with Sasha that I should have done that on separate patch. > > Also, it would have been nice not to mix the format changes with the > substantive changes. Try to keep it to "one thought per patch". OK. > > This patch has been applied with cosmetic changes. We will go from > here... Thanks Eitan From halr at voltaire.com Wed Jun 21 03:53:48 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Jun 2006 06:53:48 -0400 Subject: [openib-general] [PATCHv5] osm: partition manager force policy In-Reply-To: <4498E2C2.2080906@mellanox.co.il> References: <86d5d5ge54.fsf@mtl066.yok.mtl.com> <1150808795.4391.118133.camel@hal.voltaire.com> <4498E2C2.2080906@mellanox.co.il> Message-ID: <1150887225.4391.167876.camel@hal.voltaire.com> Hi Eitan, On Wed, 2006-06-21 at 02:10, Eitan Zahavi wrote: > Hi Hal, > > Thanks for applying the patch. > > Regarding the issues : > > Hal Rosenstock wrote: > >>+ > >>+ CL_ASSERT( p_pkey_tbl ); > > > > > > Should the other routines also assert on this or should this be > > consistent with the others ? > Yes it should b consistent. > Normally I add assertion on OUT parameters such that a "misuse" is caught. > The idea is that parameters provided by reference are more likely to be passed > by mistake as NULL. > So I would remove the assert on p_key_tbl. p_pkey_tbl is a pointer so wouldn't that rule apply ? I do notice that in the particular usage in osm_pkey_mgr.c it would already get caught by the assert in osm_physp_get_mod_pkey_tbl. > >>+ CL_ASSERT( p_block_idx != NULL ); > >>+ CL_ASSERT( p_pkey_idx != NULL ); > > > > > > There is no p_pkey_idx parameter. I presume this should be p_pkey_index. > Ooops - this means the code will not compile in debug mode ! > I see you fixed that. > > > > > > Also, two things about osm_pkey_mgr.c: > > > > Was there a need to reorder the routines ? This broke the diff so it had > > to be done largely by hand. > I reordered to to be defined in the order used. > Already agree with Sasha that I should have done that on separate patch. I thought since the patch was reissued several times this comment would have been addressed. -- Hal > > Also, it would have been nice not to mix the format changes with the > > substantive changes. Try to keep it to "one thought per patch". > OK. > > > > This patch has been applied with cosmetic changes. We will go from > > here... > Thanks > > Eitan From halr at voltaire.com Wed Jun 21 04:00:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Jun 2006 07:00:10 -0400 Subject: [openib-general] [PATCH] osm: add release notes to doc dir In-Reply-To: <86ac87gdxy.fsf@mtl066.yok.mtl.com> References: <86ac87gdxy.fsf@mtl066.yok.mtl.com> Message-ID: <1150887610.4391.168097.camel@hal.voltaire.com> Hi Eitan, On Wed, 2006-06-21 at 03:34, Eitan Zahavi wrote: > Hi Hal > > Following the OFED 1.0 release I think it will be very handy to > user to see OpenSM release notes accumulate in the doc dir. Sure; that seems reasonable. > As release notes always refer back to old releases - by not specifying > all features just new ones - I propose a file naming scheme that is > based on the release date. I think a naming scheme based on the OpenSM version is clearer. So the OFED 1.0 release was openib-1.2.1. > This patch adds the two latest releases notes. > One from Jan 2006 and one from Jun 2006. What was the Jan 2006 release ? -- Hal From sashak at voltaire.com Wed Jun 21 05:49:27 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 21 Jun 2006 15:49:27 +0300 Subject: [openib-general] [PATCH] opensm: add check for not intialized pkey blocks Message-ID: <20060621124927.GA24726@sashak.voltaire.com> Hi Hal, The lower block of pkey tables' 'blocks' vector may be not initialized due to lost MADs. We need to check it for NULL. Some duplicated code removal as well. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_pkey.c | 33 +++++++++++---------------------- 1 files changed, 11 insertions(+), 22 deletions(-) diff --git a/osm/opensm/osm_pkey.c b/osm/opensm/osm_pkey.c index 8166c90..caefe18 100644 --- a/osm/opensm/osm_pkey.c +++ b/osm/opensm/osm_pkey.c @@ -76,16 +76,19 @@ void osm_pkey_tbl_construct( void osm_pkey_tbl_destroy( IN osm_pkey_tbl_t *p_pkey_tbl) { + ib_pkey_table_t *p_block; uint16_t num_blocks, i; num_blocks = (uint16_t)(cl_ptr_vector_get_size( &p_pkey_tbl->blocks )); for (i = 0; i < num_blocks; i++) - free(cl_ptr_vector_get( &p_pkey_tbl->blocks, i )); + if ((p_block = cl_ptr_vector_get( &p_pkey_tbl->blocks, i ))) + free(p_block); cl_ptr_vector_destroy( &p_pkey_tbl->blocks ); num_blocks = (uint16_t)(cl_ptr_vector_get_size( &p_pkey_tbl->new_blocks )); for (i = 0; i < num_blocks; i++) - free(cl_ptr_vector_get( &p_pkey_tbl->new_blocks, i )); + if ((p_block = cl_ptr_vector_get( &p_pkey_tbl->new_blocks, i ))) + free(p_block); cl_ptr_vector_destroy( &p_pkey_tbl->new_blocks ); cl_map_remove_all( &p_pkey_tbl->keys ); @@ -112,26 +115,12 @@ osm_pkey_tbl_init( void osm_pkey_tbl_init_new_blocks( IN const osm_pkey_tbl_t *p_pkey_tbl) { - ib_pkey_table_t *p_block, *p_new_block; - int16_t b, num_blocks, new_blocks; + ib_pkey_table_t *p_block; + int16_t b, num_blocks = cl_ptr_vector_get_size(&p_pkey_tbl->new_blocks); - num_blocks = cl_ptr_vector_get_size(&p_pkey_tbl->blocks); - new_blocks = cl_ptr_vector_get_size(&p_pkey_tbl->new_blocks); - - for (b = 0; b < num_blocks; b++) { - p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b); - if ( b < new_blocks ) - p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b); - else - { - p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block)); - if (!p_new_block) - break; - cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, - b, p_new_block); - } - memset(p_new_block, 0, sizeof(*p_new_block)); - } + for (b = 0; b < num_blocks; b++) + if ((p_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b))) + memset(p_block, 0, sizeof(*p_block)); } /********************************************************************** @@ -296,7 +285,7 @@ osm_pkey_find_next_free_entry( OUT uint8_t *p_pkey_idx) { ib_pkey_table_t *p_new_block; - + CL_ASSERT(p_block_idx); CL_ASSERT(p_pkey_idx); From sashak at voltaire.com Wed Jun 21 06:52:38 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 21 Jun 2006 16:52:38 +0300 Subject: [openib-general] [PATCH] opensm: osm_pkey_tbl_make_block_pair() removal Message-ID: <20060621135238.GB24726@sashak.voltaire.com> Since 'blocks' pkey vector is updated only by receiver, remove it from osm_pkey_tbl_set_new_entry(), as well as osm_pkey_tbl_make_block_pair(). Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_pkey.h | 35 ------------------------ osm/opensm/osm_pkey.c | 59 ++++++++--------------------------------- 2 files changed, 11 insertions(+), 83 deletions(-) diff --git a/osm/include/opensm/osm_pkey.h b/osm/include/opensm/osm_pkey.h index a353ad0..44e932d 100644 --- a/osm/include/opensm/osm_pkey.h +++ b/osm/include/opensm/osm_pkey.h @@ -296,41 +296,6 @@ static inline ib_pkey_table_t *osm_pkey_ /* *********/ -/****f* OpenSM: osm_pkey_tbl_make_block_pair -* NAME -* osm_pkey_tbl_make_block_pair -* -* DESCRIPTION -* Find or create a pair of "old" and "new" blocks for the -* given block index -* -* SYNOPSIS -*/ -ib_api_status_t -osm_pkey_tbl_make_block_pair( - osm_pkey_tbl_t *p_pkey_tbl, - uint16_t block_idx, - ib_pkey_table_t **pp_old_block, - ib_pkey_table_t **pp_new_block); -/* -* p_pkey_tbl -* [in] Pointer to the PKey table -* -* block_idx -* [in] The block index to use -* -* pp_old_block -* [out] Pointer to the old block pointer arg -* -* pp_new_block -* [out] Pointer to the new block pointer arg -* -* RETURN VALUES -* IB_SUCCESS if OK -* IB_ERROR if failed -* -*********/ - /****f* OpenSM: osm_pkey_tbl_set_new_entry * NAME * osm_pkey_tbl_set_new_entry diff --git a/osm/opensm/osm_pkey.c b/osm/opensm/osm_pkey.c index caefe18..2937ac8 100644 --- a/osm/opensm/osm_pkey.c +++ b/osm/opensm/osm_pkey.c @@ -211,46 +211,6 @@ osm_pkey_tbl_set( /********************************************************************** **********************************************************************/ -ib_api_status_t -osm_pkey_tbl_make_block_pair( - osm_pkey_tbl_t *p_pkey_tbl, - uint16_t block_idx, - ib_pkey_table_t **pp_old_block, - ib_pkey_table_t **pp_new_block) -{ - if (block_idx >= p_pkey_tbl->max_blocks) - return(IB_ERROR); - - if (pp_old_block) - { - *pp_old_block = osm_pkey_tbl_block_get(p_pkey_tbl, block_idx); - if (! *pp_old_block) - { - *pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); - if (!*pp_old_block) - return(IB_ERROR); - memset(*pp_old_block, 0, sizeof(ib_pkey_table_t)); - cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block); - } - } - - if (pp_new_block) - { - *pp_new_block = osm_pkey_tbl_new_block_get(p_pkey_tbl, block_idx); - if (! *pp_new_block) - { - *pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); - if (!*pp_new_block) - return(IB_ERROR); - memset(*pp_new_block, 0, sizeof(ib_pkey_table_t)); - cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block); - } - } - return(IB_SUCCESS); -} - -/********************************************************************** - **********************************************************************/ /* store the given pkey in the "new" blocks array also makes sure the regular block exists. @@ -262,14 +222,17 @@ osm_pkey_tbl_set_new_entry( IN uint8_t pkey_idx, IN uint16_t pkey) { - ib_pkey_table_t *p_old_block; - ib_pkey_table_t *p_new_block; - - if (osm_pkey_tbl_make_block_pair(p_pkey_tbl, block_idx, &p_old_block, - &p_new_block)) - return(IB_ERROR); - - p_new_block->pkey_entry[pkey_idx] = pkey; + ib_pkey_table_t *p_block; + + if (!(p_block = osm_pkey_tbl_new_block_get(p_pkey_tbl, block_idx))) { + p_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t)); + if (!p_block) + return(IB_ERROR); + memset(p_block, 0, sizeof(ib_pkey_table_t)); + cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, p_block); + } + + p_block->pkey_entry[pkey_idx] = pkey; if (p_pkey_tbl->used_blocks <= block_idx) p_pkey_tbl->used_blocks = block_idx + 1; From halr at voltaire.com Wed Jun 21 07:02:15 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Jun 2006 10:02:15 -0400 Subject: [openib-general] [PATCH] opensm: add check for not intialized pkey blocks In-Reply-To: <20060621124927.GA24726@sashak.voltaire.com> References: <20060621124927.GA24726@sashak.voltaire.com> Message-ID: <1150898531.4391.174435.camel@hal.voltaire.com> Hi Sasha, On Wed, 2006-06-21 at 08:49, Sasha Khapyorsky wrote: > Hi Hal, > > The lower block of pkey tables' 'blocks' vector may be not initialized > due to lost MADs. We need to check it for NULL. Some duplicated code > removal as well. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From swise at opengridcomputing.com Wed Jun 21 07:12:46 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 21 Jun 2006 09:12:46 -0500 Subject: [openib-general] [librdmacm] check return value in operations of rping (as an attachment) In-Reply-To: <200606201853.15272.dotanb@mellanox.co.il> References: <200606201853.15272.dotanb@mellanox.co.il> Message-ID: <1150899166.11051.7.camel@stevo-desktop> Thanks, Committed with minor change to always ack events before exiting. trunk: revision 8159 iwarp branch: revision 8160 Steve. On Tue, 2006-06-20 at 18:53 +0300, Dotan Barak wrote: > Added checks to the return values of all of the functions that may > fail > (in order to add this test to the regression system). > > Signed-off-by: Dotan Barak > From mamidala at cse.ohio-state.edu Wed Jun 21 07:07:04 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Wed, 21 Jun 2006 10:07:04 -0400 (EDT) Subject: [openib-general] [librdmacm] compile error In-Reply-To: <1150818556.22519.28.camel@stevo-desktop> Message-ID: Hi, I have a quick question. I updated the infiniband kernel modules and while compiling them got this error. (Redhat AS 4, linux-2.6.16.20 on IA-32 platform ) drivers/infiniband/ulp/ipoib/ipoib_main.c: In function `ipoib_neigh_setup_dev': drivers/infiniband/ulp/ipoib/ipoib_main.c:794: error: structure has no member named `neigh_destructor' make[3]: *** [drivers/infiniband/ulp/ipoib/ipoib_main.o] Error 1 make[2]: *** [drivers/infiniband/ulp/ipoib] Error 2 make[1]: *** [drivers/infiniband] Error 2 make: *** [drivers] Error 2 Do I need to update any other module? Thanks, Amith From halr at voltaire.com Wed Jun 21 07:21:44 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Jun 2006 10:21:44 -0400 Subject: [openib-general] [librdmacm] compile error In-Reply-To: References: Message-ID: <1150899700.4391.175085.camel@hal.voltaire.com> Hi Amith, On Wed, 2006-06-21 at 10:07, amith rajith mamidala wrote: > Hi, > > I have a quick question. I updated the infiniband kernel modules and while > compiling them got this error. > (Redhat AS 4, linux-2.6.16.20 on IA-32 platform ) > > drivers/infiniband/ulp/ipoib/ipoib_main.c: In function > `ipoib_neigh_setup_dev': > drivers/infiniband/ulp/ipoib/ipoib_main.c:794: error: structure has no > member named `neigh_destructor' > make[3]: *** [drivers/infiniband/ulp/ipoib/ipoib_main.o] Error 1 > make[2]: *** [drivers/infiniband/ulp/ipoib] Error 2 > make[1]: *** [drivers/infiniband] Error 2 > make: *** [drivers] Error 2 > > Do I need to update any other module? This is due to the backwards compatibility cruft for pre-2.6.17 kernels from IPoIB being removed at r8111. You will need to revert that change. -- Hal > > Thanks, > Amith > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Wed Jun 21 07:27:51 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 21 Jun 2006 17:27:51 +0300 Subject: [openib-general] [PATCH] osm: add release notes to doc dir Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3023688AD@mtlexch01.mtl.com> > > This patch adds the two latest releases notes. > > One from Jan 2006 and one from Jun 2006. > > What was the Jan 2006 release ? [EZ] That was the first gen2 distribution. > > -- Hal From halr at voltaire.com Wed Jun 21 07:33:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Jun 2006 10:33:08 -0400 Subject: [openib-general] [PATCH] osm: add release notes to doc dir In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3023688AD@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3023688AD@mtlexch01.mtl.com> Message-ID: <1150900386.4391.175519.camel@hal.voltaire.com> On Wed, 2006-06-21 at 10:27, Eitan Zahavi wrote: > > > This patch adds the two latest releases notes. > > > One from Jan 2006 and one from Jun 2006. > > > > What was the Jan 2006 release ? > [EZ] That was the first gen2 distribution. As I mentioned in the previous email on this, my preference would be to name the release notes by that version string. Any objections ? What was the OpenSM version string for this ? -- Hal > > > > > -- Hal From eitan at mellanox.co.il Wed Jun 21 07:56:26 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 21 Jun 2006 17:56:26 +0300 Subject: [openib-general] [PATCH] osm: add release notes to doc dir Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3023688B2@mtlexch01.mtl.com> > > As I mentioned in the previous email on this, my preference would be to > name the release notes by that version string. Any objections ? > > What was the OpenSM version string for this ? [EZ] It was 2.0.1 > From paul.lundin at gmail.com Wed Jun 21 08:01:07 2006 From: paul.lundin at gmail.com (Paul) Date: Wed, 21 Jun 2006 11:01:07 -0400 Subject: [openib-general] OFED 1.0-pre 1 build issues. In-Reply-To: <4497B634.2070704@mellanox.co.il> References: <1150324203.10676.17.camel@chalcedony.pathscale.com> <4497B634.2070704@mellanox.co.il> Message-ID: Tziporet, Thanks. I also opened a bug on this. Bug # 142. Regards. On 6/20/06, Tziporet Koren wrote: > > Paul wrote: > > Michael, > > I performed the same work-around in bash (not so good with perl > > these days) it gets past the prior point. Thanks. Should something > > that takes care of this be included in the build.sh or build_env.sh > > scripts ? We would certainly need it covered in the docs at least. > > > > Now the build is dying on some undefined references. (log attached) > > > > Regards. > > > > I will ask Vlad to look into it. > > Tziporet > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Wed Jun 21 09:21:45 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 21 Jun 2006 12:21:45 -0400 (EDT) Subject: [openib-general] ib_gid lookup In-Reply-To: <1150822037.4391.126581.camel@hal.voltaire.com> References: <1150798111.4391.111384.camel@hal.voltaire.com> <1150822037.4391.126581.camel@hal.voltaire.com> Message-ID: On Tue, 20 Jun 2006, Hal Rosenstock wrote: > > > The SM does not know the IP addresses unless they are registered > > > by DAPL (via ServiceRecords) but I'm not sure that is done > > > anymore or whether DAPL runs in your environment. > > > > > > > if i run DAPL in my environment will it work or this is already > > made obsolete? > > I don't know. James or maybe Arlin would be the ones to answer. You > could also look at the code to figure this out. DAPL used to use the Address Translation Service (ATS) to map between IP addresses to GIDs. It now uses IPoIB for this purpose. You could use IPoIB to determine a node's GID using a IP or install the (unsupported?) ATS software on your systems (https://openib.org/svn/gen2/branches/ibat/). From swise at opengridcomputing.com Wed Jun 21 09:32:51 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 21 Jun 2006 11:32:51 -0500 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <1150836226.2891.231.camel@laptopd505.fenrus.org> References: <20060620203050.31536.5341.stgit@stevo-desktop> <20060620203055.31536.15131.stgit@stevo-desktop> <1150836226.2891.231.camel@laptopd505.fenrus.org> Message-ID: <1150907571.31600.31.camel@stevo-desktop> On Tue, 2006-06-20 at 22:43 +0200, Arjan van de Ven wrote: > On Tue, 2006-06-20 at 15:30 -0500, Steve Wise wrote: > > > +/* > > + * Allocate TX ring elements and chain them together. > > + * One-to-one association of adapter descriptors with ring elements. > > + */ > > +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr, > > + dma_addr_t base, void __iomem * mmio_txp_ring) > > +{ > > + struct c2_tx_desc *tx_desc; > > + struct c2_txp_desc __iomem *txp_desc; > > + struct c2_element *elem; > > + int i; > > + > > + tx_ring->start = kmalloc(sizeof(*elem) * tx_ring->count, GFP_KERNEL); > > I would think this needs a dma_alloc_coherent() rather than a kmalloc... > No, this memory is used to describe the tx ring from the host's perspective. The HW never touches this memory. The HW's TX descriptor ring is in adapter memory and is mapped into host memory (see c2dev->mmio_txp_ring). > > > + > > +/* Free all buffers in RX ring, assumes receiver stopped */ > > +static void c2_rx_clean(struct c2_port *c2_port) > > +{ > > + struct c2_dev *c2dev = c2_port->c2dev; > > + struct c2_ring *rx_ring = &c2_port->rx_ring; > > + struct c2_element *elem; > > + struct c2_rx_desc *rx_desc; > > + > > + elem = rx_ring->start; > > + do { > > + rx_desc = elem->ht_desc; > > + rx_desc->len = 0; > > + > > + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS); > > + __raw_writew(0, elem->hw_desc + C2_RXP_COUNT); > > + __raw_writew(0, elem->hw_desc + C2_RXP_LEN); > > you seem to be a fan of the __raw_write() functions... any reason why? > __raw_ is not a magic "go faster" prefix.... > In this particular case, I believe this is done to avoid a swap of '0' since its not necessary. In other places, __raw is used because the adapter needs the data in BE and we want to explicitly swap it using cpu_to_be* then raw_write it to the adapter memory... > Also on a related note, have you checked the driver for the needed PCI > posting flushes? > Um, what's a 'PCI posting flush'? Can you point me where its described/used so I can see if we need it? Thanx. > > + > > + /* Disable IRQs by clearing the interrupt mask */ > > + writel(1, c2dev->regs + C2_IDIS); > > + writel(0, c2dev->regs + C2_NIMR0); > > like here... > > + > > + elem = tx_ring->to_use; > > + elem->skb = skb; > > + elem->mapaddr = mapaddr; > > + elem->maplen = maplen; > > + > > + /* Tell HW to xmit */ > > + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR); > > + __raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN); > > + __raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS); > > or here > > > +static int c2_change_mtu(struct net_device *netdev, int new_mtu) > > +{ > > + int ret = 0; > > + > > + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) > > + return -EINVAL; > > + > > + netdev->mtu = new_mtu; > > + > > + if (netif_running(netdev)) { > > + c2_down(netdev); > > + > > + c2_up(netdev); > > + } > > this looks odd... > The 1100 hardware caches the dma address of the next skb that will be used to place data. When the MTU changes, we want to free the SKBs in the RX descriptor ring and get new ones that sufficient for the new MTU. To effectively flush that cached address of the old skb, we must quiesce the HW and firmware (via c2_down()), then reinitialize everything with skb's big enough for the new mtu. Steve. From swise at opengridcomputing.com Wed Jun 21 09:32:51 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 21 Jun 2006 11:32:51 -0500 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <1150836226.2891.231.camel@laptopd505.fenrus.org> References: <20060620203050.31536.5341.stgit@stevo-desktop> <20060620203055.31536.15131.stgit@stevo-desktop> <1150836226.2891.231.camel@laptopd505.fenrus.org> Message-ID: <1150907571.31600.31.camel@stevo-desktop> On Tue, 2006-06-20 at 22:43 +0200, Arjan van de Ven wrote: > On Tue, 2006-06-20 at 15:30 -0500, Steve Wise wrote: > > > +/* > > + * Allocate TX ring elements and chain them together. > > + * One-to-one association of adapter descriptors with ring elements. > > + */ > > +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr, > > + dma_addr_t base, void __iomem * mmio_txp_ring) > > +{ > > + struct c2_tx_desc *tx_desc; > > + struct c2_txp_desc __iomem *txp_desc; > > + struct c2_element *elem; > > + int i; > > + > > + tx_ring->start = kmalloc(sizeof(*elem) * tx_ring->count, GFP_KERNEL); > > I would think this needs a dma_alloc_coherent() rather than a kmalloc... > No, this memory is used to describe the tx ring from the host's perspective. The HW never touches this memory. The HW's TX descriptor ring is in adapter memory and is mapped into host memory (see c2dev->mmio_txp_ring). > > > + > > +/* Free all buffers in RX ring, assumes receiver stopped */ > > +static void c2_rx_clean(struct c2_port *c2_port) > > +{ > > + struct c2_dev *c2dev = c2_port->c2dev; > > + struct c2_ring *rx_ring = &c2_port->rx_ring; > > + struct c2_element *elem; > > + struct c2_rx_desc *rx_desc; > > + > > + elem = rx_ring->start; > > + do { > > + rx_desc = elem->ht_desc; > > + rx_desc->len = 0; > > + > > + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS); > > + __raw_writew(0, elem->hw_desc + C2_RXP_COUNT); > > + __raw_writew(0, elem->hw_desc + C2_RXP_LEN); > > you seem to be a fan of the __raw_write() functions... any reason why? > __raw_ is not a magic "go faster" prefix.... > In this particular case, I believe this is done to avoid a swap of '0' since its not necessary. In other places, __raw is used because the adapter needs the data in BE and we want to explicitly swap it using cpu_to_be* then raw_write it to the adapter memory... > Also on a related note, have you checked the driver for the needed PCI > posting flushes? > Um, what's a 'PCI posting flush'? Can you point me where its described/used so I can see if we need it? Thanx. > > + > > + /* Disable IRQs by clearing the interrupt mask */ > > + writel(1, c2dev->regs + C2_IDIS); > > + writel(0, c2dev->regs + C2_NIMR0); > > like here... > > + > > + elem = tx_ring->to_use; > > + elem->skb = skb; > > + elem->mapaddr = mapaddr; > > + elem->maplen = maplen; > > + > > + /* Tell HW to xmit */ > > + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR); > > + __raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN); > > + __raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS); > > or here > > > +static int c2_change_mtu(struct net_device *netdev, int new_mtu) > > +{ > > + int ret = 0; > > + > > + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) > > + return -EINVAL; > > + > > + netdev->mtu = new_mtu; > > + > > + if (netif_running(netdev)) { > > + c2_down(netdev); > > + > > + c2_up(netdev); > > + } > > this looks odd... > The 1100 hardware caches the dma address of the next skb that will be used to place data. When the MTU changes, we want to free the SKBs in the RX descriptor ring and get new ones that sufficient for the new MTU. To effectively flush that cached address of the old skb, we must quiesce the HW and firmware (via c2_down()), then reinitialize everything with skb's big enough for the new mtu. Steve. From arjan at infradead.org Wed Jun 21 10:13:33 2006 From: arjan at infradead.org (Arjan van de Ven) Date: Wed, 21 Jun 2006 19:13:33 +0200 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <1150907571.31600.31.camel@stevo-desktop> References: <20060620203050.31536.5341.stgit@stevo-desktop> <20060620203055.31536.15131.stgit@stevo-desktop> <1150836226.2891.231.camel@laptopd505.fenrus.org> <1150907571.31600.31.camel@stevo-desktop> Message-ID: <1150910013.3057.59.camel@laptopd505.fenrus.org> > 0; > > > + > > > + __raw_writew(0, elem->hw_desc + C2_RXP_STATUS); > > > + __raw_writew(0, elem->hw_desc + C2_RXP_COUNT); > > > + __raw_writew(0, elem->hw_desc + C2_RXP_LEN); > > > > you seem to be a fan of the __raw_write() functions... any reason why? > > __raw_ is not a magic "go faster" prefix.... > > > > In this particular case, I believe this is done to avoid a swap of '0' > since its not necessary. but.. that should writew() and co just autodetect (or do it at compile time)... (maybe it doesn't and we have an optimization opportunity here ;) > > Also on a related note, have you checked the driver for the needed PCI > > posting flushes? > > > > Um, what's a 'PCI posting flush'? Can you point me where its > described/used so I can see if we need it? Thanx. ok pci posting... basically, if you use writel() and co, the PCI bridges in the middle are allowed (and the more fancy ones do) cache the write, to see if more writes follow, so that the bridge can do the writes as a single burst to the device, rather than as individual writes. This is of course great... ... except when you really want the write to hit the device before the driver continues with other actions. Now the PCI spec is set up such that any traffic in the other direction (basically readl() and co) will first flush the write through the system before the read is actually sent to the device, so doing a dummy readl() is a good way to flush any pending posted writes. Where does this matter? it matters most at places such as irq enabling/disabling, IO submission and possibly IRQ acking, but also often in eeprom-like read/write logic (where you do manual clocking and need to do delays between the write()'s). But in general... any place where you do writel() without doing any readl() before doing nothing to the card for a long time, or where you are waiting for the card to do something (or want it done NOW, such as IRQ disabling) you need to issue a (dummy) readl() to flush pending writes out to the hardware. does this explanation make any sense? if not please feel free to ask any questions, I know I'm not always very good at explaining things. Greetings, Arjan van de Ven From iod00d at hp.com Wed Jun 21 10:37:11 2006 From: iod00d at hp.com (Grant Grundler) Date: Wed, 21 Jun 2006 10:37:11 -0700 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <1150907571.31600.31.camel@stevo-desktop> References: <20060620203050.31536.5341.stgit@stevo-desktop> <20060620203055.31536.15131.stgit@stevo-desktop> <1150836226.2891.231.camel@laptopd505.fenrus.org> <1150907571.31600.31.camel@stevo-desktop> Message-ID: <20060621173711.GG26637@esmail.cup.hp.com> On Wed, Jun 21, 2006 at 11:32:51AM -0500, Steve Wise wrote: > Um, what's a 'PCI posting flush'? Can you point me where its > described/used so I can see if we need it? Thanx. I've written this up before: http://iou.parisc-linux.org/ols_2002/4Posted_vs_Non_Posted.html grant From swise at opengridcomputing.com Wed Jun 21 11:47:45 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 21 Jun 2006 13:47:45 -0500 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <1150910013.3057.59.camel@laptopd505.fenrus.org> References: <20060620203050.31536.5341.stgit@stevo-desktop> <20060620203055.31536.15131.stgit@stevo-desktop> <1150836226.2891.231.camel@laptopd505.fenrus.org> <1150907571.31600.31.camel@stevo-desktop> <1150910013.3057.59.camel@laptopd505.fenrus.org> Message-ID: <1150915665.20327.0.camel@stevo-desktop> > ok pci posting... > > basically, if you use writel() and co, the PCI bridges in the middle are > allowed (and the more fancy ones do) cache the write, to see if more > writes follow, so that the bridge can do the writes as a single burst to > the device, rather than as individual writes. This is of course great... > ... except when you really want the write to hit the device before the > driver continues with other actions. > > Now the PCI spec is set up such that any traffic in the other direction > (basically readl() and co) will first flush the write through the system > before the read is actually sent to the device, so doing a dummy readl() > is a good way to flush any pending posted writes. > > Where does this matter? > it matters most at places such as irq enabling/disabling, IO submission > and possibly IRQ acking, but also often in eeprom-like read/write logic > (where you do manual clocking and need to do delays between the > write()'s). But in general... any place where you do writel() without > doing any readl() before doing nothing to the card for a long time, or > where you are waiting for the card to do something (or want it done NOW, > such as IRQ disabling) you need to issue a (dummy) readl() to flush > pending writes out to the hardware. > > > does this explanation make any sense? if not please feel free to ask any > questions, I know I'm not always very good at explaining things. Yep. I get it. I believe we're ok in this respect, but I'll review the code again with an eye for this issue... Steve. From halr at voltaire.com Wed Jun 21 11:46:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Jun 2006 14:46:00 -0400 Subject: [openib-general] [PATCH] osm: add release notes to doc dir In-Reply-To: <86ac87gdxy.fsf@mtl066.yok.mtl.com> References: <86ac87gdxy.fsf@mtl066.yok.mtl.com> Message-ID: <1150915557.4391.184538.camel@hal.voltaire.com> Hi Eitan, On Wed, 2006-06-21 at 03:34, Eitan Zahavi wrote: > Hi Hal > > Following the OFED 1.0 release I think it will be very handy to > user to see OpenSM release notes accumulate in the doc dir. > > As release notes always refer back to old releases - by not specifying > all features just new ones - I propose a file naming scheme that is > based on the release date. > > This patch adds the two latest releases notes. > One from Jan 2006 and one from Jun 2006. > > Eitan > > Signed-off-by: Eitan Zahavi Applied as opensm_release_notes_openib-1.2.1.txt and opensm_release_notes_ibg2-2.0.1.txt -- Hal From swise at opengridcomputing.com Wed Jun 21 12:48:16 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 21 Jun 2006 14:48:16 -0500 Subject: [openib-general] [PATCH 0/2][RFC] Network Event Notifier Mechanism Message-ID: <20060621194816.4507.4090.stgit@stevo-desktop> This patch implements a mechanism that allows interested clients to register for notification of certain network events. The intended use is to allow RDMA devices (linux/drivers/infiniband) to be notified of neighbour updates, ICMP redirects, path MTU changes, and route changes. The reason these devices need update events is because they typically cache this information in hardware and need to be notified when this information has been updated. This approach is one of many possibilities and may be preferred because it uses an existing notification mechanism that has precedent in the stack. An alternative would be to add a netdev method to notify affect devices of these events. This code does not yet implement path MTU change because the number of places in which this value is updated is large and if this mechanism seems reasonable, it would be probably be best to funnel these updates through a single function. We would like to get this or similar functionality included in 2.6.19 and request comments. This patchset consists of 2 patches: 1) New files implementing the Network Event Notifier 2) Core network changes to generate network event notifications Signed-off-by: Tom Tucker Signed-off-by: Steve Wise From swise at opengridcomputing.com Wed Jun 21 12:48:21 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 21 Jun 2006 14:48:21 -0500 Subject: [openib-general] [PATCH 2/2] Core network changes to support network event notification. In-Reply-To: <20060621194816.4507.4090.stgit@stevo-desktop> References: <20060621194816.4507.4090.stgit@stevo-desktop> Message-ID: <20060621194821.4507.70124.stgit@stevo-desktop> This patch adds event calls for neighbour change, route update, and routing redirect events. TODO: PMTU change events. --- net/core/Makefile | 2 +- net/core/neighbour.c | 8 ++++++++ net/ipv4/fib_semantics.c | 7 +++++++ net/ipv4/route.c | 6 ++++++ 4 files changed, 22 insertions(+), 1 deletions(-) diff --git a/net/core/Makefile b/net/core/Makefile index e9bd246..2645ba4 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -7,7 +7,7 @@ obj-y := sock.o request_sock.o skbuff.o obj-$(CONFIG_SYSCTL) += sysctl_net_core.o -obj-y += dev.o ethtool.o dev_mcast.o dst.o \ +obj-y += dev.o ethtool.o dev_mcast.o dst.o netevent.o \ neighbour.o rtnetlink.o utils.o link_watch.o filter.o obj-$(CONFIG_XFRM) += flow.o diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 50a8c73..c637897 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -30,9 +30,11 @@ #include #include #include #include +#include #include #include #include +#include #define NEIGH_DEBUG 1 @@ -755,6 +757,7 @@ #endif neigh->nud_state = NUD_STALE; neigh->updated = jiffies; neigh_suspect(neigh); + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); } } else if (state & NUD_DELAY) { if (time_before_eq(now, @@ -763,6 +766,7 @@ #endif neigh->nud_state = NUD_REACHABLE; neigh->updated = jiffies; neigh_connect(neigh); + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); next = neigh->confirmed + neigh->parms->reachable_time; } else { NEIGH_PRINTK2("neigh %p is probed.\n", neigh); @@ -783,6 +787,7 @@ #endif neigh->nud_state = NUD_FAILED; neigh->updated = jiffies; notify = 1; + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); NEIGH_CACHE_STAT_INC(neigh->tbl, res_failed); NEIGH_PRINTK2("neigh %p is failed.\n", neigh); @@ -1056,6 +1061,9 @@ out: (neigh->flags | NTF_ROUTER) : (neigh->flags & ~NTF_ROUTER); } + + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); + write_unlock_bh(&neigh->lock); #ifdef CONFIG_ARPD if (notify && neigh->parms->app_probes) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 0f4145b..67a30af 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -45,6 +45,7 @@ #include #include #include #include +#include #include "fib_lookup.h" @@ -278,9 +279,15 @@ void rtmsg_fib(int event, u32 key, struc struct nlmsghdr *n, struct netlink_skb_parms *req) { struct sk_buff *skb; + struct netevent_route_change rev; + u32 pid = req ? req->pid : n->nlmsg_pid; int size = NLMSG_SPACE(sizeof(struct rtmsg)+256); + rev.event = event; + rev.fib_info = fa->fa_info; + call_netevent_notifiers(NETEVENT_ROUTE_UPDATE, &rev); + skb = alloc_skb(size, GFP_KERNEL); if (!skb) return; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index cc9423d..e9ba831 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -105,6 +105,7 @@ #include #include #include #include +#include #ifdef CONFIG_SYSCTL #include #endif @@ -1120,6 +1121,7 @@ void ip_rt_redirect(u32 old_gw, u32 dadd struct rtable *rth, **rthp; u32 skeys[2] = { saddr, 0 }; int ikeys[2] = { dev->ifindex, 0 }; + struct netevent_redirect netevent; if (!in_dev) return; @@ -1211,6 +1213,10 @@ void ip_rt_redirect(u32 old_gw, u32 dadd rt_drop(rt); goto do_next; } + + netevent.old = &rth->u.dst; + netevent.new = &rt->u.dst; + call_netevent_notifiers(NETEVENT_REDIRECT, &netevent); rt_del(hash, rth); if (!rt_intern_hash(hash, rt, &rt)) From swise at opengridcomputing.com Wed Jun 21 12:48:19 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 21 Jun 2006 14:48:19 -0500 Subject: [openib-general] [PATCH 1/2] Network Event Notifier Mechanism. In-Reply-To: <20060621194816.4507.4090.stgit@stevo-desktop> References: <20060621194816.4507.4090.stgit@stevo-desktop> Message-ID: <20060621194818.4507.80455.stgit@stevo-desktop> This patch uses notifier blocks to implement a network event notifier mechanism. Clients register their callback function by calling register_netevent_notifier() like this: static struct notifier_block nb = { .notifier_call = my_callback_func }; ... register_netevent_notifier(&nb); --- include/net/netevent.h | 41 +++++++++++++++++++++++++++++ net/core/netevent.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 108 insertions(+), 0 deletions(-) diff --git a/include/net/netevent.h b/include/net/netevent.h new file mode 100644 index 0000000..9ceab27 --- /dev/null +++ b/include/net/netevent.h @@ -0,0 +1,41 @@ +#ifndef _NET_EVENT_H +#define _NET_EVENT_H + +/* + * Generic netevent notifiers + * + * Authors: + * Tom Tucker + * + * Changes: + */ + +#ifdef __KERNEL__ + +#include + +struct netevent_redirect { + struct dst_entry *old; + struct dst_entry *new; +}; + +struct netevent_route_change { + int event; + struct fib_info *fib_info; +}; + +enum netevent_notif_type { + NETEVENT_NEIGH_UPDATE = 1, /* arg is * struct neighbour */ + NETEVENT_ROUTE_UPDATE, /* arg is * netevent_route_change */ + NETEVENT_PMTU_UPDATE, + NETEVENT_REDIRECT, /* arg is * struct netevent_redirect */ +}; + +extern int register_netevent_notifier(struct notifier_block *nb); +extern int unregister_netevent_notifier(struct notifier_block *nb); +extern int call_netevent_notifiers(unsigned long val, void *v); + +#endif +#endif + + diff --git a/net/core/netevent.c b/net/core/netevent.c new file mode 100644 index 0000000..2261fb3 --- /dev/null +++ b/net/core/netevent.c @@ -0,0 +1,67 @@ +/* + * Network event notifiers + * + * Authors: + * Tom Tucker + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Fixes: + */ + +#include +#include + +static struct atomic_notifier_head netevent_notif_chain; + +/** + * register_netevent_notifier - register a netevent notifier block + * @nb: notifier + * + * Register a notifier to be called when a netevent occurs. + * The notifier passed is linked into the kernel structures and must + * not be reused until it has been unregistered. A negative errno code + * is returned on a failure. + */ +int register_netevent_notifier(struct notifier_block *nb) +{ + int err; + + err = atomic_notifier_chain_register(&netevent_notif_chain, nb); + return err; +} + +/** + * netevent_unregister_notifier - unregister a netevent notifier block + * @nb: notifier + * + * Unregister a notifier previously registered by + * register_neigh_notifier(). The notifier is unlinked into the + * kernel structures and may then be reused. A negative errno code + * is returned on a failure. + */ + +int unregister_netevent_notifier(struct notifier_block *nb) +{ + return atomic_notifier_chain_unregister(&netevent_notif_chain, nb); +} + +/** + * call_netevent_notifiers - call all netevent notifier blocks + * @val: value passed unmodified to notifier function + * @v: pointer passed unmodified to notifier function + * + * Call all neighbour notifier blocks. Parameters and return value + * are as for notifier_call_chain(). + */ + +int call_netevent_notifiers(unsigned long val, void *v) +{ + return atomic_notifier_call_chain(&netevent_notif_chain, val, v); +} + +EXPORT_SYMBOL_GPL(register_netevent_notifier); +EXPORT_SYMBOL_GPL(unregister_netevent_notifier); From davem at davemloft.net Wed Jun 21 13:40:46 2006 From: davem at davemloft.net (David Miller) Date: Wed, 21 Jun 2006 13:40:46 -0700 (PDT) Subject: [openib-general] [PATCH 0/2][RFC] Network Event Notifier Mechanism In-Reply-To: <20060621194816.4507.4090.stgit@stevo-desktop> References: <20060621194816.4507.4090.stgit@stevo-desktop> Message-ID: <20060621.134046.35016879.davem@davemloft.net> Most of the folks capable of reviewing networking changes listen in on the netdev at vger.kernel.org mailing list, not here. Thanks. From halr at voltaire.com Wed Jun 21 13:59:15 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Jun 2006 16:59:15 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] librdmacm/examples/mckey.c: Fix example name in messages Message-ID: <1150923404.4391.189477.camel@hal.voltaire.com> librdmacm/examples/mckey.c: Fix example name in messages Signed-off-by: Hal Rosenstock Index: ../../librdmacm/examples/mckey.c =================================================================== --- ../../librdmacm/examples/mckey.c (revision 8166) +++ ../../librdmacm/examples/mckey.c (working copy) @@ -111,7 +111,7 @@ static int init_node(struct cmatest_node node->pd = ibv_alloc_pd(node->cma_id->verbs); if (!node->pd) { ret = -ENOMEM; - printf("cmatose: unable to allocate PD\n"); + printf("mckey: unable to allocate PD\n"); goto out; } @@ -119,7 +119,7 @@ static int init_node(struct cmatest_node node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0); if (!node->cq) { ret = -ENOMEM; - printf("cmatose: unable to create CQ\n"); + printf("mckey: unable to create CQ\n"); goto out; } @@ -135,13 +135,13 @@ static int init_node(struct cmatest_node init_qp_attr.recv_cq = node->cq; ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr); if (ret) { - printf("cmatose: unable to create QP: %d\n", ret); + printf("mckey: unable to create QP: %d\n", ret); goto out; } ret = create_message(node); if (ret) { - printf("cmatose: failed to create messages: %d\n", ret); + printf("mckey: failed to create messages: %d\n", ret); goto out; } out: @@ -230,7 +230,7 @@ static int addr_handler(struct cmatest_n ret = rdma_join_multicast(node->cma_id, test.dst_addr, node); if (ret) { - printf("cmatose: failure joining: %d\n", ret); + printf("mckey: failure joining: %d\n", ret); goto err; } return 0; @@ -279,7 +279,7 @@ static int cma_handler(struct rdma_cm_id case RDMA_CM_EVENT_ADDR_ERROR: case RDMA_CM_EVENT_ROUTE_ERROR: case RDMA_CM_EVENT_MULTICAST_ERROR: - printf("cmatose: event: %d, error: %d\n", event->event, + printf("mckey: event: %d, error: %d\n", event->event, event->status); connect_error(); ret = event->status; @@ -325,7 +325,7 @@ static int alloc_nodes(void) test.nodes = malloc(sizeof *test.nodes * connections); if (!test.nodes) { - printf("cmatose: unable to allocate memory for test nodes\n"); + printf("mckey: unable to allocate memory for test nodes\n"); return -ENOMEM; } memset(test.nodes, 0, sizeof *test.nodes * connections); @@ -366,7 +366,7 @@ static int poll_cqs(void) for (done = 0; done < message_count; done += ret) { ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); if (ret < 0) { - printf("cmatose: failed polling CQ: %d\n", ret); + printf("mckey: failed polling CQ: %d\n", ret); return ret; } } @@ -415,7 +415,7 @@ static int run(char *dst, char *src) { int i, ret; - printf("cmatose: starting client\n"); + printf("mckey: starting client\n"); if (src) { ret = get_addr(src, &test.src_in); if (ret) @@ -428,13 +428,13 @@ static int run(char *dst, char *src) test.dst_in.sin_port = 7174; - printf("cmatose: joining\n"); + printf("mckey: joining\n"); for (i = 0; i < connections; i++) { ret = rdma_resolve_addr(test.nodes[i].cma_id, src ? test.src_addr : NULL, test.dst_addr, 2000); if (ret) { - printf("cmatose: failure getting addr: %d\n", ret); + printf("mckey: failure getting addr: %d\n", ret); connect_error(); return ret; } From bunk at stusta.de Wed Jun 21 15:54:58 2006 From: bunk at stusta.de (Adrian Bunk) Date: Thu, 22 Jun 2006 00:54:58 +0200 Subject: [openib-general] [-mm patch] drivers/scsi/qla2xxx/: make some functions static In-Reply-To: <20060621034857.35cfe36f.akpm@osdl.org> References: <20060621034857.35cfe36f.akpm@osdl.org> Message-ID: <20060621225458.GQ9111@stusta.de> On Wed, Jun 21, 2006 at 03:48:57AM -0700, Andrew Morton wrote: >... > Changes since 2.6.17-rc6-mm2: >... > git-infiniband.patch >... > git trees >... This patch makes some needlessly global functions static. Signed-off-by: Adrian Bunk --- drivers/scsi/qla2xxx/qla_gbl.h | 6 ------ drivers/scsi/qla2xxx/qla_init.c | 8 +++++--- drivers/scsi/qla2xxx/qla_iocb.c | 3 ++- 3 files changed, 7 insertions(+), 10 deletions(-) --- linux-2.6.17-mm1-full/drivers/scsi/qla2xxx/qla_gbl.h.old 2006-06-22 00:48:35.000000000 +0200 +++ linux-2.6.17-mm1-full/drivers/scsi/qla2xxx/qla_gbl.h 2006-06-22 00:50:32.000000000 +0200 @@ -31,13 +31,9 @@ extern void qla24xx_update_fw_options(scsi_qla_host_t *); extern int qla2x00_load_risc(struct scsi_qla_host *, uint32_t *); extern int qla24xx_load_risc(scsi_qla_host_t *, uint32_t *); -extern int qla24xx_load_risc_flash(scsi_qla_host_t *, uint32_t *); - -extern fc_port_t *qla2x00_alloc_fcport(scsi_qla_host_t *, gfp_t); extern int qla2x00_loop_resync(scsi_qla_host_t *); -extern int qla2x00_find_new_loop_id(scsi_qla_host_t *, fc_port_t *); extern int qla2x00_fabric_login(scsi_qla_host_t *, fc_port_t *, uint16_t *); extern int qla2x00_local_device_login(scsi_qla_host_t *, fc_port_t *); @@ -80,8 +76,6 @@ /* * Global Function Prototypes in qla_iocb.c source file. */ -extern void qla2x00_isp_cmd(scsi_qla_host_t *); - extern uint16_t qla2x00_calc_iocbs_32(uint16_t); extern uint16_t qla2x00_calc_iocbs_64(uint16_t); extern void qla2x00_build_scsi_iocbs_32(srb_t *, cmd_entry_t *, uint16_t); --- linux-2.6.17-mm1-full/drivers/scsi/qla2xxx/qla_init.c.old 2006-06-22 00:48:58.000000000 +0200 +++ linux-2.6.17-mm1-full/drivers/scsi/qla2xxx/qla_init.c 2006-06-22 00:49:50.000000000 +0200 @@ -39,6 +39,8 @@ static int qla2x00_restart_isp(scsi_qla_host_t *); +static int qla2x00_find_new_loop_id(scsi_qla_host_t *ha, fc_port_t *dev); + /****************************************************************************/ /* QLogic ISP2x00 Hardware Support Functions. */ /****************************************************************************/ @@ -1701,7 +1703,7 @@ * * Returns a pointer to the allocated fcport, or NULL, if none available. */ -fc_port_t * +static fc_port_t * qla2x00_alloc_fcport(scsi_qla_host_t *ha, gfp_t flags) { fc_port_t *fcport; @@ -2497,7 +2499,7 @@ * Context: * Kernel context. */ -int +static int qla2x00_find_new_loop_id(scsi_qla_host_t *ha, fc_port_t *dev) { int rval; @@ -3472,7 +3474,7 @@ return (rval); } -int +static int qla24xx_load_risc_flash(scsi_qla_host_t *ha, uint32_t *srisc_addr) { int rval; --- linux-2.6.17-mm1-full/drivers/scsi/qla2xxx/qla_iocb.c.old 2006-06-22 00:50:42.000000000 +0200 +++ linux-2.6.17-mm1-full/drivers/scsi/qla2xxx/qla_iocb.c 2006-06-22 00:51:00.000000000 +0200 @@ -15,6 +15,7 @@ static inline cont_entry_t *qla2x00_prep_cont_type0_iocb(scsi_qla_host_t *); static inline cont_a64_entry_t *qla2x00_prep_cont_type1_iocb(scsi_qla_host_t *); static request_t *qla2x00_req_pkt(scsi_qla_host_t *ha); +static void qla2x00_isp_cmd(scsi_qla_host_t *ha); /** * qla2x00_get_cmd_direction() - Determine control_flag data direction. @@ -574,7 +575,7 @@ * * Note: The caller must hold the hardware lock before calling this routine. */ -void +static void qla2x00_isp_cmd(scsi_qla_host_t *ha) { device_reg_t __iomem *reg = ha->iobase; From arlin.r.davis at intel.com Wed Jun 21 15:13:16 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 21 Jun 2006 15:13:16 -0700 Subject: [openib-general] [PATCH] uDAPL cma - event processing bug Message-ID: James, Fix bug in dapls_ib_get_dat_event() call after adding new unreachable event. -arlin Signed-off by: Arlin Davis ardavis at ichips.intel.com Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 8166) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -1092,9 +1092,6 @@ dapls_ib_get_dat_event(IN const ib_cm_ev active = active; - if (ib_cm_event > IB_CME_BROKEN) - return (DAT_EVENT_NUMBER) 0; - dat_event_num = 0; for(i = 0; i < DAPL_IB_EVENT_CNT; i++) { if (ib_cm_event == ib_cm_event_map[i].ib_cm_event) { From sean.hefty at intel.com Wed Jun 21 15:38:50 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 21 Jun 2006 15:38:50 -0700 Subject: [openib-general] [PATCH] [TRIVIAL] librdmacm/examples/mckey.c: Fix example name in messages In-Reply-To: <1150923404.4391.189477.camel@hal.voltaire.com> Message-ID: <000001c69583$78fa93e0$57781cac@amr.corp.intel.com> >librdmacm/examples/mckey.c: Fix example name in messages > >Signed-off-by: Hal Rosenstock Thanks, Hal. Do you mind committing this change? - Sean From ogerlitz at voltaire.com Wed Jun 21 22:11:35 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 22 Jun 2006 08:11:35 +0300 Subject: [openib-general] iSER updates In-Reply-To: <44977F81.9080206@voltaire.com> References: <44977F81.9080206@voltaire.com> Message-ID: <449A2687.9070400@voltaire.com> > Roland Dreier wrote: >> Linus pull my for-2.6.18 branch today, so the RDMA CM is upstream >> now. He still has not pulled scsi-misc-2.6 so AFAIK there is still >> more required before we can merge iSER. Roland, I see now that Linus has pulled the scsi-misc-2.6 updates for 2.6.18 - which means the door is open to push iSER... http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=28e4b224955cbe30275b2a7842e729023a4f4b03 Or. From bpradip at in.ibm.com Thu Jun 22 00:25:27 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Thu, 22 Jun 2006 12:55:27 +0530 Subject: [openib-general] [PATCH] [TRIVIAL] cma.c: Remove compiler warning Message-ID: <20060622072520.GA1393@harry-potter.in.ibm.com> This removes a compile warning : `ret' might be used uninitialized in this function This patch is against the IWARP branch of the code Signed-off-by: Pradipta Kumar Banerjee --- Index: core/cma.c ================================================================== --- cma.org 2006-06-22 12:45:33.000000000 +0530 +++ cma.c 2006-06-22 12:45:51.000000000 +0530 @@ -2066,6 +2066,7 @@ int rdma_disconnect(struct rdma_cm_id *i ret = iw_cm_disconnect(id_priv->cm_id.iw, 0); break; default: + ret = -ENOSYS; break; } out: From halr at voltaire.com Thu Jun 22 02:45:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Jun 2006 05:45:21 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] librdmacm/examples/mckey.c: Fix example name in messages In-Reply-To: <000001c69583$78fa93e0$57781cac@amr.corp.intel.com> References: <000001c69583$78fa93e0$57781cac@amr.corp.intel.com> Message-ID: <1150969499.4391.218996.camel@hal.voltaire.com> On Wed, 2006-06-21 at 18:38, Sean Hefty wrote: > >librdmacm/examples/mckey.c: Fix example name in messages > > > >Signed-off-by: Hal Rosenstock > > Thanks, Hal. Do you mind committing this change? Sure; committed in r8170. -- Hal > - Sean From halr at voltaire.com Thu Jun 22 03:12:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Jun 2006 06:12:58 -0400 Subject: [openib-general] [PATCH] opensm: osm_pkey_tbl_make_block_pair() removal In-Reply-To: <20060621135238.GB24726@sashak.voltaire.com> References: <20060621135238.GB24726@sashak.voltaire.com> Message-ID: <1150971161.4391.219868.camel@hal.voltaire.com> On Wed, 2006-06-21 at 09:52, Sasha Khapyorsky wrote: > Since 'blocks' pkey vector is updated only by receiver, remove it from > osm_pkey_tbl_set_new_entry(), as well as osm_pkey_tbl_make_block_pair(). > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From eitan at mellanox.co.il Thu Jun 22 04:37:46 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 22 Jun 2006 14:37:46 +0300 Subject: [openib-general] [ibutils PATCH] osm.m4 fix for x86_64 machines Message-ID: <449A810A.3020709@mellanox.co.il> Hi The following patch "osm.m4 fix" changes the way ibutils packages auto-detect stack type and location of OpenSM libraries by scanning the lib and lib64 directories. Instead of hard-coding the dir name based on the uname -m we scan both giving the lib priority on the lib64. I applied it to the ibutils trunk. Please let me know see any issues with it. Eitan -------------- next part -------------- An embedded message was scrubbed... From: unknown sender Subject: no subject Date: no date Size: 96 URL: From eitan at mellanox.co.il Thu Jun 22 04:44:32 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 22 Jun 2006 14:44:32 +0300 Subject: [openib-general] [ibutils PATCH] automake required >= 1.9.2 Message-ID: <449A82A0.3020502@mellanox.co.il> Hi The following patch remove the requirement for automake version to be 1.9.3 and up and instead allows it to be 1.9.2. I have applied the patch. Please let me know if you find any issue with this change. Thanks EZ -------------- next part -------------- An embedded message was scrubbed... From: "Sasha Khapyorsky" Subject: [ibutils PATCH] automake required >= 1.9.2 Date: Wed, 21 Jun 2006 19:40:52 +0300 Size: 6453 URL: From mamidala at cse.ohio-state.edu Thu Jun 22 06:08:51 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Thu, 22 Jun 2006 09:08:51 -0400 (EDT) Subject: [openib-general] [librdmacm] rping In-Reply-To: <1150899700.4391.175085.camel@hal.voltaire.com> Message-ID: I was checking rping with the latest stack. The client exits normally, the server still hangs after printing the cq status. server ping data: rdma-ping-9: JKLMNOPQRSTU server DISCONNECT EVENT... wait for RDMA_READ_ADV state 9 cq completion failed status 5 When I kill the process and restart the server I get the following error: rdma_bind_addr error -1 Thanks, Amith From halr at voltaire.com Thu Jun 22 06:24:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Jun 2006 09:24:31 -0400 Subject: [openib-general] [PATCH][MINOR] OpenSM/SA client: In osm_vendor_ibumad_sa.c:osmv_query_sa, eliminate redundant code Message-ID: <1150982667.4391.226473.camel@hal.voltaire.com> OpenSM/SA client: In osm_vendor_ibumad_sa.c:osmv_query_sa, eliminate redundant code Signed-off-by: Hal Rosenstock Index: libvendor/osm_vendor_ibumad_sa.c =================================================================== --- libvendor/osm_vendor_ibumad_sa.c (revision 8174) +++ libvendor/osm_vendor_ibumad_sa.c (working copy) @@ -655,7 +655,6 @@ osmv_query_sa( case OSMV_QUERY_ALL_SVC_RECS: osm_log( p_log, OSM_LOG_DEBUG, "osmv_query_sa DBG:001 %s", "SVC_REC_BY_NAME\n" ); - sa_mad_data.method = IB_MAD_METHOD_GETTABLE; sa_mad_data.attr_id = IB_MAD_ATTR_SERVICE_RECORD; sa_mad_data.attr_offset = ib_get_attr_offset( sizeof( ib_service_record_t ) ); @@ -701,7 +700,6 @@ osmv_query_sa( case OSMV_QUERY_NODE_REC_BY_NODE_GUID: osm_log( p_log, OSM_LOG_DEBUG, "osmv_query_sa DBG:001 %s","NODE_REC_BY_NODE_GUID\n" ); - sa_mad_data.method = IB_MAD_METHOD_GETTABLE; sa_mad_data.attr_id = IB_MAD_ATTR_NODE_RECORD; sa_mad_data.attr_offset = ib_get_attr_offset( sizeof( ib_node_record_t ) ); From tziporet at mellanox.co.il Thu Jun 22 06:19:39 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 22 Jun 2006 16:19:39 +0300 Subject: [openib-general] SLES9 SP3 support was added Message-ID: <449A98EB.4050501@mellanox.co.il> Hi All, We have added support for SLES9 SP3 that can be used with OFED 1.0. The kernel modules supported are: * mthca * core * CM & CMA * IPoIB * SRP All user level apps and libraries are working too. CPU Architectures supported: * x86 * x86_64 * ia64 The backport patches are available at: https://openib.org/svn/gen2/branches/1.0/ofed/patches/2.6.5-7.244/ There is also a need to take the updated configure and install.sh that add SLES9 specific support. There are no other changes in the package beside these. Is there a need to create a package (1.0.1) with SLES9 support? Tziporet From tziporet at mellanox.co.il Thu Jun 22 05:53:37 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 22 Jun 2006 15:53:37 +0300 Subject: [openib-general] OFED 1.0 - Official Release (Tziporet Koren) In-Reply-To: <20060616100547.13864.qmail@web36915.mail.mud.yahoo.com> References: <20060616100547.13864.qmail@web36915.mail.mud.yahoo.com> Message-ID: <449A92D1.8090404@mellanox.co.il> zhu shi song wrote: > I'm sorry SDP is not in production state. SDP is very > important for our application and we are waiting it > mature enough to be used in our product. And do you > have any schedule to let SDP work ok(especially can > support many large concurrent connections just like > TCP)? I very appreciate I can test new SDP before end > of June. > tks > zhu > > The plan is to have a stable SDP in 1.1 release. The schedule of 1.1 is end of July in the best case (more likely it will be mid-Aug) However we will have RCs before this and we can let you know when many large concurrent connections are supported. Tziporet From tziporet at mellanox.co.il Thu Jun 22 06:41:25 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 22 Jun 2006 16:41:25 +0300 Subject: [openib-general] SLES9 SP3 support was added Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA72DF@mtlexch01.mtl.com> Hi All, We have added support for SLES9 SP3 that can be used with OFED 1.0. The kernel modules supported are: * mthca * core * CM & CMA * IPoIB * SRP All user level apps and libraries are working too. CPU Architectures supported: * x86 * x86_64 * ia64 The backport patches are available at: https://openib.org/svn/gen2/branches/1.0/ofed/patches/2.6.5-7.244/ There is also a need to take the updated configure and install.sh that add SLES9 specific support. There are no other changes in the package beside these. Is there a need to create a package (1.0.1) with SLES9 support? Tziporet From swise at opengridcomputing.com Thu Jun 22 08:30:53 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 22 Jun 2006 10:30:53 -0500 Subject: [openib-general] [librdmacm] rping In-Reply-To: References: Message-ID: <1150990253.3040.7.camel@stevo-desktop> On Thu, 2006-06-22 at 09:08 -0400, amith rajith mamidala wrote: > The client exits normally, the > server still hangs after printing the cq status. > > server ping data: rdma-ping-9: JKLMNOPQRSTU > server DISCONNECT EVENT... > wait for RDMA_READ_ADV state 9 > cq completion failed status 5 > > When I kill the process and restart the server I get the following > error: > > rdma_bind_addr error -1 > what svn revision? What transport? From rdreier at cisco.com Thu Jun 22 09:22:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 22 Jun 2006 09:22:33 -0700 Subject: [openib-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This is mostly merging the new iSER (iSCSI over RDMA transport) initiator: Krishna Kumar: IB/uverbs: Don't free wr list when it's known to be empty Or Gerlitz: IB/iser: iSCSI iSER transport provider header file IB/iser: iSCSI iSER transport provider high level code IB/iser: iSER initiator iSCSI PDU and TX/RX IB/iser: iSER RDMA CM (CMA) and IB verbs interaction IB/iser: iSER handling of memory for RDMA IB/iser: iSER Kconfig and Makefile Roland Dreier: IB/uverbs: Remove unnecessary list_del()s drivers/infiniband/Kconfig | 2 drivers/infiniband/Makefile | 1 drivers/infiniband/core/uverbs_cmd.c | 2 drivers/infiniband/core/uverbs_main.c | 6 drivers/infiniband/ulp/iser/Kconfig | 11 drivers/infiniband/ulp/iser/Makefile | 4 drivers/infiniband/ulp/iser/iscsi_iser.c | 790 +++++++++++++++++++++++++ drivers/infiniband/ulp/iser/iscsi_iser.h | 354 +++++++++++ drivers/infiniband/ulp/iser/iser_initiator.c | 738 +++++++++++++++++++++++ drivers/infiniband/ulp/iser/iser_memory.c | 401 +++++++++++++ drivers/infiniband/ulp/iser/iser_verbs.c | 827 ++++++++++++++++++++++++++ drivers/scsi/Makefile | 1 12 files changed, 3130 insertions(+), 7 deletions(-) create mode 100644 drivers/infiniband/ulp/iser/Kconfig create mode 100644 drivers/infiniband/ulp/iser/Makefile create mode 100644 drivers/infiniband/ulp/iser/iscsi_iser.c create mode 100644 drivers/infiniband/ulp/iser/iscsi_iser.h create mode 100644 drivers/infiniband/ulp/iser/iser_initiator.c create mode 100644 drivers/infiniband/ulp/iser/iser_memory.c create mode 100644 drivers/infiniband/ulp/iser/iser_verbs.c From pradeep at us.ibm.com Thu Jun 22 10:22:01 2006 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Thu, 22 Jun 2006 10:22:01 -0700 Subject: [openib-general] IPoIB multicast Message-ID: Can some one please explain the details of the IPoIB multicast. Or if there is some previous discussion or documentation about that can I get a pointer? In particular I am looking to understand the details of initiation of multicast join throug ipoib_send() and the join completion appears to happen through a MAD callback. How are the corresponding skbs freed? Why is the tx_ring used for send and what is the mcast->pkt_queue used for. Thanks for all the help. Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradeep at us.ibm.com Thu Jun 22 10:22:01 2006 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Thu, 22 Jun 2006 10:22:01 -0700 Subject: [openib-general] Fw: IPoIB multicast Message-ID: I am not sure if this mail got sent out. Please ignore if it is a duplicate. Pradeep pradeep at us.ibm.com ----- Forwarded by Pradeep Satyanarayana/Beaverton/IBM on 06/22/2006 08:50 AM ----- Pradeep Satyanarayana/Beaverton/IBM 06/21/2006 10:28 PM To openib-general at openib.org cc Subject IPoIB multicast Can some one please explain the details of the IPoIB multicast. Or if there is some previous discussion or documentation about that can I get a pointer? In particular I am looking to understand the details of initiation of multicast join throug ipoib_send() and the join completion appears to happen through a MAD callback. How are the corresponding skbs freed? Why is the tx_ring used for send and what is the mcast->pkt_queue used for. Thanks for all the help. Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ardavis at ichips.intel.com Thu Jun 22 11:12:25 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 22 Jun 2006 11:12:25 -0700 Subject: [openib-general] uCMA kernel slab corruption and oops Message-ID: <449ADD89.6080107@ichips.intel.com> Sean, I am running a couple of iMPI/uDAPL benchmarks at the same time and ran into this: (2.6.17 kernel and svn8112) Jun 22 10:46:51 localhost kernel: Slab corruption: start=ffff8100202458f8, len=512 Jun 22 10:46:51 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 10:46:51 localhost kernel: Last user: [](rdma_destroy_id+0x188/0x193 [rdma_cm]) Jun 22 10:46:51 localhost kernel: 0f0: 6b 6b 6b 6b 6b 6b 6b 6b 18 be 2d 37 00 81 ff ff Jun 22 10:46:51 localhost kernel: Prev obj: start=ffff8100202456e0, len=512 Jun 22 10:46:51 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 10:46:51 localhost kernel: Last user: [](ucma_get_event+0x202/0x21f [rdma_ucm]) Jun 22 10:46:51 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 10:46:51 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 10:46:51 localhost kernel: Next obj: start=ffff810020245b10, len=512 Jun 22 10:46:51 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 10:46:51 localhost kernel: Last user: [](skb_release_data+0x92/0x97) Jun 22 10:46:51 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 10:46:51 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 10:46:53 localhost kernel: Slab corruption: start=ffff8100202458f8, len=512 Jun 22 10:46:53 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 10:46:53 localhost kernel: Last user: [](skb_release_data+0x92/0x97) Jun 22 10:46:53 localhost kernel: 0f0: 40 5c 3c 18 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 10:46:53 localhost kernel: Prev obj: start=ffff8100202456e0, len=512 Jun 22 10:46:53 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 10:46:53 localhost kernel: Last user: [](ucma_get_event+0x202/0x21f [rdma_ucm]) Jun 22 10:46:53 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 10:46:53 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 10:46:53 localhost kernel: Next obj: start=ffff810020245b10, len=512 Jun 22 10:46:53 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 10:46:53 localhost kernel: Last user: [](skb_release_data+0x92/0x97) Jun 22 10:46:53 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 10:46:53 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:01:01 localhost kernel: Slab corruption: start=ffff8100202458f8, len=512 Jun 22 11:01:01 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:01:01 localhost kernel: Last user: [](ib_destroy_cm_id+0x23b/0x246 [ib_cm]) Jun 22 11:01:01 localhost kernel: 0f0: d0 79 4c 2d 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:01:01 localhost kernel: Prev obj: start=ffff8100202456e0, len=512 Jun 22 11:01:01 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:01:01 localhost kernel: Last user: [](load_elf_interp+0x411/0x423) Jun 22 11:01:01 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:01:01 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:01:01 localhost kernel: Next obj: start=ffff810020245b10, len=512 Jun 22 11:01:01 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:01:01 localhost kernel: Last user: [](skb_release_data+0x92/0x97) Jun 22 11:01:01 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:01:01 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:33 localhost kernel: Slab corruption: start=ffff8100202458f8, len=512 Jun 22 11:22:33 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:22:33 localhost kernel: Last user: [](load_elf_interp+0x411/0x423) Jun 22 11:22:33 localhost kernel: 0f0: a0 83 9e 21 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:33 localhost kernel: Prev obj: start=ffff8100202456e0, len=512 Jun 22 11:22:33 localhost kernel: Redzone: 0x170fc2a5/0x170fc2a5. Jun 22 11:22:33 localhost kernel: Last user: [](mthca_create_qp+0x48/0x275 [ib_mthca]) Jun 22 11:22:33 localhost kernel: 000: 00 40 6a 3d 00 81 ff ff 38 96 d4 3a 00 81 ff ff Jun 22 11:22:33 localhost kernel: 010: 48 15 64 29 00 81 ff ff 48 15 64 29 00 81 ff ff Jun 22 11:22:33 localhost kernel: Next obj: start=ffff810020245b10, len=512 Jun 22 11:22:33 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:22:33 localhost kernel: Last user: [](load_elf_interp+0x411/0x423) Jun 22 11:22:33 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:33 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:43 localhost kernel: Slab corruption: start=ffff8100202458f8, len=512 Jun 22 11:22:43 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:22:43 localhost kernel: Last user: [](mthca_destroy_qp+0x67/0x70 [ib_mthca]) Jun 22 11:22:43 localhost kernel: 0f0: d8 cd c0 36 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:43 localhost kernel: Prev obj: start=ffff8100202456e0, len=512 Jun 22 11:22:43 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:22:43 localhost kernel: Last user: [](mthca_destroy_qp+0x67/0x70 [ib_mthca]) Jun 22 11:22:43 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:49 localhost kernel: Slab corruption: start=ffff8100202458f8, len=512 Jun 22 11:22:49 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:22:49 localhost kernel: Last user: [](ib_destroy_cm_id+0x23b/0x246 [ib_cm]) Jun 22 11:22:49 localhost kernel: 0f0: d8 cd c0 36 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:49 localhost kernel: Prev obj: start=ffff8100202456e0, len=512 Jun 22 11:22:49 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:22:49 localhost kernel: Last user: [](mthca_destroy_qp+0x67/0x70 [ib_mthca]) Jun 22 11:22:49 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:49 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:49 localhost kernel: Next obj: start=ffff810020245b10, len=512 Jun 22 11:22:49 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:22:49 localhost kernel: Last user: [](mthca_destroy_qp+0x67/0x70 [ib_mthca]) Jun 22 11:22:49 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:49 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:51 localhost kernel: Slab corruption: start=ffff8100202458f8, len=512 Jun 22 11:22:51 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:22:51 localhost kernel: Last user: [](cm_free_work+0x23/0x2a [ib_cm]) Jun 22 11:22:51 localhost kernel: 0f0: d0 79 4c 2d 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:51 localhost kernel: Prev obj: start=ffff8100202456e0, len=512 Jun 22 11:22:51 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:22:51 localhost kernel: Last user: [](mthca_destroy_qp+0x67/0x70 [ib_mthca]) Jun 22 11:22:51 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:51 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:51 localhost kernel: Next obj: start=ffff810020245b10, len=512 Jun 22 11:22:51 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:22:51 localhost kernel: Last user: [](mthca_destroy_qp+0x67/0x70 [ib_mthca]) Jun 22 11:22:51 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:51 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:53 localhost kernel: Slab corruption: start=ffff8100202458f8, len=512 Jun 22 11:22:53 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:22:53 localhost kernel: Last user: [](__ib_umem_release+0xac/0xd0 [ib_uverbs]) Jun 22 11:22:53 localhost kernel: 0f0: d0 79 4c 2d 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:53 localhost kernel: Prev obj: start=ffff8100202456e0, len=512 Jun 22 11:22:53 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:22:53 localhost kernel: Last user: [](skb_release_data+0x92/0x97) Jun 22 11:22:53 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:53 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:53 localhost kernel: Next obj: start=ffff810020245b10, len=512 Jun 22 11:22:53 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:22:53 localhost kernel: Last user: [](mthca_destroy_qp+0x67/0x70 [ib_mthca]) Jun 22 11:22:53 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:22:53 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:23:04 localhost kernel: Slab corruption: start=ffff8100202458f8, len=512 Jun 22 11:23:04 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:23:04 localhost kernel: Last user: [](load_elf_binary+0xf11/0x16ef) Jun 22 11:23:04 localhost kernel: 0f0: e8 39 17 1c 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b Jun 22 11:23:04 localhost kernel: Prev obj: start=ffff8100202456e0, len=512 Jun 22 11:23:04 localhost kernel: Redzone: 0x170fc2a5/0x170fc2a5. Jun 22 11:23:04 localhost kernel: Last user: [](cm_create_timewait_info+0x1b/0x6b [ib_cm]) Jun 22 11:23:04 localhost kernel: 000: 00 00 00 00 00 00 00 00 e8 56 24 20 00 81 ff ff Jun 22 11:23:04 localhost kernel: 010: e8 56 24 20 00 81 ff ff c5 a4 06 88 ff ff ff ff Jun 22 11:23:04 localhost kernel: Next obj: start=ffff810020245b10, len=512 Jun 22 11:23:04 localhost kernel: Redzone: 0x170fc2a5/0x170fc2a5. Jun 22 11:23:04 localhost kernel: Last user: [](cm_create_timewait_info+0x1b/0x6b [ib_cm]) Jun 22 11:23:04 localhost kernel: 000: 00 00 00 00 00 00 00 00 18 5b 24 20 00 81 ff ff Jun 22 11:23:04 localhost kernel: 010: 18 5b 24 20 00 81 ff ff c5 a4 06 88 ff ff ff ff Jun 22 11:23:23 localhost kernel: general protection fault: 0000 [1] SMP Jun 22 11:23:23 localhost kernel: CPU 0 Jun 22 11:23:23 localhost kernel: Modules linked in: rdma_ucm rdma_cm ib_addr ib_local_sa findex ib_ucm ib_cm ib_umad ib_uverbs ib_ipoib ib_multicast ib_sa ib_mthca ib_mad ib_core ixgb Jun 22 11:23:23 localhost kernel: Pid: 4078, comm: ib_cm/0 Not tainted 2.6.17 #1 Jun 22 11:23:23 localhost kernel: RIP: 0010:[] {rb_erase+465} Jun 22 11:23:23 localhost kernel: RSP: 0000:ffff810034ba3d58 EFLAGS: 00010002 Jun 22 11:23:23 localhost kernel: RAX: 6b6b6b6b6b6b6b6b RBX: ffff8100202459d0 RCX: ffff8100202459d0 Jun 22 11:23:23 localhost kernel: RDX: ffff810020245be8 RSI: 0000000000000000 RDI: 0000000000000000 Jun 22 11:23:23 localhost kernel: RBP: ffff810034ba3d68 R08: 0000000000000000 R09: 0000000000000000 Jun 22 11:23:23 localhost kernel: R10: ffff8100202458f8 R11: 0000000000000200 R12: ffffffff8806e750 Jun 22 11:23:23 localhost kernel: R13: ffff810020245b10 R14: ffff810020245b10 R15: 0000000000000282 Jun 22 11:23:23 localhost kernel: FS: 0000000000000000(0000) GS:ffffffff806ef000(0000) knlGS:0000000000000000 Jun 22 11:23:23 localhost kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jun 22 11:23:23 localhost kernel: CR2: 0000000000b373f0 CR3: 000000001ffaf000 CR4: 00000000000006e0 Jun 22 11:23:23 localhost kernel: Process ib_cm/0 (pid: 4078, threadinfo ffff810034ba2000, task ffff810034fb03c0) Jun 22 11:23:23 localhost kernel: Stack: ffff810020245b10 0000000000000286 ffff810034ba3d88 ffffffff88067828 Jun 22 11:23:23 localhost kernel: ffff810020245b10 ffff810020245b18 ffff810034ba3e18 ffffffff8806b4f5 Jun 22 11:23:23 localhost kernel: ffff810034bf2b70 ffff810034ba2000 Jun 22 11:23:23 localhost kernel: Call Trace: {:ib_cm:cm_cleanup_timewait+101} Jun 22 11:23:23 localhost kernel: {:ib_cm:cm_work_handler+4144} {__wake_up+67} Jun 22 11:23:23 localhost kernel: {run_workqueue+184} {:ib_cm:cm_work_handler+0} Jun 22 11:23:23 localhost kernel: {worker_thread+313} {default_wake_function+0} Jun 22 11:23:23 localhost kernel: {default_wake_function+0} {worker_thread+0} Jun 22 11:23:23 localhost kernel: {kthread+215} {child_rip+8} Jun 22 11:23:23 localhost kernel: {kthread+0} {child_rip+0} Jun 22 11:23:23 localhost kernel: Jun 22 11:23:23 localhost kernel: Code: 44 8b 40 08 48 89 c7 45 85 c0 3e 75 1d c7 40 08 01 00 00 00 Jun 22 11:23:23 localhost kernel: RIP {rb_erase+465} RSP Jun 22 11:23:23 localhost kernel: <3>Slab corruption: start=ffff8100202458f8, len=512 Jun 22 11:23:23 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071. Jun 22 11:23:23 localhost kernel: Last user: [](cm_free_work+0x23/0x2a [ib_cm]) Jun 22 11:23:23 localhost kernel: 0e0: 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00 00 00 00 Jun 22 11:23:23 localhost kernel: Prev obj: start=ffff8100202456e0, len=512 Jun 22 11:23:23 localhost kernel: Redzone: 0x170fc2a5/0x170fc2a5. Jun 22 11:23:23 localhost kernel: Last user: [](rdma_create_id+0x25/0xf2 [rdma_cm]) Jun 22 11:23:23 localhost kernel: 000: 00 40 6a 3d 00 81 ff ff f0 88 d4 3a 00 81 ff ff Jun 22 11:23:23 localhost kernel: 010: 00 00 00 00 00 00 00 00 27 62 08 88 ff ff ff ff Jun 22 11:23:23 localhost kernel: Next obj: start=ffff810020245b10, len=512 Jun 22 11:23:23 localhost kernel: Redzone: 0x170fc2a5/0x170fc2a5. Jun 22 11:23:23 localhost kernel: Last user: [](cm_create_timewait_info+0x1b/0x6b [ib_cm]) Jun 22 11:23:23 localhost kernel: 000: 00 00 00 00 00 00 00 00 18 5b 24 20 00 81 ff ff Jun 22 11:23:23 localhost kernel: 010: 18 5b 24 20 00 81 ff ff c5 a4 06 88 ff ff ff ff Jun 22 11:23:29 localhost kernel: NMI Watchdog detected LOCKUP on CPU 1 Jun 22 11:23:29 localhost kernel: CPU 1 Jun 22 11:23:29 localhost kernel: Modules linked in: rdma_ucm rdma_cm ib_addr ib_local_sa findex ib_ucm ib_cm ib_umad ib_uverbs ib_ipoib ib_multicast ib_sa ib_mthca ib_mad ib_core ixgb Jun 22 11:23:29 localhost kernel: Pid: 4079, comm: ib_cm/1 Not tainted 2.6.17 #1 Jun 22 11:23:29 localhost kernel: RIP: 0010:[] {.text.lock.spinlock+14} Jun 22 11:23:29 localhost kernel: RSP: 0018:ffff810034989d10 EFLAGS: 00000086 Jun 22 11:23:29 localhost kernel: RAX: ffff81003d2035e0 RBX: 000000000005ac69 RCX: ffff810034989ed0 Jun 22 11:23:29 localhost kernel: RDX: ffff81003def5190 RSI: 000000000005ac69 RDI: ffffffff8806e720 Jun 22 11:23:29 localhost kernel: RBP: ffff810034989d18 R08: ffff810034988000 R09: 0000000000000001 Jun 22 11:23:29 localhost kernel: R10: 00000000ffffffff R11: 0000000000000003 R12: ffff81003d203680 Jun 22 11:23:29 localhost kernel: R13: 000000000005ac68 R14: ffff810019b0c4d0 R15: 0000000000000282 Jun 22 11:23:29 localhost kernel: FS: 0000000000000000(0000) GS:ffff810037e9e2a8(0000) knlGS:0000000000000000 Jun 22 11:23:29 localhost kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jun 22 11:23:29 localhost kernel: CR2: 00002b5009e20d10 CR3: 0000000021544000 CR4: 00000000000006e0 Jun 22 11:23:29 localhost kernel: Process ib_cm/1 (pid: 4079, threadinfo ffff810034988000, task ffff810035de50a0) Jun 22 11:23:29 localhost kernel: Stack: 0000000000000282 ffff810034989d48 ffffffff880673c2 ffff810021198a48 Jun 22 11:23:29 localhost kernel: ffff810019b0c4d0 ffff81003d203680 ffff810019b0c4d0 ffff810034989d88 Jun 22 11:23:29 localhost kernel: ffffffff88069944 0000000000000000 Jun 22 11:23:29 localhost kernel: Call Trace: {:ib_cm:cm_acquire_id+30} Jun 22 11:23:29 localhost kernel: {:ib_cm:cm_dreq_handler+51} {:ib_cm:cm_work_handler+4051} Jun 22 11:23:29 localhost kernel: {run_workqueue+184} {:ib_cm:cm_work_handler+0} Jun 22 11:23:29 localhost kernel: {worker_thread+313} {default_wake_function+0} Jun 22 11:23:29 localhost kernel: {default_wake_function+0} {worker_thread+0} Jun 22 11:23:29 localhost kernel: {kthread+215} {child_rip+8} Jun 22 11:23:29 localhost kernel: {kthread+0} {child_rip+0} Jun 22 11:23:29 localhost kernel: Jun 22 11:23:29 localhost kernel: Code: 83 3f 00 7e f9 e9 99 fd ff ff e8 2a d1 e4 ff e9 bd fd ff ff Jun 22 11:23:29 localhost kernel: console shuts up ... Jun 22 11:25:13 localhost kernel: NMI Watchdog detected LOCKUP on CPU 0 Jun 22 11:43:23 localhost syslogd 1.4.1: restart. From arlin.r.davis at intel.com Thu Jun 22 11:17:54 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 22 Jun 2006 11:17:54 -0700 Subject: [openib-general] [PATCH] uDAPL dapl_evd_connection_callback does not support TIMED_OUT event Message-ID: James, Added support for active side TIMED_OUT event from a provider. Signed-off by: Arlin Davis ardavis at ichips.intel.com Index: dapl/common/dapl_evd_connection_callb.c =================================================================== --- dapl/common/dapl_evd_connection_callb.c (revision 8166) +++ dapl/common/dapl_evd_connection_callb.c (working copy) @@ -162,48 +162,15 @@ dapl_evd_connection_callback ( break; } case DAT_CONNECTION_EVENT_DISCONNECTED: - { - /* - * EP is now fully disconnected; initiate any post processing - * to reset the underlying QP and get the EP ready for - * another connection - */ - ep_ptr->param.ep_state = DAT_EP_STATE_DISCONNECTED; - dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event); - dapl_os_unlock (&ep_ptr->header.lock); - - break; - } case DAT_CONNECTION_EVENT_PEER_REJECTED: - { - ep_ptr->param.ep_state = DAT_EP_STATE_DISCONNECTED; - dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event); - dapl_os_unlock (&ep_ptr->header.lock); - - break; - } case DAT_CONNECTION_EVENT_UNREACHABLE: - { - ep_ptr->param.ep_state = DAT_EP_STATE_DISCONNECTED; - dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event); - dapl_os_unlock (&ep_ptr->header.lock); - - break; - } case DAT_CONNECTION_EVENT_NON_PEER_REJECTED: - { - ep_ptr->param.ep_state = DAT_EP_STATE_DISCONNECTED; - dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event); - dapl_os_unlock (&ep_ptr->header.lock); - - break; - } case DAT_CONNECTION_EVENT_BROKEN: + case DAT_CONNECTION_EVENT_TIMED_OUT: { ep_ptr->param.ep_state = DAT_EP_STATE_DISCONNECTED; dapls_ib_disconnect_clean (ep_ptr, DAT_FALSE, ib_cm_event); dapl_os_unlock ( &ep_ptr->header.lock ); - break; } case DAT_CONNECTION_REQUEST_EVENT: From bpradip at in.ibm.com Thu Jun 22 11:36:49 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Fri, 23 Jun 2006 00:06:49 +0530 Subject: [openib-general] [librdmacm] rping In-Reply-To: References: Message-ID: <449AE341.7070809@in.ibm.com> amith rajith mamidala wrote: > I was checking rping with the latest stack. The client exits normally, the > server still hangs after printing the cq status. I have seen this happening in the following two scenarios : (1) server exits before the client - The client prints the following errors and stays there client DISCONNECT EVENT... cq completion failed status 5 client: post send error 22 (2) client exits before the server - The o/p is same as what you get. This behaviour is because of the way cm_thread() and cq_thread() functions are written. I have coded a fix for this. Will send it tomorrow after some more testing. > > server ping data: rdma-ping-9: JKLMNOPQRSTU > server DISCONNECT EVENT... > wait for RDMA_READ_ADV state 9 > cq completion failed status 5 > > When I kill the process and restart the server I get the following error: > > rdma_bind_addr error -1 You will be able to kill only the rping process. If you look at the 'ps ax' output you will see that lt-rping is in the 'D' state. Hence the bind error. Only reboot helps Thanks, Pradipta Kumar. > > > Thanks, > Amith > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From bpradip at in.ibm.com Thu Jun 22 12:18:46 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Fri, 23 Jun 2006 00:48:46 +0530 Subject: [openib-general] [PATCH] rping.c: Fix hang if either the server or the client exits Message-ID: <20060622191838.GA24554@harry-potter.ibm.com> early Reply-To: bpradip at in.ibm.com This patch fixes the problem as reported by Amith. Signed-off-by: Pradipta Kumar Banerjee --- Index: rping.c ============================================================================= --- rping.c.org 2006-06-23 00:22:17.000000000 +0530 +++ rping.c 2006-06-23 00:39:06.000000000 +0530 @@ -215,6 +215,7 @@ static int rping_cma_event_handler(struc case RDMA_CM_EVENT_DISCONNECTED: fprintf(stderr, "%s DISCONNECT EVENT...\n", cb->server ? "server" : "client"); sem_post(&cb->sem); + ret = -1; break; case RDMA_CM_EVENT_DEVICE_REMOVAL: From bpradip at in.ibm.com Thu Jun 22 12:23:10 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Fri, 23 Jun 2006 00:53:10 +0530 Subject: [openib-general] resend [PATCH] rping.c: Fix hang if either the server or the client exits early Message-ID: <20060622192259.GA24588@harry-potter.ibm.com> Hi, Please ignore the earlier mail. There were some problems with the mailer. Here is the new one. This patch fixes the problem as reported by Amith. Signed-off-by: Pradipta Kumar Banerjee --- Index: rping.c ============================================================================= --- rping.c.org 2006-06-23 00:22:17.000000000 +0530 +++ rping.c 2006-06-23 00:39:06.000000000 +0530 @@ -215,6 +215,7 @@ static int rping_cma_event_handler(struc case RDMA_CM_EVENT_DISCONNECTED: fprintf(stderr, "%s DISCONNECT EVENT...\n", cb->server ? "server" : "client"); sem_post(&cb->sem); + ret = -1; break; case RDMA_CM_EVENT_DEVICE_REMOVAL: From halr at voltaire.com Thu Jun 22 12:46:15 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Jun 2006 15:46:15 -0400 Subject: [openib-general] [PATCH][TRIVIAL] librdmacm/examples/udaddy.c: Fix example name in messages Message-ID: <1151005558.4391.240388.camel@hal.voltaire.com> librdmacm/examples/udaddy.c: Fix example name in messages Signed-off-by: Hal Rosenstock Index: ../../librdmacm/examples/udaddy.c =================================================================== --- ../../librdmacm/examples/udaddy.c (revision 8166) +++ ../../librdmacm/examples/udaddy.c (working copy) @@ -47,8 +47,8 @@ /* * To execute: - * Server: rdma_cmatose - * Client: rdma_cmatose "dst_ip=ip" + * Server: udaddy + * Client: udaddy [server_addr [src_addr]] */ struct cmatest_node { @@ -116,7 +116,7 @@ static int init_node(struct cmatest_node node->pd = ibv_alloc_pd(node->cma_id->verbs); if (!node->pd) { ret = -ENOMEM; - printf("cmatose: unable to allocate PD\n"); + printf("udaddy: unable to allocate PD\n"); goto out; } @@ -124,7 +124,7 @@ static int init_node(struct cmatest_node node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0); if (!node->cq) { ret = -ENOMEM; - printf("cmatose: unable to create CQ\n"); + printf("udaddy: unable to create CQ\n"); goto out; } @@ -140,13 +140,13 @@ static int init_node(struct cmatest_node init_qp_attr.recv_cq = node->cq; ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr); if (ret) { - printf("cmatose: unable to create QP: %d\n", ret); + printf("udaddy: unable to create QP: %d\n", ret); goto out; } ret = create_message(node); if (ret) { - printf("cmatose: failed to create messages: %d\n", ret); + printf("udaddy: failed to create messages: %d\n", ret); goto out; } out: @@ -225,7 +225,7 @@ static int addr_handler(struct cmatest_n ret = rdma_resolve_route(node->cma_id, 2000); if (ret) { - printf("cmatose: resolve route failed: %d\n", ret); + printf("udaddy: resolve route failed: %d\n", ret); connect_error(); } return ret; @@ -250,7 +250,7 @@ static int route_handler(struct cmatest_ conn_param.retry_count = 5; ret = rdma_connect(node->cma_id, &conn_param); if (ret) { - printf("cmatose: failure connecting: %d\n", ret); + printf("udaddy: failure connecting: %d\n", ret); goto err; } return 0; @@ -287,7 +287,7 @@ static int connect_handler(struct rdma_c conn_param.qp_type = node->cma_id->qp->qp_type; ret = rdma_accept(node->cma_id, &conn_param); if (ret) { - printf("cmatose: failure accepting: %d\n", ret); + printf("udaddy: failure accepting: %d\n", ret); goto err2; } node->connected = 1; @@ -298,7 +298,7 @@ err2: node->cma_id = NULL; connect_error(); err1: - printf("cmatose: failing connection request\n"); + printf("udaddy: failing connection request\n"); rdma_reject(cma_id, NULL, 0); return ret; } @@ -351,7 +351,7 @@ static int cma_handler(struct rdma_cm_id case RDMA_CM_EVENT_CONNECT_ERROR: case RDMA_CM_EVENT_UNREACHABLE: case RDMA_CM_EVENT_REJECTED: - printf("cmatose: event: %d, error: %d\n", event->event, + printf("udaddy: event: %d, error: %d\n", event->event, event->status); connect_error(); ret = event->status; @@ -397,7 +397,7 @@ static int alloc_nodes(void) test.nodes = malloc(sizeof *test.nodes * connections); if (!test.nodes) { - printf("cmatose: unable to allocate memory for test nodes\n"); + printf("udaddy: unable to allocate memory for test nodes\n"); return -ENOMEM; } memset(test.nodes, 0, sizeof *test.nodes * connections); @@ -449,7 +449,7 @@ static int poll_cqs(void) for (done = 0; done < message_count; done += ret) { ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); if (ret < 0) { - printf("cmatose: failed polling CQ: %d\n", ret); + printf("udaddy: failed polling CQ: %d\n", ret); return ret; } @@ -480,10 +480,10 @@ static int run_server(void) struct rdma_cm_id *listen_id; int i, ret; - printf("cmatose: starting server\n"); + printf("udaddy: starting server\n"); ret = rdma_create_id(test.channel, &listen_id, &test, RDMA_PS_UDP); if (ret) { - printf("cmatose: listen request failed\n"); + printf("udaddy: listen request failed\n"); return ret; } @@ -491,13 +491,13 @@ static int run_server(void) test.src_in.sin_port = 7174; ret = rdma_bind_addr(listen_id, test.src_addr); if (ret) { - printf("cmatose: bind address failed: %d\n", ret); + printf("udaddy: bind address failed: %d\n", ret); return ret; } ret = rdma_listen(listen_id, 0); if (ret) { - printf("cmatose: failure trying to listen: %d\n", ret); + printf("udaddy: failure trying to listen: %d\n", ret); goto out; } @@ -552,7 +552,7 @@ static int run_client(char *dst, char *s { int i, ret; - printf("cmatose: starting client\n"); + printf("udaddy: starting client\n"); if (src) { ret = get_addr(src, &test.src_in); if (ret) @@ -565,13 +565,13 @@ static int run_client(char *dst, char *s test.dst_in.sin_port = 7174; - printf("cmatose: connecting\n"); + printf("udaddy: connecting\n"); for (i = 0; i < connections; i++) { ret = rdma_resolve_addr(test.nodes[i].cma_id, src ? test.src_addr : NULL, test.dst_addr, 2000); if (ret) { - printf("cmatose: failure getting addr: %d\n", ret); + printf("udaddy: failure getting addr: %d\n", ret); connect_error(); return ret; } From swise at opengridcomputing.com Thu Jun 22 13:24:07 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 22 Jun 2006 15:24:07 -0500 Subject: [openib-general] resend [PATCH] rping.c: Fix hang if either the server or the client exits early In-Reply-To: <20060622192259.GA24588@harry-potter.ibm.com> References: <20060622192259.GA24588@harry-potter.ibm.com> Message-ID: <1151007847.3040.51.camel@stevo-desktop> The goal of adding the return codes was so that the rping program could exit with a status indicating success or failure. Every rping run results in a DISCONNECT event, so I don't think we want to treat that case as an error. Also, can you explain why thi fixes Amith's problem, which sounded like a process was hanging? Thanks, Steve. On Fri, 2006-06-23 at 00:53 +0530, Pradipta Kumar Banerjee wrote: > Hi, > Please ignore the earlier mail. There were some problems with the mailer. > Here is the new one. > > This patch fixes the problem as reported by Amith. > > Signed-off-by: Pradipta Kumar Banerjee > > --- > > Index: rping.c > ============================================================================= > --- rping.c.org 2006-06-23 00:22:17.000000000 +0530 > +++ rping.c 2006-06-23 00:39:06.000000000 +0530 > @@ -215,6 +215,7 @@ static int rping_cma_event_handler(struc > case RDMA_CM_EVENT_DISCONNECTED: > fprintf(stderr, "%s DISCONNECT EVENT...\n", cb->server ? "server" : "client"); > sem_post(&cb->sem); > + ret = -1; > break; > > case RDMA_CM_EVENT_DEVICE_REMOVAL: From jlentini at netapp.com Thu Jun 22 13:58:57 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 22 Jun 2006 16:58:57 -0400 (EDT) Subject: [openib-general] [PATCH] uDAPL cma - event processing bug In-Reply-To: References: Message-ID: On Wed, 21 Jun 2006, Arlin Davis wrote: > James, > > Fix bug in dapls_ib_get_dat_event() call after adding new > unreachable event. Committed in revision 8180. From jlentini at netapp.com Thu Jun 22 14:13:35 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 22 Jun 2006 17:13:35 -0400 (EDT) Subject: [openib-general] [PATCH] uDAPL dapl_evd_connection_callback does not support TIMED_OUT event In-Reply-To: References: Message-ID: On Thu, 22 Jun 2006, Arlin Davis wrote: > James, > > Added support for active side TIMED_OUT event from a provider. Committed revision 8181, but with the different flag values retained: Index: dapl/common/dapl_evd_connection_callb.c =================================================================== --- dapl/common/dapl_evd_connection_callb.c (revision 8109) +++ dapl/common/dapl_evd_connection_callb.c (working copy) @@ -162,34 +162,8 @@ dapl_evd_connection_callback ( break; } case DAT_CONNECTION_EVENT_DISCONNECTED: - { - /* - * EP is now fully disconnected; initiate any post processing - * to reset the underlying QP and get the EP ready for - * another connection - */ - ep_ptr->param.ep_state = DAT_EP_STATE_DISCONNECTED; - dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event); - dapl_os_unlock (&ep_ptr->header.lock); - - break; - } case DAT_CONNECTION_EVENT_PEER_REJECTED: - { - ep_ptr->param.ep_state = DAT_EP_STATE_DISCONNECTED; - dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event); - dapl_os_unlock (&ep_ptr->header.lock); - - break; - } case DAT_CONNECTION_EVENT_UNREACHABLE: - { - ep_ptr->param.ep_state = DAT_EP_STATE_DISCONNECTED; - dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event); - dapl_os_unlock (&ep_ptr->header.lock); - - break; - } case DAT_CONNECTION_EVENT_NON_PEER_REJECTED: { ep_ptr->param.ep_state = DAT_EP_STATE_DISCONNECTED; @@ -199,6 +173,7 @@ dapl_evd_connection_callback ( break; } case DAT_CONNECTION_EVENT_BROKEN: + case DAT_CONNECTION_EVENT_TIMED_OUT: { ep_ptr->param.ep_state = DAT_EP_STATE_DISCONNECTED; dapls_ib_disconnect_clean (ep_ptr, DAT_FALSE, ib_cm_event); From bos at pathscale.com Thu Jun 22 14:30:08 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 22 Jun 2006 14:30:08 -0700 Subject: [openib-general] ipath verbs does not compile against the latest SVN trunk verbs In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0008057B9B@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0008057B9B@orsmsx408> Message-ID: <1151011808.26502.19.camel@chalcedony.pathscale.com> On Tue, 2006-06-20 at 09:55 -0700, Woodruff, Robert J wrote: > When I try to build SVN 8112 I get the following errors trying > to build the ipath verbs. We're a bit out of date on the trunk. We'll be syncing it up RSN. James, Lower the reject debug message level so we don't see warnings when consumers reject. Signed-off by: Arlin Davis ardavis at ichips.intel.com Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 8166) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -359,7 +359,7 @@ static void dapli_cm_active_cb(struct da cm_event = IB_CME_DESTINATION_REJECT; dapl_dbg_log( - DAPL_DBG_TYPE_WARN, + DAPL_DBG_TYPE_CM, " dapli_cm_active_handler: REJECTED reason=%d\n", event->status); From dsnedigar at calpont.com Thu Jun 22 14:56:49 2006 From: dsnedigar at calpont.com (Don Snedigar) Date: Thu, 22 Jun 2006 16:56:49 -0500 Subject: [openib-general] OFED-1.0 fails install on AMD64 Message-ID: <8953B8331AA98041B0C11DBC678AFC0812C7B1@srvemail1.calpont.com> I just downloaded the OFED-1.0 and the install was going fine until ibutils. At that point, the install fails with : Open MPI RPM will be created during the installation process Building ibutils RPM. Please wait... Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed --mandir=/usr/local/ofed/share/man --cache-file=/var/tmp/OFED/ibutils.cache --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' --define '_mandir %{_prefix}/share/man' --define 'build_root /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm - ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed --mandir=/usr/local/ofed/share/man --cache-file=/var/tmp/OFED/ibutils.cache --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' --define '_mandir %{_prefix}/share/man' --define 'build_root /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" See log file: /tmp/OFED.28656.log I dug down into the log file it indicates and found : g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -fPIC -DPIC -o .libs/ibnl_scanner.o ibnl_scanner.ll: In function 'int ibnl_lex()': ibnl_scanner.ll:197: warning: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, FILE*)', declared with attribute warn_unused_result g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -o ibnl_scanner.o >/dev/null 2>&1 /bin/sh ../libtool --tag=CXX --mode=link g++ -O2 -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -o libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1" Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo g++ -shared -nostdlib /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o .libs/Fabric.o .libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o .libs/TopoMatch.o .libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o .libs/ibnl_parser.o .libs/ibnl_scanner.o -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0 -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../.. -L/lib/../lib64 -L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o -m64 -mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o .libs/libibdmcom.so.1.1.1 /usr/bin/ld: /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o): relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be used when making a shared object; recompile with -fPIC /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not read symbols: Bad value collect2: ld returned 1 exit status make[3]: *** [libibdmcom.la] Error 1 make[3]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm/datamodel' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm' make[1]: *** [all] Error 2 make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm' make: *** [all-recursive] Error 1 error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install) RPM build errors: Bad exit status from /var/tmp/rpm-tmp.16738 (%install) ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed --mandir=/usr/local/ofed/share/man --cache-file=/var/tmp/OFED/ibutils.cache --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' --define '_mandir %{_prefix}/share/man' --define 'build_root /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" Can anyone shed any light on this ? Machine is dual Opteron, 2 gig memory, kernel 2.6.16 Don Snedigar Calpont Corp. 214-618-9516 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Thu Jun 22 14:56:03 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 22 Jun 2006 17:56:03 -0400 (EDT) Subject: [openib-general] [Bug 146] OFED-1.0 DAPL fails to build on SLES10 on IA64 with IA64_FETCHADD error In-Reply-To: <20060622215505.2F2CF22873D@openib.ca.sandia.gov> References: <20060622215505.2F2CF22873D@openib.ca.sandia.gov> Message-ID: On Thu, 22 Jun 2006, bugzilla-daemon at openib.org wrote: > http://openib.org/bugzilla/show_bug.cgi?id=146 > > > jlentini at netapp.com changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|NEW |ASSIGNED > > > > > ------- Comment #1 from jlentini at netapp.com 2006-06-22 14:55 ------- > We have code in dapl/udapl/linux/dapl_osd.h that is supposed to handle this. > It looks like this broke when we moved to the autotools. I'll send you a patch > to test. Here's the patch. Thank you for offering to test this. Please let me if it fixes the problem (I do not have an IA64 SLES system). Index: Makefile.am =================================================================== --- Makefile.am (revision 8109) +++ Makefile.am (working copy) @@ -1,10 +1,11 @@ # $Id: $ +OSFLAGS = -DOS_VERSION=$(shell expr `uname -r | cut -f1 -d.` \* 65536 + `uname -r | cut -f2 -d.`) # Check for RedHat, needed for ia64 udapl atomic operations (IA64_FETCHADD syntax) if OS_RHEL -OSFLAGS=-DREDHAT_EL4 +OSFLAGS += -DREDHAT_EL4 else -OSFLAGS= +OSFLAGS += endif if DEBUG From jlentini at netapp.com Thu Jun 22 15:02:23 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 22 Jun 2006 18:02:23 -0400 (EDT) Subject: [openib-general] [PATCH] uDAPL cma: lower debug level on consumer rejects In-Reply-To: References: Message-ID: On Thu, 22 Jun 2006, Arlin Davis wrote: > James, > > Lower the reject debug message level so we don't see warnings when > consumers reject. Committed in revision 8182. From paul.lundin at gmail.com Thu Jun 22 15:16:17 2006 From: paul.lundin at gmail.com (Paul) Date: Thu, 22 Jun 2006 18:16:17 -0400 Subject: [openib-general] OFED-1.0 fails install on AMD64 In-Reply-To: <8953B8331AA98041B0C11DBC678AFC0812C7B1@srvemail1.calpont.com> References: <8953B8331AA98041B0C11DBC678AFC0812C7B1@srvemail1.calpont.com> Message-ID: Well taking a couple of stabs in the dark here. What version of redhat/fedora are you using ? I am using rhel 4 update 3 and it uses gcc version 3.4.5-2 by default. It appears as if your system is using 4.0.0. Also do you have any environment variables set ? Such as CFLAGS, CCFLAGS or the like ? For the record the only reason I mention gcc 4x is because it is the only time I have personally seen that error arise. On 6/22/06, Don Snedigar wrote: > > I just downloaded the OFED-1.0 and the install was going fine until > ibutils. At that point, the install fails with : > > Open MPI RPM will be created during the installation process > > > Building ibutils RPM. Please wait... > > Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define > 'configure_options --prefix=/usr/local/ofed > --mandir=/usr/local/ofed/share/man --cache-file=/var/tmp/OFED/ibutils.cache > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define '_prefix > /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' --define '_mandir > %{_prefix}/share/man' --define 'build_root /var/tmp/OFED' > /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm > - > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed > --mandir=/usr/local/ofed/share/man --cache-file=/var/tmp/OFED/ibutils.cache > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define '_prefix > /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' --define '_mandir > %{_prefix}/share/man' --define 'build_root /var/tmp/OFED' > /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" > > See log file: /tmp/OFED.28656.log > > I dug down into the log file it indicates and found : > > g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT ibnl_scanner.lo > -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -fPIC -DPIC -o > .libs/ibnl_scanner.o > ibnl_scanner.ll: In function 'int ibnl_lex()': > ibnl_scanner.ll:197: warning: ignoring return value of 'size_t > fwrite(const void*, size_t, size_t, FILE*)', declared with attribute > warn_unused_result > g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT ibnl_scanner.lo > -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -o ibnl_scanner.o > >/dev/null 2>&1 > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2 > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -o libibdmcom.la-rpath /usr/local/ofed/lib64 -version-info "2:1:1" > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo > g++ -shared -nostdlib > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o .libs/Fabric.o > .libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o .libs/TopoMatch.o > .libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o .libs/ibnl_parser.o > .libs/ibnl_scanner.o -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0 > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64 > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../.. -L/lib/../lib64 > -L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o -m64 > -mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o .libs/libibdmcom.so.1.1.1 > /usr/bin/ld: > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o): > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be > used when making a shared object; recompile with -fPIC > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not read > symbols: Bad value > collect2: ld returned 1 exit status > make[3]: *** [libibdmcom.la] Error 1 > make[3]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0 > /ibdm/datamodel' > make[2]: *** [all-recursive] Error 1 > make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm' > make[1]: *** [all] Error 2 > make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm' > make: *** [all-recursive] Error 1 > error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install) > > > RPM build errors: > Bad exit status from /var/tmp/rpm-tmp.16738 (%install) > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed > --mandir=/usr/local/ofed/share/man --cache-file=/var/tmp/OFED/ibutils.cache > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define '_prefix > /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' --define '_mandir > %{_prefix}/share/man' --define 'build_root /var/tmp/OFED' > /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" > Can anyone shed any light on this ? > > Machine is dual Opteron, 2 gig memory, kernel 2.6.16 > > Don Snedigar > Calpont Corp. > 214-618-9516 > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dsnedigar at calpont.com Thu Jun 22 15:35:43 2006 From: dsnedigar at calpont.com (Don Snedigar) Date: Thu, 22 Jun 2006 17:35:43 -0500 Subject: [openib-general] OFED-1.0 fails install on AMD64 Message-ID: <8953B8331AA98041B0C11DBC678AFC0812C7C0@srvemail1.calpont.com> Actually, its FSM Labs v 2.2.3 with the 2.6.16 kernel. We had FC4 on the box, but then added RTLinuxPro on the box. Yes, gcc is version 4 (gcc --version gives 4.0.0 20050519 (Red Hat 4.0.0-8) Only environment variables set would be the ones that the install script sets itself.\ don ________________________________ From: Paul [mailto:paul.lundin at gmail.com] Sent: Thursday, June 22, 2006 5:16 PM To: Don Snedigar Cc: openib-general at openib.org Subject: Re: [openib-general] OFED-1.0 fails install on AMD64 Well taking a couple of stabs in the dark here. What version of redhat/fedora are you using ? I am using rhel 4 update 3 and it uses gcc version 3.4.5-2 by default. It appears as if your system is using 4.0.0. Also do you have any environment variables set ? Such as CFLAGS, CCFLAGS or the like ? For the record the only reason I mention gcc 4x is because it is the only time I have personally seen that error arise. On 6/22/06, Don Snedigar wrote: I just downloaded the OFED-1.0 and the install was going fine until ibutils. At that point, the install fails with : Open MPI RPM will be created during the installation process Building ibutils RPM. Please wait... Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed --mandir=/usr/local/ofed/share/man --cache-file=/var/tmp/OFED/ibutils.cache --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' --define '_mandir %{_prefix}/share/man' --define 'build_root /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm - ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed --mandir=/usr/local/ofed/share/man --cache-file=/var/tmp/OFED/ibutils.cache --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' --define '_mandir %{_prefix}/share/man' --define 'build_root /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" See log file: /tmp/OFED.28656.log I dug down into the log file it indicates and found : g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -fPIC -DPIC -o .libs/ibnl_scanner.o ibnl_scanner.ll: In function 'int ibnl_lex()': ibnl_scanner.ll:197: warning: ignoring return value of 'size_t fwrite(const void*, size_t, size_t, FILE*)', declared with attribute warn_unused_result g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -o ibnl_scanner.o >/dev/null 2>&1 /bin/sh ../libtool --tag=CXX --mode=link g++ -O2 -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -o libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1" Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo g++ -shared -nostdlib /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o .libs/Fabric.o .libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o .libs/TopoMatch.o .libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o .libs/ibnl_parser.o .libs/ibnl_scanner.o -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0 -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../.. -L/lib/../lib64 -L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o -m64 -mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o .libs/libibdmcom.so.1.1.1 /usr/bin/ld: /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o): relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be used when making a shared object; recompile with -fPIC /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not read symbols: Bad value collect2: ld returned 1 exit status make[3]: *** [libibdmcom.la] Error 1 make[3]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm/datamodel' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm' make[1]: *** [all] Error 2 make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm' make: *** [all-recursive] Error 1 error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install) RPM build errors: Bad exit status from /var/tmp/rpm-tmp.16738 (%install) ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed --mandir=/usr/local/ofed/share/man --cache-file=/var/tmp/OFED/ibutils.cache --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' --define '_mandir %{_prefix}/share/man' --define 'build_root /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" Can anyone shed any light on this ? Machine is dual Opteron, 2 gig memory, kernel 2.6.16 Don Snedigar Calpont Corp. 214-618-9516 _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From viswa.krish at gmail.com Thu Jun 22 16:18:17 2006 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Thu, 22 Jun 2006 16:18:17 -0700 Subject: [openib-general] Disabling end-to-end flow control Message-ID: <4df28be40606221618o17bee45bg2289fab53985d168@mail.gmail.com> Is there a way to disable end-to-end flowcontrol using any of the API's ? Thanks, -Viswa -------------- next part -------------- An HTML attachment was scrubbed... URL: From dddownload at web.de Fri Jun 23 01:06:40 2006 From: dddownload at web.de (Torsten Boob) Date: Fri, 23 Jun 2006 10:06:40 +0200 Subject: [openib-general] NFS/RDMA Message-ID: <20060623100640.2119dc38@matrix.tuxianer.homelinux.net> Hello, i have Problems to set up a nfsrdma connection (nfsrdma update 5). ./nfsrdmamount -o rdma=192.168.99.1 192.168.99.1:/nfs /mnt/nfs has following output on the client. RPC: xprt_setup_rdma: 192.168.99.1:2049 unexpected event received for QP=ffff81013eaaea00, event =4 svc_rdma_recvfrom: transport ffff81013f54b600 is closing svc_rdma_recvfrom: transport ffff81013f54b600 is closing svc_rdma_put: Destroying transport ffff81013f54b600, cm_id=ffff81013f54b400, sk_flags=54, sk_inuse=0 nfs: RPC call returned error 103 nfsmount: Software caused connection abort Using normal nfs with mount -t nfs 192.168.99.1:/nfs /mnt/nfs results in mount: 192.168.99.1:/nfs: can't read superblock Same results with openib svn {20060516} Tested with Debian Sarge and Etch. Any ideas ? Torsten From eitan at mellanox.co.il Fri Jun 23 01:48:14 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 23 Jun 2006 11:48:14 +0300 Subject: [openib-general] OFED-1.0 fails install on AMD64 In-Reply-To: <8953B8331AA98041B0C11DBC678AFC0812C7B1@srvemail1.calpont.com> References: <8953B8331AA98041B0C11DBC678AFC0812C7B1@srvemail1.calpont.com> Message-ID: <449BAACE.6000609@mellanox.co.il> Hi Don, Sorry for my late response. ibutils compilation (of libibdmcom) is breaking with the error message: > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be > used when making a shared object; recompile with -fPIC For the command: > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2 > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -o > libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1" > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo > g++ -shared -nostdlib So obviously one has to figure out why -shared did not cause -fPIC ? Also not clear why this does not break on other machines. Anyways, reproducing the problem is my first target. One obvious thing to try is to set CFLAGS=-fPIC As I do not have access to the exact type of your machine : FSM Labs v 2.2.3 with the 2.6.16 kernel (as the weekend started over hear) I guess I will be able to reproduce only Sun/Mon. Eitan Don Snedigar wrote: > I just downloaded the OFED-1.0 and the install was going fine until > ibutils. At that point, the install fails with : > > Open MPI RPM will be created during the installation process > > > Building ibutils RPM. Please wait... > > Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define > 'configure_options --prefix=/usr/local/ofed > --mandir=/usr/local/ofed/share/man > --cache-file=/var/tmp/OFED/ibutils.cache > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' > --define '_mandir %{_prefix}/share/man' --define 'build_root > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm > - > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed > --mandir=/usr/local/ofed/share/man > --cache-file=/var/tmp/OFED/ibutils.cache > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' > --define '_mandir %{_prefix}/share/man' --define 'build_root > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" > > See log file: /tmp/OFED.28656.log > > > I dug down into the log file it indicates and found : > > g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT > ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc > - -o .libs/ibnl_scanner.o > ibnl_scanner.ll: In function 'int ibnl_lex()': > ibnl_scanner.ll:197: warning: ignoring return value of 'size_t > fwrite(const void*, size_t, size_t, FILE*)', declared with attribute > warn_unused_result > g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT > ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -o > ibnl_scanner.o >/dev/null 2>&1 > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2 > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -o > libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1" > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo > g++ -shared -nostdlib > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o .libs/Fabric.o > .libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o .libs/TopoMatch.o > .libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o .libs/ibnl_parser.o > .libs/ibnl_scanner.o -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0 > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64 > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../.. -L/lib/../lib64 > -L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o -m64 > -mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o > .libs/libibdmcom.so.1.1.1 > /usr/bin/ld: > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o): > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be > used when making a shared object; recompile with -fPIC > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not read > symbols: Bad value > collect2: ld returned 1 exit status > make[3]: *** [libibdmcom.la] Error 1 > make[3]: Leaving directory > `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm/datamodel' > make[2]: *** [all-recursive] Error 1 > make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm' > make[1]: *** [all] Error 2 > make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm' > make: *** [all-recursive] Error 1 > error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install) > > > RPM build errors: > Bad exit status from /var/tmp/rpm-tmp.16738 (%install) > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed > --mandir=/usr/local/ofed/share/man > --cache-file=/var/tmp/OFED/ibutils.cache > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' > --define '_mandir %{_prefix}/share/man' --define 'build_root > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" > > Can anyone shed any light on this ? > > Machine is dual Opteron, 2 gig memory, kernel 2.6.16 > > Don Snedigar > Calpont Corp. > 214-618-9516 > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From zhushisongzhu at yahoo.com Fri Jun 23 03:01:46 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Fri, 23 Jun 2006 03:01:46 -0700 (PDT) Subject: [openib-general] OFED 1.0 - Official Release (Tziporet Koren) In-Reply-To: <449A92D1.8090404@mellanox.co.il> Message-ID: <20060623100146.49805.qmail@web36911.mail.mud.yahoo.com> thank you very much SDP is very good concept. We can port legacy applications to support infiniband and develop new applications easily and quickly. Good luck and waiting for your good news. I'm urgent to deploy infiniband cards for our real production system. zhu --- Tziporet Koren wrote: > zhu shi song wrote: > > I'm sorry SDP is not in production state. SDP is > very > > important for our application and we are waiting > it > > mature enough to be used in our product. And do > you > > have any schedule to let SDP work ok(especially > can > > support many large concurrent connections just > like > > TCP)? I very appreciate I can test new SDP before > end > > of June. > > tks > > zhu > > > > > The plan is to have a stable SDP in 1.1 release. > The schedule of 1.1 is end of July in the best case > (more likely it will > be mid-Aug) > However we will have RCs before this and we can let > you know when many > large concurrent connections are supported. > > Tziporet > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From sean.hefty at intel.com Fri Jun 23 04:51:21 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 23 Jun 2006 04:51:21 -0700 Subject: [openib-general] Disabling end-to-end flow control In-Reply-To: <4df28be40606221618o17bee45bg2289fab53985d168@mail.gmail.com> Message-ID: <000401c696bb$5acd42d0$f0791cac@amr.corp.intel.com> Is there a way to disable end-to-end flowcontrol using any of the API's ? I believe that all of the APIs (verbs, ib_cm, rdma_cm) let the user specify whether flow control is enabled. - Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Fri Jun 23 04:52:46 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 23 Jun 2006 04:52:46 -0700 Subject: [openib-general] [PATCH][TRIVIAL] librdmacm/examples/udaddy.c: Fix example name in messages In-Reply-To: <1151005558.4391.240388.camel@hal.voltaire.com> Message-ID: <000901c696bb$8dd5af00$f0791cac@amr.corp.intel.com> >librdmacm/examples/udaddy.c: Fix example name in messages > >Signed-off-by: Hal Rosenstock Thanks - if you haven't, can you commit this as well? (My connection is _really_ slow at the moment...) - Sean From sean.hefty at intel.com Fri Jun 23 05:00:44 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 23 Jun 2006 05:00:44 -0700 Subject: [openib-general] uCMA kernel slab corruption and oops In-Reply-To: <449ADD89.6080107@ichips.intel.com> Message-ID: <000b01c696bc$aad5e380$f0791cac@amr.corp.intel.com> I will look into this next week. - Sean From halr at voltaire.com Fri Jun 23 05:15:05 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Jun 2006 08:15:05 -0400 Subject: [openib-general] [PATCH][TRIVIAL] librdmacm/examples/udaddy.c: Fix example name in messages In-Reply-To: <000901c696bb$8dd5af00$f0791cac@amr.corp.intel.com> References: <000901c696bb$8dd5af00$f0791cac@amr.corp.intel.com> Message-ID: <1151064898.4391.279481.camel@hal.voltaire.com> On Fri, 2006-06-23 at 07:52, Sean Hefty wrote: > >librdmacm/examples/udaddy.c: Fix example name in messages > > > >Signed-off-by: Hal Rosenstock > > Thanks - if you haven't, can you commit this as well? (My connection is > _really_ slow at the moment...) Sure; committed in r8187. -- Hal > - Sean From bpradip at in.ibm.com Fri Jun 23 05:50:27 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Fri, 23 Jun 2006 18:20:27 +0530 Subject: [openib-general] resend [PATCH] rping.c: Fix hang if either the server or the client exits early In-Reply-To: <1151007847.3040.51.camel@stevo-desktop> References: <20060622192259.GA24588@harry-potter.ibm.com> <1151007847.3040.51.camel@stevo-desktop> Message-ID: <449BE393.3020308@in.ibm.com> Steve Wise wrote: > The goal of adding the return codes was so that the rping program could > exit with a status indicating success or failure. Every rping run > results in a DISCONNECT event, so I don't think we want to treat that > case as an error. DISCONNECT event will be generated when the connection is closed or in case of some error (like CCAE_LLP_CONNECTION_LOST, CCAE_BAD_CLOSE in case of Ammasso driver etc). > > > Also, can you explain why thi fixes Amith's problem, which sounded like > a process was hanging? > On debugging I found that the main thread was blocked in ibv_destroy_cq(), cm_thread was blocked in rdma_get_cm_event->write() and cq_thread was blocked in ibv_get_cq_event->read Taking the return value of the DISCONNECT event into consideration forcefully killed the process. On delving deeper into this problem, I think that there is more to this rping hang. Let me work on this further. On a related note - I noticed another rping hang in the following case - Start the rping as a client without first starting an rping server - If you are lucky the first run itself will result in the 'lt-rping' process in 'D' state. If not repeating the procedure will result in the hang. This is the o/p. cq completion failed status 5 wait for CONNECTED state 10 connect error -1 Thanks, Pradipta. > > Thanks, > > Steve. > > > > On Fri, 2006-06-23 at 00:53 +0530, Pradipta Kumar Banerjee wrote: >> Hi, >> Please ignore the earlier mail. There were some problems with the mailer. >> Here is the new one. >> >> This patch fixes the problem as reported by Amith. >> >> Signed-off-by: Pradipta Kumar Banerjee >> >> --- >> >> Index: rping.c >> ============================================================================= >> --- rping.c.org 2006-06-23 00:22:17.000000000 +0530 >> +++ rping.c 2006-06-23 00:39:06.000000000 +0530 >> @@ -215,6 +215,7 @@ static int rping_cma_event_handler(struc >> case RDMA_CM_EVENT_DISCONNECTED: >> fprintf(stderr, "%s DISCONNECT EVENT...\n", cb->server ? "server" : "client"); >> sem_post(&cb->sem); >> + ret = -1; >> break; >> >> case RDMA_CM_EVENT_DEVICE_REMOVAL: > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From swise at opengridcomputing.com Fri Jun 23 06:44:50 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 08:44:50 -0500 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <1150836226.2891.231.camel@laptopd505.fenrus.org> References: <20060620203050.31536.5341.stgit@stevo-desktop> <20060620203055.31536.15131.stgit@stevo-desktop> <1150836226.2891.231.camel@laptopd505.fenrus.org> Message-ID: <1151070290.7808.33.camel@stevo-desktop> > > Also on a related note, have you checked the driver for the needed PCI > posting flushes? > > > + > > + /* Disable IRQs by clearing the interrupt mask */ > > + writel(1, c2dev->regs + C2_IDIS); > > + writel(0, c2dev->regs + C2_NIMR0); > > like here... This code is followed by a call to c2_reset(), which interacts with the firmware on the adapter to quiesce the hardware. So I don't think we need to wait here for the posted writes to flush... > > + > > + elem = tx_ring->to_use; > > + elem->skb = skb; > > + elem->mapaddr = mapaddr; > > + elem->maplen = maplen; > > + > > + /* Tell HW to xmit */ > > + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR); > > + __raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN); > > + __raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS); > > or here > No need here. This logic submits the packet for transmission. We don't assume it is transmitted until we (after a completion interrupt usually) read back the HTXD entry and see the TXP_HTXD_DONE bit set (see c2_tx_interrupt()). Steve. From jlentini at netapp.com Fri Jun 23 06:48:59 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 23 Jun 2006 09:48:59 -0400 (EDT) Subject: [openib-general] NFS/RDMA In-Reply-To: <20060623100640.2119dc38@matrix.tuxianer.homelinux.net> References: <20060623100640.2119dc38@matrix.tuxianer.homelinux.net> Message-ID: Replies below: On Fri, 23 Jun 2006, Torsten Boob wrote: > Hello, > > i have Problems to set up a nfsrdma connection (nfsrdma update 5). > > ./nfsrdmamount -o rdma=192.168.99.1 192.168.99.1:/nfs /mnt/nfs > > has following output on the client. The first message is from the client code... > RPC: xprt_setup_rdma: 192.168.99.1:2049 but these messages are from the server code... > unexpected event received for QP=ffff81013eaaea00, event =4 > svc_rdma_recvfrom: transport ffff81013f54b600 is closing > svc_rdma_recvfrom: transport ffff81013f54b600 is closing > svc_rdma_put: Destroying transport ffff81013f54b600, cm_id=ffff81013f54b400, sk_flags=54, sk_inuse=0 and these are from the client code. > nfs: RPC call returned error 103 > nfsmount: Software caused connection abort Are you trying to mount to and from the same host? > Using normal nfs with > > mount -t nfs 192.168.99.1:/nfs /mnt/nfs > > results in > > mount: 192.168.99.1:/nfs: can't read superblock It looks like you have a configuration error unrelated to RDMA. If you're looking for documentation on setting up NFS, I'd recommend this: http://nfs.sourceforge.net/nfs-howto/index.html > Same results with openib svn {20060516} > Tested with Debian Sarge and Etch. > > Any ideas ? We've seen the unexpected event received for QP=ffff81013eaaea00, event =4 message once before. There is a timing issue between the NFS-RDMA server and the RDMA stack that we've only seen on IA64 systems to date. What type of hardware are you using? We are working on a fix for this now. From arjan at infradead.org Fri Jun 23 06:48:52 2006 From: arjan at infradead.org (Arjan van de Ven) Date: Fri, 23 Jun 2006 15:48:52 +0200 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <1151070290.7808.33.camel@stevo-desktop> References: <20060620203050.31536.5341.stgit@stevo-desktop> <20060620203055.31536.15131.stgit@stevo-desktop> <1150836226.2891.231.camel@laptopd505.fenrus.org> <1151070290.7808.33.camel@stevo-desktop> Message-ID: <1151070532.3204.10.camel@laptopd505.fenrus.org> > > > + /* Tell HW to xmit */ > > > + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR); > > > + __raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN); > > > + __raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS); > > > > or here > > > > No need here. This logic submits the packet for transmission. We don't > assume it is transmitted until we (after a completion interrupt usually) > read back the HTXD entry and see the TXP_HTXD_DONE bit set (see > c2_tx_interrupt()). ... but will that interrupt happen at all if these 3 writes never hit the hardware? From swise at opengridcomputing.com Fri Jun 23 06:56:45 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 08:56:45 -0500 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <1151070532.3204.10.camel@laptopd505.fenrus.org> References: <20060620203050.31536.5341.stgit@stevo-desktop> <20060620203055.31536.15131.stgit@stevo-desktop> <1150836226.2891.231.camel@laptopd505.fenrus.org> <1151070290.7808.33.camel@stevo-desktop> <1151070532.3204.10.camel@laptopd505.fenrus.org> Message-ID: <1151071005.7808.39.camel@stevo-desktop> On Fri, 2006-06-23 at 15:48 +0200, Arjan van de Ven wrote: > > > > + /* Tell HW to xmit */ > > > > + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR); > > > > + __raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN); > > > > + __raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS); > > > > > > or here > > > > > > > No need here. This logic submits the packet for transmission. We don't > > assume it is transmitted until we (after a completion interrupt usually) > > read back the HTXD entry and see the TXP_HTXD_DONE bit set (see > > c2_tx_interrupt()). > > ... but will that interrupt happen at all if these 3 writes never hit > the hardware? > I thought the posted write WILL eventually get to adapter memory. Not stall forever cached in a bridge. I'm wrong? My point is for a given HTXD entry, we write it to post a packet for transmission, then only free the packet memory and reuse this entry _after_ reading the HTXD and seeing the DONE bit set. So I still don't see a problem. But I've been wrong before ;-) Steve. From swise at opengridcomputing.com Fri Jun 23 07:02:24 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:02:24 -0500 Subject: [openib-general] resend [PATCH] rping.c: Fix hang if either the server or the client exits early In-Reply-To: <449BE393.3020308@in.ibm.com> References: <20060622192259.GA24588@harry-potter.ibm.com> <1151007847.3040.51.camel@stevo-desktop> <449BE393.3020308@in.ibm.com> Message-ID: <1151071344.7808.42.camel@stevo-desktop> On Fri, 2006-06-23 at 18:20 +0530, Pradipta Kumar Banerjee wrote: > Steve Wise wrote: > > The goal of adding the return codes was so that the rping program could > > exit with a status indicating success or failure. Every rping run > > results in a DISCONNECT event, so I don't think we want to treat that > > case as an error. > DISCONNECT event will be generated when the connection is closed or in case of > some error (like CCAE_LLP_CONNECTION_LOST, CCAE_BAD_CLOSE in case of Ammasso > driver etc). > > You'll also get the DISCONNECT event when one side finished the rping loops and does rdma_disconnect(). So receiving that event isn't necessarily an error... > > > > Also, can you explain why thi fixes Amith's problem, which sounded like > > a process was hanging? > > > On debugging I found that the main thread was blocked in ibv_destroy_cq(), > cm_thread was blocked in rdma_get_cm_event->write() and cq_thread was blocked in > ibv_get_cq_event->read > Taking the return value of the DISCONNECT event into consideration forcefully > killed the process. > On delving deeper into this problem, I think that there is more to this rping > hang. Let me work on this further. > I think rping needs some coordination on these threads and when they should be killed. > On a related note - I noticed another rping hang in the following case > - Start the rping as a client without first starting an rping server > - If you are lucky the first run itself will result in the 'lt-rping' process in > 'D' state. If not repeating the procedure will result in the hang. > > This is the o/p. > > cq completion failed status 5 > wait for CONNECTED state 10 > connect error -1 > > Thanks, > Pradipta. > > > > > > > Thanks, > > > > Steve. > > > > > > > > On Fri, 2006-06-23 at 00:53 +0530, Pradipta Kumar Banerjee wrote: > >> Hi, > >> Please ignore the earlier mail. There were some problems with the mailer. > >> Here is the new one. > >> > >> This patch fixes the problem as reported by Amith. > >> > >> Signed-off-by: Pradipta Kumar Banerjee > >> > >> --- > >> > >> Index: rping.c > >> ============================================================================= > >> --- rping.c.org 2006-06-23 00:22:17.000000000 +0530 > >> +++ rping.c 2006-06-23 00:39:06.000000000 +0530 > >> @@ -215,6 +215,7 @@ static int rping_cma_event_handler(struc > >> case RDMA_CM_EVENT_DISCONNECTED: > >> fprintf(stderr, "%s DISCONNECT EVENT...\n", cb->server ? "server" : "client"); > >> sem_post(&cb->sem); > >> + ret = -1; > >> break; > >> > >> case RDMA_CM_EVENT_DEVICE_REMOVAL: > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From arjan at infradead.org Fri Jun 23 07:04:31 2006 From: arjan at infradead.org (Arjan van de Ven) Date: Fri, 23 Jun 2006 16:04:31 +0200 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <1151071005.7808.39.camel@stevo-desktop> References: <20060620203050.31536.5341.stgit@stevo-desktop> <20060620203055.31536.15131.stgit@stevo-desktop> <1150836226.2891.231.camel@laptopd505.fenrus.org> <1151070290.7808.33.camel@stevo-desktop> <1151070532.3204.10.camel@laptopd505.fenrus.org> <1151071005.7808.39.camel@stevo-desktop> Message-ID: <1151071471.3204.12.camel@laptopd505.fenrus.org> On Fri, 2006-06-23 at 08:56 -0500, Steve Wise wrote: > On Fri, 2006-06-23 at 15:48 +0200, Arjan van de Ven wrote: > > > > > + /* Tell HW to xmit */ > > > > > + __raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR); > > > > > + __raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN); > > > > > + __raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS); > > > > > > > > or here > > > > > > > > > > No need here. This logic submits the packet for transmission. We don't > > > assume it is transmitted until we (after a completion interrupt usually) > > > read back the HTXD entry and see the TXP_HTXD_DONE bit set (see > > > c2_tx_interrupt()). > > > > ... but will that interrupt happen at all if these 3 writes never hit > > the hardware? > > > > I thought the posted write WILL eventually get to adapter memory. Not > stall forever cached in a bridge. I'm wrong? I'm not sure there is a theoretical upper bound.... (and if it's several msec per bridge, then you have a lot of latency anyway) From swise at opengridcomputing.com Fri Jun 23 07:29:24 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:29:24 -0500 Subject: [openib-general] [PATCH v2 00/14][RFC] Chelsio CXGB3 iWARP Driver Message-ID: <20060623142924.32410.7623.stgit@stevo-desktop> This patchset implements the iWARP provider driver for the Chelsio CXGB3 RNIC. It is dependent on the "iWARP Core Support" patch set. This is round 2 of the openib-general review. I'm requesting one more review from the rdma experts before widening the audience to the linux community in general. I believe I've addressed all the round 1 review comments. The entire subsystem is layed out as three modules: iw_cxgb3.ko - The main OpenIB Provider module. It depends on the other two modules. cxgb3c.ko - The cxgb3 core module that allows TCP connections to manipulated. It depends on the LLD/NETDEV module. cxgb3.ko - the cxgb3 LLD/NETDEV driver with offload support. This driver is currently checked in to gen2/branches/iwarp/src/linux-kernel/net/cxgb3. Chelsio will submit this driver eventually to the kernel netdev group for inclusion into kernel.org. For now, I've placed it in the openib tree so the entire subsystem can be used. I'm only including patches for the .h files that define the interface used by the other modules. This StGIT patchset is cloned from Roland Dreier's infiniband.git for-2.6.19 branch. The patchset consists of these patches: t3_provider - OpenIB Provider Driver t3_cq_qp - QP and CQ t3_mem - MR and MW t3_ae - Async and CQ events t3_cm - Connection Manager t3_rcore_dbg - RDMA Core Debug t3_rcore_hal - RDMA Core HAL t3_rcore_resource - RDMA Core Resource Manager t3_rcore_types - RDMA Core Types t3_core_reg - T3 Core Registation t3_core_demux - T3 Core Demuxer t3_core_l2t - T3 L2 Services t3_cfg - Makefiles t3_lld_ulp - LLD Interface Since round 1 review, the following has been done: - incorporate all the review feedback (thanks to all who reviewed it) - sparse clean - incorporated some of the ammasso review feedback (like use pr_debug()) - interoperability testing against Ammasso - iWARP conformance testing - NFSoRDMA testing (connectathon basic and general tests) - rping/krping testing - dapltest 1-6 testing - performance characterization Signed-off-by: Steve Wise From swise at opengridcomputing.com Fri Jun 23 07:29:29 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:29:29 -0500 Subject: [openib-general] [PATCH v2 01/14] CXGB3 OpenIB Driver In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop> References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623142929.32410.12997.stgit@stevo-desktop> This patch contains the cxgb3 device discovery and openib driver methods. The T3 openib driver discovers each T3 adapter by registering as a client with the cxgb3 "core" module, which will then call the provider module for each T3 adapter present. This is similar to the ib_client mechanism in openib. --- drivers/infiniband/hw/cxgb3/iwch.c | 220 +++++ drivers/infiniband/hw/cxgb3/iwch.h | 130 +++ drivers/infiniband/hw/cxgb3/iwch_provider.c | 1097 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_provider.h | 358 +++++++++ 4 files changed, 1805 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c new file mode 100644 index 0000000..20d9f1e --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -0,0 +1,220 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include + +#include +#include "iwch_provider.h" +#include "iwch_user.h" +#include "iwch.h" +#include "iwch_cm.h" + +MODULE_AUTHOR("Boyd Faulkner , " + "Steve Wise pdid2hlp = vzmalloc(sizeof(void*) * T3_MAX_NUM_PD); + if (!rnicp->pdid2hlp) + goto pdid_err; + rnicp->cqid2hlp = vzmalloc(sizeof(void*) * T3_MAX_NUM_CQ); + if (!rnicp->cqid2hlp) + goto cqid_err; + rnicp->qpid2hlp = vzmalloc(sizeof(void*) * T3_MAX_NUM_QP); + if (!rnicp->qpid2hlp) + goto qpid_err; + rnicp->stag2hlp = vzmalloc(sizeof(void*) * T3_MAX_NUM_STAG); + if (!rnicp->stag2hlp) + goto stag_err; + + spin_lock_init(&rnicp->lock); + + /* + * XXX get these from the hw! + */ + rnicp->attr.vendor_id = 0x168; + rnicp->attr.vendor_part_id = 7; + rnicp->attr.hw_version = 3; + rnicp->attr.addl_vendor_info = NULL; + rnicp->attr.addl_vendor_info_length = 0; + rnicp->attr.max_qps = T3_MAX_NUM_QP - 32; + rnicp->attr.max_wrs = (1UL << 24) - 1; + rnicp->attr.max_sge_per_wr = T3_MAX_SGE; + rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE; + rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1; + rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1; + rnicp->attr.max_cq_event_handlers = T3_MAX_NUM_CQ - 1; + rnicp->attr.max_mem_regs = T3_MAX_NUM_STAG; + rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE; + rnicp->attr.max_pds = T3_MAX_NUM_PD - 1; + rnicp->attr.mem_pgsizes_bitmask = 0x7FFF; /* 4KB-128MB */ + rnicp->attr.can_resize_wq = 0; + rnicp->attr.max_rdma_reads_per_qp = 16; + rnicp->attr.max_rdma_read_resources = + rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps; + rnicp->attr.max_rdma_read_qp_depth = 16; /* IRD */ + rnicp->attr.max_rdma_read_depth = + rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps; + rnicp->attr.rq_overflow_handled = 0; + rnicp->attr.can_modify_ird = 0; + rnicp->attr.can_modify_ord = 0; + rnicp->attr.max_mem_windows = T3_MAX_NUM_STAG - 1;/* Shared with MR */ + rnicp->attr.stag0_value = 1; + rnicp->attr.zbva_support = 1; + rnicp->attr.local_invalidate_fence = 1; + rnicp->attr.cq_overflow_detection = 1; + return 0; + +stag_err: + vfree(rnicp->qpid2hlp); +qpid_err: + vfree(rnicp->cqid2hlp); +cqid_err: + vfree(rnicp->pdid2hlp); +pdid_err: + return -ENOMEM; +} + +static void open_rnic_toe(struct t3cdev *tdev) +{ + struct iwch_dev *rnicp; + + PDBG("%s line %d\n", __FUNCTION__, __LINE__); + rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp)); + if (!rnicp) { + printk(KERN_ERR PFX "cannot allocate ib device!\n"); + return; + } + rnicp->rdev.ulp = rnicp; + rnicp->rdev.t3cdev_p = tdev; + + if (cxio_rdev_open(&rnicp->rdev)) { + printk(KERN_ERR PFX "Unable to register with RDMA Core\n"); + ib_dealloc_device(&rnicp->ibdev); + return; + } + + if (open_rnic_init(rnicp)) { + printk(KERN_ERR PFX "Unable to initialize iwch_dev!\n"); + cxio_rdev_close(&rnicp->rdev); + ib_dealloc_device(&rnicp->ibdev); + return; + } + + mutex_lock(&dev_mutex); + list_add_tail(&rnicp->entry, &dev_list); + mutex_unlock(&dev_mutex); + + if (iwch_register_device(rnicp)) { + printk(KERN_ERR PFX "Unable to register with openib\n"); + close_rnic_toe(tdev); + } + return; +} + +static void close_rnic_toe(struct t3cdev *tdev) +{ + struct iwch_dev *dev, *tmp; + PDBG("%s line %d\n", __FUNCTION__, __LINE__); + mutex_lock(&dev_mutex); + list_for_each_entry_safe(dev, tmp, &dev_list, entry) { + if (dev->rdev.t3cdev_p == tdev) { + list_del(&dev->entry); + iwch_unregister_device(dev); + cxio_rdev_close(&dev->rdev); + vfree(dev->pdid2hlp); + vfree(dev->cqid2hlp); + vfree(dev->stag2hlp); + vfree(dev->qpid2hlp); + ib_dealloc_device(&dev->ibdev); + break; + } + } + mutex_unlock(&dev_mutex); +} + +extern void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb); + +static int __init iwch_init_module(void) +{ + int err; + + err = cxio_hal_init(); + if (err) + return err; + err = iwch_cm_init(); + if (err) + return err; + cxio_register_ev_cb(iwch_ev_dispatch); + t3c_register_client(&t3c_client); + return 0; +} + +static void __exit iwch_exit_module(void) +{ + t3c_unregister_client(&t3c_client); + cxio_unregister_ev_cb(iwch_ev_dispatch); + iwch_cm_term(); + cxio_hal_exit(); +} + +module_init(iwch_init_module); +module_exit(iwch_exit_module); diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h new file mode 100644 index 0000000..bf466a6 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -0,0 +1,130 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_H__ +#define __IWCH_H__ + +#include + +#include +#include + +#include + +#include "cxio_hal.h" +#include "common.h" +#include "t3c.h" + +struct iwch_pd; +struct iwch_cq; +struct iwch_qp; +struct iwch_mr; + +struct iwch_rnic_attributes { + u32 vendor_id; + u32 vendor_part_id; + u32 hw_version; + char *addl_vendor_info; + u32 addl_vendor_info_length; + u32 max_qps; + u32 max_wrs; /* Max for any SQ/RQ */ + u32 max_sge_per_wr; + u32 max_sge_per_rdma_write_wr; /* for RDMA Write WR */ + u32 max_cqs; + u32 max_cqes_per_cq; + u32 max_cq_event_handlers; + u32 max_mem_regs; + u32 max_phys_buf_entries; /* for phys buf list */ + u32 max_pds; + + /* + * The memory page sizes supported by this RNIC. + * Bit position i in bitmap indicates page of + * size (4k)^i. Phys block list mode unsupported. + */ + u32 mem_pgsizes_bitmask; + u8 can_resize_wq; + + /* + * The maximum number of RDMA Reads that can be outstanding + * per QP with this RNIC as the target. + */ + u32 max_rdma_reads_per_qp; + + /* + * The maximum number of resources used for RDMA Reads + * by this RNIC with this RNIC as the target. + */ + u32 max_rdma_read_resources; + + /* + * The max depth per QP for initiation of RDMA Read + * by this RNIC. + */ + u32 max_rdma_read_qp_depth; + + /* + * The maximum depth for initiation of RDMA Read + * operations by this RNIC on all QPs + */ + u32 max_rdma_read_depth; + u8 rq_overflow_handled; + u32 can_modify_ird; + u32 can_modify_ord; + u32 max_mem_windows; + u32 stag0_value; + u8 zbva_support; + u8 local_invalidate_fence; + u32 cq_overflow_detection; +}; + +struct iwch_dev { + struct ib_device ibdev; + struct cxio_rdev rdev; + u32 device_cap_flags; + struct iwch_rnic_attributes attr; + struct iwch_pd **pdid2hlp; + struct iwch_cq **cqid2hlp; + struct iwch_qp **qpid2hlp; + struct iwch_mr **stag2hlp; + spinlock_t lock; + struct list_head entry; +}; + +static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct iwch_dev, ibdev); +} + +extern struct t3c_client t3c_client; +extern t3c_cpl_handler_func t3c_handlers[NUM_CPL_CMDS]; +#endif diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c new file mode 100644 index 0000000..b38cd2e --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -0,0 +1,1097 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include +#include +#include + +#include +#include "iwch.h" +#include "iwch_provider.h" +#include "iwch_cm.h" +#include "iwch_user.h" + +static int iwch_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static struct ib_ah *iwch_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return ERR_PTR(-ENOSYS); +} + +static int iwch_ah_destroy(struct ib_ah *ah) +{ + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int iwch_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int iwch_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int iwch_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return -ENOSYS; +} + +static int iwch_dealloc_ucontext(struct ib_ucontext *context) +{ + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + kfree(to_iwch_ucontext(context)); + return 0; +} + +static struct ib_ucontext *iwch_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct iwch_ucontext *context; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + context = kmalloc(sizeof(*context), GFP_KERNEL); + if (!context) { + return ERR_PTR(-ENOMEM); + } + return &context->ibucontext; +} + +static int iwch_destroy_cq(struct ib_cq *ib_cq) +{ + struct iwch_cq *chp; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + chp = to_iwch_cq(ib_cq); + + spin_lock_irq(&chp->rhp->lock); + chp->rhp->cqid2hlp[chp->cqh] = NULL; + spin_unlock_irq(&chp->rhp->lock); + + atomic_dec(&chp->refcnt); + wait_event(chp->wait, !atomic_read(&chp->refcnt)); + + cxio_destroy_cq(&chp->rhp->rdev, &chp->cq); + kfree(chp); + return 0; +} + +static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + struct iwch_create_cq_resp uresp; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + rhp = to_iwch_dev(ibdev); + chp = kzalloc(sizeof(*chp), GFP_KERNEL); + if (!chp) + return ERR_PTR(-ENOMEM); + + /* + * Attempt to make the CQ big enough to handle the T3 + * additional CQE possibilities: + * TERMINATE, + * 2 CQES for each RDMA READ operation, + * incoming RDMA READ REQUEST FAILUREs + * We can make the CQ big enough to handle these for + * a single QP. But problems can arise if the CQ is shared... + */ + entries = roundup_pow_of_two(entries + + 8 + /* max ORD */ + 8 + /* max IRRQ */ + 1 /* TERM */ + ); + chp->cq.size_log2 = long_log2(entries); + + if (cxio_create_cq(&rhp->rdev, &chp->cq)) { + kfree(chp); + return ERR_PTR(-ENOMEM); + } + chp->rhp = rhp; + chp->ibcq.cqe = (1 << chp->cq.size_log2) - 1; + spin_lock_init(&chp->lock); + atomic_set(&chp->refcnt, 1); + init_waitqueue_head(&chp->wait); + chp->cqh = chp->cq.cqid; + + spin_lock_irq(&rhp->lock); + rhp->cqid2hlp[chp->cq.cqid] = chp; + spin_unlock_irq(&rhp->lock); + + if (context) { + uresp.cqid = chp->cq.cqid; + uresp.entries = chp->ibcq.cqe; + uresp.physaddr = chp->cq.dma_addr; + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + iwch_destroy_cq(&chp->ibcq); + return ERR_PTR(-EFAULT); + } + } + PDBG("created cq_hdl(%0x) chp=%p size=0x%0x, dma_addr=0x%0llx\n", + chp->cq.cqid, chp, (1 << chp->cq.size_log2), + (u64)chp->cq.dma_addr); + return &chp->ibcq; +} + +static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata) +{ + struct iwch_cq *chp = to_iwch_cq(cq); + struct t3_cq oldcq, newcq; + int ret; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + /* We don't downsize... */ + if (cqe <= cq->cqe) + return 0; + + /* create new t3_cq with new size */ + cqe = roundup_pow_of_two(cqe+1); + newcq.size_log2 = long_log2(cqe); + + /* Dont allow resize to less than the current wce count */ + if (cqe < Q_COUNT(chp->cq.rptr, chp->cq.wptr)) { + return -ENOMEM; + } + + /* Quiesce all QPs using this CQ */ + ret = iwch_quiesce_qps(chp); + if (ret) { + return ret; + } + + /* XXX limit max based on rdev */ + ret = cxio_create_cq(&chp->rhp->rdev, &newcq); + if (ret) { + kfree(chp); + return ret; + } + + /* copy CQEs */ + memcpy(newcq.queue, chp->cq.queue, (1 << chp->cq.size_log2) * + sizeof(struct t3_cqe)); + + /* old iwch_qp gets new t3_cq but keeps old cqid */ + oldcq = chp->cq; + chp->cq = newcq; + chp->cq.cqid = oldcq.cqid; + + /* resize new t3_cq to update the HW context */ + ret = cxio_resize_cq(&chp->rhp->rdev, &chp->cq); + if (ret) { + chp->cq = oldcq; + return ret; + } + chp->ibcq.cqe = (1<cq.size_log2) - 1; + + /* destroy old t3_cq */ + oldcq.cqid = newcq.cqid; + ret = cxio_destroy_cq(&chp->rhp->rdev, &oldcq); + if (ret) { + printk(KERN_ERR MOD "%s - cxio_destroy_cq failed %d\n", + __FUNCTION__, ret); + } + + /* add user hooks here */ + + /* resume qps */ + ret = iwch_resume_qps(chp); + return ret; +} + +static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + enum t3_cq_opcode cq_op; + int err; + int flags; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + chp = to_iwch_cq(ibcq); + rhp = chp->rhp; + if (notify == IB_CQ_SOLICITED) + cq_op = CQ_ARM_SE; + else + cq_op = CQ_ARM_AN; + spin_lock_irqsave(&chp->lock, flags); + err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0); + spin_unlock_irqrestore(&chp->lock, flags); + if (err) + printk(KERN_ERR MOD "Error %d rearming CQ %llu\n", err, + chp->cqh); + return err; +} + +static int iwch_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + int len = vma->vm_end - vma->vm_start; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + vma->vm_flags |= VM_RESERVED; + if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, + len, vma->vm_page_prot)) + return -EAGAIN; + return 0; +} + +static int iwch_deallocate_pd(struct ib_pd *pd) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + u64 pd_h; + + php = to_iwch_pd(pd); + rhp = php->rhp; + pd_h = (u64) php->pdid; + PDBG("iwch_deallocate_pd entry: hdl(%0llx)\n", pd_h); + rhp->pdid2hlp[pd_h] = NULL; + cxio_hal_put_pdid(rhp->rdev.rscp, php->pdid); + kfree(php); + return 0; +} + +static struct ib_pd *iwch_allocate_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct iwch_pd *php; + u32 pdid; + struct iwch_dev *rhp; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + rhp = (struct iwch_dev *) ibdev; + pdid = cxio_hal_get_pdid(rhp->rdev.rscp); + if (!pdid) + return ERR_PTR(-EINVAL); + php = kzalloc(sizeof(*php), GFP_KERNEL); + if (!php) { + cxio_hal_put_pdid(rhp->rdev.rscp, pdid); + return ERR_PTR(-ENOMEM); + } + php->pdid = pdid; + php->rhp = rhp; + rhp->pdid2hlp[pdid] = php; + if (context) { + if (ib_copy_to_udata(udata, &php->pdid, sizeof (__u32))) { + iwch_deallocate_pd(&php->ibpd); + return ERR_PTR(-EFAULT); + } + } + PDBG("iwch_allocate_pd: pdid(0x%0x) hlp(0x%p)\n", pdid, php); + return &php->ibpd; +} + +static int iwch_dereg_mr(struct ib_mr *ib_mr) +{ + struct iwch_dev *rhp; + struct iwch_mr *mhp; + struct iwch_pd *php; + u64 mem_h; + + /* There can be no memory windows */ + if (atomic_read(&ib_mr->usecnt)) + return -EINVAL; + + mhp = to_iwch_mr(ib_mr); + rhp = mhp->rhp; + mem_h = mhp->attr.stag >> 8; + /* TBD: check dereg_mem return status: regreg mem with mw bound to it */ + cxio_dereg_mem(&rhp->rdev, mhp->attr.stag); + rhp->stag2hlp[mem_h] = NULL; + php = get_php(rhp, mhp->attr.pdid); + if (mhp->kva) + kfree((void *) (unsigned long) mhp->kva); + kfree(mhp); + PDBG("iwch_dereg_mem: mem_h(0x%0llx) hlp(%p)\n", mem_h, mhp); + return 0; +} + +static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + u64 *page_list; + int shift; + u64 total_size; + int npages; + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + int ret; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + php = to_iwch_pd(pd); + rhp = php->rhp; + + acc = iwch_convert_access(acc); + + + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) { + ret = -EINVAL; + goto err; + } + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) { + ret = -EINVAL; + goto err; + } + + ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start, + &total_size, &npages, &shift, &page_list); + if (ret) + goto err; + + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.zbva = 0; + + /* XXX TPT perms are backwards from BIND WR perms! */ + mhp->attr.perms = (acc & 0x1) << 3; + mhp->attr.perms |= (acc & 0x2) << 1; + mhp->attr.perms |= (acc & 0x4) >> 1; + mhp->attr.perms |= (acc & 0x8) >> 3; + + mhp->attr.va_fbo = *iova_start; + mhp->attr.page_size = shift - 12; + + mhp->attr.len = (u32) total_size; + mhp->attr.pbl_size = npages; + ret = iwch_register_mem(rhp, php, mhp, shift, page_list); + kfree(page_list); + if (ret) { + goto err; + } + return &mhp->ibmr; +err: + kfree(mhp); + return ERR_PTR(ret); + +} + +static int iwch_reregister_phys_mem(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, u64 * iova_start) +{ + + struct iwch_mr mh, *mhp; + struct iwch_pd *php; + struct iwch_dev *rhp; + int new_acc; + u64 *page_list = NULL; + int shift = 0; + u64 total_size; + int npages; + int ret; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + /* There can be no memory windows */ + if (atomic_read(&mr->usecnt)) + return -EINVAL; + + mhp = to_iwch_mr(mr); + rhp = mhp->rhp; + php = to_iwch_pd(mr->pd); + + /* make sure we are on the same adapter */ + if (rhp != php->rhp) + return -EINVAL; + + new_acc = mhp->attr.perms; + + memcpy(&mh, mhp, sizeof *mhp); + + printk("%s: %d stag = 0x%x\n",__FUNCTION__, __LINE__,mh.attr.stag); + if (mr_rereg_mask & IB_MR_REREG_PD) + php = to_iwch_pd(pd); + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + mh.attr.perms = iwch_convert_access(acc); + if (mr_rereg_mask & IB_MR_REREG_TRANS) + ret = build_phys_page_list(buffer_list, num_phys_buf, + iova_start, + &total_size, &npages, + &shift, &page_list); + + ret = iwch_reregister_mem(rhp, php, &mh, shift, page_list); + kfree(page_list); + if (ret) { + return ret; + } + if (mr_rereg_mask & IB_MR_REREG_PD) + mhp->attr.pdid = php->pdid; + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + mhp->attr.perms = acc; + if (mr_rereg_mask & IB_MR_REREG_TRANS) { + mhp->attr.zbva = 0; + mhp->attr.va_fbo = *iova_start; + mhp->attr.page_size = shift - 12; + mhp->attr.len = (u32) total_size; + mhp->attr.pbl_size = npages; + } + + return 0; +} + + +struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, + int acc, struct ib_udata *udata) +{ + u64 *pages; + int shift, n, len; + int i, j, k; + int err = 0; + struct ib_umem_chunk *chunk; + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + shift = ffs(region->page_size) - 1; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + n = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) + n += chunk->nents; + + pages = kmalloc(n * sizeof(u64), GFP_KERNEL); + if (!pages) { + err = -ENOMEM; + goto err; + } + + acc = iwch_convert_access(acc); + + i = n = 0; + + list_for_each_entry(chunk, ®ion->chunk_list, list) + for (j = 0; j < chunk->nmap; ++j) { + len = sg_dma_len(&chunk->page_list[j]) >> shift; + for (k = 0; k < len; ++k) { + pages[i++] = cpu_to_be64(sg_dma_address( + &chunk->page_list[j]) + + region->page_size * k); + } + } + + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.zbva = 0; + mhp->attr.perms = (acc & 0x1) << 3; + mhp->attr.perms |= (acc & 0x2) << 1; + mhp->attr.perms |= (acc & 0x4) >> 1; + mhp->attr.perms |= (acc & 0x8) >> 3; + mhp->attr.va_fbo = region->virt_base; + mhp->attr.page_size = shift - 12; + mhp->attr.len = (u32) region->length; + mhp->attr.pbl_size = i; + err = iwch_register_mem(rhp, php, mhp, shift, pages); + kfree(pages); + if (err) + goto err; + return &mhp->ibmr; + +err: + kfree(mhp); + return ERR_PTR(err); +} + +struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct ib_phys_buf bl; + u64 kva; + struct ib_mr *ibmr; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + /* + * T3 only supports 32 bits of size. + */ + bl.size = 0xffffffff; + bl.addr = 0; + kva = 0; + ibmr = iwch_register_phys_mem(pd, &bl, 1, acc, &kva); + return ibmr; +} + +struct ib_mw *iwch_alloc_mw(struct ib_pd *pd) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mw *mhp; + u64 win_h; + u32 stag = 0; + int ret; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + ret = cxio_allocate_window(&rhp->rdev, &stag, php->pdid); + if (ret) { + kfree(mhp); + return ERR_PTR(ret); + } + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.type = TPT_MW; + mhp->attr.stag = stag; + win_h = (stag) >> 8; + rhp->stag2hlp[win_h] = (struct iwch_mr *) mhp; + PDBG("iwch_allocate_window: win_h(0x%0llx) mhp(%p) stag(0x%x)\n", + win_h, mhp, stag); + return &(mhp->ibmw); +} + +int iwch_dealloc_mw(struct ib_mw *mw) +{ + struct iwch_dev *rhp; + struct iwch_mw *mhp; + struct iwch_pd *php; + u64 win_h; + + mhp = to_iwch_mw(mw); + rhp = mhp->rhp; + win_h = (mw->rkey) >> 8; + php = get_php(rhp, mhp->attr.pdid); + cxio_deallocate_window(&rhp->rdev, mhp->attr.stag); + rhp->stag2hlp[win_h] = NULL; + kfree(mhp); + PDBG("iwch_deallocate_window: win_h(0x%0llx) hlp(%p)\n", win_h, mhp); + return 0; +} + +static int iwch_destroy_qp(struct ib_qp *ib_qp) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + struct iwch_qp_attributes attrs; + + qhp = to_iwch_qp(ib_qp); + rhp = qhp->rhp; + + if (qhp->attr.state == IWCH_QP_STATE_RTS) { + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0); + } + wait_event(qhp->wait, !qhp->ep); + + spin_lock_irq(&rhp->lock); + rhp->qpid2hlp[qhp->wq.qpid] = NULL; + spin_unlock_irq(&rhp->lock); + + atomic_dec(&qhp->refcnt); + wait_event(qhp->wait, !atomic_read(&qhp->refcnt)); + + cxio_destroy_qp(&rhp->rdev, &qhp->wq); + + PDBG("iwch_destroy_qp: qp_h(0x%0x) qhp(%p)\n", qhp->wq.qpid, qhp); + kfree(qhp); + return 0; +} + +static struct ib_qp *iwch_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *attrs, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + struct iwch_pd *php; + struct iwch_cq *schp; + struct iwch_cq *rchp; + struct iwch_create_qp_resp uresp; + int wqsize, sqsize, rqsize; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + if (attrs->qp_type != IB_QPT_RC) + return ERR_PTR(-EINVAL); + php = to_iwch_pd(pd); + rhp = php->rhp; + schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cqh); + rchp = get_chp(rhp, ((struct iwch_cq *) attrs->recv_cq)->cqh); + if (!schp || !rchp) + return ERR_PTR(-EINVAL); + + /* The RQT size must be # of entries + 1 rounded up to a power of two */ + rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr); + if (rqsize == attrs->cap.max_recv_wr) + rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr+1); + + /* T3 doesn't support RQT depth < 16 */ + if (rqsize < 16) + rqsize = 16; + + if (rqsize >= T3_MAX_RQ_SIZE) + return ERR_PTR(-EINVAL); + + /* + * XXX the SQ and total WQ sizes don't need to be + * a power of two. However, all the code assumes + * they are. EG: Q_FREECNT() and friends. + */ + sqsize = roundup_pow_of_two(attrs->cap.max_send_wr); + wqsize = roundup_pow_of_two(rqsize + sqsize); + PDBG("%s wqsize %d sqsize %d rqsize %d\n", __FUNCTION__, + wqsize, sqsize, rqsize); + qhp = kzalloc(sizeof(*qhp), GFP_KERNEL); + if (!qhp) + return ERR_PTR(-ENOMEM); + qhp->wq.size_log2 = long_log2(wqsize); + qhp->wq.rq_size_log2 = long_log2(rqsize); + qhp->wq.sq_size_log2 = long_log2(sqsize); + if (cxio_create_qp(&rhp->rdev, 1, &qhp->wq)) { + kfree(qhp); + return ERR_PTR(-ENOMEM); + } + attrs->cap.max_recv_wr = rqsize - 1; + attrs->cap.max_send_wr = sqsize; + qhp->rhp = rhp; + qhp->attr.pd = php->pdid; + qhp->attr.scq = ((struct iwch_cq *) attrs->send_cq)->cqh; + qhp->attr.rcq = ((struct iwch_cq *) attrs->recv_cq)->cqh; + qhp->attr.sq_num_entries = attrs->cap.max_send_wr; + qhp->attr.rq_num_entries = attrs->cap.max_recv_wr; + qhp->attr.sq_max_sges = attrs->cap.max_send_sge; + qhp->attr.sq_max_sges_rdma_write = attrs->cap.max_send_sge; + qhp->attr.rq_max_sges = attrs->cap.max_recv_sge; + qhp->attr.state = IWCH_QP_STATE_IDLE; + qhp->attr.next_state = IWCH_QP_STATE_IDLE; + + /* + * XXX - these don't get passed in from the openib user + * at create time. The CM sets them via a QP modify. + * Need to fix... I think the CM should + */ + qhp->attr.enable_rdma_read = 1; + qhp->attr.enable_rdma_write = 1; + qhp->attr.enable_bind = 1; + qhp->attr.max_ord = 1; + qhp->attr.max_ird = 1; + spin_lock_init(&qhp->lock); + init_waitqueue_head(&qhp->wait); + atomic_set(&qhp->refcnt, 1); + + spin_lock_irq(&rhp->lock); + rhp->qpid2hlp[qhp->wq.qpid] = qhp; + spin_unlock_irq(&rhp->lock); + + PDBG("iwch_create_qp: udata = 0x%p failed\n", udata); + if (udata) { + uresp.qpid = qhp->wq.qpid; + uresp.entries = qhp->attr.sq_num_entries + qhp->attr.rq_num_entries; + uresp.physaddr = qhp->wq.dma_addr; + uresp.physsize = (u64) uresp.entries * sizeof(union t3_wr); + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + PDBG("iwch_create_qp: ib_copy_to_udata failed\n"); + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-EFAULT); + } + } + qhp->ibqp.qp_num = qhp->wq.qpid; + init_timer(&(qhp->timer)); + PDBG("iwch_create_qp: sq_num_entries = %d, rq_num_entries = %d\n", + qhp->attr.sq_num_entries, qhp->attr.rq_num_entries); + PDBG("iwch_create_qp: qh_h(0x%0x) qhp=%p dma_addr=0x%llx size=%d\n", + (qhp->wq.qpid), qhp, (u64)qhp->wq.dma_addr, + (1 << qhp->wq.size_log2)); + return (&qhp->ibqp); +} + +static int iwch_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + enum iwch_qp_attr_mask mask = 0; + struct iwch_qp_attributes attrs; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + /* iwarp does not support the RTR state */ + if ((attr_mask & IB_QP_STATE) && (attr->qp_state == IB_QPS_RTR)) + attr_mask &= ~IB_QP_STATE; + + /* Make sure we still have something left to do */ + if (!attr_mask) + return 0; + + memset(&attrs, 0, sizeof attrs); + qhp = to_iwch_qp(ibqp); + rhp = qhp->rhp; + + attrs.next_state = iwch_convert_state(attr->qp_state); + attrs.enable_rdma_read = (attr->qp_access_flags & + IB_ACCESS_REMOTE_READ) ? 1 : 0; + attrs.enable_rdma_write = (attr->qp_access_flags & + IB_ACCESS_REMOTE_WRITE) ? 1 : 0; + attrs.enable_bind = (attr->qp_access_flags & IB_ACCESS_MW_BIND) ? 1 : 0; + + + mask |= (attr_mask & IB_QP_STATE) ? IWCH_QP_ATTR_NEXT_STATE : 0; + mask |= (attr_mask & IB_QP_ACCESS_FLAGS) ? + (IWCH_QP_ATTR_ENABLE_RDMA_READ | + IWCH_QP_ATTR_ENABLE_RDMA_WRITE | + IWCH_QP_ATTR_ENABLE_RDMA_BIND) : 0; + + return iwch_modify_qp(rhp, qhp, mask, &attrs, 0); +} + +void iwch_qp_add_ref(struct ib_qp *qp) +{ + atomic_inc(&(to_iwch_qp(qp)->refcnt)); +} + +void iwch_qp_rem_ref(struct ib_qp *qp) +{ + if (atomic_dec_and_test(&(to_iwch_qp(qp)->refcnt))) + wake_up(&(to_iwch_qp(qp)->wait)); +} + +struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn) +{ + return (struct ib_qp *)get_qhp(to_iwch_dev(dev), qpn); +} + + +static int iwch_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 * pkey) +{ + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + *pkey = 0; + return 0; +} + +static int iwch_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct iwch_dev *dev; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + PDBG("ibdev %p, port %d, index %d, gid %p\n", + ibdev, port, index, gid); + dev = to_iwch_dev(ibdev); + BUG_ON(port == 0 || port > 2); + PDBG("dev %p port %d netdev %p\n", dev, port, + dev->rdev.rnic_info.lldevs[port-1]); + memset(&(gid->raw[0]), 0, sizeof(gid->raw)); + memcpy(&(gid->raw[0]), dev->rdev.rnic_info.lldevs[port-1]->dev_addr, 6); + return 0; +} + +static int iwch_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + + struct iwch_dev *dev; + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + dev = to_iwch_dev(ibdev); + memset(props, 0, sizeof *props); + memcpy(&props->sys_image_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); + props->device_cap_flags = dev->device_cap_flags; +#if 0 + props->fw_ver = cht3dev->fw_ver; + props->hw_ver = dev->adapter->params->chip_version; +#endif + props->vendor_id = (u32)dev->rdev.rnic_info.pdev->vendor; + props->vendor_part_id = (u32)dev->rdev.rnic_info.pdev->device; + props->max_mr_size = ~0ull; + props->max_qp = dev->attr.max_qps; + props->max_qp_wr = dev->attr.max_wrs; + props->max_sge = dev->attr.max_sge_per_wr; + props->max_sge_rd = 1; + props->max_qp_rd_atom = dev->attr.max_rdma_reads_per_qp; + props->max_cq = dev->attr.max_cqs; + props->max_cqe = dev->attr.max_cqes_per_cq; + props->max_mr = dev->attr.max_mem_regs; + props->max_pd = dev->attr.max_pds; + props->local_ca_ack_delay = 0; + + return 0; +} + +static int iwch_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + props->max_mtu = IB_MTU_4096; + props->lid = 0; + props->lmc = 0; + props->sm_lid = 0; + props->sm_sl = 0; + props->state = IB_PORT_ACTIVE; + props->phys_state = 0; + props->port_cap_flags = + IB_PORT_CM_SUP | + IB_PORT_SNMP_TUNNEL_SUP | + IB_PORT_REINIT_SUP | + IB_PORT_DEVICE_MGMT_SUP | + IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP; + props->gid_tbl_len = 1; + props->pkey_tbl_len = 1; + props->qkey_viol_cntr = 0; + props->active_width = 2; + props->active_speed = 2; + props->max_msg_sz = -1; + + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + struct ethtool_drvinfo info; + struct net_device *lldev = dev->rdev.t3cdev_p->lldev; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + lldev->ethtool_ops->get_drvinfo(lldev, &info); + return sprintf(buf, "%s\n", info.version); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + struct ethtool_drvinfo info; + struct net_device *lldev = dev->rdev.t3cdev_p->lldev; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + lldev->ethtool_ops->get_drvinfo(lldev, &info); + return sprintf(buf, "%s\n", info.fw_version); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + struct ethtool_drvinfo info; + struct net_device *lldev = dev->rdev.t3cdev_p->lldev; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + lldev->ethtool_ops->get_drvinfo(lldev, &info); + return sprintf(buf, "%s\n", info.driver); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + PDBG("%s:%s:%u dev = 0x%p\n", __FILE__, __FUNCTION__, __LINE__, dev); + return sprintf(buf, "%x.%x\n", dev->rdev.rnic_info.pdev->vendor, + dev->rdev.rnic_info.pdev->device); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); + +static struct class_device_attribute *iwch_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type, + &class_device_attr_board_id +}; + +int iwch_register_device(struct iwch_dev *dev) +{ + int ret; + int i; + + PDBG("%s line %d\n", __FUNCTION__, __LINE__); + strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX); + PDBG(" dev name = %s\n", dev->ibdev.name); + memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); + memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); + dev->ibdev.owner = THIS_MODULE; + dev->device_cap_flags = + (IB_DEVICE_ZERO_STAG | + IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW); + + dev->ibdev.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV); + dev->ibdev.node_type = RDMA_NODE_RNIC; + memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC)); + dev->ibdev.phys_port_cnt = dev->rdev.rnic_info.nports; + dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev); + dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev); + dev->ibdev.query_device = iwch_query_device; + dev->ibdev.query_port = iwch_query_port; + dev->ibdev.modify_port = iwch_modify_port; + dev->ibdev.query_pkey = iwch_query_pkey; + dev->ibdev.query_gid = iwch_query_gid; + dev->ibdev.alloc_ucontext = iwch_alloc_ucontext; + dev->ibdev.dealloc_ucontext = iwch_dealloc_ucontext; + dev->ibdev.mmap = iwch_mmap; + dev->ibdev.alloc_pd = iwch_allocate_pd; + dev->ibdev.dealloc_pd = iwch_deallocate_pd; + dev->ibdev.create_ah = iwch_ah_create; + dev->ibdev.destroy_ah = iwch_ah_destroy; + dev->ibdev.create_qp = iwch_create_qp; + dev->ibdev.modify_qp = iwch_ib_modify_qp; + dev->ibdev.destroy_qp = iwch_destroy_qp; + dev->ibdev.create_cq = iwch_create_cq; + dev->ibdev.destroy_cq = iwch_destroy_cq; + dev->ibdev.resize_cq = iwch_resize_cq; + dev->ibdev.poll_cq = iwch_poll_cq; + dev->ibdev.get_dma_mr = iwch_get_dma_mr; + dev->ibdev.reg_phys_mr = iwch_register_phys_mem; + dev->ibdev.rereg_phys_mr = iwch_reregister_phys_mem; + dev->ibdev.reg_user_mr = iwch_reg_user_mr; + dev->ibdev.dereg_mr = iwch_dereg_mr; + dev->ibdev.alloc_mw = iwch_alloc_mw; + dev->ibdev.bind_mw = iwch_bind_mw; + dev->ibdev.dealloc_mw = iwch_dealloc_mw; + + dev->ibdev.attach_mcast = iwch_multicast_attach; + dev->ibdev.detach_mcast = iwch_multicast_detach; + dev->ibdev.process_mad = iwch_process_mad; + + dev->ibdev.req_notify_cq = iwch_arm_cq; + dev->ibdev.post_send = iwch_post_send; + dev->ibdev.post_recv = iwch_post_receive; + + + dev->ibdev.iwcm = + (struct iw_cm_verbs *) kmalloc(sizeof(struct iw_cm_verbs), + GFP_KERNEL); + dev->ibdev.iwcm->connect = iwch_connect; + dev->ibdev.iwcm->accept = iwch_accept_cr; + dev->ibdev.iwcm->reject = iwch_reject_cr; + dev->ibdev.iwcm->create_listen = iwch_create_listen; + dev->ibdev.iwcm->destroy_listen = iwch_destroy_listen; + dev->ibdev.iwcm->add_ref = iwch_qp_add_ref; + dev->ibdev.iwcm->rem_ref = iwch_qp_rem_ref; + dev->ibdev.iwcm->get_qp = iwch_get_qp; + + ret = ib_register_device(&dev->ibdev); + if (ret) + goto bail1; + + PDBG("%s line %d\n", __FUNCTION__, __LINE__); + for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) { + ret = class_device_create_file(&dev->ibdev.class_dev, + iwch_class_attributes[i]); + if (ret) { + goto bail2; + } + } + PDBG("%s line %d\n", __FUNCTION__, __LINE__); + return 0; +bail2: + PDBG("%s line %d\n", __FUNCTION__, __LINE__); + ib_unregister_device(&dev->ibdev); +bail1: + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return ret; +} + +void iwch_unregister_device(struct iwch_dev *dev) +{ + int i; + + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) + class_device_remove_file(&dev->ibdev.class_dev, + iwch_class_attributes[i]); + ib_unregister_device(&dev->ibdev); + return; +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h new file mode 100644 index 0000000..3ceed66 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h @@ -0,0 +1,358 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_PROVIDER_H__ +#define __IWCH_PROVIDER_H__ + +#include +#include +#include +#include +#include "t3cdev.h" +#include "iwch.h" +#include "cxio_wr.h" +#include "cxio_hal.h" + + +struct iwch_pd { + struct ib_pd ibpd; + u32 pdid; + struct iwch_dev *rhp; +}; + +static inline struct iwch_pd *to_iwch_pd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct iwch_pd, ibpd); +} + +struct tpt_attributes { + u32 stag; + u32 state:1; + u32 type:2; + u32 rsvd:1; + enum tpt_mem_perm perms; + u32 remote_invaliate_disable:1; + u32 zbva:1; + u32 mw_bind_enable:1; + u32 page_size:5; + + u32 pdid; + u32 qpid; + u32 pbl_addr; + u32 len; + u64 va_fbo; + u32 pbl_size; +}; + +struct iwch_mr { + struct ib_mr ibmr; + struct iwch_dev *rhp; + u64 kva; + struct tpt_attributes attr; +}; + +typedef struct iwch_mw iwch_mw_handle; + +static inline struct iwch_mr *to_iwch_mr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct iwch_mr, ibmr); +} + +struct iwch_mw { + struct ib_mw ibmw; + struct iwch_dev *rhp; + u64 kva; + struct tpt_attributes attr; +}; + +static inline struct iwch_mw *to_iwch_mw(struct ib_mw *ibmw) +{ + return container_of(ibmw, struct iwch_mw, ibmw); +} + +struct iwch_cq { + struct ib_cq ibcq; + struct iwch_dev *rhp; + u64 cqh; + struct t3_cq cq; + spinlock_t lock; + atomic_t refcnt; + wait_queue_head_t wait; +}; + +static inline struct iwch_cq *to_iwch_cq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct iwch_cq, ibcq); +} + +enum IWCH_QP_FLAGS { + QP_QUIESCED = 0x01 +}; + +struct iwch_mpa_attributes { + u8 recv_marker_enabled; + u8 xmit_marker_enabled; /* iWARP: enable inbound Read Resp. */ + u8 crc_enabled; + u8 version; /* 0 or 1 */ +}; + +struct iwch_qp_attributes { + u64 scq; + u64 rcq; + u32 sq_num_entries; + u32 rq_num_entries; + u32 sq_max_sges; + u32 sq_max_sges_rdma_write; + u32 rq_max_sges; + u32 state; + u8 enable_rdma_read; + u8 enable_rdma_write; /* enable inbound Read Resp. */ + u8 enable_bind; + u8 enable_stag0_fastreg; /* Enable STAG0 + Fast-register */ + /* + * Next QP state. If specify the current state, only the + * QP attributes will be modified. + */ + u32 max_ord; + u32 max_ird; + u64 pd; /* IN */ + u32 next_state; + char terminate_buffer[52]; + u32 terminate_msg_len; + u8 is_terminate_local; + struct iwch_mpa_attributes mpa_attr; /* IN-OUT */ + struct iwch_ep *llp_stream_handle; + char *stream_msg_buf; /* Last stream msg. before Idle -> RTS */ + u32 stream_msg_buf_len; /* Only on Idle -> RTS */ +}; + +struct iwch_qp { + struct ib_qp ibqp; + struct iwch_dev *rhp; + struct iwch_ep *ep; + struct iwch_qp_attributes attr; + struct t3_wq wq; + spinlock_t lock; + atomic_t refcnt; + wait_queue_head_t wait; + enum IWCH_QP_FLAGS flags; + struct timer_list timer; +}; + +static inline int qp_quiesced(struct iwch_qp *qhp) +{ + return (qhp->flags & QP_QUIESCED); +} + +static inline struct iwch_qp *to_iwch_qp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct iwch_qp, ibqp); +} + +void iwch_qp_add_ref(struct ib_qp *qp); +void iwch_qp_rem_ref(struct ib_qp *qp); +struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn); + +/* + * I'm anticipating we'll need something per user... + */ +struct iwch_ucontext { + struct ib_ucontext ibucontext; +}; + +static inline struct iwch_ucontext *to_iwch_ucontext(struct ib_ucontext *c) +{ + return container_of(c, struct iwch_ucontext, ibucontext); +} + +enum iwch_qp_attr_mask { + IWCH_QP_ATTR_NEXT_STATE = 1 << 0, + IWCH_QP_ATTR_ENABLE_RDMA_READ = 1 << 7, + IWCH_QP_ATTR_ENABLE_RDMA_WRITE = 1 << 8, + IWCH_QP_ATTR_ENABLE_RDMA_BIND = 1 << 9, + IWCH_QP_ATTR_MAX_ORD = 1 << 11, + IWCH_QP_ATTR_MAX_IRD = 1 << 12, + IWCH_QP_ATTR_LLP_STREAM_HANDLE = 1 << 22, + IWCH_QP_ATTR_STREAM_MSG_BUFFER = 1 << 23, + IWCH_QP_ATTR_MPA_ATTR = 1 << 24, + IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE = 1 << 25, + IWCH_QP_ATTR_VALID_MODIFY = (IWCH_QP_ATTR_ENABLE_RDMA_READ | + IWCH_QP_ATTR_ENABLE_RDMA_WRITE | + IWCH_QP_ATTR_MAX_ORD | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_STREAM_MSG_BUFFER | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE) +}; + +int iwch_modify_qp(struct iwch_dev *rhp, + struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs, + int internal); + +enum iwch_qp_state { + IWCH_QP_STATE_IDLE, + IWCH_QP_STATE_RTS, + IWCH_QP_STATE_ERROR, + IWCH_QP_STATE_TERMINATE, + IWCH_QP_STATE_CLOSING, + IWCH_QP_STATE_TOT +}; + +static inline int iwch_convert_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: + case IB_QPS_INIT: + return IWCH_QP_STATE_IDLE; + case IB_QPS_RTS: + return IWCH_QP_STATE_RTS; + case IB_QPS_SQD: + return IWCH_QP_STATE_CLOSING; + case IB_QPS_SQE: + return IWCH_QP_STATE_TERMINATE; + case IB_QPS_ERR: + return IWCH_QP_STATE_ERROR; + default: + return -1; + } +} + +enum iwch_mem_perms { + IWCH_MEM_ACCESS_LOCAL_READ = 1 << 0, + IWCH_MEM_ACCESS_LOCAL_WRITE = 1 << 1, + IWCH_MEM_ACCESS_REMOTE_READ = 1 << 2, + IWCH_MEM_ACCESS_REMOTE_WRITE = 1 << 3, + IWCH_MEM_ACCESS_ATOMICS = 1 << 4, + IWCH_MEM_ACCESS_BINDING = 1 << 5, + IWCH_MEM_ACCESS_LOCAL = + (IWCH_MEM_ACCESS_LOCAL_READ | IWCH_MEM_ACCESS_LOCAL_WRITE), + IWCH_MEM_ACCESS_REMOTE = + (IWCH_MEM_ACCESS_REMOTE_WRITE | IWCH_MEM_ACCESS_REMOTE_READ) + /* cannot go beyond 1 << 31 */ +} __attribute__ ((packed)); + +static inline u32 iwch_convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_WRITE ? IWCH_MEM_ACCESS_REMOTE_WRITE : 0) + | (acc & IB_ACCESS_REMOTE_READ ? IWCH_MEM_ACCESS_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? IWCH_MEM_ACCESS_LOCAL_WRITE : 0) | + (acc & IB_ACCESS_MW_BIND ? IWCH_MEM_ACCESS_BINDING : 0) | + IWCH_MEM_ACCESS_LOCAL_READ; +} + +enum iwch_stag_state { + IWCH_STAG_STATE_VALID, + IWCH_STAG_STATE_INVALID +}; + +enum iwch_qp_query_flags { + IWCH_QP_QUERY_CONTEXT_NONE = 0x0, /* No ctx; Only attrs */ + IWCH_QP_QUERY_CONTEXT_GET = 0x1, /* Get ctx + attrs */ + IWCH_QP_QUERY_CONTEXT_SUSPEND = 0x2, /* Not Supported */ + + /* + * Quiesce QP context; Consumer + * will NOT replay outstanding WR + */ + IWCH_QP_QUERY_CONTEXT_QUIESCE = 0x4, + IWCH_QP_QUERY_CONTEXT_REMOVE = 0x8, + IWCH_QP_QUERY_TEST_USERWRITE = 0x32 /* Test special */ +}; + +static inline struct iwch_pd *get_php(struct iwch_dev *rhp, u64 pd_h) +{ + if (pd_h >= T3_MAX_NUM_PD) + return NULL; + return rhp->pdid2hlp[pd_h]; +} + +static inline struct iwch_cq *get_chp(struct iwch_dev *rhp, u64 cq_h) +{ + if (cq_h >= T3_MAX_NUM_CQ) + return NULL; + return rhp->cqid2hlp[cq_h]; +} + +static inline struct iwch_qp *get_qhp(struct iwch_dev *rhp, u64 qp_h) +{ + if (qp_h >= T3_MAX_NUM_QP) + return NULL; + return rhp->qpid2hlp[qp_h]; +} + +static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, + u64 mem_h) +{ + if (mem_h >= T3_MAX_NUM_STAG) + return NULL; + return rhp->stag2hlp[mem_h]; +} + +int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int iwch_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind); +int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc); +int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg); +int iwch_register_device(struct iwch_dev *dev); +void iwch_unregister_device(struct iwch_dev *dev); +int iwch_quiesce_qps(struct iwch_cq *chp); +int iwch_resume_qps(struct iwch_cq *chp); +void stop_read_rep_timer(struct iwch_qp *qhp); +int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + u64 *page_list); +int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + u64 *page_list); +int build_phys_page_list(struct ib_phys_buf *buffer_list, + int num_phys_buf, + u64 *iova_start, + u64 *total_size, + int *npages, + int *shift, + u64 **page_list); + +#define MOD "iw_cxgb3:" +#define PDBG(fmt, args...) pr_debug(MOD fmt, ##args) + +#define IWCH_NODE_DESC "cxgb3 Chelsio Communications" + +#endif From swise at opengridcomputing.com Fri Jun 23 07:29:34 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:29:34 -0500 Subject: [openib-general] [PATCH v2 02/14] CXGB3 QP and CQ. In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop> References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623142934.32410.33916.stgit@stevo-desktop> This patch contains qp and cq manipulation code. ISSUE: CQs can overflow in with the T3A hardware. There is no way around this for now. The next spin of the T3 hardware will resolve this issue and the driver will be updated. ISSUE: QP termination/WR flushing not handled correctly. Need firmware support to finalize this. --- drivers/infiniband/hw/cxgb3/iwch_cq.c | 228 +++++++ drivers/infiniband/hw/cxgb3/iwch_qp.c | 1006 +++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_user.h | 62 ++ 3 files changed, 1296 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cq.c b/drivers/infiniband/hw/cxgb3/iwch_cq.c new file mode 100644 index 0000000..303b7f2 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cq.c @@ -0,0 +1,228 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "iwch_provider.h" +#include "iwch.h" + +/* + * Get one cq entry from cxio and map it to openib. + * + * Returns: + * 0 EMPTY; + * 1 cqe returned + * -EAGAIN caller must try again + * any other -errno fatal error + */ +int iwch_poll_cq_one(struct iwch_dev *rhp, struct iwch_cq *chp, + struct ib_wc *wc) +{ + struct iwch_qp *qhp = NULL; + struct t3_cqe cqe, *rd_cqe; + struct t3_wq *wq; + u32 credit = 0; + u8 cqe_flushed; + u64 cookie; + int ret = 1; + + rd_cqe = cxio_next_cqe(&chp->cq); + + if (!rd_cqe) + return 0; + + qhp = get_qhp(rhp, CQE_QPID(*rd_cqe)); + if (!qhp) + wq = NULL; + else { + spin_lock(&qhp->lock); + wq = &(qhp->wq); + } + ret = cxio_poll_cq(wq, &(chp->cq), &cqe, &cqe_flushed, &cookie, + &credit); + if (credit) { + PDBG("%s updating %d cq credits on id %d\n", __FUNCTION__, + credit, chp->cq.cqid); + cxio_hal_cq_op(&rhp->rdev, &chp->cq, CQ_CREDIT_UPDATE, credit); + } + + if (ret) { + ret = -EAGAIN; + goto out; + } + ret = 1; + + BUG_ON(!qhp); + + wc->wr_id = cookie; + wc->qp_num = qhp->wq.qpid; + + PDBG("%s qpid 0x%x type %d opcode %d status 0x%d wrid hi 0x%x " + "lo %x cookie %llx\n", __FUNCTION__, CQE_QPID(cqe), CQE_TYPE(cqe), + CQE_OPCODE(cqe), CQE_STATUS(cqe), CQE_WRID_HI(cqe), + CQE_WRID_LOW(cqe), cookie); + + if (CQE_TYPE(cqe) == 0) { + if (!CQE_STATUS(cqe)) + wc->byte_len = CQE_LEN(cqe); + else + wc->byte_len = 0; + wc->opcode = IB_WC_RECV; + } else { + switch (CQE_OPCODE(cqe)) { + case T3_RDMA_WRITE: + wc->opcode = IB_WC_RDMA_WRITE; + break; + case T3_READ_REQ: + wc->opcode = IB_WC_RDMA_READ; + wc->byte_len = CQE_LEN(cqe); + break; + case T3_SEND: + case T3_SEND_WITH_SE: + wc->opcode = IB_WC_SEND; + break; + case T3_BIND_MW: + wc->opcode = IB_WC_BIND_MW; + break; + + /* these aren't supported yet */ + case T3_SEND_WITH_INV: + case T3_SEND_WITH_SE_INV: + case T3_LOCAL_INV: + case T3_FAST_REGISTER: + default: + PDBG("unexpected opcode(0x%0x) in the CQE received " + "for QPID=0x%0x\n", CQE_OPCODE(cqe), + CQE_QPID(cqe)); + ret = -EINVAL; + goto out; + } + } + + if (cqe_flushed) { + wc->status = IB_WC_WR_FLUSH_ERR; + } else { + + switch (CQE_STATUS(cqe)) { + case TPT_ERR_SUCCESS: + wc->status = IB_WC_SUCCESS; + break; + case TPT_ERR_STAG: + wc->status = IB_WC_LOC_ACCESS_ERR; + break; + case TPT_ERR_PDID: + wc->status = IB_WC_LOC_PROT_ERR; + break; + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + wc->status = IB_WC_LOC_ACCESS_ERR; + break; + case TPT_ERR_WRAP: + wc->status = IB_WC_GENERAL_ERR; + break; + case TPT_ERR_BOUND: + wc->status = IB_WC_LOC_LEN_ERR; + break; + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + wc->status = IB_WC_MW_BIND_ERR; + break; + case TPT_ERR_CRC: + case TPT_ERR_MARKER: + case TPT_ERR_PDU_LEN_ERR: + case TPT_ERR_OUT_OF_RQE: + case TPT_ERR_DDP_VERSION: + case TPT_ERR_RDMA_VERSION: + case TPT_ERR_DDP_QUEUE_NUM: + case TPT_ERR_MSN: + case TPT_ERR_TBIT: + case TPT_ERR_MO: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_IRD_OVERFLOW: + case TPT_ERR_OPCODE: + wc->status = IB_WC_FATAL_ERR; + break; + default: + PDBG("unexpected cqe_status(0x%0x) for QPID=0x(%0x)\n", + CQE_STATUS(cqe), CQE_QPID(cqe)); + ret = -EINVAL; + } + } +out: + if (wq) + spin_unlock(&qhp->lock); + return ret; +} + +int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + unsigned long flags; + int npolled; + int err = 0; + + chp = to_iwch_cq(ibcq); + rhp = chp->rhp; + + spin_lock_irqsave(&chp->lock, flags); + for (npolled = 0; npolled < num_entries; ++npolled) { +#ifdef DEBUG + int i=0; +#endif + + /* + * Because T3 can post CQEs that are _not_ associated + * with a WR, we might have to poll again after removing + * one of these. + */ + do { + err = iwch_poll_cq_one(rhp, chp, wc + npolled); +#ifdef DEBUG + BUG_ON(++i > 1000); +#endif + } while (err == -EAGAIN); + if (err <= 0) + break; + } + spin_unlock_irqrestore(&chp->lock, flags); + + if (err < 0) + return err; + else { + return npolled; + } +} + +int iwch_modify_cq(struct ib_cq *cq, int cqe) +{ + PDBG("iwch_modify_cq: TBD\n"); + return 0; +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c new file mode 100644 index 0000000..f1136c1 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -0,0 +1,1006 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "iwch_provider.h" +#include "iwch.h" +#include "iwch_cm.h" + +#define NO_SUPPORT -1 + +static inline int iwch_build_rdma_send(union t3_wr *wqe, + struct ib_send_wr *wr, + u8 * flit_cnt) +{ + int i; + switch (wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + if (wr->send_flags & IB_SEND_SOLICITED) + wqe->send.rdmaop = T3_SEND_WITH_SE; + else + wqe->send.rdmaop = T3_SEND; + wqe->send.rem_stag = 0; + break; +#if 0 /* Not currently supported */ + case TYPE_SEND_INVALIDATE: + case TYPE_SEND_INVALIDATE_IMMEDIATE: + wqe->send.rdmaop = T3_SEND_WITH_INV; + wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + break; + case TYPE_SEND_SE_INVALIDATE: + wqe->send.rdmaop = T3_SEND_WITH_SE_INV; + wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + break; +#endif + default: + break; + } + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + wqe->send.reserved = 0; + if (wr->opcode == IB_WR_SEND_WITH_IMM) { + wqe->send.plen = 4; + wqe->send.sgl[0].stag = wr->imm_data; + wqe->send.sgl[0].len = 0; + wqe->send.num_sgle = 0; + *flit_cnt = 5; + } else { + wqe->send.plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((wqe->send.plen + wr->sg_list[i].length) < + wqe->send.plen) { + return -EMSGSIZE; + } + wqe->send.plen += wr->sg_list[i].length; + wqe->send.sgl[i].stag = + cpu_to_be32(wr->sg_list[i].lkey); + wqe->send.sgl[i].len = + cpu_to_be32(wr->sg_list[i].length); + wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr); + } + wqe->send.plen = cpu_to_be32(wqe->send.plen); + wqe->send.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 4 + ((wr->num_sge) << 1); + } + return 0; +} + +static inline int iwch_build_rdma_write(union t3_wr *wqe, + struct ib_send_wr *wr, + u8 *flit_cnt) +{ + int i; + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + wqe->write.rdmaop = T3_RDMA_WRITE; + wqe->write.reserved = 0; + wqe->write.stag_sink = cpu_to_be32(wr->wr.rdma.rkey); + wqe->write.to_sink = cpu_to_be64(wr->wr.rdma.remote_addr); + + wqe->write.num_sgle = wr->num_sge; + + if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) { + wqe->write.plen = cpu_to_be32(4); + wqe->write.sgl[0].stag = cpu_to_be32(wr->imm_data); + wqe->write.sgl[0].len = 0; + wqe->write.num_sgle = 0; + *flit_cnt = 6; + } else { + wqe->write.plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((wqe->send.plen + wr->sg_list[i].length) < + wqe->send.plen) { + return -EMSGSIZE; + } + wqe->write.plen += wr->sg_list[i].length; + wqe->write.sgl[i].stag = + cpu_to_be32(wr->sg_list[i].lkey); + wqe->write.sgl[i].len = + cpu_to_be32(wr->sg_list[i].length); + wqe->write.sgl[i].to = + cpu_to_be64(wr->sg_list[i].addr); + } + wqe->write.plen = cpu_to_be32(wqe->write.plen); + wqe->write.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 5 + ((wr->num_sge) << 1); + } + return 0; +} + +static inline int iwch_build_rdma_read(union t3_wr *wqe, + struct ib_send_wr *wr, + u8 *flit_cnt) +{ + if (wr->num_sge > 1) + return -EINVAL; + wqe->read.rdmaop = T3_READ_REQ; + wqe->read.reserved = 0; + wqe->read.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + wqe->read.rem_to = cpu_to_be64(wr->wr.rdma.remote_addr); + wqe->read.local_stag = cpu_to_be32(wr->sg_list[0].lkey); + wqe->read.local_len = cpu_to_be32(wr->sg_list[0].length); + wqe->read.local_to = cpu_to_be64(wr->sg_list[0].addr); + *flit_cnt = sizeof(struct t3_rdma_read_wr) >> 3; + return 0; +} + +/* + * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now. + */ +static inline int iwch_sgl2pbl_map(struct iwch_dev *rhp, + struct ib_sge *sg_list, u32 num_sgle, + u32 * pbl_addr, u8 * page_size) +{ + int i; + struct iwch_mr *mhp; + u32 offset; + for (i = 0; i < num_sgle; i++) { + mhp = get_mhp(rhp, (sg_list[i].lkey) >> 8); + if (!mhp) + return -EIO; + if (!mhp->attr.state) + return -EIO; + if (mhp->attr.zbva) + return -EIO; + if (sg_list[i].addr < mhp->attr.va_fbo) + return -EINVAL; + if (sg_list[i].addr + ((u64) sg_list[i].length) < + sg_list[i].addr) + return -EINVAL; + if (sg_list[i].addr + ((u64) sg_list[i].length) > + mhp->attr.va_fbo + ((u64) mhp->attr.len)) + return -EINVAL; + offset = sg_list[i].addr - mhp->attr.va_fbo; + offset += ((u32) mhp->attr.va_fbo) % + (1UL << (12 + mhp->attr.page_size)); + pbl_addr[i] = mhp->attr.pbl_addr + + (offset >> (12 + mhp->attr.page_size)); + page_size[i] = mhp->attr.page_size; + } + return 0; +} + +static inline int iwch_build_rdma_recv(struct iwch_dev *rhp, + union t3_wr *wqe, + struct ib_recv_wr *wr) +{ + int i, err = 0; + u32 pbl_addr[4]; + u8 page_size[4]; + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + err = iwch_sgl2pbl_map(rhp, wr->sg_list, wr->num_sge, pbl_addr, + page_size); + if (err) + return err; + wqe->recv.pagesz[0] = page_size[0]; + wqe->recv.pagesz[1] = page_size[1]; + wqe->recv.pagesz[2] = page_size[2]; + wqe->recv.pagesz[3] = page_size[3]; + wqe->recv.num_sgle = cpu_to_be32(wr->num_sge); + for (i = 0; i < wr->num_sge; i++) { + wqe->recv.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey); + wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length); + + /* to in the WQE == the offset into the page */ + wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % + (1UL << (12 + page_size[i]))); + + /* pbl_addr is the adapters address in the PBL */ + wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]); + } + for (; i < T3_MAX_SGE; i++) { + wqe->recv.sgl[i].stag = 0; + wqe->recv.sgl[i].len = 0; + wqe->recv.sgl[i].to = 0; + wqe->recv.pbl_addr[i] = 0; + } + return 0; +} + +int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + int err = 0; + u8 t3_wr_flit_cnt; + enum t3_wr_opcode t3_wr_opcode = 0; + enum t3_wr_flags t3_wr_flags; + struct iwch_qp *qhp; + u32 idx; + union t3_wr *wqe; + u32 num_wrs; + int flag; + + qhp = to_iwch_qp(ibqp); + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, + qhp->wq.sq_size_log2); + if (num_wrs <= 0) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -ENOMEM; + } + while (wr) { + if (num_wrs == 0) { + err = -ENOMEM; + *bad_wr = wr; + break; + } + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + t3_wr_flags = 0; + if (wr->send_flags & IB_SEND_SOLICITED) + t3_wr_flags |= T3_SOLICITED_EVENT_FLAG; + if (wr->send_flags & IB_SEND_FENCE) + t3_wr_flags |= T3_READ_FENCE_FLAG; + if (wr->send_flags & IB_SEND_SIGNALED) + t3_wr_flags |= T3_COMPLETION_FLAG; + switch (wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + t3_wr_opcode = T3_WR_SEND; + err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt); + break; + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + t3_wr_opcode = T3_WR_WRITE; + err = iwch_build_rdma_write(wqe, wr, &t3_wr_flit_cnt); + break; + case IB_WR_RDMA_READ: + PDBG("%s %d - read sq_wptr %u wptr %u cookie %llx\n", + __FUNCTION__, __LINE__, qhp->wq.sq_wptr, + qhp->wq.wptr, wr->wr_id); + t3_wr_opcode = T3_WR_READ; + t3_wr_flags = 0; /* XXX */ + err = iwch_build_rdma_read(wqe, wr, &t3_wr_flit_cnt); + break; + default: + PDBG("iwch_post_sendq: post of type=0x%0x TBD!\n", + wr->opcode); + err = -EINVAL; + } + if (err) { + *bad_wr = wr; + break; + } + wqe->send.wrid.id0.hi = qhp->wq.sq_wptr; + wqe->send.wrid.id0.low = qhp->wq.wptr; + wqe->flit[T3_SQ_COOKIE_FLIT] = wr->wr_id; + build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), + 0, t3_wr_flit_cnt); + PDBG("%s %d cookie %llx idx 0x%x sq_wptr %x sw_rptr %x wqe %p opcode %d\n", + __FUNCTION__, __LINE__, wr->wr_id, idx, + qhp->wq.sq_wptr, qhp->wq.sq_rptr, wqe, t3_wr_opcode); + if (!qhp->wq.sq_oldest_wr && + ((wr->send_flags & IB_SEND_SIGNALED) || + (wr->opcode == IB_WR_RDMA_READ))) { + qhp->wq.sq_oldest_wr = wqe; + PDBG("%s %d sq_oldest_wr %p\n", __FUNCTION__, __LINE__, + qhp->wq.sq_oldest_wr); + } + wr = wr->next; + num_wrs--; + ++(qhp->wq.wptr); + ++(qhp->wq.sq_wptr); + } + spin_unlock_irqrestore(&qhp->lock, flag); + RING_DOORBELL(qhp->wq.doorbell, qhp->wq.qpid); + return err; +} + +int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + int err = 0; + struct iwch_qp *qhp; + u32 idx; + union t3_wr *wqe; + u32 num_wrs; + int flag; + + qhp = to_iwch_qp(ibqp); + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.rq_rptr, qhp->wq.rq_wptr, + qhp->wq.rq_size_log2) - 1; + if (!wr) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + while (wr) { + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + if (num_wrs) + err = iwch_build_rdma_recv(qhp->rhp, wqe, wr); + else + err = -ENOMEM; + if (err) { + *bad_wr = wr; + break; + } + qhp->wq.rq[Q_PTR2IDX(qhp->wq.rq_wptr, qhp->wq.rq_size_log2)] = + wr->wr_id; + build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), + 0, sizeof(struct t3_receive_wr) >> 3); + PDBG("%s %d cookie %llx idx 0x%x rq_wptr %x rw_rptr %x " + "wqe %p \n", __FUNCTION__, __LINE__, wr->wr_id, idx, + qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe); + ++(qhp->wq.rq_wptr); + ++(qhp->wq.wptr); + wr = wr->next; + num_wrs--; + } + spin_unlock_irqrestore(&qhp->lock, flag); + RING_DOORBELL(qhp->wq.doorbell, qhp->wq.qpid); + return err; +} + +int iwch_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + struct iwch_dev *rhp; + struct iwch_mw *mhp; + struct iwch_qp *qhp; + union t3_wr *wqe; + u32 pbl_addr; + u8 page_size; + u32 num_wrs; + int flag; + struct ib_sge sgl; + int err=0; + enum t3_wr_flags t3_wr_flags; + u32 idx; + + qhp = to_iwch_qp(qp); + mhp = to_iwch_mw(mw); + rhp = qhp->rhp; + + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, + qhp->wq.sq_size_log2); + if ((num_wrs) <= 0) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -ENOMEM; + } + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + PDBG("%s: idx=0x%0x, mw=0x%p, mw_bind=0x%p\n", __FUNCTION__, idx, + mw, mw_bind); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + wqe->send.wrid.id0.hi = qhp->wq.sq_wptr; + wqe->send.wrid.id0.low = qhp->wq.wptr; + + t3_wr_flags = 0; + if (mw_bind->send_flags & IB_SEND_SIGNALED) + t3_wr_flags = T3_COMPLETION_FLAG; + + sgl.addr = mw_bind->addr; + sgl.lkey = mw_bind->mr->lkey; + sgl.length = mw_bind->length; + wqe->bind.reserved = 0; + wqe->bind.type = T3_VA_BASED_TO; + + /* TBD: check perms */ + wqe->bind.perms = iwch_convert_access(mw_bind->mw_access_flags); + wqe->bind.mr_stag = cpu_to_be32(mw_bind->mr->lkey); + wqe->bind.mw_stag = cpu_to_be32(mw->rkey); + wqe->bind.mw_len = cpu_to_be32(mw_bind->length); + wqe->bind.mw_va = cpu_to_be64(mw_bind->addr); + err = iwch_sgl2pbl_map(rhp, &sgl, 1, &pbl_addr, &page_size); + if (err) { + spin_unlock_irqrestore(&qhp->lock, flag); + return err; + } + wqe->bind.mr_pbl_addr = cpu_to_be32(pbl_addr); + wqe->bind.mr_pagesz = page_size; + wqe->bind.reserved2 = 0; + wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id; + build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, + sizeof(struct t3_bind_mw_wr) >> 3); + + if (!qhp->wq.sq_oldest_wr) { + qhp->wq.sq_oldest_wr = wqe; + PDBG("%s %d sq_oldest_wr %p\n", __FUNCTION__, __LINE__, + qhp->wq.sq_oldest_wr); + } + ++(qhp->wq.wptr); + ++(qhp->wq.sq_wptr); + spin_unlock_irqrestore(&qhp->lock, flag); + + RING_DOORBELL(qhp->wq.doorbell, qhp->wq.qpid); + + return err; +} + +int iwch_query_qp(u64 rh, u64 qp_h, enum iwch_qp_query_flags flags, + struct iwch_qp_attributes *attrs) +{ + return 0; +} + + +static inline void build_term_codes(int t3err, u8 *layer_type, u8 *ecode, + int tagged) +{ + switch (t3err) { + case TPT_ERR_STAG: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_INV_STAG; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_INV_STAG; + } + break; + case TPT_ERR_PDID: + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_STAG_NOT_ASSOC; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_STAG_NOT_ASSOC; + } + break; + case TPT_ERR_WRAP: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_TO_WRAP; + break; + case TPT_ERR_BOUND: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_BASE_BOUNDS; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_BASE_BOUNDS; + } else { + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_MSG_TOOBIG; + } + break; + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_CANT_INV_STAG; + break; + case TPT_ERR_ECC: + case TPT_ERR_ECC_PSTAG: + case TPT_ERR_INTERNAL_ERR: + *layer_type = LAYER_RDMAP|RDMAP_LOCAL_CATA; + *ecode = 0; + break; + case TPT_ERR_OUT_OF_RQE: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MSN_NOBUF; + break; + case TPT_ERR_PBL_ADDR_BOUND: + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_BASE_BOUNDS; + break; + case TPT_ERR_CRC: + *layer_type = LAYER_MPA|DDP_LLP; + *ecode = MPA_CRC_ERR; + break; + case TPT_ERR_MARKER: + *layer_type = LAYER_MPA|DDP_LLP; + *ecode = MPA_MARKER_ERR; + break; + case TPT_ERR_PDU_LEN_ERR: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_MSG_TOOBIG; + break; + case TPT_ERR_DDP_VERSION: + if (tagged) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; /* XXX */ + *ecode = DDPT_INV_VERS; + } else { + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; /* XXX */ + *ecode = DDPU_INV_VERS; + } + break; + case TPT_ERR_RDMA_VERSION: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_INV_VERS; + break; + case TPT_ERR_OPCODE: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_INV_OPCODE; + break; + case TPT_ERR_DDP_QUEUE_NUM: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_QN; + break; + case TPT_ERR_MSN: + case TPT_ERR_MSN_GAP: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_IRD_OVERFLOW: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MSN_RANGE; + break; + case TPT_ERR_TBIT: + *layer_type = LAYER_DDP|DDP_LOCAL_CATA; + *ecode = 0; + break; + case TPT_ERR_MO: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MO; + break; + default: + *layer_type = LAYER_RDMAP|DDP_LOCAL_CATA; + *ecode = 0; + break; + } +} + +/* + * This posts a TERMINATE with layer=RDMA, type=catastrophic. + */ +int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg) +{ + int err = 0; + u32 idx; + union t3_wr *wqe; + int num_wrs; + int flag; + struct terminate_message *term; + int status; + int tagged = 0; + + PDBG("%s %d\n", __FUNCTION__, __LINE__); + spin_lock_irqsave(&qhp->lock, flag); + num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, + qhp->wq.sq_size_log2); + if (num_wrs <= 0) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EIO; + } + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + if (!qhp->wq.sq_oldest_wr) { + qhp->wq.sq_oldest_wr = wqe; + PDBG("%s %d sq_oldest_wr %p\n", __FUNCTION__, __LINE__, + qhp->wq.sq_oldest_wr); + } + wqe->send.wrid.id0.hi = qhp->wq.sq_wptr; + wqe->send.wrid.id0.low = qhp->wq.wptr; + wqe->send.rdmaop = T3_TERMINATE; + wqe->send.rem_stag = 0; + wqe->send.reserved = 0; + + /* indicate data is immediate. */ + wqe->send.num_sgle = 0; + + /* immediate data length */ + wqe->send.plen = htonl(4); + + /* immediate data starts here. */ + term = (struct terminate_message *)wqe->send.sgl; + status = rsp_msg ? CQE_STATUS(rsp_msg->cqe) : TPT_ERR_INTERNAL_ERR; + if (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE) + tagged = 1; + if ((CQE_OPCODE(rsp_msg->cqe) == T3_READ_REQ) || + (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) + tagged = 2; + build_term_codes(status, &term->layer_etype, &term->ecode, tagged); + term->hdrct_rsvd = 0; /* no header info */ + + wqe->flit[T3_SQ_COOKIE_FLIT] = ~0; + build_fw_riwrh((void *)wqe, T3_WR_SEND, + T3_COMPLETION_FLAG|T3_NOTIFY_FLAG, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, 5); + ++(qhp->wq.wptr); + ++(qhp->wq.sq_wptr); + spin_unlock_irqrestore(&qhp->lock, flag); + RING_DOORBELL(qhp->wq.doorbell, qhp->wq.qpid); + return err; +} + +/* + * Assumes qhp lock is held. + */ +static void flush_qp(struct iwch_qp *qhp, int *flag) +{ + struct iwch_cq *rchp, *schp; + + rchp = qhp->rhp->cqid2hlp[qhp->attr.rcq]; + schp = qhp->rhp->cqid2hlp[qhp->attr.scq]; + + /* take a ref on the qhp since we must release the lock */ + atomic_inc(&qhp->refcnt); + spin_unlock_irqrestore(&qhp->lock, *flag); + + /* locking heirarchy: cq lock first, then qp lock. */ + spin_lock_irqsave(&rchp->lock, *flag); + spin_lock(&qhp->lock); + cxio_flush_rq(&qhp->rhp->rdev, &qhp->wq, &rchp->cq); + spin_unlock(&qhp->lock); + spin_unlock_irqrestore(&rchp->lock, *flag); + + /* locking heirarchy: cq lock first, then qp lock. */ + spin_lock_irqsave(&schp->lock, *flag); + spin_lock(&qhp->lock); + cxio_flush_sq(&qhp->rhp->rdev, &qhp->wq, &schp->cq); + spin_unlock(&qhp->lock); + spin_unlock_irqrestore(&schp->lock, *flag); + + /* deref */ + if (atomic_dec_and_test(&qhp->refcnt)) + wake_up(&qhp->wait); + + spin_lock_irqsave(&qhp->lock, *flag); +} + +static int rdma_init(struct iwch_dev *rhp, struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs) +{ + struct t3_rdma_init_attr init_attr; + int ret; + + init_attr.tid = qhp->ep->hwtid; + init_attr.qpid = qhp->wq.qpid; + init_attr.pdid = qhp->attr.pd; + init_attr.scqid = qhp->attr.scq; + init_attr.rcqid = qhp->attr.rcq; + + /* TBD!!! rq table slot allocation needs + * to be implemented in the core driver. + * For now, allocate 1Kx64B for each rq + */ + init_attr.rq_addr = (qhp->ep->hwtid) << 16; + init_attr.rq_size = 1 << qhp->wq.rq_size_log2; + + PDBG("%s init_attr.rq_size = %d\n", __FUNCTION__, init_attr.rq_size); + init_attr.mpaattrs = uP_RI_MPA_IETF_ENABLE | + qhp->attr.mpa_attr.recv_marker_enabled | + (qhp->attr.mpa_attr.xmit_marker_enabled << 1) | + (qhp->attr.mpa_attr.crc_enabled << 2); + + /* + * XXX - The IWCM doesn't quite handle getting these + * attrs set before going into RTS. For now, just turn + * them on always... + */ +#if 0 + init_attr.qpcaps = qhp->attr.enableRdmaRead | + (qhp->attr.enableRdmaWrite << 1) | + (qhp->attr.enableBind << 2) | + (qhp->attr.enable_stag0_fastreg << 3) | + (qhp->attr.enable_stag0_fastreg << 4); +#else + init_attr.qpcaps = 0x1f; +#endif + init_attr.tcp_emss = qhp->ep->emss; + init_attr.ord = qhp->attr.max_ord; + init_attr.ird = qhp->attr.max_ird; + init_attr.qp_dma_addr = qhp->wq.dma_addr; + init_attr.qp_dma_size = (1UL << qhp->wq.size_log2); + init_attr.rqes_posted = Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr) ? + 0 : 1; + ret = cxio_rdma_init(&rhp->rdev, &init_attr); + PDBG("%s ret %d\n", __FUNCTION__, ret); + return ret; +} + +int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs, + int internal) +{ + int ret = 0; + struct iwch_qp_attributes newattr = qhp->attr; + int flag; + int disconnect = 0; + int terminate = 0; + int abort = 0; + int free = 0; + struct iwch_ep *ep = NULL; + + PDBG("%s %d qhp %p qpid %d ep %p state %d -> %d\n", __FUNCTION__, + __LINE__, qhp, qhp->wq.qpid, qhp->ep, qhp->attr.state, + (mask & IWCH_QP_ATTR_NEXT_STATE) ? attrs->next_state : -1); + + spin_lock_irqsave(&qhp->lock, flag); + + /* Process attr changes if in IDLE */ + if (mask & IWCH_QP_ATTR_VALID_MODIFY) { + if (qhp->attr.state != IWCH_QP_STATE_IDLE) { + ret = -EIO; + goto out; + } + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_READ) + newattr.enable_rdma_read = attrs->enable_rdma_read; + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_WRITE) + newattr.enable_rdma_write = attrs->enable_rdma_write; + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_BIND) + newattr.enable_bind = attrs->enable_bind; + if (mask & IWCH_QP_ATTR_MAX_ORD) { + if (attrs->max_ord > + rhp->attr.max_rdma_read_qp_depth) { + ret = -EINVAL; + goto out; + } + newattr.max_ord = attrs->max_ord; + } + if (mask & IWCH_QP_ATTR_MAX_IRD) { + if (attrs->max_ird > + rhp->attr.max_rdma_reads_per_qp) { + ret = -EINVAL; + goto out; + } + newattr.max_ird = attrs->max_ird; + } + qhp->attr = newattr; + } + + if (!(mask & IWCH_QP_ATTR_NEXT_STATE)) + goto out; + if (qhp->attr.state == attrs->next_state) + goto out; + + switch (qhp->attr.state) { + case IWCH_QP_STATE_IDLE: + switch (attrs->next_state) { + case IWCH_QP_STATE_RTS: + if (!(mask & IWCH_QP_ATTR_LLP_STREAM_HANDLE)) { + ret = -EINVAL; + goto out; + } + if (!(mask & IWCH_QP_ATTR_MPA_ATTR)) { + ret = -EINVAL; + goto out; + } + qhp->attr.mpa_attr = attrs->mpa_attr; + qhp->attr.llp_stream_handle = attrs->llp_stream_handle; + qhp->ep = qhp->attr.llp_stream_handle; + qhp->attr.state = IWCH_QP_STATE_RTS; + + /* + * Ref the endpoint here and deref when we + * disassociate the endpoint from the QP. This + * happens in CLOSING->IDLE transition or *->ERROR + * transition. + */ + atomic_inc(&qhp->ep->com.refcnt); + spin_unlock_irqrestore(&qhp->lock, flag); + ret = rdma_init(rhp, qhp, mask, attrs); + spin_lock_irqsave(&qhp->lock, flag); + if (ret) + goto err; + break; + case IWCH_QP_STATE_ERROR: + qhp->attr.state = IWCH_QP_STATE_ERROR; + flush_qp(qhp, &flag); + break; + default: + ret = -EINVAL; + goto out; + } + break; + case IWCH_QP_STATE_RTS: + switch (attrs->next_state) { + case IWCH_QP_STATE_CLOSING: + BUG_ON(atomic_read(&qhp->ep->com.refcnt) < 2); + qhp->attr.state = IWCH_QP_STATE_CLOSING; + if (Q_EMPTY(qhp->wq.sq_rptr, qhp->wq.sq_wptr)) { + if (!internal) { + abort=0; + disconnect = 1; + ep = qhp->ep; + } + } else { + if (!internal) { + abort=1; + disconnect = 1; + ep = qhp->ep; + } + ret = -EINVAL; + goto err; + } + break; + case IWCH_QP_STATE_TERMINATE: + qhp->attr.state = IWCH_QP_STATE_TERMINATE; + if (!internal) + terminate = 1; + break; + case IWCH_QP_STATE_ERROR: + qhp->attr.state = IWCH_QP_STATE_ERROR; + if (!internal) { + abort=1; + disconnect = 1; + ep = qhp->ep; + } + goto err; + break; + default: + ret = -EINVAL; + goto out; + } + break; + case IWCH_QP_STATE_CLOSING: + if (!internal) { + ret = -EINVAL; + goto out; + } + switch (attrs->next_state) { + case IWCH_QP_STATE_IDLE: + qhp->attr.state = IWCH_QP_STATE_IDLE; + qhp->attr.llp_stream_handle = NULL; + free_ep(&qhp->ep->com); + qhp->ep = NULL; + wake_up(&qhp->wait); + break; + case IWCH_QP_STATE_ERROR: + goto err; + default: + ret = -EINVAL; + goto err; + } + break; + case IWCH_QP_STATE_ERROR: + if (attrs->next_state != IWCH_QP_STATE_IDLE) { + ret = -EINVAL; + goto out; + } + + if (!Q_EMPTY(qhp->wq.sq_rptr, qhp->wq.sq_wptr) || + !Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr)) { + ret = -EINVAL; + goto out; + } + qhp->attr.state = IWCH_QP_STATE_IDLE; + memset(&qhp->attr, 0, sizeof(qhp->attr)); + break; + case IWCH_QP_STATE_TERMINATE: + if (!internal) { + ret = -EINVAL; + goto out; + } + goto err; + break; + default: + printk(KERN_ERR "%s in a bad state %d\n", + __FUNCTION__, qhp->attr.state); + ret = -EINVAL; + goto err; + break; + } + goto out; +err: + PDBG("%s disassociating LLP EP %p qpid %d\n", __FUNCTION__, qhp->ep, + qhp->wq.qpid); + + /* disassociate the LLP connection */ + qhp->attr.llp_stream_handle = NULL; + ep = qhp->ep; + qhp->ep = NULL; + qhp->attr.state = IWCH_QP_STATE_ERROR; + free=1; + wake_up(&qhp->wait); + BUG_ON(!ep); +#ifdef notyet + flush_qp(qhp, flag); +#endif +out: + spin_unlock_irqrestore(&qhp->lock, flag); + + if (terminate) + iwch_post_terminate(qhp, NULL); + + /* + * If disconnect is 1, then we need to initiate a disconnect + * on the EP. This can be a normal close (RTS->CLOSING) or + * an abnormal close (RTS/CLOSING->ERROR). + */ + if (disconnect) + iwch_ep_disconnect(ep, abort, GFP_KERNEL); + + /* + * If free is 1, then we've disassociated the EP from the QP + * and we need to dereference the EP. + */ + if (free) + free_ep(&ep->com); + + PDBG("%s %d state -> %d\n", __FUNCTION__, __LINE__, qhp->attr.state); + return ret; +} + +static int quiesce_qp(struct iwch_qp *qhp) +{ + spin_lock_irq(&qhp->lock); + iwch_quiesce_tid(qhp->ep); + qhp->flags |= QP_QUIESCED; + spin_unlock_irq(&qhp->lock); + return 0; +} + +static int resume_qp(struct iwch_qp *qhp) +{ + spin_lock_irq(&qhp->lock); + iwch_resume_tid(qhp->ep); + qhp->flags &= ~QP_QUIESCED; + spin_unlock_irq(&qhp->lock); + return 0; +} + +int iwch_quiesce_qps(struct iwch_cq *chp) +{ + int i; + struct iwch_qp *qhp; + + for (i=0; i < T3_MAX_NUM_QP; i++) { + qhp = chp->rhp->qpid2hlp[i]; + if (!qhp) + continue; + if ((qhp->attr.rcq == chp->cq.cqid) && !qp_quiesced(qhp)) { + quiesce_qp(qhp); + continue; + } + if ((qhp->attr.scq == chp->cq.cqid) && !qp_quiesced(qhp)) + quiesce_qp(qhp); + } + return 0; +} + +int iwch_resume_qps(struct iwch_cq *chp) +{ + int i; + struct iwch_qp *qhp; + + for (i=0; i < T3_MAX_NUM_QP; i++) { + qhp = chp->rhp->qpid2hlp[i]; + if (!qhp) + continue; + if ((qhp->attr.rcq == chp->cq.cqid) && qp_quiesced(qhp)) { + resume_qp(qhp); + continue; + } + if ((qhp->attr.scq == chp->cq.cqid) && qp_quiesced(qhp)) + resume_qp(qhp); + } + return 0; +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_user.h b/drivers/infiniband/hw/cxgb3/iwch_user.h new file mode 100644 index 0000000..ab87f72 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_user.h @@ -0,0 +1,62 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_USER_H__ +#define __IWCH_USER_H__ + +#define IWCH_UVERBS_ABI_VERSION 1 + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct iwch_create_cq_resp { + __u32 cqid; + __u32 entries; /* actual number of entries after creation */ + __u64 physaddr; /* library mmaps this to get addressability */ + __u64 queue; +}; + +struct iwch_create_qp_resp { + __u32 qpid; + __u32 entries; /* actual number of entries after creation */ + __u64 physaddr; /* library mmaps this to get addressability */ + __u64 physsize; /* library mmaps this to get addressability */ + __u64 queue; + __u64 sq_db_page; + __u64 rq_db_page; +}; +#endif From swise at opengridcomputing.com Fri Jun 23 07:29:39 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:29:39 -0500 Subject: [openib-general] [PATCH v2 03/14] CXGB3 Memory Registration In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop> References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623142939.32410.29905.stgit@stevo-desktop> This patch contains the code to register memory regions and windows. --- drivers/infiniband/hw/cxgb3/iwch_mem.c | 171 ++++++++++++++++++++++++++++++++ 1 files changed, 171 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c b/drivers/infiniband/hw/cxgb3/iwch_mem.c new file mode 100644 index 0000000..68ed76a --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c @@ -0,0 +1,171 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include + +#include +#include + +#include "cxio_hal.h" +#include "iwch.h" +#include "iwch_provider.h" + +int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + u64 *page_list) +{ + u32 stag; + u64 mem_h; + + + if (cxio_register_phys_mem(&rhp->rdev, + &stag, mhp->attr.pdid, + mhp->attr.perms, + mhp->attr.zbva, + mhp->attr.va_fbo, + mhp->attr.len, + shift-12, + page_list, + &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) { + return -ENOMEM; + } + mhp->attr.state = 1; + mhp->attr.stag = stag; + mem_h = stag >> 8; + mhp->ibmr.rkey = mhp->ibmr.lkey = stag; + rhp->stag2hlp[mem_h] = mhp; + PDBG("iwch_register_mem: mem_h(0x%0llx) mhp(%p)\n", mem_h, mhp); + return 0; +} + +int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + u64 *page_list) +{ + u32 stag; + u64 mem_h; + + + stag = mhp->attr.stag; + if (cxio_reregister_phys_mem(&rhp->rdev, + &stag, mhp->attr.pdid, + mhp->attr.perms, + mhp->attr.zbva, + mhp->attr.va_fbo, + mhp->attr.len, + shift-12, + page_list, + &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) { + return -ENOMEM; + } + mhp->attr.state = 1; + mhp->attr.stag = stag; + mem_h = stag >> 8; + mhp->ibmr.rkey = mhp->ibmr.lkey = stag; + rhp->stag2hlp[mem_h] = mhp; + PDBG("iwch_reregister_mem: mem_h(0x%0llx) mhp(%p)\n", mem_h, mhp); + return 0; +} + +int build_phys_page_list(struct ib_phys_buf *buffer_list, + int num_phys_buf, + u64 *iova_start, + u64 *total_size, + int *npages, + int *shift, + u64 **page_list) +{ + u64 mask; + int i, j, n; + + mask = 0; + *total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) + return -EINVAL; + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return -EINVAL; + *total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + if (*total_size > 0xFFFFFFFFULL) + return -ENOMEM; + + /* Find largest page shift we can use to cover buffers */ + for (*shift = PAGE_SHIFT; *shift < 27; ++(*shift)) + if (num_phys_buf > 1) { + if ((1ULL << *shift) & mask) + break; + } else { + if (1ULL << *shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << *shift) - 1))) + break; + } + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << *shift) - 1); + buffer_list[0].addr &= ~0ull << *shift; + + *npages = 0; + for (i = 0; i < num_phys_buf; ++i) + *npages += (buffer_list[i].size + + (1ULL << *shift) - 1) >> *shift; + + if (!*npages) { + return -EINVAL; + } + + *page_list = kmalloc(sizeof(u64) * *npages, GFP_KERNEL); + if (!*page_list) { + return -ENOMEM; + } + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift; + ++j) + (*page_list)[n++] = cpu_to_be64(buffer_list[i].addr + + ((u64) j << *shift)); + + PDBG("%s va %llx mask %llx shift %d len %lld pbl_size %d\n", + __FUNCTION__, *iova_start, mask, *shift, *total_size, *npages); + PDBG("pa0 %llx\n", (*page_list)[0]); + + return 0; + +} From swise at opengridcomputing.com Fri Jun 23 07:29:44 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:29:44 -0500 Subject: [openib-general] [PATCH v2 04/14] CXGB3 Async Events In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop> References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623142944.32410.95234.stgit@stevo-desktop> This patch contains code to handle async and completion events. --- drivers/infiniband/hw/cxgb3/iwch_ev.c | 209 +++++++++++++++++++++++++++++++++ 1 files changed, 209 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c b/drivers/infiniband/hw/cxgb3/iwch_ev.c new file mode 100644 index 0000000..36837b1 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c @@ -0,0 +1,209 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include "iwch_provider.h" +#include "iwch.h" +#include "iwch_cm.h" +#include "cxio_hal.h" +#include "cxio_wr.h" + +static void post_qp_event(struct iwch_dev *rnicp, struct iwch_cq *chp, + struct respQ_msg_t *rsp_msg, + enum ib_event_type ib_event, + int send_term) +{ + struct ib_event event; + struct iwch_qp_attributes attrs; + struct iwch_qp *qhp; + + printk(KERN_ERR "%s - AE qpid 0x%x opcode %d status 0x%x " + "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, + CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), + CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + + spin_lock(&rnicp->lock); + qhp = rnicp->qpid2hlp[CQE_QPID(rsp_msg->cqe)]; + if (!qhp) { + printk(KERN_ERR "%s unaffiliated error %d\n", + __FUNCTION__, CQE_STATUS(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + BUG_ON(1); + return; + } + atomic_inc(&qhp->refcnt); + spin_unlock(&rnicp->lock); + + event.event = ib_event; + event.device = chp->ibcq.device; + if (ib_event == IB_EVENT_CQ_ERR) + event.element.cq = &chp->ibcq; + else + event.element.qp = &qhp->ibqp; + + if (qhp->ibqp.event_handler) + (*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context); + attrs.next_state = IWCH_QP_STATE_TERMINATE; + if ((qhp->attr.state == IWCH_QP_STATE_RTS) && + !iwch_modify_qp(qhp->rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1) && send_term) + iwch_post_terminate(qhp, rsp_msg); + if (atomic_dec_and_test(&qhp->refcnt)) + wake_up(&qhp->wait); +} + +void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb) +{ + struct iwch_dev *rnicp; + struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data; + struct iwch_cq *chp; + struct iwch_qp *qhp; + + u64 cq_h = be16_to_cpu(rsp_msg->cq_id); + rnicp = (struct iwch_dev *) rdev_p->ulp; + + spin_lock(&rnicp->lock); + chp = rnicp->cqid2hlp[cq_h]; + qhp = rnicp->qpid2hlp[CQE_QPID(rsp_msg->cqe)]; + if (!chp || !qhp) { + printk(KERN_ERR MOD "Event for deleted cq or qp - " + "cqid %d qpid %d\n", (u32)cq_h, + (u32)CQE_QPID(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + goto out; + } + iwch_qp_add_ref(&qhp->ibqp); + atomic_inc(&chp->refcnt); + spin_unlock(&rnicp->lock); + + PDBG("%s - cq_h %lld\n", __FUNCTION__, cq_h); + + BUG_ON(!chp->ibcq.comp_handler); + + /* + * 1) incoming TERMINATE message. + * 2) completion of our sending a TERMINATE. + */ + if ((CQE_OPCODE(rsp_msg->cqe) == T3_TERMINATE) && + (CQE_STATUS(rsp_msg->cqe) == 0)) { + if (SQ_TYPE(rsp_msg->cqe)) { + PDBG("%s %d disconnecting\n", __FUNCTION__, __LINE__); + BUG_ON(!qhp->ep); + iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC); + } else { + PDBG("%s %d post REQ_ERR AE\n", __FUNCTION__, __LINE__); + post_qp_event(rnicp, chp, rsp_msg, + IB_EVENT_QP_REQ_ERR, 0); + iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC); + } + goto done; + } + + /* Bad incoming Read request */ + if (SQ_TYPE(rsp_msg->cqe) && + (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) { + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1); + goto done; + } + + /* Bad incoming write */ + if (RQ_TYPE(rsp_msg->cqe) && + (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)) { + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1); + goto done; + } + + switch (CQE_STATUS(rsp_msg->cqe)) { + + /* Completion Events */ + case TPT_ERR_SUCCESS: + + /* + * Confirm the destination entry if this is a RECV completion. + */ + if (qhp->ep && SQ_TYPE(rsp_msg->cqe)) + dst_confirm(qhp->ep->dst); + + case TPT_ERR_STAG: + case TPT_ERR_PDID: + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + case TPT_ERR_WRAP: + case TPT_ERR_BOUND: + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + (*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context); + break; + + /* Device Fatal Errors */ + case TPT_ERR_ECC: + case TPT_ERR_ECC_PSTAG: + case TPT_ERR_INTERNAL_ERR: + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_DEVICE_FATAL, 1); + break; + + /* QP Fatal Errors */ + case TPT_ERR_OUT_OF_RQE: + case TPT_ERR_PBL_ADDR_BOUND: + case TPT_ERR_CRC: + case TPT_ERR_MARKER: + case TPT_ERR_PDU_LEN_ERR: + case TPT_ERR_DDP_VERSION: + case TPT_ERR_RDMA_VERSION: + case TPT_ERR_OPCODE: + case TPT_ERR_DDP_QUEUE_NUM: + case TPT_ERR_MSN: + case TPT_ERR_TBIT: + case TPT_ERR_MO: + case TPT_ERR_MSN_GAP: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_RQE_ADDR_BOUND: + case TPT_ERR_IRD_OVERFLOW: + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1); + break; + + default: + PDBG("%s unknown T3 status 0x%x\n", __FUNCTION__, + CQE_STATUS(rsp_msg->cqe)); + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1); + break; + } +done: + if (atomic_dec_and_test(&chp->refcnt)) + wake_up(&chp->wait); + iwch_qp_rem_ref(&qhp->ibqp); +out: + dev_kfree_skb_irq(skb); +} From swise at opengridcomputing.com Fri Jun 23 07:30:31 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:30:31 -0500 Subject: [openib-general] [PATCH v2 13/14] CXGB3 Makefiles/Kconfig In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop> References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623143031.32410.45614.stgit@stevo-desktop> The cxgb3 rdma support is broken into 2 modules: iw_cxgb3.ko - the openib provider module. cxgb3c.ko - the cxgb3 "core" services module. --- drivers/infiniband/Kconfig | 1 + drivers/infiniband/Makefile | 1 + drivers/infiniband/hw/cxgb3/Kconfig | 14 ++++++++++++++ drivers/infiniband/hw/cxgb3/Makefile | 21 +++++++++++++++++++++ drivers/infiniband/hw/cxgb3/locking.txt | 25 +++++++++++++++++++++++++ 5 files changed, 62 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 04e6d4f..7dcf976 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -37,6 +37,7 @@ config INFINIBAND_ADDR_TRANS source "drivers/infiniband/hw/mthca/Kconfig" source "drivers/infiniband/hw/ipath/Kconfig" source "drivers/infiniband/hw/amso1100/Kconfig" +source "drivers/infiniband/hw/cxgb3/Kconfig" source "drivers/infiniband/ulp/ipoib/Kconfig" diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index e2b93f9..1a73af0 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -2,5 +2,6 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ obj-$(CONFIG_IPATH_CORE) += hw/ipath/ obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ +obj-$(CONFIG_INFINIBAND_IWCH) += hw/cxgb3/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ diff --git a/drivers/infiniband/hw/cxgb3/Kconfig b/drivers/infiniband/hw/cxgb3/Kconfig new file mode 100644 index 0000000..156df63 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/Kconfig @@ -0,0 +1,14 @@ +config INFINIBAND_IWCH + tristate "Chelsio OpenIB module" + depends on CHELSIO_T3 && INFINIBAND + ---help--- + This is the Chelsio OpenIB provider module. + +config INFINIBAND_IWCH_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_IWCH + default n + ---help--- + This option causes the Chelsio OpenIB provider module to produce + a bunch of debug messages. Select this if you are developing the + driver or trying to diagnose a problem. diff --git a/drivers/infiniband/hw/cxgb3/Makefile b/drivers/infiniband/hw/cxgb3/Makefile new file mode 100644 index 0000000..ed72caa --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/Makefile @@ -0,0 +1,21 @@ +EXTRA_CFLAGS += \ + -DCONFIG_CHELSIO_T3_OFFLOAD \ + -I$(TOPDIR)/drivers/infiniband/include \ + -I$(TOPDIR)/drivers/net/cxgb3 \ + -I$(TOPDIR)/drivers/infiniband/hw/cxgb3/t3c \ + -I$(TOPDIR)/drivers/infiniband/hw/cxgb3/core + +obj-$(CONFIG_INFINIBAND_IWCH) += iw_cxgb3.o cxgb3c.o + +iw_cxgb3-y := iwch_cm.o iwch_ev.o iwch_cq.o iwch_qp.o iwch_mem.o \ + iwch_provider.o iwch.o core/cxio_hal.o core/cxio_resource.o + +ifdef CONFIG_INFINIBAND_IWCH_DEBUG +EXTRA_CFLAGS += -O1 -g -DDEBUG +iw_cxgb3-y += core/cxio_dbg.o +endif + +cxgb3c-y := \ + t3c/t3c.o \ + t3c/l2t.o \ + t3c/t3cdev.o diff --git a/drivers/infiniband/hw/cxgb3/locking.txt b/drivers/infiniband/hw/cxgb3/locking.txt new file mode 100644 index 0000000..e5e9991 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/locking.txt @@ -0,0 +1,25 @@ +cq lock: + - spin lock + - used to synchronize the t3_cq + +qp lock: + - spin lock + - used to synchronize updates to the qp state, attrs, and the t3_wq. + - touched on interrupt and process context + +rnicp lock: + - spin lock + - touched on interrupt and process context + - used around lookup tables mapping CQID and QPID to a structure. + - used also to bump the refcnt atomically with the lookup. + +poll: + lock+disable on cq lock + lock qp lock for each cqe that is polled around the call + to cxio_poll_cq(). + +post: + lock+disable qp lock + +global mutex iwch_mutex: + used to maintain global device list. From swise at opengridcomputing.com Fri Jun 23 07:30:36 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:30:36 -0500 Subject: [openib-general] [PATCH v2 14/14] CXGB3 Low Level Driver ULP Interface In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop> References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623143036.32410.98171.stgit@stevo-desktop> This is all I'm submitting from the LLD/NETDEV driver. These headers define the interface used by the other modules to discover devices and communicate with the device. The entire LLD driver can be found in gen2/branches/iwarp/src/linux-kernel/net/cxgb3 --- drivers/net/cxgb3/t3_core.h | 45 ++++++++++++++++++++++++++++ drivers/net/cxgb3/t3cdev.h | 69 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 114 insertions(+), 0 deletions(-) diff --git a/drivers/net/cxgb3/t3_core.h b/drivers/net/cxgb3/t3_core.h new file mode 100644 index 0000000..1ce076a --- /dev/null +++ b/drivers/net/cxgb3/t3_core.h @@ -0,0 +1,45 @@ +/* + * Copyright (C) 2003-2006 Chelsio Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _T3_CORE_H_ +#define _T3_CORE_H_ +#include + +struct t3cdev; +struct t3_core { + void (*add) (struct t3cdev *); + void (*remove) (struct t3cdev *); +}; + +extern struct t3_core *t3_core; +void t3_register_core(struct t3_core *core); +void t3_unregister_core(struct t3_core *core); +#endif diff --git a/drivers/net/cxgb3/t3cdev.h b/drivers/net/cxgb3/t3cdev.h new file mode 100644 index 0000000..7bc2df6 --- /dev/null +++ b/drivers/net/cxgb3/t3cdev.h @@ -0,0 +1,69 @@ +/* + * Copyright (C) 2003-2006 Chelsio Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _T3CDEV_H_ +#define _T3CDEV_H_ + +#include +#include +#include +#include +#include +#include +#include +#include + +#define T3CNAMSIZ 16 + +#define NETIF_F_TCPIP_OFFLOAD (1 << 16) + +/* Get the t3cdev associated with a net_device */ +#define T3CDEV(netdev) (*(struct t3cdev **)&(netdev)->ec_ptr) + +struct t3cdev { + char name[T3CNAMSIZ]; /* T3C device name */ + struct list_head t3c_list; /* for list linking */ + struct net_device *lldev; /* LL dev associated with T3C messages */ + struct proc_dir_entry *proc_dir; /* root of proc dir for this T3C */ + int (*open)(struct t3cdev *dev); + int (*close)(struct t3cdev *dev); + int (*send)(struct t3cdev *dev, struct sk_buff *skb); + int (*recv)(struct t3cdev *dev, struct sk_buff **skb, int n); + int (*ctl)(struct t3cdev *dev, unsigned int req, void *data); + void (*neigh_update)(struct t3cdev *dev, struct neighbour *neigh, + int fl, struct net_device *lldev); + void *priv; /* driver private data */ + void *l2opt; /* optional layer 2 data */ + void *l3opt; /* optional layer 3 data */ + void *l4opt; /* optional layer 4 data */ + void *ulp; /* ulp stuff */ +}; +#endif /* _T3CDEV_H_ */ From swise at opengridcomputing.com Fri Jun 23 07:29:55 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:29:55 -0500 Subject: [openib-general] [PATCH v2 06/14] CXGB3 RDMA Core Debug Code In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop> References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623142955.32410.44090.stgit@stevo-desktop> This patch implements debug code for the RDMA Core. --- drivers/infiniband/hw/cxgb3/core/cxio_dbg.c | 209 +++++++++++++++++++++++++++ 1 files changed, 209 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c new file mode 100644 index 0000000..4cc3e96 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c @@ -0,0 +1,209 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifdef DEBUG +#include +#include "common.h" +#include "cxgb3_ioctl.h" +#include "cxio_hal.h" +#include "cxio_wr.h" + +void cxio_dump_tpt(struct cxio_rdev *rdev, u32 stag) +{ + struct ch_mem_range *m; + u64 *data; + int rc; + int size = 32; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + DBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = (stag>>8) * 32 + rdev->rnic_info.tpt_base; + m->len = size; + DBG("%s TPT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + DBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + DBG("TPT %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_pbl(struct cxio_rdev *rdev, u32 pbl_addr, uint len, u8 shift) +{ + struct ch_mem_range *m; + u64 *data; + int rc; + int size, npages; + + shift += 12; + npages = (len + (1ULL << shift) - 1) >> shift; + size = npages * sizeof(u64); + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + DBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = (pbl_addr<<3) + rdev->rnic_info.pbl_base; + m->len = size; + DBG("%s PBL addr 0x%x len %d depth %d\n", + __FUNCTION__, m->addr, m->len, npages); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + DBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + DBG("PBL %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_wqe(union t3_wr *wqe) +{ + u64 *data = (u64 *)wqe; + uint size = (uint)(be64_to_cpu(*data) & 0xff); + + while (size > 0) { + DBG("WQE %p: %016llx\n", data, be64_to_cpu(*data)); + size--; + data++; + } +} + +void cxio_dump_wce(struct t3_cqe *wce) +{ + u64 *data = (u64 *)wce; + int size = sizeof(*wce); + + while (size > 0) { + DBG("WCE %p: %016llx\n", data, be64_to_cpu(*data)); + size -= 8; + data++; + } +} + +void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents) +{ + struct ch_mem_range *m; + int size = nents * 64; + u64 *data; + int rc; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + DBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = ((hwtid)<<10) + rdev->rnic_info.rqt_base; + m->len = size; + DBG("%s RQT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + DBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + DBG("RQT %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid) +{ + struct ch_mem_range *m; + int size = TCB_SIZE; + u32 *data; + int rc; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + DBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_CM; + m->addr = hwtid * size; + m->len = size; + DBG("%s TCB %d len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + DBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u32 *)m->buf; + while (size > 0) { + printk("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", + m->addr, + *(data+2), *(data+3), *(data),*(data+1), + *(data+6), *(data+7), *(data+4), *(data+5)); + size -= 32; + data += 8; + m->addr += 32; + } + kfree(m); +} +EXPORT_SYMBOL(cxio_dump_tpt); +EXPORT_SYMBOL(cxio_dump_pbl); +EXPORT_SYMBOL(cxio_dump_wqe); +EXPORT_SYMBOL(cxio_dump_wce); +EXPORT_SYMBOL(cxio_dump_rqt); +EXPORT_SYMBOL(cxio_dump_tcb); +#endif From swise at opengridcomputing.com Fri Jun 23 07:30:00 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:30:00 -0500 Subject: [openib-general] [PATCH v2 07/14] CXGB3 RDMA Core HAL Code. In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop> References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623143000.32410.17526.stgit@stevo-desktop> This code implements a HAL interface to the T3 hardware. --- drivers/infiniband/hw/cxgb3/core/cxio_hal.c | 1152 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/core/cxio_hal.h | 166 ++++ 2 files changed, 1318 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c new file mode 100644 index 0000000..e142e5f --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c @@ -0,0 +1,1152 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include "cxio_hal.h" +#include "sge_defs.h" +#include + +static struct cxio_rdev *rdev_tbl[T3_MAX_NUM_RNIC]; +static cxio_hal_ev_callback_func_t cxio_ev_cb = NULL; + +static inline struct cxio_rdev *cxio_hal_find_rdev_by_name(char *dev_name) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i]) + if (!strcmp(rdev_tbl[i]->dev_name, dev_name)) + return rdev_tbl[i]; + return NULL; +} + +static inline struct cxio_rdev *cxio_hal_find_rdev_by_t3cdev(struct t3cdev + *tdev) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i]) + if (rdev_tbl[i]->t3cdev_p == tdev) + return rdev_tbl[i]; + return NULL; +} + +static inline int cxio_hal_add_rdev(struct cxio_rdev *rdev_p) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) { + if (!rdev_tbl[i]) { + rdev_tbl[i] = rdev_p; + break; + } + } + return (i == T3_MAX_NUM_RNIC); +} + +static inline void cxio_hal_delete_rdev(struct cxio_rdev *rdev_p) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i] == rdev_p) { + rdev_tbl[i] = NULL; + break; + } +} + +extern int cxio_hal_init_rhdl_resource(u32 nr_rhdl); +extern void cxio_hal_destroy_rhdl_resource(void); +extern int cxio_hal_init_resource(struct cxio_hal_resource **rscpp, + u32 nr_tpt, u32 nr_pbl, + u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, + u32 nr_pdid); +extern u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag); +extern u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid); +extern u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid); +extern void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp); + +int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq, + enum t3_cq_opcode op, u32 credit) +{ + int ret; + struct t3_cqe *cqe; + u32 rptr; + + struct rdma_cq_op setup; + setup.id = cq->cqid; + setup.credits = (op == CQ_CREDIT_UPDATE) ? credit : 0; + setup.op = op; + ret = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_OP, &setup); + + if ((ret < 0) || (op == CQ_CREDIT_UPDATE)) + return ret; + + /* + * If the rearm returned an index other than our current index, + * then there might be CQE's in flight (being DMA'd). We must wait + * here for them to complete or the consumer can miss a notification. + */ + if (Q_PTR2IDX((cq->rptr), cq->size_log2) != ret) { + int i=0; + + rptr = cq->rptr; + + /* + * Keep the generation correct by bumping rptr until it + * matches the index returned by the rearm - 1. + */ + while (Q_PTR2IDX((rptr+1), cq->size_log2) != ret) + rptr++; + + /* + * Now rptr is the index for the (last) cqe that was + * in-flight at the time the HW rearmed the CQ. We + * spin until that CQE is valid. + */ + cqe = cq->queue + Q_PTR2IDX(rptr, cq->size_log2); + while (!CQ_VLD_ENTRY(rptr, cq->size_log2, cqe)) { + udelay(1); + if (i++ > 1000000) { + BUG_ON(1); + printk(KERN_ERR "%s: stalled rnic\n", + rdev_p->dev_name); + return -EIO; + } + } + } + return 0; +} + +static inline int cxio_hal_clear_cq_ctx(struct cxio_rdev *rdev_p, u32 cqid) +{ + struct rdma_cq_setup setup; + setup.id = cqid; + setup.base_addr = 0; /* NULL address */ + setup.size = 0; /* disaable the CQ */ + setup.credits = 0; + setup.credit_thres = 0; + setup.ovfl_mode = 0; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid) +{ + u64 sge_cmd; + struct t3_modify_qp_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL); + if (!skb) { + DBG("failed in alloc_skb in destroy_ctrl_qp\n"); + return -ENOMEM; + } + wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0x3, 1, qpid, + 0x4); + sge_cmd = qpid << 8 | 3; + wqe->wrid.id1 = cpu_to_be64(sge_cmd); + wqe->ctx1 = 0ULL; + wqe->ctx0 = 0ULL; + skb->priority = CPL_PRIORITY_CONTROL; + return (t3c_send(rdev_p->t3cdev_p, skb)); +} + +int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + struct rdma_cq_setup setup; + int size = (1UL << (cq->size_log2)) * sizeof(struct t3_cqe); + + cq->cqid = cxio_hal_get_cqid(rdev_p->rscp); + if (!cq->cqid) + return -ENOMEM; + cq->sw_queue = kzalloc(size, GFP_KERNEL); + if (!cq->sw_queue) + return -ENOMEM; + cq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (cq->size_log2)) * + sizeof(struct t3_cqe), + &(cq->dma_addr), GFP_KERNEL); + if (!cq->queue) { + kfree(cq->sw_queue); + return -ENOMEM; + } + pci_unmap_addr_set(cq, mapping, cq->dma_addr); + memset(cq->queue, 0, size); + setup.id = cq->cqid; + setup.base_addr = (u64) (cq->dma_addr); + setup.size = 1UL << cq->size_log2; + setup.credits = 65535; + setup.credit_thres = 1; + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +int cxio_resize_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + struct rdma_cq_setup setup; + setup.id = cq->cqid; + setup.base_addr = (u64) (cq->dma_addr); + setup.size = 1UL << cq->size_log2; + setup.credits = setup.size; + setup.credit_thres = setup.size; /* TBD: overflow recovery */ + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +int cxio_create_qp(struct cxio_rdev *rdev_p, u32 kernel_domain, + struct t3_wq *wq) +{ + int depth = 1UL << wq->size_log2; + wq->qpid = cxio_hal_get_qpid(rdev_p->rscp); + if (!wq->qpid) + return -ENOMEM; + + wq->rq = kzalloc(depth * sizeof(u64), GFP_KERNEL); + if (!wq->rq) { + cxio_hal_put_qpid(rdev_p->rscp, wq->qpid); + return -ENOMEM; + } + + wq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev), + depth * sizeof(union t3_wr), + &(wq->dma_addr), GFP_KERNEL); + if (!wq->queue) { + kfree(wq->rq); + cxio_hal_put_qpid(rdev_p->rscp, wq->qpid); + return -ENOMEM; + } + + pci_unmap_addr_set(wq, mapping, wq->dma_addr); +#ifdef USER_DOORBELL + if (kernel_domain) +#endif + wq->doorbell = rdev_p->rnic_info.kdb_addr; +#ifdef USER_DOORBELL + else + wq->doorbell = (void *)rdev_p->rnic_info.udbell_physbase + + (wq->qpid << PAGE_SHIFT); +#endif + return 0; +} + +int cxio_destroy_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + int err; + err = cxio_hal_clear_cq_ctx(rdev_p, cq->cqid); + kfree(cq->sw_queue); + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (cq->size_log2)) + * sizeof(struct t3_cqe), cq->queue, + pci_unmap_addr(cq, mapping)); + cxio_hal_put_cqid(rdev_p->rscp, cq->cqid); + return err; +} + +int cxio_destroy_qp(struct cxio_rdev *rdev_p, struct t3_wq *wq) +{ + int err; + err = cxio_hal_clear_qp_ctx(rdev_p, wq->qpid); + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (wq->size_log2)) + * sizeof(union t3_wr), wq->queue, + pci_unmap_addr(wq, mapping)); + kfree(wq->rq); + cxio_hal_put_qpid(rdev_p->rscp, wq->qpid); + return err; +} + +static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq) +{ + struct t3_cqe cqe; + + DBG("%s %d wq %p cq %p sw_rptr %x sw_wptr %x\n", __FUNCTION__, + __LINE__, wq, cq, cq->sw_rptr, cq->sw_wptr); + memset(&cqe, 0, sizeof(cqe)); + cqe.header = V_CQE_STATUS(1) | + V_CQE_OPCODE(T3_SEND) | + V_CQE_TYPE(0) | + V_CQE_SWCQE(1) | + V_CQE_QPID(wq->qpid) | + V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, cq->size_log2)); + cqe.header = cpu_to_be32(cqe.header); + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe; + cq->sw_wptr++; +} + +void cxio_flush_rq(struct cxio_rdev *rdev_p, struct t3_wq *wq, + struct t3_cq *cq) +{ + u32 ptr; + + DBG("%s %d wq %p cq %p\n", __FUNCTION__, __LINE__, wq, cq); + + /* mark the wq in error so all CQEs will be completed as flushed */ + wq->error = 1; + + /* flush RQ */ + ptr = wq->rq_rptr; + while (ptr++ != wq->rq_wptr) { + insert_recv_cqe(wq, cq); + } +} + +static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq, union t3_wr *wr) +{ + struct t3_cqe cqe; + enum t3_rdma_opcode op; + + DBG("%s %d wq %p cq %p sw_rptr %x sw_wptr %x\n", __FUNCTION__, + __LINE__, wq, cq, cq->sw_rptr, cq->sw_wptr); + memset(&cqe, 0, sizeof(cqe)); + op = wr2opcode(G_FW_RIWR_OP(be32_to_cpu(wr->send.wrh.op_seop_flags))); + cqe.header = V_CQE_STATUS(1) | + V_CQE_OPCODE(op) | + V_CQE_TYPE(1) | + V_CQE_SWCQE(1) | + V_CQE_QPID(wq->qpid) | + V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, cq->size_log2)); + cqe.header = cpu_to_be32(cqe.header); + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe; + CQE_WRID_SQ_WPTR(cqe) = wr->send.wrid.id0.hi; + CQE_WRID_WPTR(cqe) = wr->send.wrid.id0.low; + cq->sw_wptr++; +} + +void cxio_flush_sq(struct cxio_rdev *rdev_p, struct t3_wq *wq, + struct t3_cq *cq) +{ + u32 ptr; + union t3_wr *wr = wq->sq_oldest_wr; + + DBG("%s %d wq %p cq %p\n", __FUNCTION__, __LINE__, wq, cq); + + /* mark the wq in error so all CQEs will be completed as flushed */ + wq->error = 1; + + /* flush SQ */ + ptr = wq->sq_rptr; + while (ptr++ != wq->sq_wptr) { + BUG_ON(!wr); + insert_sq_cqe(wq, cq, wr); + wr = next_sq_wr(wq); + + } +} + +static int cxio_hal_init_ctrl_cq(struct cxio_rdev *rdev_p) +{ + struct rdma_cq_setup setup; + setup.id = 0; + setup.base_addr = 0; /* NULL address */ + setup.size = 1; /* enable the CQ */ + setup.credits = 0; + + /* force SGE to redirect to RspQ and interrupt */ + setup.credit_thres = 0; + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p) +{ + int err; + u64 sge_cmd, ctx0, ctx1; + u64 base_addr; + struct t3_modify_qp_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL); + if (!skb) { + DBG("failed in alloc_skb in init_ctrl_qp\n"); + return -ENOMEM; + } + err = cxio_hal_init_ctrl_cq(rdev_p); + if (err) { + DBG("err initializing ctrl_cq, err status =%d\n", err); + return err; + } + rdev_p->ctrl_qp.workq = dma_alloc_coherent( + &(rdev_p->rnic_info.pdev->dev), + (1 << T3_CTRL_QP_SIZE_LOG2) * + sizeof(union t3_wr), + &(rdev_p->ctrl_qp.dma_addr), + GFP_KERNEL); + if (!rdev_p->ctrl_qp.workq) { + DBG("failed to allocate memory for ctrl QP\n"); + return -ENOMEM; + } + pci_unmap_addr_set(&rdev_p->ctrl_qp, mapping, + rdev_p->ctrl_qp.dma_addr); + rdev_p->ctrl_qp.doorbell = rdev_p->rnic_info.kdb_addr; + memset(rdev_p->ctrl_qp.workq, 0, + (1 << T3_CTRL_QP_SIZE_LOG2) * sizeof(union t3_wr)); + + init_MUTEX(&rdev_p->ctrl_qp.sem); + init_waitqueue_head(&rdev_p->ctrl_qp.waitq); + + /* update HW Ctrl QP context */ + base_addr = rdev_p->ctrl_qp.dma_addr; + base_addr >>= 12; + ctx0 = (V_EC_SIZE((1 << T3_CTRL_QP_SIZE_LOG2)) | + V_EC_BASE_LO((u32) base_addr & 0xffff)); + ctx0 <<= 32; + ctx0 |= V_EC_CREDITS(FW_WR_NUM); + base_addr >>= 16; + ctx1 = (u32) base_addr; + base_addr >>= 32; + ctx1 |= ((u64) (V_EC_BASE_HI((u32) base_addr & 0xf) | V_EC_RESPQ(0) | + V_EC_TYPE(0) | V_EC_GEN(1) | + V_EC_UP_TOKEN(FW_RI_TID_START) | F_EC_VALID)) << 32; + wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0x3, 1, + T3_CTRL_QP_ID, 0x4); + sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3; + wqe->wrid.id1 = cpu_to_be64(sge_cmd); + wqe->ctx1 = cpu_to_be64(ctx1); + wqe->ctx0 = cpu_to_be64(ctx0); + DBG("CtrlQP dma_addr=0x%llx kaddr=%p size=%d\n", + (u64) rdev_p->ctrl_qp.dma_addr, rdev_p->ctrl_qp.workq, + 1 << T3_CTRL_QP_SIZE_LOG2); + skb->priority = CPL_PRIORITY_CONTROL; + return (t3c_send(rdev_p->t3cdev_p, skb)); +} + +static int cxio_hal_destroy_ctrl_qp(struct cxio_rdev *rdev_p) +{ + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << T3_CTRL_QP_SIZE_LOG2) + * sizeof(union t3_wr), rdev_p->ctrl_qp.workq, + pci_unmap_addr(&rdev_p->ctrl_qp, mapping)); + return cxio_hal_clear_qp_ctx(rdev_p, T3_CTRL_QP_ID); +} + +/* write len bytes of data into addr (32B aligned address) + * If data is NULL, clear len byte of memory to zero. + * caller aquires the sem before the call + */ +static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr, + u32 len, void *data, int completion) +{ + u32 i, nr_wqe, copy_len; + u8 *copy_data; + u8 wr_len, utx_len; /* lenght in 8 byte flit */ + enum t3_wr_flags flag; + u64 *wqe; + u64 utx_cmd; + addr &= 0x7FFFFFF; + nr_wqe = len % 96 ? len / 96 + 1 : len / 96; /* 96B max per WQE */ + DBG("wptr=%d rptr=%d len=%d, nr_wqe=%d data=%p addr=0x%0x\n", + rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, len, nr_wqe, data, + addr); + utx_len = 3; /* in 32B unit */ + for (i = 0; i < nr_wqe; i++) { + if (Q_FULL(rdev_p->ctrl_qp.rptr, rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2)) { + DBG("ctrl_qp full wtpr=0x%0x rptr=0x%0x, " + "wait for more space i=%d\n", rdev_p->ctrl_qp.wptr, + rdev_p->ctrl_qp.rptr, i); + return 0; + if (wait_event_interruptible(rdev_p->ctrl_qp.waitq, + !Q_FULL(rdev_p->ctrl_qp. + rptr, + rdev_p->ctrl_qp. + wptr, + T3_CTRL_QP_SIZE_LOG2))) { + DBG("ctrl_qp workq wakeup due to interrupt\n"); + return -ERESTARTSYS; + } + DBG("ctrl_qp wakeup, continue posting work request " + "i=%d\n", i); + } + wqe = (u64 *) (rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr % + (1 << T3_CTRL_QP_SIZE_LOG2))); + flag = 0; + if (i == (nr_wqe - 1)) { + /* last WQE */ + flag = completion ? T3_COMPLETION_FLAG : 0; + if (len % 32) + utx_len = len / 32 + 1; + else + utx_len = len / 32; + } + + /* + * Force a CQE to return the credit to the workq in case + * we posted more than half the max QP size of WRs + */ + if ((i != 0) && + (i % (((1 << T3_CTRL_QP_SIZE_LOG2)) >> 1) == 0)) { + flag = T3_COMPLETION_FLAG; + DBG("force a completion at i=%d\n", i); + } + + /* build the utx mem command */ + wqe += (sizeof(struct t3_bypass_wr) >> 3); + utx_cmd = (T3_UTX_MEM_WRITE << 28) | (addr + i * 3); + utx_cmd <<= 32; + utx_cmd |= (utx_len << 28) | ((utx_len << 2) + 1); + *wqe = cpu_to_be64(utx_cmd); + wqe++; + copy_data = (u8 *) data + i * 96; + copy_len = len > 96 ? 96 : len; + + /* clear memory content if data is NULL */ + if (data) + memcpy(wqe, copy_data, copy_len); + else + memset(wqe, 0, copy_len); + if (copy_len % 32) + memset(((u8 *) wqe) + copy_len, 0, + 32 - (copy_len % 32)); + wr_len = ((sizeof(struct t3_bypass_wr)) >> 3) + 1 + + (utx_len << 2); + wqe = (u64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr % + (1 << T3_CTRL_QP_SIZE_LOG2))); + + /* wptr in the WRID[31:0] */ + *(wqe + 1) = cpu_to_be64((u64) rdev_p->ctrl_qp.wptr); + + /* + * This must be the last write with a memory barrier + * for the genbit + */ + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag, + Q_GENBIT(rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID, + wr_len); + if (flag == T3_COMPLETION_FLAG) + RING_DOORBELL(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID); + len -= 96; + rdev_p->ctrl_qp.wptr++; + } + return 0; +} + +/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl, and pbl_size + * OUT: stag index, actual pbl_size, pbl_addr allocated. + * TBD: shared memory region support + */ +static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry, + u32 * stag, u8 stag_state, u32 pdid, + enum tpt_mem_type type, enum tpt_mem_perm perm, + u32 zbva, u64 to, u32 len, u8 page_size, u64 * pbl, + u32 * pbl_size, u32 * pbl_addr) +{ + int err; + struct tpt_entry tpt; + u32 stag_idx; + u32 wptr; + u32 pbl_size_save; + stag_state = stag_state > 0; + stag_idx = (*stag) >> 8; + pbl_size_save = reset_tpt_entry ? 0 : *pbl_size; + if ((!reset_tpt_entry) && !(*stag != T3_STAG_UNSET)) { + stag_idx = cxio_hal_get_stag(rdev_p->rscp); + if (!stag_idx) + return -ENOMEM; + *stag = (stag_idx << 8) | ((*stag) & 0xFF); + } + DBG("stag_state=%0x type=%0x pdid=%0x, stag_idx = 0x%x`\n", + stag_state, type, pdid, stag_idx); + + + /* allocate pbl entries if requested size >0 */ + if (pbl_size_save) { + + /* + * TBD: pbl resource management. + * For now, give each stag a 2KB pbl region, i.e. 256 pages + */ + if ((*pbl_size) > 256) { + DBG("TBD: PBL allocation failure: fixed 256 entries " + "for now\n"); + return -ENOMEM; + } + *pbl_addr = (stag_idx << 8); + + /* update the actual pbl_size allocated */ + *pbl_size = 256; + } + down_interruptible(&rdev_p->ctrl_qp.sem); + + /* write PBL first if any - update pbl only if pbl list exist */ + if (pbl) { + + DBG("*pdb_addr %x, pbl_base %x, pbl_size_save %d\n", + *pbl_addr, rdev_p->rnic_info.pbl_base, pbl_size_save); + err = cxio_hal_ctrl_qp_write_mem(rdev_p, ((*pbl_addr) >> 2) + + (rdev_p->rnic_info.pbl_base >> 5), + (pbl_size_save << 3), pbl, 0); + if (err) + goto ret; + } + + /* write TPT entry */ + if (reset_tpt_entry) { + memset(&tpt, 0, sizeof(tpt)); + } else { + tpt.valid_stag_pdid = cpu_to_be32(F_TPT_VALID | + V_TPT_STAG_KEY((*stag) & M_TPT_STAG_KEY) | + V_TPT_STAG_STATE(stag_state) | + V_TPT_STAG_TYPE(type) | V_TPT_PDID(pdid)); + BUG_ON(page_size >= 28); + tpt.flags_pagesize_qpid = cpu_to_be32(V_TPT_PERM(perm) | + F_TPT_MW_BIND_ENABLE | + V_TPT_ADDR_TYPE((zbva ? TPT_ZBTO : TPT_VATO)) | + V_TPT_PAGE_SIZE(page_size)); + tpt.rsvd_pbl_addr = pbl_size_save ? + cpu_to_be32(V_TPT_PBL_ADDR(*pbl_addr)) : 0; + tpt.len = cpu_to_be32(len); + tpt.va_hi = cpu_to_be32((u32) (to >> 32)); + tpt.va_low_or_fbo = cpu_to_be32((u32) (to & 0xFFFFFFFFULL)); + tpt.rsvd_bind_cnt_or_pstag = 0; + tpt.rsvd_pbl_size = pbl_size_save ? + cpu_to_be32(V_TPT_PBL_SIZE((*pbl_size) >> 2)) : 0; + } + err = cxio_hal_ctrl_qp_write_mem(rdev_p, + stag_idx + + (rdev_p->rnic_info.tpt_base >> 5), + sizeof(tpt), &tpt, 1); + + /* release the stag index to free pool */ + if (reset_tpt_entry) + cxio_hal_put_stag(rdev_p->rscp, stag_idx); +ret: + wptr = rdev_p->ctrl_qp.wptr; + up(&rdev_p->ctrl_qp.sem); + if (!err) { + if (wait_event_interruptible(rdev_p->ctrl_qp.waitq, + SEQ32_GE(rdev_p->ctrl_qp.rptr, + wptr))) + return -ERESTARTSYS; + } + return err; +} + +/* IN : stag key, pdid, pbl_size + * Out: stag index, actaul pbl_size, and pbl_addr allocated. + */ +int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr) +{ + *stag = T3_STAG_UNSET; + return (__cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, + perm, 0, 0ULL, 0, 0, NULL, pbl_size, pbl_addr)); +} + +int cxio_register_phys_mem(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, u64 * pbl, u32 * pbl_size, + u32 * pbl_addr) +{ + *stag = T3_STAG_UNSET; + return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm, + zbva, to, len, page_size, pbl, pbl_size, pbl_addr); +} + +int cxio_reregister_phys_mem(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, u64 * pbl, u32 * pbl_size, + u32 * pbl_addr) +{ + return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm, + zbva, to, len, page_size, pbl, pbl_size, pbl_addr); +} + +int cxio_dereg_mem(struct cxio_rdev *rdev_p, u32 stag) +{ + /* TBD: check if there is any MW bound to the MR */ + return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL, + NULL, NULL); +} + +int cxio_allocate_window(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid) +{ + u32 pbl_size = 0; + *stag = T3_STAG_UNSET; + return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_MW, 0, 0, 0ULL, 0, 0, + NULL, &pbl_size, NULL); +} + +int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag) +{ + return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL, + NULL, NULL); +} + +int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr) +{ + struct t3_rdma_init_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_ATOMIC); + if (!skb) + return -ENOMEM; + DBG("%s %d\n", __FUNCTION__, __LINE__); + wqe = (struct t3_rdma_init_wr *) __skb_put(skb, sizeof(*wqe)); + wqe->wrh.op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(T3_WR_INIT)); + wqe->wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(attr->tid) | + V_FW_RIWR_LEN(sizeof(*wqe) >> 3)); + wqe->wrid.id1 = 0; + wqe->qpid = cpu_to_be32(attr->qpid); + wqe->pdid = cpu_to_be32(attr->pdid); + wqe->scqid = cpu_to_be32(attr->scqid); + wqe->rcqid = cpu_to_be32(attr->rcqid); + wqe->rq_addr = cpu_to_be32(attr->rq_addr); + wqe->rq_size = cpu_to_be32(attr->rq_size); + wqe->mpaattrs = attr->mpaattrs; + wqe->qpcaps = attr->qpcaps; + wqe->ulpdu_size = cpu_to_be16(attr->tcp_emss); + wqe->rqes_posted = cpu_to_be32(attr->rqes_posted); + wqe->ord = cpu_to_be32(attr->ord); + wqe->ird = cpu_to_be32(attr->ird); + wqe->qp_dma_addr = cpu_to_be64(attr->qp_dma_addr); + wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size); + wqe->rsvd = 0; + skb->priority = 0; /* 0=>ToeQ; 1=>CtrlQ */ + return (t3c_send(rdev_p->t3cdev_p, skb)); +} + +void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb) +{ + cxio_ev_cb = ev_cb; +} + +void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb) +{ + cxio_ev_cb = NULL; +} + +static int cxio_hal_ev_handler(struct t3cdev *t3cdev_p, struct sk_buff *skb) +{ + static int cnt; + struct cxio_rdev *rdev_p = NULL; + struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data; + DBG("%d: cxio_hal_ev_handler being called for CQ_ID(%d), " + "overflow=%0x, notify=%0x with CQE:\n", cnt, + be16_to_cpu(rsp_msg->cq_id), rsp_msg->cq_overflow, + rsp_msg->cq_notify); + DBG("QPID=%0x genbit=%0x type=%0x Status=%0x opcode=%0x " + "len=%0x wrid_hi_stag=%x wrid_low_msn=%x\n", + CQE_QPID(rsp_msg->cqe), CQE_GENBIT(rsp_msg->cqe), + CQE_TYPE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), + CQE_OPCODE(rsp_msg->cqe), CQE_LEN(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + rdev_p = (struct cxio_rdev *)t3cdev_p->ulp; + if (!rdev_p) { + DBG("cxio_hal_ev_handler called by t3cdev (%p) with null!\n", + t3cdev_p); + return 0; + } + if (CQE_QPID(rsp_msg->cqe) == T3_CTRL_QP_ID) { + rdev_p->ctrl_qp.rptr = CQE_WRID_LOW(rsp_msg->cqe) + 1; + wake_up_interruptible(&rdev_p->ctrl_qp.waitq); + dev_kfree_skb_irq(skb); + } else if (cxio_ev_cb) { + (*cxio_ev_cb) (rdev_p, skb); + } else { + dev_kfree_skb_irq(skb); + } + DBG("ev call back wptr=%d rptr=%d\n", rdev_p->ctrl_qp.wptr, + rdev_p->ctrl_qp.rptr); + cnt++; + return 0; +} + +/* Caller takes care of locking if needed */ +int cxio_rdev_open(struct cxio_rdev *rdev_p) +{ + struct net_device *netdev_p = NULL; + int err = 0; + if (strlen(rdev_p->dev_name)) { + if (cxio_hal_find_rdev_by_name(rdev_p->dev_name)) { + return -EBUSY; + } + netdev_p = dev_get_by_name(rdev_p->dev_name); + if (!netdev_p) { + DBG("dev_get_by_name(%s) failed\n", rdev_p->dev_name); + return -EINVAL; + } + dev_put(netdev_p); + } else if (rdev_p->t3cdev_p) { + if (cxio_hal_find_rdev_by_t3cdev(rdev_p->t3cdev_p)) { + return -EBUSY; + } + netdev_p = rdev_p->t3cdev_p->lldev; + strncpy(rdev_p->dev_name, rdev_p->t3cdev_p->name, + T3_MAX_DEV_NAME_LEN); + } else { + DBG("t3cdev_p or dev_name must be set\n"); + return -EINVAL; + } + + if (cxio_hal_add_rdev(rdev_p)) { + DBG("max number of RNIC supported exceeded\n"); + return -ENOMEM; + } + + DBG("opening rnic dev %s\n", rdev_p->dev_name); + memset(&rdev_p->ctrl_qp, 0, sizeof(rdev_p->ctrl_qp)); + if (!rdev_p->t3cdev_p) + rdev_p->t3cdev_p = T3CDEV(netdev_p); + rdev_p->t3cdev_p->ulp = (void *) rdev_p; + err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_GET_PARAMS, + &(rdev_p->rnic_info)); + if (err) { + printk("%s t3cdev_p(%p)->ctl returned error %d.\n", + __FUNCTION__, rdev_p->t3cdev_p, err); + goto err1; + } + DBG("rnic %s info: tpt_base=0x%0x tpt_top=0x%0x pbl_base=0x%0x " + "pbl_top=0x%0x rqt_base=0x%0x, rqt_top=0x%0x\n", + rdev_p->dev_name, rdev_p->rnic_info.tpt_base, + rdev_p->rnic_info.tpt_top, rdev_p->rnic_info.pbl_base, + rdev_p->rnic_info.pbl_top, rdev_p->rnic_info.rqt_base, + rdev_p->rnic_info.rqt_top); + DBG("udbell_len=0x%0x udbell_physbase=0x%lx " + "kdb_addr=%p\n", rdev_p->rnic_info.udbell_len, + rdev_p->rnic_info.udbell_physbase, rdev_p->rnic_info.kdb_addr); + + err = cxio_hal_init_ctrl_qp(rdev_p); + if (err) { + printk("%s error %d initializing ctrl_qp.\n", + __FUNCTION__, err); + goto err1; + } + err = cxio_hal_init_resource(&rdev_p->rscp, T3_MAX_NUM_STAG, 0, + 0, T3_MAX_NUM_QP, T3_MAX_NUM_CQ, + T3_MAX_NUM_PD); + if (err) { + printk(KERN_ERR "%s error %d initializing hal resources.\n", + __FUNCTION__, err); + goto err2; + } + return 0; +err2: + cxio_hal_destroy_ctrl_qp(rdev_p); +err1: + cxio_hal_delete_rdev(rdev_p); + return err; +} + +void cxio_rdev_close(struct cxio_rdev *rdev_p) +{ + if (rdev_p) { + cxio_hal_delete_rdev(rdev_p); + rdev_p->t3cdev_p->ulp = NULL; + cxio_hal_destroy_ctrl_qp(rdev_p); + cxio_hal_destroy_resource(rdev_p->rscp); + } +} + +int __init cxio_hal_init(void) +{ + if (cxio_hal_init_rhdl_resource(T3_MAX_NUM_RI)) + return -ENOMEM; + memset(rdev_tbl, 0, T3_MAX_NUM_RNIC * sizeof(void *)); + t3_register_cpl_handler(CPL_ASYNC_NOTIF, cxio_hal_ev_handler); + return 0; +} + +void __exit cxio_hal_exit(void) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) { + cxio_rdev_close(rdev_tbl[i]); + } + cxio_hal_destroy_rhdl_resource(); +} + +int cxio_peek_cq(struct t3_wq *wq, struct t3_cq *cq, int cqe_opcode) +{ + struct t3_cqe *peek_cqe; + u32 peekptr; + + peekptr = cq->rptr; + peek_cqe = cq->queue + Q_PTR2IDX(peekptr, cq->size_log2); + + /* + * see if the cqe with the requested opcode is here already. + */ + while (CQ_VLD_ENTRY(peekptr, cq->size_log2, peek_cqe)) { + if ((RQ_TYPE(*peek_cqe)) && + (CQE_OPCODE(*peek_cqe) == cqe_opcode) && + (CQE_QPID(*peek_cqe) == wq->qpid)) { + return 0; + } else { + ++(peekptr); + peek_cqe = cq->queue + + Q_PTR2IDX(peekptr, cq->size_log2); + } + if (peekptr == cq->rptr) { /* CQ full */ + /* Don't handle error here */ + /* Don't reset timer */ + return 0; + } + } + + /* + * The opcode was not found + */ + return -EAGAIN; +} + +static inline void create_read_req_cqe(struct t3_rdma_read_wr *wr, + struct t3_cqe *response_cqe, + struct t3_cqe *read_cqe) +{ + DBG("%s %d enter\n", __FUNCTION__, __LINE__); + + /* + * Now that we found the read response cqe, + * we build a proper read request sq cqe to + * return to the user, using the read request WR + * and bits of the read response cqe. + */ + read_cqe->header = + V_CQE_STATUS(CQE_STATUS(*response_cqe)) | + V_CQE_OPCODE(T3_READ_REQ) | + V_CQE_TYPE(1) | + V_CQE_QPID(CQE_QPID(*response_cqe)); + read_cqe->header = cpu_to_be32(read_cqe->header); + CQE_WRID_SQ_WPTR(*read_cqe) = wr->wrid.id0.hi; + CQE_WRID_WPTR(*read_cqe) = wr->wrid.id0.low; + read_cqe->len = wr->local_len; /* XXX Violates RDMAC but matches IB */ +} + +/* + * Slow path poll code. + */ +int __cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, + struct t3_cqe *cqe, u8 * cqe_flushed, + u64 * cookie, u32 * credit) +{ + int ret = 0; + struct t3_cqe *rd_cqe, *peek_cqe, read_cqe; + u32 peekptr; + + rd_cqe = cxio_next_cqe(cq); + + BUG_ON(!rd_cqe); + + /* + * skip cqe's not affiliated with a QP. + */ + if (wq == NULL) { + ret = -1; + goto skip_cqe; + } + + /* + * If this CQE was already returned (out of order completion) + * then silently toss it. + */ + if (CQE_OPCODE(*rd_cqe) == T3_READ_RESP && + (!wq->sq_oldest_wr || + (wq->sq_oldest_wr->send.rdmaop != T3_READ_REQ))) { + DBG("%s %d dropping old read response cqe\n", + __FUNCTION__, __LINE__); + ret = -1; + goto skip_cqe; + } + + if (CQE_OPCODE(*rd_cqe) == T3_TERMINATE) { + ret = -1; + wq->error = 1; + goto skip_cqe; + } + + if (CQE_STATUS(*rd_cqe) || wq->error) { + ret = 0; + *cqe_flushed = wq->error; + wq->error = 1; + + /* + * T3A inserts errors into the CQE. We cannot return + * these as work completions. + */ + /* incoming write failures */ + if ((CQE_OPCODE(*rd_cqe) == T3_RDMA_WRITE) + && RQ_TYPE(*rd_cqe)) { + ret = -1; + goto skip_cqe; + } + /* incoming read request failures */ + if ((CQE_OPCODE(*rd_cqe) == T3_READ_RESP) && SQ_TYPE(*rd_cqe)) { + ret = -1; + goto skip_cqe; + } + + /* incoming SEND with no receive posted failures */ + if ((CQE_OPCODE(*rd_cqe) == T3_SEND) && RQ_TYPE(*rd_cqe) && + Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) { + ret = -1; + goto skip_cqe; + } + goto proc_cqe; + } + + /* + * If this WQ's oldest pending SQ WR is a read request, then we + * must try and find the RQ Read Response which might not + * be the next CQE for that WQ on the CQ (reads can complete + * out of order). If its not in the CQ yet, then we must return + * "empty". This ensures we don't complete a subsequent WR + * out of order... + */ + + /* + * XXX This stalls the CQ for all QPs. Need to redesign this later + * to only stall the WQ in question. + */ + if (wq->sq_oldest_wr && + (wq->sq_oldest_wr->send.rdmaop == T3_READ_REQ)) { + DBG("%s %d oldest wr is read!\n", __FUNCTION__, __LINE__); + peekptr = cq->rptr; + peek_cqe = cq->queue + Q_PTR2IDX(peekptr, cq->size_log2); + + /* + * see if the read response is here already. + */ + while (CQ_VLD_ENTRY(peekptr, cq->size_log2, peek_cqe)) { + if ((RQ_TYPE(*peek_cqe)) && + (CQE_OPCODE(*peek_cqe) == T3_READ_RESP) && + (CQE_QPID(*peek_cqe) == wq->qpid)) { + create_read_req_cqe(&wq->sq_oldest_wr->read, + peek_cqe, &read_cqe); + rd_cqe = &read_cqe; + ret = 0; + goto proc_cqe; + } else { + ++peekptr; + peek_cqe = cq->queue + + Q_PTR2IDX(peekptr, cq->size_log2); + } + if (peekptr == cq->rptr) { /* CQ full */ + wq->error = 1; + *cqe_flushed = 1; + ret = 0; + goto proc_cqe; + } + } + + /* + * The read response hasn't happened, so we cannot return + * any other completion event for this WQ. + */ + ret = -1; + goto ret_cqe; + } + + /* + * HW only validates 4 bits of MSN. So we must validate that + * the MSN in the SEND is the next expected MSN. If its not, + * then we complete this with TPT_ERR_MSN and mark the wq in error. + */ + if (RQ_TYPE(*rd_cqe) && (CQE_WRID_MSN(*rd_cqe) != (wq->rq_rptr + 1))) { + ret = 0; + wq->error = 1; + (*rd_cqe).header = cpu_to_be32(cpu_to_be32((*rd_cqe).header) | + V_CQE_STATUS(TPT_ERR_MSN)); + goto proc_cqe; + } + +proc_cqe: + *cqe = *rd_cqe; + + /* + * Reap the associated WR(s) that are freed up with this + * completion. + */ + if (SQ_TYPE(*rd_cqe)) { + BUG_ON(!wq->sq_oldest_wr); + wq->sq_rptr = CQE_WRID_SQ_WPTR(*rd_cqe) + 1; + BUG_ON((wq->sq_oldest_wr-wq->queue) != + Q_PTR2IDX(CQE_WRID_WPTR(*rd_cqe), wq->size_log2)); + *cookie = wq->queue[Q_PTR2IDX(CQE_WRID_WPTR(*rd_cqe), + wq->size_log2) + ].flit[T3_SQ_COOKIE_FLIT]; + wq->sq_oldest_wr = next_sq_wr(wq); + } else { + *cookie = wq->rq[Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2)]; + ++(wq->rq_rptr); + } + + /* If we created a READ_REQ CQE, don't skip this one */ + if (rd_cqe == &read_cqe) + goto ret_cqe; +skip_cqe: + if (SW_CQE(*rd_cqe)) { + DBG("skip sw cqe sw_rptr %x\n", cq->sw_rptr); + ++cq->sw_rptr; + } else { + DBG("cq %p cqid %d skip hw cqe rptr %x\n", cq, cq->cqid, + cq->rptr); + ++cq->rptr; + + /* + * compute credits. + */ + if (((cq->rptr - cq->wptr) > (1 << (cq->size_log2 - 1))) + || ((cq->rptr - cq->wptr) >= 128)) { + *credit = cq->rptr - cq->wptr; + cq->wptr = cq->rptr; + } + } + +ret_cqe: + return ret; +} + +EXPORT_SYMBOL(__cxio_poll_cq); +EXPORT_SYMBOL(cxio_peek_cq); +EXPORT_SYMBOL(cxio_hal_cq_op); +EXPORT_SYMBOL(cxio_hal_clear_qp_ctx); +EXPORT_SYMBOL(cxio_create_cq); +EXPORT_SYMBOL(cxio_destroy_cq); +EXPORT_SYMBOL(cxio_resize_cq); +EXPORT_SYMBOL(cxio_create_qp); +EXPORT_SYMBOL(cxio_destroy_qp); +EXPORT_SYMBOL(cxio_allocate_stag); +EXPORT_SYMBOL(cxio_register_phys_mem); +EXPORT_SYMBOL(cxio_reregister_phys_mem); +EXPORT_SYMBOL(cxio_dereg_mem); +EXPORT_SYMBOL(cxio_allocate_window); +EXPORT_SYMBOL(cxio_deallocate_window); +EXPORT_SYMBOL(cxio_rdma_init); +EXPORT_SYMBOL(cxio_hal_get_rhdl); +EXPORT_SYMBOL(cxio_hal_put_rhdl); +EXPORT_SYMBOL(cxio_hal_get_pdid); +EXPORT_SYMBOL(cxio_hal_put_pdid); +EXPORT_SYMBOL(cxio_register_ev_cb); +EXPORT_SYMBOL(cxio_unregister_ev_cb); +EXPORT_SYMBOL(cxio_rdev_open); +EXPORT_SYMBOL(cxio_rdev_close); diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.h b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h new file mode 100644 index 0000000..37db2b5 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h @@ -0,0 +1,166 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_HAL_H__ +#define __CXIO_HAL_H__ + +#include "t3_cpl.h" +#include "defs.h" +#include "t3cdev.h" +#include "cxgb3_ctl_defs.h" +#include "cxio_wr.h" + +#define T3_CTRL_QP_ID FW_RI_SGEEC_START +#define T3_CTL_QP_TID FW_RI_TID_START +#define T3_CTRL_QP_SIZE_LOG2 10 +#define T3_CTRL_CQ_ID 0 + +/* TBD */ +#define T3_MAX_NUM_RNIC 8 +#define T3_MAX_NUM_RI (1<<15) +#define T3_MAX_NUM_QP (1<<15) +#define T3_MAX_NUM_CQ (1<<15) +#define T3_MAX_NUM_PD (1<<15) +#define T3_MAX_NUM_STAG (1<<13) +#define T3_MAX_PBL_SIZE 256 +#define T3_MAX_RQ_SIZE 1024 + +#define T3_STAG_UNSET 0xffffffff + +#define T3_MAX_DEV_NAME_LEN 32 + +struct cxio_hal_ctrl_qp { + u32 wptr; + u32 rptr; + struct semaphore sem; /* for the wtpr, can sleep */ + wait_queue_head_t waitq; /* wait for RspQ/CQE msg */ + union t3_wr *workq; /* the work request queue */ + dma_addr_t dma_addr; /* pci bus address of the workq */ + DECLARE_PCI_UNMAP_ADDR(mapping) + void __iomem *doorbell; +}; + +struct cxio_hal_resource { + struct kfifo *tpt_fifo; + spinlock_t tpt_fifo_lock; + struct kfifo *qpid_fifo; + spinlock_t qpid_fifo_lock; + struct kfifo *cqid_fifo; + spinlock_t cqid_fifo_lock; + struct kfifo *pdid_fifo; + spinlock_t pdid_fifo_lock; +}; + +struct cxio_rdev { + char dev_name[T3_MAX_DEV_NAME_LEN]; + struct t3cdev *t3cdev_p; + struct rdma_info rnic_info; + struct cxio_hal_resource *rscp; + struct cxio_hal_ctrl_qp ctrl_qp; + void *ulp; +}; + +typedef void (*cxio_hal_ev_callback_func_t) (struct cxio_rdev * rdev_p, + struct sk_buff * skb); + +struct respQ_msg_t { + u32 opaque0:32; + u32 opaque1:8; + u32 cq_overflow:1; /* bit 16 */ + u32 opaque2:7; + u32 opaque3:16; + + u32 opaque4:2; + u32 cq_notify:1; /* bit 58 */ + u32 opaque5:5; + u32 opaque6:24; + u32 opaque7:16; + u32 cq_id:16; /* bit [15:0] */ + + struct t3_cqe cqe; +}; + +enum t3_cq_opcode { + CQ_ARM_AN = 0x2, + CQ_ARM_SE = 0x6, + CQ_FORCE_AN = 0x3, + CQ_CREDIT_UPDATE = 0x7 +}; + +int cxio_rdev_open(struct cxio_rdev *rdev); +void cxio_rdev_close(struct cxio_rdev *rdev); +int cxio_hal_cq_op(struct cxio_rdev *rdev, struct t3_cq *cq, + enum t3_cq_opcode op, u32 credit); +int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev, u32 qpid); +int cxio_create_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +int cxio_destroy_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +int cxio_resize_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +int cxio_create_qp(struct cxio_rdev *rdev, u32 kernel_domain, struct t3_wq *wq); +int cxio_destroy_qp(struct cxio_rdev *rdev, struct t3_wq *wq); +int cxio_peek_cq(struct t3_wq *wr, struct t3_cq *cq, int opcode); +int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr); +int cxio_register_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, u64 * pbl, u32 * pbl_size, + u32 * pbl_addr); +int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, u64 * pbl, u32 * pbl_size, + u32 * pbl_addr); +int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag); +int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid); +int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag); +int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr); +void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb); +void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb); +u32 cxio_hal_get_rhdl(void); +void cxio_hal_put_rhdl(u32 rhdl); +u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp); +void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid); +int __init cxio_hal_init(void); +void __exit cxio_hal_exit(void); +void cxio_flush_rq(struct cxio_rdev *dev, struct t3_wq *wq, struct t3_cq *cq); +void cxio_flush_sq(struct cxio_rdev *dev, struct t3_wq *wq, struct t3_cq *cq); + +#define DBG(fmt, args...) pr_debug("iw_cxgb3: " fmt, ## args) + +#ifdef DEBUG +void cxio_dump_tpt(struct cxio_rdev *rev, u32 stag); +void cxio_dump_pbl(struct cxio_rdev *rev, u32 pbl_addr, uint len, u8 shift); +void cxio_dump_wqe(union t3_wr *wqe); +void cxio_dump_wce(struct t3_cqe *wce); +void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents); +void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid); +#endif + +#endif From swise at opengridcomputing.com Fri Jun 23 07:30:05 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:30:05 -0500 Subject: [openib-general] [PATCH v2 08/14] CXGB3 RDMA Core Resource Allocation In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop> References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623143005.32410.4680.stgit@stevo-desktop> This patch implements resource allocation services for assigning unique IDs to the various objects. ISSUE: - this uses kfifos to manage basically a list of numbers to dish out as QPIDs, CQIDs, STAGs. A bitmap would be more efficient memory-wise, but there is an issue with STAG indecies: They are supposed to be random. This code randomizes the stag kfifo. --- drivers/infiniband/hw/cxgb3/core/cxio_resource.c | 255 ++++++++++++++++++++++ 1 files changed, 255 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.c b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c new file mode 100644 index 0000000..8c8bfb5 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c @@ -0,0 +1,255 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +/* Crude resource management */ +#include +#include +#include +#include +#include +#include +#include "cxio_hal.h" + +static struct kfifo *rhdl_fifo; +static spinlock_t rhdl_fifo_lock; + +#define RANDOM_SIZE 16 + + +/* Loosely based on the Mersenne twister algorithm */ +static u32 next_random(u32 rand) +{ + u32 y, ylast; + + y = rand; + ylast = y; + y = (y * 69069) & 0xffffffff; + y = (y & 0x80000000) + (ylast & 0x7fffffff); + if ((y & 1)) + y = ylast ^ (y > 1) ^ (2567483615UL); + else + y = ylast ^ (y > 1); + y = y ^ (y >> 11); + y = y ^ ((y >> 7) & 2636928640UL); + y = y ^ ((y >> 15) & 4022730752UL); + y = y ^ (y << 18); + return y; +} +static int __cxio_init_resource_fifo(struct kfifo **fifo, + spinlock_t *fifo_lock, + u32 nr, u32 skip_low, + u32 skip_high, + int random) +{ + u32 i, j, entry = 0, idx; + u32 random_bytes; + u32 rarray[16]; + spin_lock_init(fifo_lock); + + *fifo = kfifo_alloc(nr * sizeof(u32), GFP_KERNEL, fifo_lock); + if (IS_ERR(*fifo)) + return -ENOMEM; + + for (i = 0; i < skip_low + skip_high; i++) + __kfifo_put(*fifo, (unsigned char *) &entry, sizeof(u32)); + if (random) { + j = 0; + get_random_bytes(&random_bytes,sizeof(random_bytes)); + for (i = 0; i < RANDOM_SIZE; i++) + rarray[i] = i + skip_low; + for (i = skip_low + RANDOM_SIZE; i < nr - skip_high; i++) { + if (j >= RANDOM_SIZE) { + j = 0; + random_bytes = next_random(random_bytes); + } + idx = (random_bytes >> (j * 2)) & 0xF; + __kfifo_put(*fifo, + (unsigned char *) &rarray[idx], + sizeof(u32)); + rarray[idx] = i; + j++; + } + for (i = 0; i < RANDOM_SIZE; i++) + __kfifo_put(*fifo, + (unsigned char *) &rarray[i], + sizeof(u32)); + } else + for (i = skip_low; i < nr - skip_high; i++) + __kfifo_put(*fifo, (unsigned char *) &i, sizeof(u32)); + + for (i = 0; i < skip_low + skip_high; i++) + kfifo_get(*fifo, (unsigned char *) &entry, sizeof(u32)); + return 0; +} + +static int cxio_init_resource_fifo(struct kfifo **fifo, spinlock_t * fifo_lock, + u32 nr, u32 skip_low, u32 skip_high) +{ + return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, + skip_high, 0)); +} + +static int cxio_init_resource_fifo_random(struct kfifo **fifo, + spinlock_t * fifo_lock, + u32 nr, u32 skip_low, u32 skip_high) +{ + + return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, + skip_high, 1)); +} + +int cxio_hal_init_rhdl_resource(u32 nr_rhdl) +{ + return cxio_init_resource_fifo(&rhdl_fifo, &rhdl_fifo_lock, nr_rhdl, 1, + 0); +} + +void cxio_hal_destroy_rhdl_resource(void) +{ + kfifo_free(rhdl_fifo); +} + +/* nr_* must be power of 2 */ +int cxio_hal_init_resource(struct cxio_hal_resource **rscpp, + u32 nr_tpt, u32 nr_pbl, + u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, u32 nr_pdid) +{ + int err = 0; + struct cxio_hal_resource *rscp; + rscp = kmalloc(sizeof(*rscp), GFP_KERNEL); + if (!rscp) { + return -ENOMEM; + } + *rscpp = rscp; + err = cxio_init_resource_fifo_random(&rscp->tpt_fifo, + &rscp->tpt_fifo_lock, + nr_tpt, 1, 0); + if (err) + goto tpt_err; + err = cxio_init_resource_fifo(&rscp->qpid_fifo, &rscp->qpid_fifo_lock, + nr_qpid, 16, 16); + if (err) + goto qpid_err; + err = cxio_init_resource_fifo(&rscp->cqid_fifo, &rscp->cqid_fifo_lock, + nr_cqid, 1, 0); + if (err) + goto cqid_err; + err = cxio_init_resource_fifo(&rscp->pdid_fifo, &rscp->pdid_fifo_lock, + nr_pdid, 1, 0); + if (err) + goto pdid_err; + return 0; +pdid_err: + kfifo_free(rscp->cqid_fifo); +cqid_err: + kfifo_free(rscp->qpid_fifo); +qpid_err: + kfifo_free(rscp->tpt_fifo); +tpt_err: + return -ENOMEM; +} + +/* + * returns 0 if no resource available + */ +static inline u32 cxio_hal_get_resource(struct kfifo *fifo) +{ + u32 entry; + if (kfifo_get(fifo, (unsigned char *) &entry, sizeof(u32))) + return entry; + else + return 0; /* fifo emptry */ +} + +static inline void cxio_hal_put_resource(struct kfifo *fifo, u32 entry) +{ + BUG_ON(kfifo_put(fifo, (unsigned char *) &entry, sizeof(u32)) == 0); +} + +u32 cxio_hal_get_rhdl(void) +{ + return cxio_hal_get_resource(rhdl_fifo); +} + +void cxio_hal_put_rhdl(u32 rhdl) +{ + cxio_hal_put_resource(rhdl_fifo, rhdl); +} + +u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->tpt_fifo); +} + +void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag) +{ + cxio_hal_put_resource(rscp->tpt_fifo, stag); +} + +u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->qpid_fifo); +} + +void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid) +{ + cxio_hal_put_resource(rscp->qpid_fifo, qpid); +} + +u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->cqid_fifo); +} + +void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid) +{ + cxio_hal_put_resource(rscp->cqid_fifo, cqid); +} + +u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->pdid_fifo); +} + +void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid) +{ + cxio_hal_put_resource(rscp->pdid_fifo, pdid); +} + +void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp) +{ + kfifo_free(rscp->tpt_fifo); + kfifo_free(rscp->cqid_fifo); + kfifo_free(rscp->qpid_fifo); + kfifo_free(rscp->pdid_fifo); + kfree(rscp); +} From swise at opengridcomputing.com Fri Jun 23 07:30:10 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:30:10 -0500 Subject: [openib-general] [PATCH v2 09/14] CXGB3 RDMA Core Types. In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop> References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623143010.32410.83385.stgit@stevo-desktop> This patch contains all the HW-specific types. Also included is a inline fastpath cq_poll() function. --- drivers/infiniband/hw/cxgb3/core/cxio_wr.h | 722 ++++++++++++++++++++++++++++ 1 files changed, 722 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h new file mode 100644 index 0000000..7c78dee --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h @@ -0,0 +1,722 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_WR_H__ +#define __CXIO_WR_H__ + +#include +#include +#include +#include "firmware_exports.h" + +#define T3_MAX_SGE 4 + +#define Q_EMPTY(rptr,wptr) ((rptr)==(wptr)) +#define Q_FULL(rptr,wptr,size_log2) ( (((wptr)-(rptr))>>(size_log2)) && \ + ((rptr)!=(wptr)) ) +#define Q_GENBIT(ptr,size_log2) (!(((ptr)>>size_log2)&0x1)) +#define Q_FREECNT(rptr,wptr,size_log2) ((1UL<> S_FW_RIWR_OP)) & M_FW_RIWR_OP) + +#define S_FW_RIWR_SOPEOP 22 +#define M_FW_RIWR_SOPEOP 0x3 +#define V_FW_RIWR_SOPEOP(x) ((x) << S_FW_RIWR_SOPEOP) + +#define S_FW_RIWR_FLAGS 8 +#define M_FW_RIWR_FLAGS 0x3fffff +#define V_FW_RIWR_FLAGS(x) ((x) << S_FW_RIWR_FLAGS) +#define G_FW_RIWR_FLAGS(x) ((((x) >> S_FW_RIWR_FLAGS)) & M_FW_RIWR_FLAGS) + +#define S_FW_RIWR_TID 8 +#define V_FW_RIWR_TID(x) ((x) << S_FW_RIWR_TID) + +#define S_FW_RIWR_LEN 0 +#define V_FW_RIWR_LEN(x) ((x) << S_FW_RIWR_LEN) + +#define S_FW_RIWR_GEN 31 +#define V_FW_RIWR_GEN(x) ((x) << S_FW_RIWR_GEN) + +struct t3_sge { + u32 stag; + u32 len; + u64 to; +}; + +/* If num_sgle is zero, flit 5+ contains immediate data.*/ +struct t3_send_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + + enum t3_rdma_opcode rdmaop:8; + u32 reserved:24; /* 2 */ + u32 rem_stag; /* 2 */ + u32 plen; /* 3 */ + u32 num_sgle; + struct t3_sge sgl[T3_MAX_SGE]; /* 4+ */ +}; + +struct t3_local_inv_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u32 stag; /* 2 */ + u32 reserved3; +}; + +struct t3_rdma_write_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + enum t3_rdma_opcode rdmaop:8; /* 2 */ + u32 reserved:24; /* 2 */ + u32 stag_sink; + u64 to_sink; /* 3 */ + u32 plen; /* 4 */ + u32 num_sgle; + struct t3_sge sgl[T3_MAX_SGE]; /* 5+ */ +}; + +struct t3_rdma_read_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + enum t3_rdma_opcode rdmaop:8; /* 2 */ + u32 reserved:24; + u32 rem_stag; + u64 rem_to; /* 3 */ + u32 local_stag; /* 4 */ + u32 local_len; + u64 local_to; /* 5 */ +}; + +enum t3_addr_type { + T3_VA_BASED_TO = 0x0, + T3_ZERO_BASED_TO = 0x1 +} __attribute__ ((packed)); + +enum t3_mem_perms { + T3_MEM_ACCESS_LOCAL_READ = 0x1, + T3_MEM_ACCESS_LOCAL_WRITE = 0x2, + T3_MEM_ACCESS_REM_READ = 0x4, + T3_MEM_ACCESS_REM_WRITE = 0x8 +} __attribute__ ((packed)); + +struct t3_bind_mw_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u32 reserved:16; + enum t3_addr_type type:8; + enum t3_mem_perms perms:8; /* 2 */ + u32 mr_stag; + u32 mw_stag; /* 3 */ + u32 mw_len; + u64 mw_va; /* 4 */ + u32 mr_pbl_addr; /* 5 */ + u32 reserved2:24; + u32 mr_pagesz:8; +}; + +struct t3_receive_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 pagesz[T3_MAX_SGE]; + u32 num_sgle; /* 2 */ + struct t3_sge sgl[T3_MAX_SGE]; /* 3+ */ + u32 pbl_addr[T3_MAX_SGE]; +}; + +struct t3_bypass_wr { + struct fw_riwrh wrh; + union t3_wrid wrid; /* 1 */ +}; + +struct t3_modify_qp_wr { + struct fw_riwrh wrh; + union t3_wrid wrid; + u64 ctx1; + u64 ctx0; +}; + +enum t3_mpa_attrs { + uP_RI_MPA_RX_MARKER_ENABLE = 0x1, + uP_RI_MPA_TX_MARKER_ENABLE = 0x2, + uP_RI_MPA_CRC_ENABLE = 0x4, + uP_RI_MPA_IETF_ENABLE = 0x8 +} __attribute__ ((packed)); + +enum t3_qp_caps { + uP_RI_QP_RDMA_READ_ENABLE = 0x01, + uP_RI_QP_RDMA_WRITE_ENABLE = 0x02, + uP_RI_QP_BIND_ENABLE = 0x04, + uP_RI_QP_FAST_REGISTER_ENABLE = 0x08, + uP_RI_QP_STAG0_ENABLE = 0x10 +} __attribute__ ((packed)); + +struct t3_rdma_init_attr { + u32 tid; + u32 qpid; + u32 pdid; + u32 scqid; + u32 rcqid; + u32 rq_addr; + u32 rq_size; + enum t3_mpa_attrs mpaattrs; + enum t3_qp_caps qpcaps; + u16 tcp_emss; + u32 ord; + u32 ird; + u64 qp_dma_addr; + u32 qp_dma_size; + u8 rqes_posted; +}; + +struct t3_rdma_init_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u32 qpid; /* 2 */ + u32 pdid; + u32 scqid; /* 3 */ + u32 rcqid; + u32 rq_addr; /* 4 */ + u32 rq_size; + enum t3_mpa_attrs mpaattrs:8; /* 5 */ + enum t3_qp_caps qpcaps:8; + u32 ulpdu_size:16; + u32 rqes_posted; /* bits 31-1 - reservered */ + /* bit 0 - set if RECV posted */ + u32 ord; /* 6 */ + u32 ird; + u64 qp_dma_addr; /* 7 */ + u32 qp_dma_size; /* 8 */ + u32 rsvd; +}; + +union t3_wr { + struct t3_send_wr send; + struct t3_rdma_write_wr write; + struct t3_rdma_read_wr read; + struct t3_receive_wr recv; + struct t3_local_inv_wr local_inv; + struct t3_bind_mw_wr bind; + struct t3_bypass_wr bypass; + struct t3_rdma_init_wr init; + struct t3_modify_qp_wr qp_mod; + u64 flit[16]; +}; + +#define T3_SQ_CQE_FLIT 13 +#define T3_SQ_COOKIE_FLIT 14 + +#define T3_RQ_COOKIE_FLIT 13 +#define T3_RQ_CQE_FLIT 14 + +static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op, + enum t3_wr_flags flags, u8 genbit, u32 tid, + u8 len) +{ + wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) | + V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) | + V_FW_RIWR_FLAGS(flags)); + wmb(); + wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) | + V_FW_RIWR_TID(tid) | + V_FW_RIWR_LEN(len)); + /* 2nd gen bit... */ + ((union t3_wr *)wqe)->flit[15] = cpu_to_be64(genbit); +} + +/* + * T3 ULP2_TX commands + */ +enum t3_utx_mem_op { + T3_UTX_MEM_READ = 2, + T3_UTX_MEM_WRITE = 3 +}; + +/* T3 MC7 RDMA TPT entry format */ + +enum tpt_mem_type { + TPT_NON_SHARED_MR = 0x0, + TPT_SHARED_MR = 0x1, + TPT_MW = 0x2, + TPT_MW_RELAXED_PROTECTION = 0x3 +}; + +enum tpt_addr_type { + TPT_ZBTO = 0, + TPT_VATO = 1 +}; + +enum tpt_mem_perm { + TPT_LOCAL_READ = 0x8, + TPT_LOCAL_WRITE = 0x4, + TPT_REMOTE_READ = 0x2, + TPT_REMOTE_WRITE = 0x1 +}; + +struct tpt_entry { + u32 valid_stag_pdid; + u32 flags_pagesize_qpid; + + u32 rsvd_pbl_addr; + u32 len; + u32 va_hi; + u32 va_low_or_fbo; + + u32 rsvd_bind_cnt_or_pstag; + u32 rsvd_pbl_size; +}; +#define S_TPT_VALID 31 +#define V_TPT_VALID(x) ((x) << S_TPT_VALID) +#define F_TPT_VALID V_TPT_VALID(1U) + +#define S_TPT_STAG_KEY 23 +#define M_TPT_STAG_KEY 0xFF +#define V_TPT_STAG_KEY(x) ((x) << S_TPT_STAG_KEY) +#define G_TPT_STAG_KEY(x) (((x) >> S_TPT_STAG_KEY) & M_TPT_STAG_KEY) + +#define S_TPT_STAG_STATE 22 +#define V_TPT_STAG_STATE(x) ((x) << S_TPT_STAG_STATE) +#define F_TPT_STAG_STATE V_TPT_STAG_STATE(1U) + +#define S_TPT_STAG_TYPE 20 +#define M_TPT_STAG_TYPE 0x3 +#define V_TPT_STAG_TYPE(x) ((x) << S_TPT_STAG_TYPE) +#define G_TPT_STAG_TYPE(x) (((x) >> S_TPT_STAG_TYPE) & M_TPT_STAG_TYPE) + +#define S_TPT_PDID 0 +#define M_TPT_PDID 0xFFFFF +#define V_TPT_PDID(x) ((x) << S_TPT_PDID) +#define G_TPT_PDID(x) (((x) >> S_TPT_PDID) & M_TPT_PDID) + +#define S_TPT_PERM 28 +#define M_TPT_PERM 0xF +#define V_TPT_PERM(x) ((x) << S_TPT_PERM) +#define G_TPT_PERM(x) (((x) >> S_TPT_PERM) & M_TPT_PERM) + +#define S_TPT_REM_INV_DIS 27 +#define V_TPT_REM_INV_DIS(x) ((x) << S_TPT_REM_INV_DIS) +#define F_TPT_REM_INV_DIS V_TPT_REM_INV_DIS(1U) + +#define S_TPT_ADDR_TYPE 26 +#define V_TPT_ADDR_TYPE(x) ((x) << S_TPT_ADDR_TYPE) +#define F_TPT_ADDR_TYPE V_TPT_ADDR_TYPE(1U) + +#define S_TPT_MW_BIND_ENABLE 25 +#define V_TPT_MW_BIND_ENABLE(x) ((x) << S_TPT_MW_BIND_ENABLE) +#define F_TPT_MW_BIND_ENABLE V_TPT_MW_BIND_ENABLE(1U) + +#define S_TPT_PAGE_SIZE 20 +#define M_TPT_PAGE_SIZE 0x1F +#define V_TPT_PAGE_SIZE(x) ((x) << S_TPT_PAGE_SIZE) +#define G_TPT_PAGE_SIZE(x) (((x) >> S_TPT_PAGE_SIZE) & M_TPT_PAGE_SIZE) + +#define S_TPT_PBL_ADDR 0 +#define M_TPT_PBL_ADDR 0x1FFFFFFF +#define V_TPT_PBL_ADDR(x) ((x) << S_TPT_PBL_ADDR) +#define G_TPT_PBL_ADDR(x) (((x) >> S_TPT_PBL_ADDR) & M_TPT_PBL_ADDR) + +#define S_TPT_QPID 0 +#define M_TPT_QPID 0xFFFFF +#define V_TPT_QPID(x) ((x) << S_TPT_QPID) +#define G_TPT_QPID(x) (((x) >> S_TPT_QPID) & M_TPT_QPID) + +#define S_TPT_PSTAG 0 +#define M_TPT_PSTAG 0xFFFFFF +#define V_TPT_PSTAG(x) ((x) << S_TPT_PSTAG) +#define G_TPT_PSTAG(x) (((x) >> S_TPT_PSTAG) & M_TPT_PSTAG) + +#define S_TPT_PBL_SIZE 0 +#define M_TPT_PBL_SIZE 0xFFFFF +#define V_TPT_PBL_SIZE(x) ((x) << S_TPT_PBL_SIZE) +#define G_TPT_PBL_SIZE(x) (((x) >> S_TPT_PBL_SIZE) & M_TPT_PBL_SIZE) + +/* + * CQE defs + */ +struct t3_cqe { + u32 header:32; + u32 len:32; + u32 wrid_hi_stag:32; + u32 wrid_low_msn:32; +}; + +#define S_CQE_QPID 12 +#define M_CQE_QPID 0xFFFFF +#define G_CQE_QPID(x) ((((x) >> S_CQE_QPID)) & M_CQE_QPID) +#define V_CQE_QPID(x) ((x)<> S_CQE_SWCQE)) & M_CQE_SWCQE) +#define V_CQE_SWCQE(x) ((x)<> S_CQE_GENBIT) & M_CQE_GENBIT) +#define V_CQE_GENBIT(x) ((x)<> S_CQE_STATUS)) & M_CQE_STATUS) +#define V_CQE_STATUS(x) ((x)<> S_CQE_TYPE)) & M_CQE_TYPE) +#define V_CQE_TYPE(x) ((x)<> S_CQE_OPCODE)) & M_CQE_OPCODE) +#define V_CQE_OPCODE(x) ((x)<sw_rptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2)); + return cqe; + } + cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2)); + if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe)) + return cqe; + return NULL; +} + +/* + * Return a ptr to the next signaled wr in the SQ or NULL. + */ +static inline union t3_wr *next_sq_wr(struct t3_wq *wq) +{ + union t3_wr *wr = wq->sq_oldest_wr; + int count = Q_COUNT(wq->sq_rptr, wq->sq_wptr); + u32 wptr = wr - wq->queue + 1; + + BUG_ON(!wr); + while (count) { + u32 opflags; + wr = (union t3_wr *)(wq->queue+Q_PTR2IDX(wptr, wq->size_log2)); + + opflags = be32_to_cpu(wr->recv.wrh.op_seop_flags); + + /* XXX Reads always generate a completion. */ + if (G_FW_RIWR_OP(opflags) == T3_WR_READ) + return wr; + + /* Skip (and don't count) receives */ + if (G_FW_RIWR_OP(opflags) == T3_WR_RCV) { + wptr++; + continue; + } + + /* If this WR is signaled, return it. */ + if (G_FW_RIWR_FLAGS(opflags) & T3_COMPLETION_FLAG) + return wr; + wptr++; + count--; + } + return NULL; +} + +int __cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, + struct t3_cqe *cqe, u8 * cqe_flushed, + u64 * cookie, u32 * credit); + +#define FASTPATH_POLL + +/* + * Fastpath poll. + * + * Caller must: + * check the validity of the first CQE, + * supply the wq assicated with the qpid. + * credit: cq credit to return to sge. + * cqe_flushed: 1 iff the CQE is flushed. + * cqe: copy of the polled CQE. + * + * return value: + * 0 CQE returned, + * -1 CQE skipped, try again. + */ +static inline int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, + struct t3_cqe *cqe, u8 * cqe_flushed, + u64 * cookie, u32 * credit) +{ +#ifdef FASTPATH_POLL + struct t3_cqe *rd_cqe; + + rd_cqe = cxio_next_cqe(cq); + + /* fastpath: + * wq is valid + * wq not in error + * cqe status not error + * opcode not TERMINATE + * opcode not read response + * wq->sq_oldest_wr is not a read request + * either its a SQ CQE -or- the MSN is correct in the RQ CQE + */ + if (likely(wq && !wq->error && + !CQE_STATUS(*rd_cqe) && + (CQE_OPCODE(*rd_cqe) != T3_TERMINATE) && + (CQE_OPCODE(*rd_cqe) != T3_READ_RESP) && + (!wq->sq_oldest_wr || + ( wq->sq_oldest_wr->send.rdmaop != T3_READ_REQ)) && + (SQ_TYPE(*rd_cqe) || (RQ_TYPE(*rd_cqe) && + (CQE_WRID_MSN(*rd_cqe) == + (wq->rq_rptr + 1)))))) { + *cqe = *rd_cqe; + *cqe_flushed = 0; + *credit = 0; + + /* + * Reap the associated WR(s) that are freed up with this + * completion. + */ + if (SQ_TYPE(*rd_cqe)) { + wq->sq_rptr = CQE_WRID_SQ_WPTR(*rd_cqe) + 1; + BUG_ON(!wq->sq_oldest_wr); + *cookie = wq->queue[Q_PTR2IDX(CQE_WRID_WPTR(*rd_cqe), + wq->size_log2) + ].flit[T3_SQ_COOKIE_FLIT]; + wq->sq_oldest_wr = next_sq_wr(wq); + } else { + *cookie = wq->rq[Q_PTR2IDX(wq->rq_rptr, + wq->rq_size_log2)]; + ++(wq->rq_rptr); + } + + if (SW_CQE(*rd_cqe)) { + ++cq->sw_rptr; + } else { + ++cq->rptr; + + /* + * compute credits. + */ + if (((cq->rptr-cq->wptr) > (1 << (cq->size_log2 - 1))) + || ((cq->rptr - cq->wptr) >= 128)) { + *credit = cq->rptr - cq->wptr; + cq->wptr = cq->rptr; + } + } + return 0; + } +#endif + *cqe_flushed = 0; + *credit = 0; + return __cxio_poll_cq(wq, cq, cqe, cqe_flushed, cookie, credit); +} +#endif From swise at opengridcomputing.com Fri Jun 23 07:29:50 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:29:50 -0500 Subject: [openib-general] [PATCH v2 05/14] CXGB3 Connection Manager In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop> References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623142950.32410.76113.stgit@stevo-desktop> This patch contains the code to manage TCP connections, and do MPA negotiation. It implements the IWCM device-specific methods. ISSUES: - IWCM should pass down a dst entry or at least the next hop ipaddr/macaddr. Currently this code looks up this info based on the source and destination ipaddr. - port management isn't correct. This should be moved into the core IWCM or CMA. Its not trivial to support native stack TCP port allocation/reservation. --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 2135 +++++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_cm.h | 232 ++++ 2 files changed, 2367 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c new file mode 100644 index 0000000..897cb5e --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -0,0 +1,2135 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "iwch.h" +#include "iwch_provider.h" +#include "iwch_cm.h" + +#ifdef DEBUG +char *states[] = { + "idle", + "listen", + "connecting", + "mpa_wait_req", + "mpa_req_sent", + "mpa_req_rcvd", + "mpa_rep_sent", + "fpdu_mode", + "aborting", + "closing", + "moribund", + "dead", + NULL, +}; +#endif + +static int ep_timeout_secs = 10; +module_param(ep_timeout_secs, int, 0444); +MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout " + "in seconds (default=10)"); + +static int mpa_rev = 1; +module_param(mpa_rev, int, 0444); +MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, " + "1 is spec compliant. (default=1)"); + +static int markers_enabled = 0; +module_param(markers_enabled, int, 0444); +MODULE_PARM_DESC(markers_enabled, "Enable MPA MARKERS (default(0)=disabled)"); + +static int crc_enabled = 1; +module_param(crc_enabled, int, 0444); +MODULE_PARM_DESC(crc_enabled, "Enable MPA CRC (default(1)=enabled)"); + +static u16 port_start = 32768; +module_param(port_start, ushort, 0444); +MODULE_PARM_DESC(port_start, + "Starting port for ephemeral ports. (default=32768)"); + +static u16 port_end = 65535; +module_param(port_end, ushort, 0444); +MODULE_PARM_DESC(port_end, + "Ending port for ephemeral ports. (default=65535)"); + +static int rcv_win = 512 * 1024; +module_param(rcv_win, int, 0444); +MODULE_PARM_DESC(rcv_win, "TCP receive window in bytes (default=512KB)"); + +static void process_work(void *ctx); +static struct workqueue_struct *workq; +DECLARE_WORK(skb_work, process_work, NULL); + +static struct sk_buff_head rxq; +static t3c_cpl_handler_func work_handlers[NUM_CPL_CMDS]; + +static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp); +static void ep_timeout(unsigned long arg); +static void connect_reply_upcall(struct iwch_ep *ep, int status); + +static void start_ep_timer(struct iwch_ep *ep) +{ + PDBG("%s enter (%s line %u) ep %p\n", + __FUNCTION__, __FILE__, __LINE__, ep); + if (timer_pending(&ep->timer)) { + PDBG("%s stopped and restarted timer (%s line %u) ep %p\n", + __FUNCTION__, __FILE__, __LINE__, ep); + del_timer_sync(&ep->timer); + } else + ep_atomic_inc(&ep->com.refcnt); + ep->timer.expires = jiffies + ep_timeout_secs * HZ; + ep->timer.data = (unsigned long)ep; + ep->timer.function = ep_timeout; + add_timer(&ep->timer); +} + +static void stop_ep_timer(struct iwch_ep *ep) +{ + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + del_timer_sync(&ep->timer); + free_ep(&ep->com); +} + +/* + * Port bitmap to track which ports are in use. This should be + * global to all openib rnic devices... + */ +static DECLARE_BITMAP(portbits, 65536); +static DEFINE_SPINLOCK(portlock); + +static int get_port(u16 *portp) +{ + u32 port = (u32)ntohs(*portp); + int ret = 0; + PDBG("%s enter (%s line %u) inp port %d\n", __FUNCTION__, __FILE__, __LINE__, port); + spin_lock(&portlock); + if (port == 0) { + port = find_next_zero_bit(portbits, 65536, port_start); + if (port > port_end) + ret = 1; + else + set_bit(port, portbits); + } else + if (test_and_set_bit(port, portbits)) + ret = 1; + spin_unlock(&portlock); + if (!ret) { + *portp = htons(port); + PDBG("%s alloc port %d\n", __FUNCTION__, port); + } + return ret; +} + +static void free_port(u16 port) +{ + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + spin_lock(&portlock); + PDBG("%s free port %d\n", __FUNCTION__, ntohs(port)); + clear_bit((u32)ntohs(port), portbits); + spin_unlock(&portlock); +} + +int iwch_quiesce_tid(struct iwch_ep *ep) +{ + struct cpl_set_tcb_field *req; + struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + + if (!skb) { + return -ENOMEM; + } + req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_RX_QUIESCE); + req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE); + req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE); + + skb->priority = CPL_PRIORITY_DATA; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +int iwch_resume_tid(struct iwch_ep *ep) +{ + struct cpl_set_tcb_field *req; + struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + + if (!skb) { + return -ENOMEM; + } + req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_RX_QUIESCE); + req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE); + req->val = 0; + + skb->priority = CPL_PRIORITY_DATA; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static void set_emss(struct iwch_ep *ep, u16 opt) +{ + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + ep->emss = T3C_DATA(ep->com.tdev)->mtus[G_TCPOPT_MSS(opt)] - 40; + if (G_TCPOPT_TSTAMP(opt)) { + ep->emss -= 12; + } + if (ep->emss < 128) + ep->emss = 128; + PDBG("emss=%d\n", ep->emss); +} + +#if 0 +static int state_exch(struct iwch_ep_common *epc, enum iwch_ep_state exch) +{ + unsigned long flags; + int old; + + spin_lock_irqsave(&epc->lock, flags); + old = epc->state; + epc->state = exch; + spin_unlock_irqrestore(&epc->lock, flags); + return old; +} +#endif + +static int state_comp_exch(struct iwch_ep_common *epc, + enum iwch_ep_state comp, + enum iwch_ep_state exch) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&epc->lock, flags); + ret = (epc->state == comp); + if (ret) + epc->state = exch; + spin_unlock_irqrestore(&epc->lock, flags); + return ret; +} + +static enum iwch_ep_state state_read(struct iwch_ep_common *epc) +{ + unsigned long flags; + enum iwch_ep_state state; + + spin_lock_irqsave(&epc->lock, flags); + state = epc->state; + spin_unlock_irqrestore(&epc->lock, flags); + return state; +} + +static void state_set(struct iwch_ep_common *epc, enum iwch_ep_state new) +{ + unsigned long flags; + + spin_lock_irqsave(&epc->lock, flags); + PDBG(" %s - %s -> %s\n", __FUNCTION__, states[epc->state], + states[new]); + epc->state = new; + spin_unlock_irqrestore(&epc->lock, flags); + return; +} + +static void *alloc_ep(int size, gfp_t gfp) +{ + struct iwch_ep_common *epc; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + epc = kmalloc(size, gfp); + if (epc) { + memset(epc, 0, size); + atomic_set(&epc->refcnt, 1); + spin_lock_init(&epc->lock); + init_waitqueue_head(&epc->waitq); + } + PDBG("alloc ep %p\n", epc); + return (void *) epc; +} + +void __free_ep(struct iwch_ep_common *epc) +{ + PDBG("%s enter (%s line %u) ep %p, &refcnt %p state %s, refcnt %d\n", + __FUNCTION__, __FILE__, + __LINE__, epc, &epc->refcnt, + states[state_read(epc)], + atomic_read(&epc->refcnt)); + if (atomic_read(&epc->refcnt) == 1) { + goto out; + } + if (!atomic_dec_and_test(&epc->refcnt)) { + return; + } +out: + PDBG("free ep %p\n", epc); + free_port(epc->local_addr.sin_port); + kfree(epc); +} + +static void process_work(void *ctx) +{ + struct sk_buff *skb = NULL; + void *ep; + struct t3cdev *tdev; + int ret; + + while ((skb = skb_dequeue(&rxq))) { + ep = *((void **) (skb->cb)); + tdev = *((struct t3cdev **) (skb->cb + sizeof(void *))); + ret = work_handlers[G_OPCODE(ntohl(skb->csum))] + (tdev, skb, ep); + if (ret & CPL_RET_BUF_DONE) + kfree_skb(skb); + + /* + * ep was referenced in sched(), and is freed here. + */ + free_ep(ep); + } +} + +static int status2errno(int status) +{ + switch (status) { + case CPL_ERR_NONE: + return 0; + case CPL_ERR_CONN_RESET: + return -ECONNRESET; + case CPL_ERR_ARP_MISS: + return -EHOSTUNREACH; + case CPL_ERR_CONN_TIMEDOUT: + return -ETIMEDOUT; + case CPL_ERR_TCAM_FULL: + return -ENOMEM; + case CPL_ERR_CONN_EXIST: + return -EADDRINUSE; + default: + return -EIO; + } +} + +/* + * Try and reuse skbs already allocated... + */ +static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp) +{ + if (skb) { + BUG_ON(skb_cloned(skb)); + skb_trim(skb, 0); + skb_get(skb); + } else { + skb = alloc_skb(len, gfp); + } + return skb; +} + +static struct rtable *find_route(struct t3cdev *dev, + u32 local_ip, u32 peer_ip, u16 local_port, + u16 peer_port, u8 tos) +{ + struct rtable *rt; + struct flowi fl = { + .oif = 0, + .nl_u = { + .ip4_u = { + .daddr = peer_ip, + .saddr = local_ip, + .tos = tos} + }, + .proto = IPPROTO_TCP, + .uli_u = { + .ports = { + .sport = local_port, + .dport = peer_port} + } + }; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + if (ip_route_output_flow(&rt, &fl, NULL, 0)) { + return NULL; + } + return rt; +} + +static unsigned int find_best_mtu(const struct t3c_data *d, unsigned short mtu) +{ + int i = 0; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + while (i < d->nmtus - 1 && d->mtus[i + 1] <= mtu) + ++i; + return i; +} + +/* + * XXX need to upcall the connection setup failure somehow! + */ +static void arp_failure_discard(struct t3cdev *dev, struct sk_buff *skb) +{ + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + kfree_skb(skb); +} + +/* + * Handle an ARP failure for an active open. + */ +static void act_open_req_arp_failure(struct t3cdev *dev, struct sk_buff *skb) +{ + printk(KERN_ERR MOD "ARP failure duing connect\n"); + kfree_skb(skb); +} + +/* + * Handle an ARP failure for a CPL_ABORT_REQ. Change it into a no RST variant + * and send it along. + */ +static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb) +{ + struct cpl_abort_req *req = cplhdr(skb); + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + req->cmd = CPL_ABORT_NO_RST; + t3c_send(dev, skb); +} + +static int send_halfclose(struct iwch_ep *ep, gfp_t gfp) +{ + struct cpl_close_con_req *req; + struct sk_buff *skb; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + skb = get_skb(NULL, sizeof(*req), gfp); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, arp_failure_discard); + req = (struct cpl_close_con_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_TOE_CLOSE_CON)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid)); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp) +{ + struct cpl_abort_req *req; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + skb = get_skb(skb, sizeof(*req), gfp); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", + __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, abort_arp_failure); + req = (struct cpl_abort_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_TOE_HOST_ABORT_CON_REQ)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid)); + req->cmd = CPL_ABORT_SEND_RST; + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_connect(struct iwch_ep *ep) +{ + struct cpl_act_open_req *req; + struct sk_buff *skb; + u32 opt0h, opt0l, opt2; + unsigned int mtu_idx; + int wscale; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", + __FUNCTION__); + return -ENOMEM; + } + mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst)); + wscale = compute_wscale(rcv_win); + opt0h = V_NAGLE(0) | + V_KEEP_ALIVE(1) | + F_TCAM_BYPASS | + V_WND_SCALE(wscale) | + V_MSS_IDX(mtu_idx) | + V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx); + opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10); + opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0); + skb->priority = CPL_PRIORITY_SETUP; + set_arp_failure_handler(skb, act_open_req_arp_failure); + + req = (struct cpl_act_open_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, ep->atid)); + req->local_port = ep->com.local_addr.sin_port; + req->peer_port = ep->com.remote_addr.sin_port; + req->local_ip = ep->com.local_addr.sin_addr.s_addr; + req->peer_ip = ep->com.remote_addr.sin_addr.s_addr; + req->opt0h = htonl(opt0h); + req->opt0l = htonl(opt0l); + req->params = 0; + req->opt2 = htonl(opt2); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + int len; + + PDBG("%s (%s line %u pd_len %d)\n", __FUNCTION__, __FILE__, __LINE__, ep->plen); + + BUG_ON(skb_cloned(skb)); + + mpalen = sizeof(*mpa) + ep->plen; + if (skb->data + mpalen + sizeof(*req) > skb->end) { + kfree_skb(skb); + skb=alloc_skb(mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + connect_reply_upcall(ep, -ENOMEM); + return; + } + } + skb_trim(skb, 0); + skb_reserve(skb, sizeof(*req)); + skb_put(skb, mpalen); + skb->priority = CPL_PRIORITY_DATA; + mpa = (struct mpa_message *) skb->data; + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REQ, sizeof(mpa->key)); + mpa->flags = (crc_enabled ? MPA_CRC : 0) | + (markers_enabled ? MPA_MARKERS : 0); + mpa->private_data_size = htons(ep->plen); + mpa->revision = mpa_rev; + + if (ep->plen) { + memcpy(mpa->private_data, ep->mpa_pkt + sizeof(*mpa), ep->plen); + } + + /* + * Reference the mpa skb. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + len = skb->len; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_TOE_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(len); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx)); + req->flags = htonl(F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + BUG_ON(ep->mpa_skb); + ep->mpa_skb = skb; + l2t_send(ep->com.tdev, skb, ep->l2t); + start_ep_timer(ep); + state_set(&ep->com, MPA_REQ_SENT); + return; +} + +static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + struct sk_buff *skb; + + PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + mpalen = sizeof(*mpa) + plen; + + skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__); + return -ENOMEM; + } + skb_reserve(skb, sizeof(*req)); + mpa = (struct mpa_message *) skb_put(skb, mpalen); + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key)); + mpa->flags = MPA_REJECT; + mpa->revision = mpa_rev; + mpa->private_data_size = htons(plen); + if (plen) { + memcpy(mpa->private_data, pdata, plen); + } + + /* + * Reference the mpa skb again. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_TOE_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(mpalen); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx)); + req->flags = htonl(F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + BUG_ON(ep->mpa_skb); + ep->mpa_skb = skb; + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + int len; + struct sk_buff *skb; + + PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + mpalen = sizeof(*mpa) + plen; + + skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + skb_reserve(skb, sizeof(*req)); + mpa = (struct mpa_message *) skb_put(skb, mpalen); + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key)); + mpa->flags = (ep->mpa_attr.crc_enabled ? MPA_CRC : 0) | + (markers_enabled ? MPA_MARKERS : 0); + mpa->revision = mpa_rev; + mpa->private_data_size = htons(plen); + if (plen) { + memcpy(mpa->private_data, pdata, plen); + } + + /* + * Reference the mpa skb. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + len = skb->len; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_TOE_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(len); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx)); + req->flags = htonl(F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + ep->mpa_skb = skb; + state_set(&ep->com, MPA_REP_SENT); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_act_establish *req = cplhdr(skb); + unsigned int tid = GET_TID(req); + + PDBG("%s (%s line %u) hwtid %d\n", __FUNCTION__, __FILE__, __LINE__, + tid); + + dst_confirm(ep->dst); + + /* setup the hwtid for this connection */ + ep->hwtid = tid; + t3c_insert_tid(ep->com.tdev, &t3c_client, ep, tid); + + ep->snd_seq = ntohl(req->snd_isn); + + set_emss(ep, ntohs(req->tcp_opt)); + + /* dealloc the atid */ + t3c_free_atid(ep->com.tdev, ep->atid); + + /* start MPA negotiation */ + send_mpa_req(ep, skb); + + return 0; +} + +static void abort_connection(struct iwch_ep *ep, struct sk_buff *skb) +{ + PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + state_set(&ep->com, ABORTING); + send_abort(ep, skb, GFP_KERNEL); +} + +static void close_complete_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CLOSE; + if (ep->com.cm_id) { + PDBG("close complete delivered ep %p cm_id %p hwtid %d\n", + ep, ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void peer_close_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_DISCONNECT; + if (ep->com.cm_id) { + PDBG("peer close delivered ep %p cm_id %p hwtid %d\n", + ep, ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } +} + +static void peer_abort_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CLOSE; + event.status = -ECONNRESET; + if (ep->com.cm_id) { + PDBG("abort delivered ep %p cm_id %p hwtid %d\n", ep, + ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void connect_reply_upcall(struct iwch_ep *ep, int status) +{ + struct iw_cm_event event; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CONNECT_REPLY; + event.status = status; + event.local_addr = ep->com.local_addr; + event.remote_addr = ep->com.remote_addr; + + if ((status == 0) || (status == -ECONNREFUSED)) { + event.private_data_len = ep->plen; + event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); + } + if (ep->com.cm_id) { + PDBG("%s ep %p tid %d status %d\n", __FUNCTION__, ep, ep->hwtid, status); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } + if (status < 0) { + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void connect_request_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CONNECT_REQUEST; + event.local_addr = ep->com.local_addr; + event.remote_addr = ep->com.remote_addr; + event.private_data_len = ep->plen; + event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); + event.provider_data = ep; + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + if (state_read(&ep->parent_ep->com) != DEAD) + ep->parent_ep->com.cm_id->event_handler( + ep->parent_ep->com.cm_id, + &event); + free_ep(&ep->parent_ep->com); + ep->parent_ep = NULL; +} + +static void established_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_ESTABLISHED; + if (ep->com.cm_id) { + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } +} + +static int update_rx_credits(struct iwch_ep *ep, u32 credits) +{ + struct cpl_rx_data_ack *req; + struct sk_buff *skb; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "update_rx_credits - cannot alloc skb!\n"); + return 0; + } + + req = (struct cpl_rx_data_ack *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid)); + req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1)); + skb->priority = CPL_PRIORITY_ACK; + ep->com.tdev->send(ep->com.tdev, skb); + return credits; +} + +static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb) +{ + struct mpa_message *mpa; + u16 plen; + struct iwch_qp_attributes attrs; + enum iwch_qp_attr_mask mask; + int err; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + /* + * Stop mpa timer. If it expired, then the state is + * CLOSING and we bail since ep_timeout already aborted + * the connection. + */ + stop_ep_timer(ep); + if (state_read(&ep->com) == CLOSING) { + return; + } + state_set(&ep->com, FPDU_MODE); + + /* + * If we get more than the supported amount of private data + * then we must fail this connection. + */ + if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) { + err = -EINVAL; + goto err; + } + + /* + * copy the new data into our accumulation buffer. + */ + memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len); + ep->mpa_pkt_len += skb->len; + + /* + * if we don't even have the mpa message, then bail. + */ + if (ep->mpa_pkt_len < sizeof(*mpa)) { + return; + } + mpa = (struct mpa_message *) ep->mpa_pkt; + + /* Validate MPA header. */ + if (mpa->revision != mpa_rev) { + err = -EPROTO; + goto err; + } + if (memcmp(mpa->key, MPA_KEY_REP, sizeof(mpa->key))) { + err = -EPROTO; + goto err; + } + + plen = ntohs(mpa->private_data_size); + + /* + * Fail if there's too much private data. + */ + if (plen > MPA_MAX_PRIVATE_DATA) { + err = -EPROTO; + goto err; + } + + /* + * If plen does not account for pkt size + */ + if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) { + err = -EPROTO; + goto err; + } + + ep->plen = (u8) plen; + + /* + * If we don't have all the pdata yet, then bail. + * We'll continue process when more data arrives. + */ + if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) { + return; + } + + if (mpa->flags & MPA_REJECT) { + err = -ECONNREFUSED; + goto err; + } + + /* + * If we get here we have accumulated the entire mpa + * start reply message including private data. And + * the MPA header is valid. + */ + + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; + ep->mpa_attr.recv_marker_enabled = markers_enabled; + ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; + ep->mpa_attr.version = mpa_rev; + PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, " + "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__, + ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled, + ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version); + + /* + * Quiesce the TID here. The uP unquiesces the TID as + * part of the rdma_init operation. + */ + err = iwch_quiesce_tid(ep); + if (err) { + goto err; + } + + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ird; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | IWCH_QP_ATTR_MAX_ORD; + + /* bind QP and TID with INIT_WR */ + err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + if (!err) { + goto out; + } +err: + abort_connection(ep, skb); +out: + connect_reply_upcall(ep, err); + return; +} + +static void process_mpa_request(struct iwch_ep *ep, struct sk_buff *skb) +{ + struct mpa_message *mpa; + u16 plen; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + /* + * Stop mpa timer. If it expired, then the state is + * CLOSING and we bail since ep_timeout already aborted + * the connection. + */ + stop_ep_timer(ep); + if (state_read(&ep->com) == CLOSING) { + return; + } + + /* + * If we get more than the supported amount of private data + * then we must fail this connection. + */ + if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) { + abort_connection(ep, skb); + return; + } + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + /* + * Copy the new data into our accumulation buffer. + */ + memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len); + ep->mpa_pkt_len += skb->len; + + /* + * If we don't even have the mpa message, then bail. + * We'll continue process when more data arrives. + */ + if (ep->mpa_pkt_len < sizeof(*mpa)) { + return; + } + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + mpa = (struct mpa_message *) ep->mpa_pkt; + + /* + * Validate MPA Header. + */ + if (mpa->revision != mpa_rev) { + abort_connection(ep, skb); + return; + } + + if (memcmp(mpa->key, MPA_KEY_REQ, sizeof(mpa->key))) { + abort_connection(ep, skb); + return; + } + + plen = ntohs(mpa->private_data_size); + + /* + * Fail if there's too much private data. + */ + if (plen > MPA_MAX_PRIVATE_DATA) { + abort_connection(ep, skb); + return; + } + + /* + * If plen does not account for pkt size + */ + if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) { + abort_connection(ep, skb); + return; + } + ep->plen = (u8) plen; + + /* + * If we don't have all the pdata yet, then bail. + */ + if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) { + return; + } + + /* + * If we get here we have accumulated the entire mpa + * start reply message including private data. + */ + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; + ep->mpa_attr.recv_marker_enabled = markers_enabled; + ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; + ep->mpa_attr.version = mpa_rev; + PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, " + "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__, + ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled, + ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version); + + state_set(&ep->com, MPA_REQ_RCVD); + + /* drive upcall */ + connect_request_upcall(ep); + return; +} + +static int rx_data(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_rx_data *hdr = cplhdr(skb); + unsigned int dlen = ntohs(hdr->len); + + PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + skb_pull(skb, sizeof(*hdr)); + skb_trim(skb, dlen); + + switch (state_read(&ep->com)) { + case MPA_REQ_SENT: + process_mpa_reply(ep, skb); + break; + case MPA_REQ_WAIT: + process_mpa_request(ep, skb); + break; + case MPA_REP_SENT: + break; + default: + printk(KERN_ERR MOD "%s - unexpected streaming data." + " ep %p state %d hwtid %d\n", + __FUNCTION__, ep, state_read(&ep->com), ep->hwtid); + + /* generate some kind of upcall if needed */ + BUG_ON(1); + abort_connection(ep, skb); + break; + } + + /* update RX credits */ + update_rx_credits(ep, dlen); + + return CPL_RET_BUF_DONE; +} + +/* + * Upcall from the adapter indicating data has been transmitted. + * For us its just the single MPA request or reply. We can now free + * the skb holding the mpa message. + */ +static int tx_ack(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_wr_ack *hdr = cplhdr(skb); + unsigned int credits = ntohs(hdr->credits); + enum iwch_qp_attr_mask mask; + + PDBG("%s (%s line %u) credits %d\n", __FUNCTION__, __FILE__, + __LINE__, credits); + + /* XXX remove this once Felix fixes the FW. */ + if (credits == 0) { + return CPL_RET_BUF_DONE; + } + BUG_ON(credits != 1); + BUG_ON(ep->mpa_skb == NULL); + kfree_skb(ep->mpa_skb); + ep->mpa_skb = NULL; + dst_confirm(ep->dst); + if (state_read(&ep->com) == MPA_REP_SENT) { + struct iwch_qp_attributes attrs; + int err; + + /* bind QP to EP and move to RTS */ + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ord; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + /* bind QP and TID with INIT_WR */ + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_MAX_ORD; + + err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + if (err) { + abort_connection(ep, skb); + return 0; + } + state_set(&ep->com, FPDU_MODE); + established_upcall(ep); + ep->com.rpl_done = 1; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + } + return CPL_RET_BUF_DONE; +} + +static int abort_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + + PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + state_set(&ep->com, DEAD); + close_complete_upcall(ep); + t3c_remove_tid(ep->com.tdev, ctx, ep->hwtid); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + free_ep(&ep->com); + return CPL_RET_BUF_DONE; +} + +static int act_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_act_open_rpl *rpl = cplhdr(skb); + + PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + t3c_free_atid(ep->com.tdev, ep->atid); + connect_reply_upcall(ep, status2errno(rpl->status)); + free_ep(&ep->com); + return CPL_RET_BUF_DONE; +} + +static int listen_start(struct iwch_listen_ep *ep) +{ + struct sk_buff *skb; + struct cpl_pass_open_req *req; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n"); + return -ENOMEM; + } + + req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid)); + req->local_port = ep->com.local_addr.sin_port; + req->local_ip = ep->com.local_addr.sin_addr.s_addr; + req->peer_port = 0; + req->peer_ip = 0; + req->peer_netmask = 0; + req->opt0h = htonl(F_DELACK | F_TCAM_BYPASS); + req->opt0l = htonl(V_RCV_BUFSIZ(rcv_win>>10)); + req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); + + skb->priority = 1; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_listen_ep *ep = ctx; + struct cpl_pass_open_rpl *rpl = cplhdr(skb); + + PDBG("%s (%s line %u) errno %d\n", __FUNCTION__, __FILE__, __LINE__, + status2errno(rpl->status)); + ep->com.rpl_err = status2errno(rpl->status); + ep->com.rpl_done = 1; + wake_up(&ep->com.waitq); + + return CPL_RET_BUF_DONE; +} + +static int listen_stop(struct iwch_listen_ep *ep) +{ + struct sk_buff *skb; + struct cpl_close_listserv_req *req; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); + skb->priority = 1; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, + void *ctx) +{ + struct iwch_listen_ep *ep = ctx; + struct cpl_close_listserv_rpl *rpl = cplhdr(skb); + + PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + ep->com.rpl_err = status2errno(rpl->status); + ep->com.rpl_done = 1; + wake_up(&ep->com.waitq); + return CPL_RET_BUF_DONE; +} + +static void accept_cr(struct iwch_ep *ep, u32 peer_ip, struct sk_buff *skb) +{ + struct cpl_pass_accept_rpl *rpl; + unsigned int mtu_idx; + u32 opt0h, opt0l, opt2; + int wscale; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + BUG_ON(skb_cloned(skb)); + skb_trim(skb, sizeof(*rpl)); + skb_get(skb); + mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst)); + wscale = compute_wscale(rcv_win); + opt0h = V_NAGLE(0) | + V_KEEP_ALIVE(1) | + F_TCAM_BYPASS | + V_WND_SCALE(wscale) | + V_MSS_IDX(mtu_idx) | + V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx); + opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10); + opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0); + + rpl = cplhdr(skb); + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, ep->hwtid)); + rpl->peer_ip = peer_ip; + rpl->opt0h = htonl(opt0h); + rpl->opt0l_status = htonl(opt0l | CPL_PASS_OPEN_ACCEPT); + rpl->opt2 = htonl(opt2); + rpl->rsvd = rpl->opt2; /* workaround for HW bug */ + skb->priority = CPL_PRIORITY_SETUP; + l2t_send(ep->com.tdev, skb, ep->l2t); + + return; +} + +static void reject_cr(struct t3cdev *tdev, u32 hwtid, u32 peer_ip, + struct sk_buff *skb) +{ + struct cpl_pass_accept_rpl *rpl; + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + BUG_ON(skb_cloned(skb)); + skb_trim(skb, sizeof(*rpl)); + skb_get(skb); + rpl = cplhdr(skb); + skb->priority = CPL_PRIORITY_SETUP; + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, hwtid)); + rpl->peer_ip = peer_ip; + rpl->opt0h = htonl(F_TCAM_BYPASS); + rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT); + rpl->opt2 = 0; + rpl->rsvd = rpl->opt2; + tdev->send(tdev, skb); +} + +static int pass_accept_req(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *child_ep, *parent_ep = ctx; + struct cpl_pass_accept_req *req = cplhdr(skb); + unsigned int hwtid = GET_TID(req); + struct dst_entry *dst; + struct l2t_entry *l2t; + struct rtable *rt; + struct iff_mac tim; + + PDBG("%s (%s line %u) - hwtid %u\n", __FUNCTION__, __FILE__, __LINE__, + hwtid); + + if (state_read(&parent_ep->com) != LISTEN) { + printk(KERN_ERR "%s - listening ep not in LISTEN\n", + __FUNCTION__); + goto reject; + } + + /* + * Find the netdev for this connection request. + */ + tim.mac_addr = req->dst_mac; + tim.vlan_tag = ntohs(req->vlan_tag); + if (tdev->ctl(tdev, GET_IFF_FROM_MAC, &tim) < 0 || !tim.dev) { + printk(KERN_ERR + "%s bad dst mac %02x %02x %02x %02x %02x %02x\n", + __FUNCTION__, + req->dst_mac[0], + req->dst_mac[1], + req->dst_mac[2], + req->dst_mac[3], + req->dst_mac[4], + req->dst_mac[5]); + goto reject; + } + +#if 0 + if (ip_route_input(skb, req->peer_ip, req->local_ip, + G_PASS_OPEN_TOS(ntohl(req->tos_tid)), tim.dev)) { + + printk(KERN_ERR MOD "%s - failed to find input route\n", + __FUNCTION__); + goto reject; + } + PDBG("%s (%s line %u) - hwtid %u\n", + __FUNCTION__, __FILE__, __LINE__, hwtid); + BUG_TRAP(!skb->dst); + dst_release(skb->dst); + skb->dst = NULL; +#endif + + /* Find output route */ + rt = find_route(tdev, + req->local_ip, + req->peer_ip, + req->local_port, + req->peer_port, G_PASS_OPEN_TOS(ntohl(req->tos_tid))); + if (!rt) { + printk(KERN_ERR MOD "%s - failed to find dst entry!\n", + __FUNCTION__); + goto reject; + } + dst = &rt->u.dst; + l2t = t3_l2t_get(tdev, dst->neighbour, dst->neighbour->dev->if_port); + if (!l2t) { + printk(KERN_ERR MOD "%s - failed to allocate l2t entry!\n", + __FUNCTION__); + dst_release(dst); + goto reject; + } + child_ep = alloc_ep(sizeof(*child_ep), GFP_KERNEL); + if (!child_ep) { + printk(KERN_ERR MOD "%s - failed to allocate ep entry!\n", + __FUNCTION__); + l2t_release(L2DATA(tdev), l2t); + dst_release(dst); + goto reject; + } + state_set(&child_ep->com, CONNECTING); + child_ep->com.tdev = tdev; + child_ep->com.cm_id = NULL; + child_ep->com.local_addr.sin_family = PF_INET; + child_ep->com.local_addr.sin_port = req->local_port; + child_ep->com.local_addr.sin_addr.s_addr = req->local_ip; + child_ep->com.remote_addr.sin_family = PF_INET; + child_ep->com.remote_addr.sin_port = req->peer_port; + child_ep->com.remote_addr.sin_addr.s_addr = req->peer_ip; + ep_atomic_inc(&parent_ep->com.refcnt); + child_ep->parent_ep = parent_ep; + child_ep->tos = G_PASS_OPEN_TOS(ntohl(req->tos_tid)); + child_ep->l2t = l2t; + child_ep->dst = dst; + child_ep->hwtid = hwtid; + init_timer(&child_ep->timer); + t3c_insert_tid(tdev, &t3c_client, child_ep, hwtid); + accept_cr(child_ep, req->peer_ip, skb); + goto out; +reject: + reject_cr(tdev, hwtid, req->peer_ip, skb); +out: + return CPL_RET_BUF_DONE; +} + +static int pass_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_pass_establish *req = cplhdr(skb); + + PDBG("%s (%s line %u) ep %p\n", __FUNCTION__, __FILE__, __LINE__, ep); + ep->snd_seq = ntohl(req->snd_isn); + + set_emss(ep, ntohs(req->tcp_opt)); + + dst_confirm(ep->dst); + state_set(&ep->com, MPA_REQ_WAIT); + start_ep_timer(ep); + + return CPL_RET_BUF_DONE; +} + +static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct iwch_qp_attributes attrs; + int ret; + int abort = 0; + + PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + dst_confirm(ep->dst); + switch (state_read(&ep->com)) { + case MPA_REQ_WAIT: + state_set(&ep->com, CLOSING); + break; + case MPA_REQ_SENT: + state_set(&ep->com, CLOSING); + connect_reply_upcall(ep, -ECONNRESET); + break; + case MPA_REQ_RCVD: + + /* + * We're gonna mark this puppy DEAD, but keep + * the reference on it until the ULP accepts or + * rejects the CR. + */ + state_set(&ep->com, CLOSING); + ep_atomic_inc(&ep->com.refcnt); + break; + case MPA_REP_SENT: + state_set(&ep->com, CLOSING); + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + break; + case FPDU_MODE: + state_set(&ep->com, CLOSING); + peer_close_upcall(ep); + attrs.next_state = IWCH_QP_STATE_CLOSING; + ret = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (ret) { + printk(KERN_ERR MOD "%s - qp <- closing err!\n", + __FUNCTION__); + abort = 1; + } + break; + case ABORTING: + goto out; + case CLOSING: + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + goto out; + case MORIBUND: + stop_ep_timer(ep); + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_IDLE; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + state_set(&ep->com, DEAD); + close_complete_upcall(ep); + t3c_remove_tid(ep->com.tdev, ctx, ep->hwtid); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + free_ep(&ep->com); + goto out; + case DEAD: + goto out; + default: + BUG_ON(1); + } + iwch_ep_disconnect(ep, abort, GFP_KERNEL); +out: + return CPL_RET_BUF_DONE; +} + +/* + * Returns whether an ABORT_REQ_RSS message is a negative advice. + */ +static inline int is_neg_adv_abort(unsigned int status) +{ + return status == CPL_ERR_RTX_NEG_ADVICE || + status == CPL_ERR_PERSIST_NEG_ADVICE; +} + +static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_abort_req_rss *req = cplhdr(skb); + struct iwch_ep *ep = ctx; + struct cpl_abort_rpl *rpl; + struct sk_buff *rpl_skb; + struct iwch_qp_attributes attrs; + int ret; + int state; + + if (is_neg_adv_abort(req->status)) { + PDBG("%s neg_adv_abort ep %p hwtid %d\n", __FUNCTION__, ep, + ep->hwtid); + t3_l2t_send_event(ep->com.tdev, ep->l2t); + return CPL_RET_BUF_DONE; + } + + state = state_read(&ep->com); + PDBG("%s (%s line %u) ep %p state %u\n", __FUNCTION__, __FILE__, + __LINE__, ep, state); + switch (state) { + case CONNECTING: + break; + case MPA_REQ_WAIT: + break; + case MPA_REQ_SENT: + connect_reply_upcall(ep, -ECONNRESET); + break; + case MPA_REP_SENT: + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + break; + case MPA_REQ_RCVD: + + /* + * We're gonna mark this puppy DEAD, but keep + * the reference on it until the ULP accepts or + * rejects the CR. + */ + ep_atomic_inc(&ep->com.refcnt); + break; + case MORIBUND: + stop_ep_timer(ep); + case FPDU_MODE: + case CLOSING: + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_ERROR; + ret = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (ret) { + printk(KERN_ERR MOD + "%s - qp <- error failed!\n", + __FUNCTION__); + } + } + peer_abort_upcall(ep); + break; + case ABORTING: + break; + case DEAD: + PDBG("%s PEER_ABORT IN DEAD STATE!!!!\n", __FUNCTION__); + return CPL_RET_BUF_DONE; + default: + BUG_ON(1); + break; + } + dst_confirm(ep->dst); + + rpl_skb = get_skb(skb, sizeof(*rpl), GFP_KERNEL); + if (!rpl_skb) { + printk(KERN_ERR MOD "%s - cannot allocate skb!\n", + __FUNCTION__); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + free_ep(&ep->com); + return CPL_RET_BUF_DONE; + } + rpl_skb->priority = CPL_PRIORITY_DATA; + rpl = (struct cpl_abort_rpl *) skb_put(rpl_skb, sizeof(*rpl)); + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_TOE_HOST_ABORT_CON_RPL)); + rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid)); + rpl->cmd = CPL_ABORT_NO_RST; + ep->com.tdev->send(ep->com.tdev, rpl_skb); + if (state != ABORTING) { + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + t3c_remove_tid(ep->com.tdev, ctx, ep->hwtid); + state_set(&ep->com, DEAD); + free_ep(&ep->com); + } + return CPL_RET_BUF_DONE; +} + +static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct iwch_qp_attributes attrs; + + PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + BUG_ON(!ep); + + /* The cm_id may be null if we failed to connect */ + switch (state_read(&ep->com)) { + case CLOSING: + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + break; + case MORIBUND: + stop_ep_timer(ep); + if ((ep->com.cm_id) && (ep->com.qp)) { + attrs.next_state = IWCH_QP_STATE_IDLE; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, + IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + state_set(&ep->com, DEAD); + close_complete_upcall(ep); + t3c_remove_tid(ep->com.tdev, ctx, ep->hwtid); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + free_ep(&ep->com); + break; + case DEAD: + default: + BUG_ON(1); + break; + } + + return CPL_RET_BUF_DONE; +} + +/* + * T3A does 3 things when a TERM is received: + * 1) send up a CPL_RDMA_TERMINATE message with the TERM packet + * 2) generate an async event on the QP with the TERMINATE opcode + * 3) post a TERMINATE opcde cqe into the associated CQ. + * + * For (1), we save the message in the qp for later consumer consumption. + * For (2), we move the QP into TERMINATE, post a QP event and disconnect. + * For (3), we toss the CQE in cxio_poll_cq(). + * + * terminate() handles case (1)... + */ +static int terminate(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + + PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + skb_pull(skb, sizeof(struct cpl_rdma_terminate)); + PDBG("%s saving %d bytes of term msg\n", __FUNCTION__, skb->len); + memcpy(ep->com.qp->attr.terminate_buffer, skb->data, skb->len); + ep->com.qp->attr.terminate_msg_len = skb->len; + ep->com.qp->attr.is_terminate_local = 0; + return CPL_RET_BUF_DONE; +} + +static void ep_timeout(unsigned long arg) +{ + struct iwch_ep *ep = (struct iwch_ep *)arg; + struct iwch_qp_attributes attrs; + + PDBG("%s enter (%s line %u) ep %p hwtid %d\n", __FUNCTION__, __FILE__, + __LINE__, ep, ep->hwtid); + if (state_comp_exch(&ep->com, MPA_REQ_SENT, CLOSING)) { + struct sk_buff *skb; + + connect_reply_upcall(ep, -ETIMEDOUT); + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) { + abort_connection(ep, skb); + } + } + if (state_comp_exch(&ep->com, MPA_REQ_WAIT, CLOSING)) { + struct sk_buff *skb; + + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) { + abort_connection(ep, skb); + } + } + if (state_comp_exch(&ep->com, MORIBUND, ABORTING)) { + struct sk_buff *skb; + + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) { + abort_connection(ep, skb); + } + } + free_ep(&ep->com); +} + +int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) +{ + int err; + struct iwch_ep *ep = to_ep(cm_id); + PDBG("%s enter (%s line %u) ep %p hwtid %d\n", __FUNCTION__, __FILE__, + __LINE__, ep, ep->hwtid); + + if (state_read(&ep->com) == DEAD) { + free_ep(&ep->com); + return -ECONNRESET; + } + BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); + state_set(&ep->com, CLOSING); + if (mpa_rev == 0) { + abort_connection(ep, NULL); + } else { + err = send_mpa_reject(ep, pdata, pdata_len); + err = send_halfclose(ep, GFP_KERNEL); + } + return 0; +} + +int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + int err; + struct iwch_ep *ep = to_ep(cm_id); + struct iwch_dev *h = to_iwch_dev(cm_id->device); + + PDBG("%s enter (%s line %u) ep %p hwtid %d\n", __FUNCTION__, __FILE__, + __LINE__, ep, ep->hwtid); + + if (state_read(&ep->com) == DEAD) { + free_ep(&ep->com); + return -ECONNRESET; + } + BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); + + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->com.qp = get_qhp(h, conn_param->qpn); + BUG_ON(!ep->com.qp); + + /* + * Quiesce the TID here. The uP unquiesces the TID as + * part of the rdma_init operation. + */ + err = iwch_quiesce_tid(ep); + if (err) { + abort_connection(ep, NULL); + return err; + } + ep->com.rpl_done = 0; + ep->com.rpl_err = 0; + ep->ird = conn_param->ird; + ep->ord = conn_param->ord; + PDBG("%s %d ird %d ord %d\n", __FUNCTION__, __LINE__, ep->ird, ep->ord); + ep_atomic_inc(&ep->com.refcnt); + err = send_mpa_reply(ep, conn_param->private_data, + conn_param->private_data_len); + if (err) { + free_ep(&ep->com); + abort_connection(ep, NULL); + return err; + } + + /* wait until the MPA is transmitted. */ + PDBG("sleeping on ep %p\n", ep); + wait_event(ep->com.waitq, ep->com.rpl_done); + PDBG("awakened on ep %p\n", ep); + + err = ep->com.rpl_err; + if (err) { + ep->com.cm_id = NULL; + ep->com.qp = NULL; + cm_id->rem_ref(cm_id); + } + free_ep(&ep->com); + return err; +} + +int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + int err = 0; + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_ep *ep; + struct rtable *rt; + + ep = alloc_ep(sizeof(*ep), GFP_KERNEL); + if (!ep) { + printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__); + err = -ENOMEM; + goto out; + } + PDBG("%s ep %p\n", __FUNCTION__, ep); + init_timer(&ep->timer); + ep->plen = conn_param->private_data_len; + if (ep->plen) { + memcpy(ep->mpa_pkt + sizeof(struct mpa_message), + conn_param->private_data, ep->plen); + } + ep->ird = conn_param->ird; + ep->ord = conn_param->ord; + ep->com.tdev = h->rdev.t3cdev_p; + + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->com.qp = get_qhp(h, conn_param->qpn); + BUG_ON(!ep->com.qp); + + /* + * XXX. + */ + if (get_port(&cm_id->local_addr.sin_port)) { + err = -EADDRINUSE; + goto fail1; + } + + /* + * Allocate an active TID to initiate a TCP connection. + */ + ep->atid = t3c_alloc_atid(h->rdev.t3cdev_p, &t3c_client, ep); + if (ep->atid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); + err = -ENOMEM; + goto fail2; + } + + /* find a route */ + /* XXX Shouldn't need this. IWCM should pass down dst entry ptr */ + rt = find_route(h->rdev.t3cdev_p, + cm_id->local_addr.sin_addr.s_addr, + cm_id->remote_addr.sin_addr.s_addr, + cm_id->local_addr.sin_port, + cm_id->remote_addr.sin_port, IPTOS_LOWDELAY); + if (!rt) { + printk(KERN_ERR MOD "%s - cannot find route.\n", __FUNCTION__); + err = -EHOSTUNREACH; + goto fail3; + } + ep->dst = &rt->u.dst; + + /* get a l2t entry */ + ep->l2t = t3_l2t_get(ep->com.tdev, + ep->dst->neighbour, + ep->dst->neighbour->dev->if_port); + if (!ep->l2t) { + printk(KERN_ERR MOD "%s - cannot alloc l2e.\n", __FUNCTION__); + err = -ENOMEM; + goto fail4; + } + + state_set(&ep->com, CONNECTING); + ep->tos = IPTOS_LOWDELAY; /* XXX */ + ep->com.local_addr = cm_id->local_addr; + ep->com.remote_addr = cm_id->remote_addr; + + /* send connect request to rnic */ + err = send_connect(ep); + if (!err) { + goto out; + } + + l2t_release(L2DATA(h->rdev.t3cdev_p), ep->l2t); +fail4: + dst_release(ep->dst); +fail3: + t3c_free_atid(ep->com.tdev, ep->atid); +fail2: + free_port(cm_id->local_addr.sin_port); +fail1: + free_ep(&ep->com); +out: + return err; +} + +int iwch_create_listen(struct iw_cm_id *cm_id, int backlog) +{ + int err = 0; + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_listen_ep *ep; + + PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + might_sleep(); + + if (get_port(&cm_id->local_addr.sin_port)) { + err = -EADDRINUSE; + goto out; + } + ep = alloc_ep(sizeof(*ep), GFP_KERNEL); + if (!ep) { + printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__); + err = -ENOMEM; + goto fail1; + } + ep->com.tdev = h->rdev.t3cdev_p; + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->backlog = backlog; + ep->com.local_addr = cm_id->local_addr; + + /* + * Allocate a server TID. + */ + ep->stid = t3c_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep); + if (ep->stid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); + err = -ENOMEM; + goto fail2; + } + + state_set(&ep->com, LISTEN); + err = listen_start(ep); + if (err) { + goto fail3; + } + + /* wait for pass_open_rpl */ + wait_event(ep->com.waitq, ep->com.rpl_done); + err = ep->com.rpl_err; + if (!err) { + cm_id->provider_data = ep; + goto out; + } +fail3: + t3c_free_stid(ep->com.tdev, ep->stid); +fail2: + free_ep(&ep->com); +fail1: + free_port(cm_id->local_addr.sin_port); +out: + return err; +} + +int iwch_destroy_listen(struct iw_cm_id *cm_id) +{ + int err; + struct iwch_listen_ep *ep = to_listen_ep(cm_id); + + PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + might_sleep(); + state_set(&ep->com, DEAD); + ep->com.rpl_done = 0; + ep->com.rpl_err = 0; + err = listen_stop(ep); + wait_event(ep->com.waitq, ep->com.rpl_done); + t3c_free_stid(ep->com.tdev, ep->stid); + err = ep->com.rpl_err; + cm_id->rem_ref(cm_id); + free_ep(&ep->com); + return err; +} + +int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp) +{ + int ret=0; + int state; + + + state = state_read(&ep->com); + PDBG("%s enter (%s line %u) ep %p state %s, abrupt %d\n", + __FUNCTION__, __FILE__, __LINE__, ep, states[state], abrupt); + if (state == DEAD) { + PDBG("%s already dead ep %p\n", __FUNCTION__, ep); + return 0; + } + if (abrupt) { + if (state != ABORTING) { + state_set(&ep->com, ABORTING); + ret = send_abort(ep, NULL, gfp); + } + } else { + + if (state != CLOSING) { + state_set(&ep->com, CLOSING); + } else { + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + } + + ret = send_halfclose(ep, gfp); + } + return ret; +} + +int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, + struct l2t_entry *l2t) +{ + struct iwch_ep *ep = ctx; + + if (ep->dst != old) + return 0; + + PDBG("%s ep %p redirect to dst %p l2t %p\n", __FUNCTION__, ep, new, l2t); + dst_hold(new); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + ep->l2t = l2t; + dst_release(old); + ep->dst = new; + return 1; +} + +/* + * All the CM events are handled on a work queue to have a safe context. + */ +static int sched(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep_common *epc = ctx; + + ep_atomic_inc(&epc->refcnt); + + /* + * Save ctx and tdev in the skb->cb area. + */ + *((void **) skb->cb) = ctx; + *((struct t3cdev **) (skb->cb + sizeof(void *))) = tdev; + + /* + * Queue the skb and schedule the worker thread. + */ + skb_queue_tail(&rxq, skb); + queue_work(workq, &skb_work); + return 0; +} + +int __init iwch_cm_init(void) +{ + skb_queue_head_init(&rxq); + + workq = create_singlethread_workqueue("iw_cxgb3"); + if (!workq) + return -ENOMEM; + + /* + * All upcalls from the T3 Core go to sched() to + * schedule the processing on a work queue. + */ + t3c_handlers[CPL_ACT_ESTABLISH] = sched; + t3c_handlers[CPL_ACT_OPEN_RPL] = sched; + t3c_handlers[CPL_RX_DATA] = sched; + t3c_handlers[CPL_TX_DMA_ACK] = sched; + t3c_handlers[CPL_ABORT_RPL_RSS] = sched; + t3c_handlers[CPL_ABORT_RPL] = sched; + t3c_handlers[CPL_PASS_OPEN_RPL] = sched; + t3c_handlers[CPL_CLOSE_LISTSRV_RPL] = sched; + t3c_handlers[CPL_PASS_ACCEPT_REQ] = sched; + t3c_handlers[CPL_PASS_ESTABLISH] = sched; + t3c_handlers[CPL_PEER_CLOSE] = sched; + t3c_handlers[CPL_CLOSE_CON_RPL] = sched; + t3c_handlers[CPL_ABORT_REQ_RSS] = sched; + t3c_handlers[CPL_RDMA_TERMINATE] = sched; + + /* + * These are the real handlers that are called from a + * work queue. + */ + work_handlers[CPL_ACT_ESTABLISH] = act_establish; + work_handlers[CPL_ACT_OPEN_RPL] = act_open_rpl; + work_handlers[CPL_RX_DATA] = rx_data; + work_handlers[CPL_TX_DMA_ACK] = tx_ack; + work_handlers[CPL_ABORT_RPL_RSS] = abort_rpl; + work_handlers[CPL_ABORT_RPL] = abort_rpl; + work_handlers[CPL_PASS_OPEN_RPL] = pass_open_rpl; + work_handlers[CPL_CLOSE_LISTSRV_RPL] = close_listsrv_rpl; + work_handlers[CPL_PASS_ACCEPT_REQ] = pass_accept_req; + work_handlers[CPL_PASS_ESTABLISH] = pass_establish; + work_handlers[CPL_PEER_CLOSE] = peer_close; + work_handlers[CPL_ABORT_REQ_RSS] = peer_abort; + work_handlers[CPL_CLOSE_CON_RPL] = close_con_rpl; + work_handlers[CPL_RDMA_TERMINATE] = terminate; + return 0; +} + +void __exit iwch_cm_term(void) +{ + flush_workqueue(workq); + destroy_workqueue(workq); +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h new file mode 100644 index 0000000..0e26352 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h @@ -0,0 +1,232 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _IWCH_CM_H_ +#define _IWCH_CM_H_ + +#include +#include +#include +#include +#include "iwch_provider.h" +#include + +#include + +#define MPA_KEY_REQ "MPA ID Req Frame" +#define MPA_KEY_REP "MPA ID Rep Frame" + +#define MPA_MAX_PRIVATE_DATA 256 +#define MPA_REV 0 /* XXX - amso1100 uses rev 0 ! */ +#define MPA_REJECT 0x20 +#define MPA_CRC 0x40 +#define MPA_MARKERS 0x80 +#define MPA_FLAGS_MASK 0xE0 + +#define free_ep(A) { \ + PDBG("%s %d: Calling __free_ep\n",__FUNCTION__, __LINE__); \ + __free_ep(A); \ +} + +#define ep_atomic_inc(A) { \ + PDBG("%s enter (%s line %u) A %p, refcnt %d\n", \ + __FUNCTION__, __FILE__, \ + __LINE__, A, \ + atomic_read(A)); \ + atomic_inc(A); \ +} + +struct mpa_message { + u8 key[16]; + u8 flags; + u8 revision; + u16 private_data_size; + u8 private_data[0]; +}; + +struct terminate_message { + u8 layer_etype; + u8 ecode; + u16 hdrct_rsvd; + u8 len_hdrs[0]; +}; + +#define TERM_MAX_LENGTH (sizeof(struct terminate_message) + 2 + 18 + 28) + +enum iwch_term_layers { + LAYER_RDMAP = 0x00, + LAYER_DDP = 0x10, + LAYER_MPA = 0x20, +}; + +enum iwch_rdma_etypes { + RDMAP_LOCAL_CATA = 0x00, + RDMAP_REMOTE_PROT = 0x01, + RDMAP_REMOTE_OP = 0x02, +}; + +enum iwch_rdma_ecodes { + RDMAP_INV_STAG = 0x00, + RDMAP_BASE_BOUNDS = 0x01, + RDMAP_ACC_VIOL = 0x02, + RDMAP_STAG_NOT_ASSOC = 0x03, + RDMAP_TO_WRAP = 0x04, + RDMAP_INV_VERS = 0x05, + RDMAP_INV_OPCODE = 0x06, + RDMAP_STREAM_CATA = 0x07, + RDMAP_GLOBAL_CATA = 0x08, + RDMAP_CANT_INV_STAG = 0x09, + RDMAP_UNSPECIFIED = 0xff +}; + +enum iwch_ddp_etypes { + DDP_LOCAL_CATA = 0x00, + DDP_TAGGED_ERR = 0x01, + DDP_UNTAGGED_ERR = 0x02, + DDP_LLP = 0x03 +}; + +enum iwch_ddp_tagged_ecodes { + DDPT_INV_STAG = 0x00, + DDPT_BASE_BOUNDS = 0x01, + DDPT_STAG_NOT_ASSOC = 0x02, + DDPT_TO_WRAP = 0x03, + DDPT_INV_VERS = 0x04, +}; + +enum iwch_ddp_utagged_ecodes { + DDPU_INV_QN = 0x01, + DDPU_INV_MSN_NOBUF = 0x02, + DDPU_INV_MSN_RANGE = 0x03, + DDPU_INV_MO = 0x04, + DDPU_MSG_TOOBIG = 0x05, + DDPU_INV_VERS = 0x06 +}; + +enum iwch_mpa_ecodes { + MPA_CRC_ERR = 0x02, + MPA_MARKER_ERR = 0x03 +}; + + +enum iwch_ep_state { + IDLE = 0, + LISTEN, + CONNECTING, + MPA_REQ_WAIT, + MPA_REQ_SENT, + MPA_REQ_RCVD, + MPA_REP_SENT, + FPDU_MODE, + ABORTING, + CLOSING, + MORIBUND, + DEAD, +}; + +struct iwch_ep_common { + struct iw_cm_id *cm_id; + struct iwch_qp *qp; + struct t3cdev *tdev; + enum iwch_ep_state state; + atomic_t refcnt; + spinlock_t lock; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + wait_queue_head_t waitq; + int rpl_done; + int rpl_err; +}; + +struct iwch_listen_ep { + struct iwch_ep_common com; + unsigned int stid; + int backlog; +}; + +struct iwch_ep { + struct iwch_ep_common com; + struct iwch_ep *parent_ep; + struct timer_list timer; + unsigned int atid; + u32 hwtid; + u32 snd_seq; + struct l2t_entry *l2t; + struct dst_entry *dst; + struct sk_buff *mpa_skb; + struct iwch_mpa_attributes mpa_attr; + unsigned int mpa_pkt_len; + u8 mpa_pkt[sizeof(struct mpa_message) + MPA_MAX_PRIVATE_DATA]; + u8 tos; + u16 emss; + u16 plen; + u32 ird; + u32 ord; +}; + +static inline struct iwch_ep *to_ep(struct iw_cm_id *cm_id) +{ + return (struct iwch_ep *)cm_id->provider_data; +} + +static inline struct iwch_listen_ep *to_listen_ep(struct iw_cm_id *cm_id) +{ + return (struct iwch_listen_ep *)cm_id->provider_data; +} + +static inline int compute_wscale(int win) +{ + int wscale = 0; + + while (wscale < 14 && (65535< References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623143015.32410.11151.stgit@stevo-desktop> This patch contains device discovery and registration for the cxgb3 "core" module. The cxgb3 core module provides TCP connection management services. This module is needed to support multiple ULPs using the cxgb3 device for managing TCP connections. The OpenIB driver uses it to allocate and setup iWARP LLP connections + pass data in streaming mode (for MPA negotiation) before going into RDMA mode and associating the LLP stream with a iWARP QP. It is separated from the LLD/NETDEV driver because its not needed for a dumb NIC only installation. There will be other ULPs that use this interface. This patch also has the first-level event handler functions that process L2/L3 events obtained via the Network Event Notifier mechanism. --- drivers/infiniband/hw/cxgb3/t3c/defs.h | 100 +++++ drivers/infiniband/hw/cxgb3/t3c/t3cdev.c | 570 ++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/t3c/tcb.h | 378 ++++++++++++++++++++ 3 files changed, 1048 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/t3c/defs.h b/drivers/infiniband/hw/cxgb3/t3c/defs.h new file mode 100644 index 0000000..3f9b9d3 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/t3c/defs.h @@ -0,0 +1,100 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _CHELSIO_DEFS_H +#define _CHELSIO_DEFS_H + +#include +#include + +#include + +#include "t3c.h" + +#define VALIDATE_TID 1 + +void *t3_alloc_mem(unsigned long size); +void t3_free_mem(void *addr); +void t3c_neigh_update(struct neighbour *neigh, int flags); +void t3c_redirect(struct dst_entry *old, struct dst_entry *new); + +/* + * Map an ATID or STID to their entries in the corresponding TID tables. + */ +static inline union active_open_entry *atid2entry(const struct tid_info *t, + unsigned int atid) +{ + return &t->atid_tab[atid - t->atid_base]; +} + + +static inline union listen_entry *stid2entry(const struct tid_info *t, + unsigned int stid) +{ + return &t->stid_tab[stid - t->stid_base]; +} + +/* + * Find the socket corresponding to a TID. + */ +static inline struct t3c_tid_entry *lookup_tid(const struct tid_info *t, + unsigned int tid) +{ + return tid < t->ntids ? &(t->tid_tab[tid]) : NULL; +} + +/* + * Find the socket corresponding to a server TID. + */ +static inline struct t3c_tid_entry *lookup_stid(const struct tid_info *t, + unsigned int tid) +{ + if (tid < t->stid_base || tid >= t->stid_base + t->nstids) + return NULL; + return &(stid2entry(t, tid)->t3c_tid); +} + +/* + * Find the socket corresponding to an active-open TID. + */ +static inline struct t3c_tid_entry *lookup_atid(const struct tid_info *t, + unsigned int tid) +{ + if (tid < t->atid_base || tid >= t->atid_base + t->natids) + return NULL; + return &(atid2entry(t, tid)->t3c_tid); +} + +int process_rx(struct t3cdev *dev, struct sk_buff **skbs, int n); +int attach_t3cdev(struct t3cdev *dev); +void detach_t3cdev(struct t3cdev *dev); +#endif diff --git a/drivers/infiniband/hw/cxgb3/t3c/t3cdev.c b/drivers/infiniband/hw/cxgb3/t3c/t3cdev.c new file mode 100644 index 0000000..bec4d45 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/t3c/t3cdev.c @@ -0,0 +1,570 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "l2t.h" +#include "defs.h" +#include "t3cdev.h" +#include "firmware_exports.h" + +DEFINE_MUTEX(t3cdev_db_lock); +LIST_HEAD(t3cdev_list); + +static const unsigned int MAX_ATIDS = 64 * 1024; +static const unsigned int ATID_BASE = 0x100000; + +#ifdef CONFIG_PROC_FS +#include + +static struct proc_dir_entry *t3cdev_proc_root; + +static int devices_read_proc(char *buf, char **start, off_t offset, + int length, int *eof, void *data) +{ + int len; + struct t3cdev *dev; + struct net_device *ndev; + + len = sprintf(buf, "Device Interfaces\n"); + + mutex_lock(&t3cdev_db_lock); + list_for_each_entry(dev, &t3cdev_list, t3c_list) { + len += sprintf(buf + len, "%-16s", dev->name); + read_lock(&dev_base_lock); + for (ndev = dev_base; ndev; ndev = ndev->next) { + if (T3CDEV(ndev) == dev) + len += sprintf(buf + len, " %s", ndev->name); + } + read_unlock(&dev_base_lock); + len += sprintf(buf + len, "\n"); + if (len >= length) + break; + } + mutex_unlock(&t3cdev_db_lock); + + if (len > length) + len = length; + *eof = 1; + return len; +} + +static void t3c_proc_cleanup(void) +{ + remove_proc_entry("devices", t3cdev_proc_root); + remove_proc_entry("net/cxgb3c", NULL); + t3cdev_proc_root = NULL; +} + +static struct proc_dir_entry *create_t3c_proc_dir(const char *name) +{ + struct proc_dir_entry *d; + + if (!t3cdev_proc_root) + return NULL; + + d = proc_mkdir(name, t3cdev_proc_root); + if (d) + d->owner = THIS_MODULE; + return d; +} + +static void delete_t3c_proc_dir(struct t3cdev *dev) +{ + if (dev->proc_dir) { + remove_proc_entry(dev->name, t3cdev_proc_root); + dev->proc_dir = NULL; + } +} + +static int __init t3c_proc_init(void) +{ + struct proc_dir_entry *d; + + t3cdev_proc_root = proc_mkdir("net/cxgb3c", NULL); + if (!t3cdev_proc_root) + return -ENOMEM; + t3cdev_proc_root->owner = THIS_MODULE; + + d = create_proc_read_entry("devices", 0, t3cdev_proc_root, + devices_read_proc, NULL); + if (!d) + goto cleanup; + d->owner = THIS_MODULE; + return 0; + +cleanup: + t3c_proc_cleanup(); + return -ENOMEM; +} +#else +#define t3c_proc_init() 0 +#define create_t3c_proc_dir(name) NULL +#define delete_t3c_proc_dir(dev) +#endif /* CONFIG_PROC_FS */ + +/* + * Register a T3C device and try to attach an appropriate TCP offload module + * to it. 'name' is a template that may contain at most one %d format + * specifier. + */ +void unregister_t3cdev(struct t3cdev *dev) +{ + mutex_lock(&t3cdev_db_lock); + list_del(&dev->t3c_list); + delete_t3c_proc_dir(dev); + mutex_unlock(&t3cdev_db_lock); + return; +} + +/* + * Register a T3C device and try to attach an appropriate TCP offload module + * to it. 'name' is a template that may contain at most one %d format + * specifier. + */ +void register_t3cdev(struct t3cdev *dev, const char *name) +{ + static int unit; + + mutex_lock(&t3cdev_db_lock); + snprintf(dev->name, sizeof(dev->name), name, unit++); + dev->proc_dir = create_t3c_proc_dir(dev->name); + list_add_tail(&dev->t3c_list, &t3cdev_list); + mutex_unlock(&t3cdev_db_lock); + return; +} + +/* + * Sends an sk_buff to a T3C driver after dealing with any active network taps. + */ +int t3c_send(struct t3cdev *dev, struct sk_buff *skb) +{ + int r; + + local_bh_disable(); + r = dev->send(dev, skb); + local_bh_enable(); + return r; +} +EXPORT_SYMBOL(t3c_send); + +void t3c_neigh_update(struct neighbour *neigh, int flags) +{ + struct net_device *dev = neigh->dev; + + if (dev && (dev->features & NETIF_F_TCPIP_OFFLOAD)) { + struct t3cdev *tdev = T3CDEV(dev); + + BUG_ON(!tdev); + t3_l2t_update(tdev, neigh, flags, neigh->dev); + } +} + +static void set_l2t_ix(struct t3cdev *tdev, u32 tid, struct l2t_entry *e) +{ + struct sk_buff *skb; + struct cpl_set_tcb_field *req; + + skb = alloc_skb(sizeof(*req), GFP_ATOMIC); + if (!skb) { + printk(KERN_ERR "%s: cannot allocate skb!\n", __FUNCTION__); + return; + } + skb->priority = CPL_PRIORITY_CONTROL; + req = (struct cpl_set_tcb_field *)skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, tid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_L2T_IX); + req->mask = cpu_to_be64(V_TCB_L2T_IX(M_TCB_L2T_IX)); + req->val = cpu_to_be64(V_TCB_L2T_IX(e->idx)); + tdev->send(tdev, skb); +} + +void t3c_redirect(struct dst_entry *old, struct dst_entry *new) +{ + struct net_device *olddev, *newdev; + struct tid_info *ti; + struct t3cdev *tdev; + u32 tid; + int update_tcb; + struct l2t_entry *e; + struct t3c_tid_entry *te; + + olddev = old->neighbour->dev; + newdev = new->neighbour->dev; + if (!(olddev->features & NETIF_F_TCPIP_OFFLOAD)) + return; + if (!(newdev->features & NETIF_F_TCPIP_OFFLOAD)) { + printk(KERN_WARNING "%s: Redirect to non-offload" + "device ignored.\n", __FUNCTION__); + return; + } + tdev = T3CDEV(olddev); + BUG_ON(!tdev); + if (tdev != T3CDEV(newdev)) { + printk(KERN_WARNING "%s: Redirect to different " + "offload device ignored.\n", __FUNCTION__); + return; + } + + /* Add new L2T entry */ + e = t3_l2t_get(tdev, new->neighbour, new->neighbour->dev->if_port); + if (!e) { + printk(KERN_ERR "%s: couldn't allocate new l2t entry!\n", + __FUNCTION__); + return; + } + + /* Walk tid table and notify clients of dst change. */ + ti = &(T3C_DATA(tdev))->tid_maps; + for (tid=0; tid < ti->ntids; tid++) { + te = lookup_tid(ti, tid); + BUG_ON(!te); + if (te->ctx && te->client && te->client->redirect) { + update_tcb = te->client->redirect(te->ctx, old, new, e); + if (update_tcb) { + l2t_hold(L2DATA(tdev), e); + set_l2t_ix(tdev, tid, e); + } + } + } + l2t_release(L2DATA(tdev), e); +} + +/* + * Allocate a chunk of memory using kmalloc or, if that fails, vmalloc. + * The allocated memory is cleared. + */ +void *t3_alloc_mem(unsigned long size) +{ + void *p = kmalloc(size, GFP_KERNEL); + + if (!p) + p = vmalloc(size); + if (p) + memset(p, 0, size); + return p; +} + +/* + * Free memory allocated through t3_alloc_mem(). + */ +void t3_free_mem(void *addr) +{ + unsigned long p = (unsigned long) addr; + + if (p >= VMALLOC_START && p < VMALLOC_END) + vfree(addr); + else + kfree(addr); +} + +/* + * Allocate and initialize the TID tables. Returns 0 on success. + */ +static int init_tid_tabs(struct tid_info *t, unsigned int ntids, + unsigned int natids, unsigned int nstids, + unsigned int atid_base, unsigned int stid_base) +{ + unsigned long size = ntids * sizeof(*t->tid_tab) + + natids * sizeof(*t->atid_tab) + nstids * sizeof(*t->stid_tab); + + t->tid_tab = t3_alloc_mem(size); + if (!t->tid_tab) + return -ENOMEM; + + t->stid_tab = (union listen_entry *)&t->tid_tab[ntids]; + t->atid_tab = (union active_open_entry *)&t->stid_tab[nstids]; + t->ntids = ntids; + t->nstids = nstids; + t->stid_base = stid_base; + t->sfree = NULL; + t->natids = natids; + t->atid_base = atid_base; + t->afree = NULL; + t->stids_in_use = t->atids_in_use = 0; + atomic_set(&t->tids_in_use, 0); + spin_lock_init(&t->stid_lock); + spin_lock_init(&t->atid_lock); + + /* + * Setup the free lists for stid_tab and atid_tab. + */ + if (nstids) { + while (--nstids) + t->stid_tab[nstids - 1].next = &t->stid_tab[nstids]; + t->sfree = t->stid_tab; + } + if (natids) { + while (--natids) + t->atid_tab[natids - 1].next = &t->atid_tab[natids]; + t->afree = t->atid_tab; + } + return 0; +} + +static void free_tid_maps(struct tid_info *t) +{ + t3_free_mem(t->tid_tab); +} + +/* + * Process a received packet with an unknown/unexpected CPL opcode. + */ +static int do_bad_cpl(struct t3cdev *dev, struct sk_buff *skb) +{ + printk(KERN_ERR "%s: received bad CPL command 0x%x\n", dev->name, + *skb->data); + return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG; +} + +/* + * Handlers for each CPL opcode + */ +static cpl_handler_func cpl_handlers[NUM_CPL_CMDS]; + +/* + * Add a new handler to the CPL dispatch table. A NULL handler may be supplied + * to unregister an existing handler. + */ +void t3_register_cpl_handler(unsigned int opcode, cpl_handler_func h) +{ + if (opcode < NUM_CPL_CMDS) + cpl_handlers[opcode] = h ? h : do_bad_cpl; + else + printk(KERN_ERR "T3C: handler registration for " + "opcode %x failed\n", opcode); +} +EXPORT_SYMBOL(t3_register_cpl_handler); + +/* + * T3CDEV's receive method. + */ +int process_rx(struct t3cdev *dev, struct sk_buff **skbs, int n) +{ + while (n--) { + struct sk_buff *skb = *skbs++; + unsigned int opcode = G_OPCODE(ntohl(skb->csum)); + int ret = cpl_handlers[opcode] (dev, skb); + +#if VALIDATE_TID + if (ret & CPL_RET_UNKNOWN_TID) { + union opcode_tid *p = cplhdr(skb); + + printk(KERN_ERR "%s: CPL message (opcode %u) had " + "unknown TID %u\n", dev->name, opcode, + G_TID(ntohl(p->opcode_tid))); + } +#endif + if (ret & CPL_RET_BUF_DONE) + kfree_skb(skb); + } + return 0; +} + +#ifdef CONFIG_PROC_FS +#include + +static int t3cdev_info_read_proc(char *buf, char **start, off_t offset, + int length, int *eof, void *data) +{ + struct t3c_data *d = data; + struct tid_info *t = &d->tid_maps; + int len; + + len = sprintf(buf, "TID range: 0..%d, in use: %u\n" + "STID range: %d..%d, in use: %u\n" + "ATID range: %d..%d, in use: %u\n" + "MSS: %u\n", + t->ntids - 1, atomic_read(&t->tids_in_use), t->stid_base, + t->stid_base + t->nstids - 1, t->stids_in_use, + t->atid_base, t->atid_base + t->natids - 1, + t->atids_in_use, d->tx_max_chunk); + if (len > length) + len = length; + *eof = 1; + return len; +} + +static int t3cdev_info_proc_setup(struct proc_dir_entry *dir, + struct t3c_data *d) +{ + struct proc_dir_entry *p; + + if (!dir) + return -EINVAL; + + p = create_proc_read_entry("info", 0, dir, t3cdev_info_read_proc, d); + if (!p) + return -ENOMEM; + + p->owner = THIS_MODULE; + return 0; +} + +static void t3cdev_proc_init(struct t3cdev *dev) +{ + t3_l2t_proc_setup(dev->proc_dir, L2DATA(dev)); + t3cdev_info_proc_setup(dev->proc_dir, T3C_DATA(dev)); +} + +static void t3cdev_info_proc_free(struct proc_dir_entry *dir) +{ + if (dir) + remove_proc_entry("info", dir); +} + +static void t3cdev_proc_cleanup(struct t3cdev *dev) +{ + t3_l2t_proc_free(dev->proc_dir); + t3cdev_info_proc_free(dev->proc_dir); +} + +#else +#define t3cdev_proc_init(dev) +#define t3cdev_proc_cleanup(dev) +#endif + +void detach_t3cdev(struct t3cdev *dev) +{ + struct t3c_data *t = T3C_DATA(dev); + t3cdev_proc_cleanup(dev); + dev->close(dev); + free_tid_maps(&t->tid_maps); + dev->recv = NULL; + dev->neigh_update = NULL; + T3C_DATA(dev) = NULL; + t3_free_l2t(L2DATA(dev)); + L2DATA(dev) = NULL; + kfree(t); +} + +int attach_t3cdev(struct t3cdev *dev) +{ + int natids, err; + struct t3c_data *t; + struct tid_range stid_range, tid_range; + struct ddp_params ddp; + struct mtutab mtutab; + unsigned int l2t_capacity; + + t = kcalloc(1, sizeof(*t), GFP_KERNEL); + if (!t) + return -ENOMEM; + + err = -EOPNOTSUPP; + if (dev->ctl(dev, GET_TX_MAX_CHUNK, &t->tx_max_chunk) < 0 || + dev->ctl(dev, GET_MAX_OUTSTANDING_WR, &t->max_wrs) < 0 || + dev->ctl(dev, GET_L2T_CAPACITY, &l2t_capacity) < 0 || + dev->ctl(dev, GET_MTUS, &mtutab) < 0 || + dev->ctl(dev, GET_DDP_PARAMS, &ddp) < 0 || + dev->ctl(dev, GET_TID_RANGE, &tid_range) < 0 || + dev->ctl(dev, GET_STID_RANGE, &stid_range) < 0) + goto out_free; + + err = -ENOMEM; + L2DATA(dev) = t3_init_l2t(l2t_capacity); + if (!L2DATA(dev)) + goto out_free; + + natids = min(tid_range.num / 2, MAX_ATIDS); + err = init_tid_tabs(&t->tid_maps, tid_range.num, natids, + stid_range.num, ATID_BASE, stid_range.base); + if (err) + goto out_free_l2t; + + t->mtus = mtutab.mtus; + t->nmtus = mtutab.size; + + t->ddp_llimit = ddp.llimit; + t->ddp_ulimit = ddp.ulimit; + t->ddp_tagmask = ddp.tag_mask; + + INIT_LIST_HEAD(&t->list_node); + t->dev = dev; + + T3C_DATA(dev) = t; + dev->recv = process_rx; + dev->neigh_update = t3_l2t_update; + + /* All setup completed, let the driver know. */ + err = dev->open(dev); + if (err) + goto free_all; + + t3cdev_proc_init(dev); + return 0; + +free_all: + dev->recv = NULL; + dev->neigh_update = NULL; + T3C_DATA(dev) = NULL; + free_tid_maps(&t->tid_maps); +out_free_l2t: + t3_free_l2t(L2DATA(dev)); + L2DATA(dev) = NULL; +out_free: + kfree(t); + return err; +} + +void __init t3cdev_init(void) +{ + int i; + + if (t3c_proc_init()) + printk(KERN_WARNING "Unable to create /proc/net/t3c dir\n"); + + for (i = 0; i < NUM_CPL_CMDS; ++i) + cpl_handlers[i] = do_bad_cpl; + return; +} + +void __exit t3cdev_exit(void) +{ + t3c_proc_cleanup(); + return; +} + diff --git a/drivers/infiniband/hw/cxgb3/t3c/tcb.h b/drivers/infiniband/hw/cxgb3/t3c/tcb.h new file mode 100644 index 0000000..64d6e17 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/t3c/tcb.h @@ -0,0 +1,378 @@ +/* This file is automatically generated --- do not edit */ + +#ifndef _TCB_DEFS_H +#define _TCB_DEFS_H + +#define W_TCB_T_STATE 0 +#define S_TCB_T_STATE 0 +#define M_TCB_T_STATE 0xfULL +#define V_TCB_T_STATE(x) ((x) << S_TCB_T_STATE) + +#define W_TCB_TIMER 0 +#define S_TCB_TIMER 4 +#define M_TCB_TIMER 0x1ULL +#define V_TCB_TIMER(x) ((x) << S_TCB_TIMER) + +#define W_TCB_DACK_TIMER 0 +#define S_TCB_DACK_TIMER 5 +#define M_TCB_DACK_TIMER 0x1ULL +#define V_TCB_DACK_TIMER(x) ((x) << S_TCB_DACK_TIMER) + +#define W_TCB_DEL_FLAG 0 +#define S_TCB_DEL_FLAG 6 +#define M_TCB_DEL_FLAG 0x1ULL +#define V_TCB_DEL_FLAG(x) ((x) << S_TCB_DEL_FLAG) + +#define W_TCB_L2T_IX 0 +#define S_TCB_L2T_IX 7 +#define M_TCB_L2T_IX 0x7ffULL +#define V_TCB_L2T_IX(x) ((x) << S_TCB_L2T_IX) + +#define W_TCB_SMAC_SEL 0 +#define S_TCB_SMAC_SEL 18 +#define M_TCB_SMAC_SEL 0x3ULL +#define V_TCB_SMAC_SEL(x) ((x) << S_TCB_SMAC_SEL) + +#define W_TCB_TOS 0 +#define S_TCB_TOS 20 +#define M_TCB_TOS 0x3fULL +#define V_TCB_TOS(x) ((x) << S_TCB_TOS) + +#define W_TCB_MAX_RT 0 +#define S_TCB_MAX_RT 26 +#define M_TCB_MAX_RT 0xfULL +#define V_TCB_MAX_RT(x) ((x) << S_TCB_MAX_RT) + +#define W_TCB_T_RXTSHIFT 0 +#define S_TCB_T_RXTSHIFT 30 +#define M_TCB_T_RXTSHIFT 0xfULL +#define V_TCB_T_RXTSHIFT(x) ((x) << S_TCB_T_RXTSHIFT) + +#define W_TCB_T_DUPACKS 1 +#define S_TCB_T_DUPACKS 2 +#define M_TCB_T_DUPACKS 0xfULL +#define V_TCB_T_DUPACKS(x) ((x) << S_TCB_T_DUPACKS) + +#define W_TCB_T_MAXSEG 1 +#define S_TCB_T_MAXSEG 6 +#define M_TCB_T_MAXSEG 0xfULL +#define V_TCB_T_MAXSEG(x) ((x) << S_TCB_T_MAXSEG) + +#define W_TCB_T_FLAGS1 1 +#define S_TCB_T_FLAGS1 10 +#define M_TCB_T_FLAGS1 0xffffffffULL +#define V_TCB_T_FLAGS1(x) ((x) << S_TCB_T_FLAGS1) + +#define W_TCB_T_FLAGS2 2 +#define S_TCB_T_FLAGS2 10 +#define M_TCB_T_FLAGS2 0x7fULL +#define V_TCB_T_FLAGS2(x) ((x) << S_TCB_T_FLAGS2) + +#define W_TCB_SND_SCALE 2 +#define S_TCB_SND_SCALE 17 +#define M_TCB_SND_SCALE 0xfULL +#define V_TCB_SND_SCALE(x) ((x) << S_TCB_SND_SCALE) + +#define W_TCB_RCV_SCALE 2 +#define S_TCB_RCV_SCALE 21 +#define M_TCB_RCV_SCALE 0xfULL +#define V_TCB_RCV_SCALE(x) ((x) << S_TCB_RCV_SCALE) + +#define W_TCB_SND_UNA_RAW 2 +#define S_TCB_SND_UNA_RAW 25 +#define M_TCB_SND_UNA_RAW 0x7ffffffULL +#define V_TCB_SND_UNA_RAW(x) ((x) << S_TCB_SND_UNA_RAW) + +#define W_TCB_SND_NXT_RAW 3 +#define S_TCB_SND_NXT_RAW 20 +#define M_TCB_SND_NXT_RAW 0x7ffffffULL +#define V_TCB_SND_NXT_RAW(x) ((x) << S_TCB_SND_NXT_RAW) + +#define W_TCB_RCV_NXT 4 +#define S_TCB_RCV_NXT 15 +#define M_TCB_RCV_NXT 0xffffffffULL +#define V_TCB_RCV_NXT(x) ((x) << S_TCB_RCV_NXT) + +#define W_TCB_RCV_ADV 5 +#define S_TCB_RCV_ADV 15 +#define M_TCB_RCV_ADV 0xffffULL +#define V_TCB_RCV_ADV(x) ((x) << S_TCB_RCV_ADV) + +#define W_TCB_SND_MAX_RAW 5 +#define S_TCB_SND_MAX_RAW 31 +#define M_TCB_SND_MAX_RAW 0x7ffffffULL +#define V_TCB_SND_MAX_RAW(x) ((x) << S_TCB_SND_MAX_RAW) + +#define W_TCB_SND_CWND 6 +#define S_TCB_SND_CWND 26 +#define M_TCB_SND_CWND 0x7ffffffULL +#define V_TCB_SND_CWND(x) ((x) << S_TCB_SND_CWND) + +#define W_TCB_SND_SSTHRESH 7 +#define S_TCB_SND_SSTHRESH 21 +#define M_TCB_SND_SSTHRESH 0x7ffffffULL +#define V_TCB_SND_SSTHRESH(x) ((x) << S_TCB_SND_SSTHRESH) + +#define W_TCB_T_RTT_TS_RECENT_AGE 8 +#define S_TCB_T_RTT_TS_RECENT_AGE 16 +#define M_TCB_T_RTT_TS_RECENT_AGE 0xffffffffULL +#define V_TCB_T_RTT_TS_RECENT_AGE(x) ((x) << S_TCB_T_RTT_TS_RECENT_AGE) + +#define W_TCB_T_RTSEQ_RECENT 9 +#define S_TCB_T_RTSEQ_RECENT 16 +#define M_TCB_T_RTSEQ_RECENT 0xffffffffULL +#define V_TCB_T_RTSEQ_RECENT(x) ((x) << S_TCB_T_RTSEQ_RECENT) + +#define W_TCB_T_SRTT 10 +#define S_TCB_T_SRTT 16 +#define M_TCB_T_SRTT 0xffffULL +#define V_TCB_T_SRTT(x) ((x) << S_TCB_T_SRTT) + +#define W_TCB_T_RTTVAR 11 +#define S_TCB_T_RTTVAR 0 +#define M_TCB_T_RTTVAR 0xffffULL +#define V_TCB_T_RTTVAR(x) ((x) << S_TCB_T_RTTVAR) + +#define W_TCB_TS_LAST_ACK_SENT_RAW 11 +#define S_TCB_TS_LAST_ACK_SENT_RAW 16 +#define M_TCB_TS_LAST_ACK_SENT_RAW 0x7ffffffULL +#define V_TCB_TS_LAST_ACK_SENT_RAW(x) ((x) << S_TCB_TS_LAST_ACK_SENT_RAW) + +#define W_TCB_DIP 12 +#define S_TCB_DIP 11 +#define M_TCB_DIP 0xffffffffULL +#define V_TCB_DIP(x) ((x) << S_TCB_DIP) + +#define W_TCB_SIP 13 +#define S_TCB_SIP 11 +#define M_TCB_SIP 0xffffffffULL +#define V_TCB_SIP(x) ((x) << S_TCB_SIP) + +#define W_TCB_DP 14 +#define S_TCB_DP 11 +#define M_TCB_DP 0xffffULL +#define V_TCB_DP(x) ((x) << S_TCB_DP) + +#define W_TCB_SP 14 +#define S_TCB_SP 27 +#define M_TCB_SP 0xffffULL +#define V_TCB_SP(x) ((x) << S_TCB_SP) + +#define W_TCB_TIMESTAMP 15 +#define S_TCB_TIMESTAMP 11 +#define M_TCB_TIMESTAMP 0xffffffffULL +#define V_TCB_TIMESTAMP(x) ((x) << S_TCB_TIMESTAMP) + +#define W_TCB_TIMESTAMP_OFFSET 16 +#define S_TCB_TIMESTAMP_OFFSET 11 +#define M_TCB_TIMESTAMP_OFFSET 0xfULL +#define V_TCB_TIMESTAMP_OFFSET(x) ((x) << S_TCB_TIMESTAMP_OFFSET) + +#define W_TCB_TX_MAX 16 +#define S_TCB_TX_MAX 15 +#define M_TCB_TX_MAX 0xffffffffULL +#define V_TCB_TX_MAX(x) ((x) << S_TCB_TX_MAX) + +#define W_TCB_TX_HDR_PTR_RAW 17 +#define S_TCB_TX_HDR_PTR_RAW 15 +#define M_TCB_TX_HDR_PTR_RAW 0x1ffffULL +#define V_TCB_TX_HDR_PTR_RAW(x) ((x) << S_TCB_TX_HDR_PTR_RAW) + +#define W_TCB_TX_LAST_PTR_RAW 18 +#define S_TCB_TX_LAST_PTR_RAW 0 +#define M_TCB_TX_LAST_PTR_RAW 0x1ffffULL +#define V_TCB_TX_LAST_PTR_RAW(x) ((x) << S_TCB_TX_LAST_PTR_RAW) + +#define W_TCB_TX_COMPACT 18 +#define S_TCB_TX_COMPACT 17 +#define M_TCB_TX_COMPACT 0x1ULL +#define V_TCB_TX_COMPACT(x) ((x) << S_TCB_TX_COMPACT) + +#define W_TCB_RX_COMPACT 18 +#define S_TCB_RX_COMPACT 18 +#define M_TCB_RX_COMPACT 0x1ULL +#define V_TCB_RX_COMPACT(x) ((x) << S_TCB_RX_COMPACT) + +#define W_TCB_RCV_WND 18 +#define S_TCB_RCV_WND 19 +#define M_TCB_RCV_WND 0x7ffffffULL +#define V_TCB_RCV_WND(x) ((x) << S_TCB_RCV_WND) + +#define W_TCB_RX_HDR_OFFSET 19 +#define S_TCB_RX_HDR_OFFSET 14 +#define M_TCB_RX_HDR_OFFSET 0x7ffffffULL +#define V_TCB_RX_HDR_OFFSET(x) ((x) << S_TCB_RX_HDR_OFFSET) + +#define W_TCB_RX_FRAG0_START_IDX_RAW 20 +#define S_TCB_RX_FRAG0_START_IDX_RAW 9 +#define M_TCB_RX_FRAG0_START_IDX_RAW 0x7ffffffULL +#define V_TCB_RX_FRAG0_START_IDX_RAW(x) ((x) << S_TCB_RX_FRAG0_START_IDX_RAW) + +#define W_TCB_RX_FRAG1_START_IDX_OFFSET 21 +#define S_TCB_RX_FRAG1_START_IDX_OFFSET 4 +#define M_TCB_RX_FRAG1_START_IDX_OFFSET 0x7ffffffULL +#define V_TCB_RX_FRAG1_START_IDX_OFFSET(x) ((x) << S_TCB_RX_FRAG1_START_IDX_OFFSET) + +#define W_TCB_RX_FRAG0_LEN 21 +#define S_TCB_RX_FRAG0_LEN 31 +#define M_TCB_RX_FRAG0_LEN 0x7ffffffULL +#define V_TCB_RX_FRAG0_LEN(x) ((x) << S_TCB_RX_FRAG0_LEN) + +#define W_TCB_RX_FRAG1_LEN 22 +#define S_TCB_RX_FRAG1_LEN 26 +#define M_TCB_RX_FRAG1_LEN 0x7ffffffULL +#define V_TCB_RX_FRAG1_LEN(x) ((x) << S_TCB_RX_FRAG1_LEN) + +#define W_TCB_NEWRENO_RECOVER 23 +#define S_TCB_NEWRENO_RECOVER 21 +#define M_TCB_NEWRENO_RECOVER 0x7ffffffULL +#define V_TCB_NEWRENO_RECOVER(x) ((x) << S_TCB_NEWRENO_RECOVER) + +#define W_TCB_PDU_HAVE_LEN 24 +#define S_TCB_PDU_HAVE_LEN 16 +#define M_TCB_PDU_HAVE_LEN 0x1ULL +#define V_TCB_PDU_HAVE_LEN(x) ((x) << S_TCB_PDU_HAVE_LEN) + +#define W_TCB_PDU_LEN 24 +#define S_TCB_PDU_LEN 17 +#define M_TCB_PDU_LEN 0xffffULL +#define V_TCB_PDU_LEN(x) ((x) << S_TCB_PDU_LEN) + +#define W_TCB_RX_QUIESCE 25 +#define S_TCB_RX_QUIESCE 1 +#define M_TCB_RX_QUIESCE 0x1ULL +#define V_TCB_RX_QUIESCE(x) ((x) << S_TCB_RX_QUIESCE) + +#define W_TCB_RX_PTR_RAW 25 +#define S_TCB_RX_PTR_RAW 2 +#define M_TCB_RX_PTR_RAW 0x1ffffULL +#define V_TCB_RX_PTR_RAW(x) ((x) << S_TCB_RX_PTR_RAW) + +#define W_TCB_CPU_NO 25 +#define S_TCB_CPU_NO 19 +#define M_TCB_CPU_NO 0x7fULL +#define V_TCB_CPU_NO(x) ((x) << S_TCB_CPU_NO) + +#define W_TCB_ULP_TYPE 25 +#define S_TCB_ULP_TYPE 26 +#define M_TCB_ULP_TYPE 0xfULL +#define V_TCB_ULP_TYPE(x) ((x) << S_TCB_ULP_TYPE) + +#define S_TF_DACK 10 +#define V_TF_DACK(x) ((x) << S_TF_DACK) + +#define S_TF_NAGLE 11 +#define V_TF_NAGLE(x) ((x) << S_TF_NAGLE) + +#define S_TF_RECV_SCALE 12 +#define V_TF_RECV_SCALE(x) ((x) << S_TF_RECV_SCALE) + +#define S_TF_RECV_TSTMP 13 +#define V_TF_RECV_TSTMP(x) ((x) << S_TF_RECV_TSTMP) + +#define S_TF_RECV_SACK 14 +#define V_TF_RECV_SACK(x) ((x) << S_TF_RECV_SACK) + +#define S_TF_TURBO 15 +#define V_TF_TURBO(x) ((x) << S_TF_TURBO) + +#define S_TF_KEEPALIVE 16 +#define V_TF_KEEPALIVE(x) ((x) << S_TF_KEEPALIVE) + +#define S_TF_TCAM_BYPASS 17 +#define V_TF_TCAM_BYPASS(x) ((x) << S_TF_TCAM_BYPASS) + +#define S_TF_CORE_FIN 18 +#define V_TF_CORE_FIN(x) ((x) << S_TF_CORE_FIN) + +#define S_TF_CORE_MORE 19 +#define V_TF_CORE_MORE(x) ((x) << S_TF_CORE_MORE) + +#define S_TF_MIGRATING 20 +#define V_TF_MIGRATING(x) ((x) << S_TF_MIGRATING) + +#define S_TF_ACTIVE_OPEN 21 +#define V_TF_ACTIVE_OPEN(x) ((x) << S_TF_ACTIVE_OPEN) + +#define S_TF_ASK_MODE 22 +#define V_TF_ASK_MODE(x) ((x) << S_TF_ASK_MODE) + +#define S_TF_NON_OFFLOAD 23 +#define V_TF_NON_OFFLOAD(x) ((x) << S_TF_NON_OFFLOAD) + +#define S_TF_MOD_SCHD 24 +#define V_TF_MOD_SCHD(x) ((x) << S_TF_MOD_SCHD) + +#define S_TF_MOD_SCHD_REASON0 25 +#define V_TF_MOD_SCHD_REASON0(x) ((x) << S_TF_MOD_SCHD_REASON0) + +#define S_TF_MOD_SCHD_REASON1 26 +#define V_TF_MOD_SCHD_REASON1(x) ((x) << S_TF_MOD_SCHD_REASON1) + +#define S_TF_MOD_SCHD_RX 27 +#define V_TF_MOD_SCHD_RX(x) ((x) << S_TF_MOD_SCHD_RX) + +#define S_TF_CORE_PUSH 28 +#define V_TF_CORE_PUSH(x) ((x) << S_TF_CORE_PUSH) + +#define S_TF_RCV_COALESCE_ENABLE 29 +#define V_TF_RCV_COALESCE_ENABLE(x) ((x) << S_TF_RCV_COALESCE_ENABLE) + +#define S_TF_RCV_COALESCE_PUSH 30 +#define V_TF_RCV_COALESCE_PUSH(x) ((x) << S_TF_RCV_COALESCE_PUSH) + +#define S_TF_RCV_COALESCE_LAST_PSH 31 +#define V_TF_RCV_COALESCE_LAST_PSH(x) ((x) << S_TF_RCV_COALESCE_LAST_PSH) + +#define S_TF_RCV_COALESCE_HEARTBEAT 32 +#define V_TF_RCV_COALESCE_HEARTBEAT(x) ((x) << S_TF_RCV_COALESCE_HEARTBEAT) + +#define S_TF_HALF_CLOSE 33 +#define V_TF_HALF_CLOSE(x) ((x) << S_TF_HALF_CLOSE) + +#define S_TF_DACK_MSS 34 +#define V_TF_DACK_MSS(x) ((x) << S_TF_DACK_MSS) + +#define S_TF_CCTRL_SEL0 35 +#define V_TF_CCTRL_SEL0(x) ((x) << S_TF_CCTRL_SEL0) + +#define S_TF_CCTRL_SEL1 36 +#define V_TF_CCTRL_SEL1(x) ((x) << S_TF_CCTRL_SEL1) + +#define S_TF_TCP_NEWRENO_FAST_RECOVERY 37 +#define V_TF_TCP_NEWRENO_FAST_RECOVERY(x) ((x) << S_TF_TCP_NEWRENO_FAST_RECOVERY) + +#define S_TF_TX_PACE_AUTO 38 +#define V_TF_TX_PACE_AUTO(x) ((x) << S_TF_TX_PACE_AUTO) + +#define S_TF_PEER_FIN_HELD 39 +#define V_TF_PEER_FIN_HELD(x) ((x) << S_TF_PEER_FIN_HELD) + +#define S_TF_CORE_URG 40 +#define V_TF_CORE_URG(x) ((x) << S_TF_CORE_URG) + +#define S_TF_RDMA_ERROR 41 +#define V_TF_RDMA_ERROR(x) ((x) << S_TF_RDMA_ERROR) + +#define S_TF_SSWS_DISABLED 42 +#define V_TF_SSWS_DISABLED(x) ((x) << S_TF_SSWS_DISABLED) + +#define S_TF_DUPACK_COUNT_ODD 43 +#define V_TF_DUPACK_COUNT_ODD(x) ((x) << S_TF_DUPACK_COUNT_ODD) + +#define S_TF_TX_CHANNEL 44 +#define V_TF_TX_CHANNEL(x) ((x) << S_TF_TX_CHANNEL) + +#define S_TF_RX_CHANNEL 45 +#define V_TF_RX_CHANNEL(x) ((x) << S_TF_RX_CHANNEL) + +#define S_TF_TX_PACE_FIXED 46 +#define V_TF_TX_PACE_FIXED(x) ((x) << S_TF_TX_PACE_FIXED) + +#define S_TF_RDMA_FLM_ERROR 47 +#define V_TF_RDMA_FLM_ERROR(x) ((x) << S_TF_RDMA_FLM_ERROR) + +#define S_TF_RX_FLOW_CONTROL_DISABLE 48 +#define V_TF_RX_FLOW_CONTROL_DISABLE(x) ((x) << S_TF_RX_FLOW_CONTROL_DISABLE) + +#endif /* _TCB_DEFS_H */ From swise at opengridcomputing.com Fri Jun 23 07:30:21 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:30:21 -0500 Subject: [openib-general] [PATCH v2 11/14] CXGB3 Core ULP Demux Code. In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop> References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623143021.32410.63281.stgit@stevo-desktop> This code demuxes connection data and events from the LLD driver to the various registered ULPs. It also has the cxgb3 core module init logic, which includes registering with the Network Event Notifier to obtain L2/L3 events. --- drivers/infiniband/hw/cxgb3/t3c/t3c.c | 504 +++++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/t3c/t3c.h | 188 ++++++++++++ 2 files changed, 692 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/t3c/t3c.c b/drivers/infiniband/hw/cxgb3/t3c/t3c.c new file mode 100644 index 0000000..53d978a --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/t3c/t3c.c @@ -0,0 +1,504 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include + +#include "defs.h" +#include "l2t.h" +#include +#include "t3c.h" +#include + +// #define T3C_DEBUG + +#define MOD "t3c: " +#ifdef T3C_DEBUG +#define assert(expr) \ + if(!(expr)) { \ + printk(KERN_ERR MOD "Assertion failed! %s, %s, %s, line %d\n",\ + #expr, __FILE__, __FUNCTION__, __LINE__); \ + } +#define dprintk(fmt, args...) do {printk(KERN_INFO MOD fmt, ##args);} while (0) +#else +#define assert(expr) do {} while (0) +#define dprintk(fmt, args...) do {} while (0) +#endif + +MODULE_AUTHOR("Steve Wise "); +MODULE_DESCRIPTION("Chelsio T3 Core Module"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION("1.0"); + +void __init t3cdev_init(void); +void __exit t3cdev_exit(void); +void unregister_t3cdev(struct t3cdev *dev); +void register_t3cdev(struct t3cdev *dev, const char *name); + +static LIST_HEAD(client_list); + +void t3c_register_client(struct t3c_client *client) +{ + struct t3cdev *tdev; + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + mutex_lock(&t3cdev_db_lock); + list_add_tail(&client->client_list, &client_list); + list_for_each_entry(tdev, &t3cdev_list, t3c_list) { + if (client->add) { + dprintk("%s - calling %s add fn with t3cdev %s\n", + __FUNCTION__, client->name, tdev->name); + client->add(tdev); + } + } + mutex_unlock(&t3cdev_db_lock); +} +EXPORT_SYMBOL(t3c_register_client); + +void t3c_unregister_client(struct t3c_client *client) +{ + struct t3cdev *tdev; + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + mutex_lock(&t3cdev_db_lock); + list_del(&client->client_list); + list_for_each_entry(tdev, &t3cdev_list, t3c_list) { + if (client->remove) { + dprintk("%s - calling %s remove fn with t3cdev %s\n", + __FUNCTION__, client->name, tdev->name); + client->remove(tdev); + } + } + mutex_unlock(&t3cdev_db_lock); +} +EXPORT_SYMBOL(t3c_unregister_client); + +/* + * Called by t3's pci add function. + */ +static void add_t3cdev(struct t3cdev *tdev) +{ + struct t3c_client *client; + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + register_t3cdev(tdev, "cxgb3c%d"); + attach_t3cdev(tdev); + list_for_each_entry(client, &client_list, client_list) { + if (client->add) { + dprintk("%s - calling %s add fn with t3cdev %s\n", + __FUNCTION__, client->name, tdev->name); + client->add(tdev); + } + } +} + +/* + * Called by t3's pci remove function. + */ +static void remove_t3cdev(struct t3cdev *tdev) +{ + struct t3c_client *client; + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + list_for_each_entry(client, &client_list, client_list) { + if (client->remove) { + dprintk("%s - calling %s add fn with t3cdev %s\n", + __FUNCTION__, client->name, tdev->name); + client->remove(tdev); + } + } + detach_t3cdev(tdev); + unregister_t3cdev(tdev); +} + +/* + * Free an active-open TID. + */ +void t3c_free_atid(struct t3cdev *tdev, int atid) +{ + struct tid_info *t = &(T3C_DATA(tdev))->tid_maps; + union active_open_entry *p = atid2entry(t, atid); + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + spin_lock_bh(&t->atid_lock); + p->next = t->afree; + t->afree = p; + t->atids_in_use--; + spin_unlock_bh(&t->atid_lock); +} +EXPORT_SYMBOL(t3c_free_atid); + +/* + * Free a server TID and return it to the free pool. + */ +void t3c_free_stid(struct t3cdev *tdev, int stid) +{ + struct tid_info *t = &(T3C_DATA(tdev))->tid_maps; + union listen_entry *p = stid2entry(t, stid); + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + spin_lock_bh(&t->stid_lock); + p->next = t->sfree; + t->sfree = p; + t->stids_in_use--; + spin_unlock_bh(&t->stid_lock); +} +EXPORT_SYMBOL(t3c_free_stid); + +void t3c_insert_tid(struct t3cdev *tdev, struct t3c_client *client, + void *ctx, unsigned int tid) +{ + struct tid_info *t = &(T3C_DATA(tdev))->tid_maps; + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + t->tid_tab[tid].client = client; + t->tid_tab[tid].ctx = ctx; + atomic_inc(&t->tids_in_use); +} +EXPORT_SYMBOL(t3c_insert_tid); + +/* + * Remove a t3c from the TID table. A client may defer processing its last + * CPL message if it is locked at the time it arrives, and while the message + * sits in the client's backlog the TID may be reused for another connection. + * To handle this we atomically switch the TID association if it still points + * to the original client context. + */ +void t3c_remove_tid(struct t3cdev *tdev, void *ctx, unsigned int tid) +{ + struct tid_info *t = &(T3C_DATA(tdev))->tid_maps; + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + cmpxchg(&t->tid_tab[tid].ctx, ctx, NULL); + atomic_dec(&t->tids_in_use); +} +EXPORT_SYMBOL(t3c_remove_tid); + +int t3c_alloc_atid(struct t3cdev *tdev, struct t3c_client *client, void *ctx) +{ + int atid = -1; + struct tid_info *t = &(T3C_DATA(tdev))->tid_maps; + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + spin_lock_bh(&t->atid_lock); + if (t->afree) { + union active_open_entry *p = t->afree; + + atid = (p - t->atid_tab) + t->atid_base; + t->afree = p->next; + p->t3c_tid.ctx = ctx; + p->t3c_tid.client = client; + t->atids_in_use++; + } + spin_unlock_bh(&t->atid_lock); + return atid; +} +EXPORT_SYMBOL(t3c_alloc_atid); + +int t3c_alloc_stid(struct t3cdev *tdev, struct t3c_client *client, void *ctx) +{ + int stid = -1; + struct tid_info *t = &(T3C_DATA(tdev))->tid_maps; + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + spin_lock_bh(&t->stid_lock); + if (t->sfree) { + union listen_entry *p = t->sfree; + + stid = (p - t->stid_tab) + t->stid_base; + t->sfree = p->next; + p->t3c_tid.ctx = ctx; + p->t3c_tid.client = client; + t->stids_in_use++; + } + spin_unlock_bh(&t->stid_lock); + return stid; +} +EXPORT_SYMBOL(t3c_alloc_stid); + +static int do_smt_write_rpl(struct t3cdev *dev, struct sk_buff *skb) +{ + struct cpl_smt_write_rpl *rpl = cplhdr(skb); + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + if (rpl->status != CPL_ERR_NONE) + printk(KERN_ERR + "Unexpected SMT_WRITE_RPL status %u for entry %u\n", + rpl->status, GET_TID(rpl)); + + return CPL_RET_BUF_DONE; +} + +static int do_l2t_write_rpl(struct t3cdev *dev, struct sk_buff *skb) +{ + struct cpl_l2t_write_rpl *rpl = cplhdr(skb); + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + if (rpl->status != CPL_ERR_NONE) + printk(KERN_ERR + "Unexpected L2T_WRITE_RPL status %u for entry %u\n", + rpl->status, GET_TID(rpl)); + + return CPL_RET_BUF_DONE; +} + +static int do_act_open_rpl(struct t3cdev *dev, struct sk_buff *skb) +{ + struct cpl_act_open_rpl *rpl = cplhdr(skb); + unsigned int atid = G_TID(ntohl(rpl->atid)); + struct t3c_tid_entry *t3c_tid; + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + t3c_tid = lookup_atid(&(T3C_DATA(dev))->tid_maps, atid); + if (t3c_tid->ctx && t3c_tid->client && t3c_tid->client->handlers && + t3c_tid->client->handlers[CPL_ACT_OPEN_RPL]) { + return t3c_tid->client->handlers[CPL_ACT_OPEN_RPL] (dev, skb, + t3c_tid->ctx); + } else { + printk(KERN_ERR "%s: received clientless CPL command 0x%x\n", + dev->name, CPL_ACT_OPEN_RPL); + return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG; + } +} + +static int do_stid_rpl(struct t3cdev *dev, struct sk_buff *skb) +{ + union opcode_tid *p = cplhdr(skb); + unsigned int stid = G_TID(ntohl(p->opcode_tid)); + struct t3c_tid_entry *t3c_tid; + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + t3c_tid = lookup_stid(&(T3C_DATA(dev))->tid_maps, stid); + if (t3c_tid->ctx && t3c_tid->client->handlers && + t3c_tid->client->handlers[p->opcode]) { + return t3c_tid->client->handlers[p->opcode] (dev, skb, t3c_tid->ctx); + } else { + printk(KERN_ERR "%s: received clientless CPL command 0x%x\n", + dev->name, p->opcode); + return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG; + } +} + +static int do_hwtid_rpl(struct t3cdev *dev, struct sk_buff *skb) +{ + union opcode_tid *p = cplhdr(skb); + unsigned int hwtid = G_TID(ntohl(p->opcode_tid)); + struct t3c_tid_entry *t3c_tid; + + dprintk("%s enter (%s line %u) opcode 0x%x tid %d\n", + __FUNCTION__, __FILE__, __LINE__, p->opcode, hwtid); + t3c_tid = lookup_tid(&(T3C_DATA(dev))->tid_maps, hwtid); + if (t3c_tid->ctx && t3c_tid->client->handlers && + t3c_tid->client->handlers[p->opcode]) { + return t3c_tid->client->handlers[p->opcode] + (dev, skb, t3c_tid->ctx); + } else { + printk(KERN_ERR "%s: received clientless CPL command 0x%x\n", + dev->name, p->opcode); + return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG; + } +} + +static int do_cr(struct t3cdev *dev, struct sk_buff *skb) +{ + struct cpl_pass_accept_req *req = cplhdr(skb); + unsigned int stid = G_PASS_OPEN_TID(ntohl(req->tos_tid)); + struct t3c_tid_entry *t3c_tid; + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + t3c_tid = lookup_stid(&(T3C_DATA(dev))->tid_maps, stid); + if (t3c_tid->ctx && t3c_tid->client->handlers && + t3c_tid->client->handlers[CPL_PASS_ACCEPT_REQ]) { + return t3c_tid->client->handlers[CPL_PASS_ACCEPT_REQ] + (dev, skb, t3c_tid->ctx); + } else { + printk(KERN_ERR "%s: received clientless CPL command 0x%x\n", + dev->name, CPL_PASS_ACCEPT_REQ); + return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG; + } +} + +static int do_act_establish(struct t3cdev *dev, struct sk_buff *skb) +{ + struct cpl_act_establish *req = cplhdr(skb); + unsigned int atid = G_PASS_OPEN_TID(ntohl(req->tos_tid)); + struct t3c_tid_entry *t3c_tid; + + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + t3c_tid = lookup_atid(&(T3C_DATA(dev))->tid_maps, atid); + if (t3c_tid->ctx && t3c_tid->client->handlers && + t3c_tid->client->handlers[CPL_ACT_ESTABLISH]) { + return t3c_tid->client->handlers[CPL_ACT_ESTABLISH] + (dev, skb, t3c_tid->ctx); + } else { + printk(KERN_ERR "%s: received clientless CPL command 0x%x\n", + dev->name, CPL_PASS_ACCEPT_REQ); + return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG; + } +} + +static int do_set_tcb_rpl(struct t3cdev *dev, struct sk_buff *skb) +{ + struct cpl_set_tcb_rpl *rpl = cplhdr(skb); + + if (rpl->status != CPL_ERR_NONE) + printk(KERN_ERR + "Unexpected SET_TCB_RPL status %u for tid %u\n", + rpl->status, GET_TID(rpl)); + return CPL_RET_BUF_DONE; +} + +static int do_trace(struct t3cdev *dev, struct sk_buff *skb) +{ + struct cpl_trace_pkt *p = cplhdr(skb); + + skb->protocol = 0xffff; + skb->dev = dev->lldev; + skb_pull(skb, sizeof(*p)); + skb->mac.raw = skb->data; + netif_receive_skb(skb); + return 0; +} + +static int do_term(struct t3cdev *dev, struct sk_buff *skb) +{ + unsigned int hwtid = ntohl(skb->priority) >> 8 & 0xfffff; + unsigned int opcode = G_OPCODE(ntohl(skb->csum)); + struct t3c_tid_entry *t3c_tid; + + dprintk("%s enter (%s line %u) opcode 0x%x tid %d\n", + __FUNCTION__, __FILE__, __LINE__, opcode, hwtid); + + t3c_tid = lookup_tid(&(T3C_DATA(dev))->tid_maps, hwtid); + if (t3c_tid->ctx && t3c_tid->client->handlers && + t3c_tid->client->handlers[opcode]) { + return t3c_tid->client->handlers[opcode](dev,skb,t3c_tid->ctx); + } else { + printk(KERN_ERR "%s: received clientless CPL command 0x%x\n", + dev->name, opcode); + return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG; + } +} + +static int nb_callback(struct notifier_block *self, unsigned long event, + void *ctx) +{ + switch (event) { + case (NETEVENT_NEIGH_UPDATE): { + t3c_neigh_update((struct neighbour *)ctx, 0); + break; + } + case (NETEVENT_ROUTE_UPDATE): + dprintk("%s ROUTE_UPDATE\n", __FUNCTION__); + break; + case (NETEVENT_PMTU_UPDATE): + dprintk("%s PMTU_UPDATE\n", __FUNCTION__); + break; + case (NETEVENT_REDIRECT): { + struct netevent_redirect *nr = ctx; + dprintk("%s REDIRECT old dst %p new dst %p " + "old neigh %p new neigh %p old neigh key %x " + "new neigh key %x\n", __FUNCTION__, + nr->old, nr->new, + nr->old ? nr->old->neighbour : NULL, + nr->new ? nr->new->neighbour : NULL, + nr->old->neighbour ? + *(u32*)nr->old->neighbour->primary_key + : 0, + nr->new->neighbour ? + *(u32*)nr->new->neighbour->primary_key + : 0); + t3c_redirect(nr->old, nr->new); + t3c_neigh_update(nr->new->neighbour, 0); + break; + } + default: + printk(KERN_ERR "unknown net event notifier type %lu\n", + event); + break; + } + return 0; +} + +static struct notifier_block nb = { + .notifier_call = nb_callback +}; + + +/* + * upcall struct for the t3 module. + */ +static struct t3_core core = { + .add = add_t3cdev, + .remove = remove_t3cdev, +}; + +int __init t3c_init(void) +{ + t3cdev_init(); + register_netevent_notifier(&nb); + dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + t3_register_cpl_handler(CPL_SMT_WRITE_RPL, do_smt_write_rpl); + t3_register_cpl_handler(CPL_L2T_WRITE_RPL, do_l2t_write_rpl); + t3_register_cpl_handler(CPL_PASS_OPEN_RPL, do_stid_rpl); + t3_register_cpl_handler(CPL_CLOSE_LISTSRV_RPL, do_stid_rpl); + t3_register_cpl_handler(CPL_PASS_ACCEPT_REQ, do_cr); + t3_register_cpl_handler(CPL_PASS_ESTABLISH, do_hwtid_rpl); + t3_register_cpl_handler(CPL_ABORT_RPL_RSS, do_hwtid_rpl); + t3_register_cpl_handler(CPL_ABORT_RPL, do_hwtid_rpl); + t3_register_cpl_handler(CPL_RX_DATA, do_hwtid_rpl); + t3_register_cpl_handler(CPL_TX_DATA_ACK, do_hwtid_rpl); + t3_register_cpl_handler(CPL_TX_DMA_ACK, do_hwtid_rpl); + t3_register_cpl_handler(CPL_ACT_OPEN_RPL, do_act_open_rpl); + t3_register_cpl_handler(CPL_PEER_CLOSE, do_hwtid_rpl); + t3_register_cpl_handler(CPL_CLOSE_CON_RPL, do_hwtid_rpl); + t3_register_cpl_handler(CPL_ABORT_REQ_RSS, do_hwtid_rpl); + t3_register_cpl_handler(CPL_ACT_ESTABLISH, do_act_establish); + t3_register_cpl_handler(CPL_SET_TCB_RPL, do_set_tcb_rpl); + t3_register_cpl_handler(CPL_RDMA_TERMINATE, do_term); + t3_register_cpl_handler(CPL_TRACE_PKT, do_trace); + t3_register_core(&core); + + return 0; +} + +static void __exit t3c_exit(void) +{ + dprintk("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + t3_unregister_core(&core); + t3cdev_exit(); + unregister_netevent_notifier(&nb); + return; +} +module_init(t3c_init); +module_exit(t3c_exit); diff --git a/drivers/infiniband/hw/cxgb3/t3c/t3c.h b/drivers/infiniband/hw/cxgb3/t3c/t3c.h new file mode 100644 index 0000000..fdc51a8 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/t3c/t3c.h @@ -0,0 +1,188 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _CHELSIO_T3C_H +#define _CHELSIO_T3C_H + +#include +#include + +#include +#include + +#include +#include + +/* + * Client registration. Users of the T3 Core driver must register themselves. + * The T3 Core driver will call the add function of every client for each T3 + * PCI device probed, passing up the t3cdev ptr. Each client fills out an + * array of callback functions to process CPL messages. + */ +typedef int (*t3c_cpl_handler_func)(struct t3cdev *dev, + struct sk_buff *skb, void *ctx); + +struct t3c_client { + char *name; + void (*add) (struct t3cdev *); + void (*remove) (struct t3cdev *); + t3c_cpl_handler_func *handlers; + int (*redirect)(void *ctx, struct dst_entry *old, + struct dst_entry *new, + struct l2t_entry *l2t); + struct list_head client_list; +}; + +void t3c_register_client(struct t3c_client *); +void t3c_unregister_client(struct t3c_client *); + +/* + * TID allocation services. + */ +int t3c_alloc_atid(struct t3cdev *dev, struct t3c_client *client, void *ctx); +int t3c_alloc_stid(struct t3cdev *dev, struct t3c_client *client, void *ctx); +void t3c_free_atid(struct t3cdev *dev, int atid); +void t3c_free_stid(struct t3cdev *dev, int stid); +void t3c_insert_tid(struct t3cdev *dev, struct t3c_client *client, void *ctx, + unsigned int tid); +void t3c_remove_tid(struct t3cdev *dev, void *ctx, unsigned int tid); + +struct t3c_tid_entry { + struct t3c_client *client; + void *ctx; +}; + +/* CPL message priority levels */ +enum { + CPL_PRIORITY_DATA = 0, /* data messages */ + CPL_PRIORITY_SETUP = 1, /* connection setup messages */ + CPL_PRIORITY_TEARDOWN = 0, /* connection teardown messages */ + CPL_PRIORITY_LISTEN = 1, /* listen start/stop messages */ + CPL_PRIORITY_ACK = 1, /* RX ACK messages */ + CPL_PRIORITY_CONTROL = 1 /* TOE control messages */ +}; + +/* Flags for return value of CPL message handlers */ +enum { + CPL_RET_BUF_DONE = 1, // buffer processing done, buffer may be freed + CPL_RET_BAD_MSG = 2, // bad CPL message (e.g., unknown opcode) + CPL_RET_UNKNOWN_TID = 4 // unexpected unknown TID +}; + +typedef int (*cpl_handler_func)(struct t3cdev *dev, struct sk_buff *skb); + +/* + * Returns a pointer to the first byte of the CPL header in an sk_buff that + * contains a CPL message. + */ +static inline void *cplhdr(struct sk_buff *skb) +{ + return skb->data; +} + +void t3_register_cpl_handler(unsigned int opcode, cpl_handler_func h); + +union listen_entry { + struct t3c_tid_entry t3c_tid; + union listen_entry *next; +}; + +union active_open_entry { + struct t3c_tid_entry t3c_tid; + union active_open_entry *next; +}; + +/* + * Holds the size, base address, free list start, etc of the TID, server TID, + * and active-open TID tables for a TOE. The tables themselves are allocated + * dynamically. + */ +struct tid_info { + struct t3c_tid_entry *tid_tab; + unsigned int ntids; + atomic_t tids_in_use; + + union listen_entry *stid_tab; + unsigned int nstids; + unsigned int stid_base; + + union active_open_entry *atid_tab; + unsigned int natids; + unsigned int atid_base; + + /* + * The following members are accessed R/W so we put them in their own + * cache lines. + * + * XXX We could combine the atid fields above with the lock here since + * atids are use once (unlike other tids). OTOH the above fields are + * usually in cache due to tid_tab. + */ + spinlock_t atid_lock ____cacheline_aligned_in_smp; + union active_open_entry *afree; + unsigned int atids_in_use; + + spinlock_t stid_lock ____cacheline_aligned; + union listen_entry *sfree; + unsigned int stids_in_use; +}; + +struct t3c_data { + struct list_head list_node; + struct t3cdev *dev; + unsigned int tx_max_chunk; /* max payload for TX_DATA */ + unsigned int max_wrs; /* max in-flight WRs per connection */ + unsigned int ddp_llimit; /* DDP parameters */ + unsigned int ddp_ulimit; + unsigned int ddp_tagmask; + unsigned int nmtus; + const unsigned short *mtus; + struct tid_info tid_maps; +}; + +/* + * t3cdev -> t3c_data accessor + */ +#define T3C_DATA(dev) (*(struct t3c_data **)&(dev)->l4opt) + +/* XXX REMOVE THIS HACK WHEN 2.6.16 is published! */ +#include +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,16) +#include +#else +#include +#endif /* XXX end of hack */ + +extern struct mutex t3cdev_db_lock; +extern struct list_head t3cdev_list; + +#endif From swise at opengridcomputing.com Fri Jun 23 07:30:26 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 09:30:26 -0500 Subject: [openib-general] [PATCH v2 12/14] CXGB3 Core L2 Management. In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop> References: <20060623142924.32410.7623.stgit@stevo-desktop> Message-ID: <20060623143026.32410.24373.stgit@stevo-desktop> This code manages the hardware's neighbour table and thus hooks into the native L2/L3 stack. Currently we're using the Netevent Notifier Mechanism patch to detect neighbour changes. This needs more discussion on how RNICs should be made aware of next hop changes... ISSUE: The processing of notification of L2/L3 events should really be in the IWCM, and an interface defined between the IWCM and IW devices to pass pertinent events down to the device. Currently, this is all done in the cxgb3c module since there are no other open source iwarp drivers that need this. --- drivers/infiniband/hw/cxgb3/t3c/l2t.c | 616 +++++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/t3c/l2t.h | 147 ++++++++ 2 files changed, 763 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/t3c/l2t.c b/drivers/infiniband/hw/cxgb3/t3c/l2t.c new file mode 100644 index 0000000..7912950 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/t3c/l2t.c @@ -0,0 +1,616 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include +#include "t3cdev.h" +#include "defs.h" +#include "l2t.h" +#include "t3_cpl.h" +#include "firmware_exports.h" + +/* #define L2T_DEBUG */ + +#ifdef L2T_DEBUG +#define dprintk(fmt, args...) do {printk(KERN_INFO fmt, ##args);} while (0) +#else +#define dprintk(fmt, args...) do {} while (0) +#endif + + +#define VLAN_NONE 0xfff + +/* + * Module locking notes: There is a RW lock protecting the L2 table as a + * whole plus a spinlock per L2T entry. Entry lookups and allocations happen + * under the protection of the table lock, individual entry changes happen + * while holding that entry's spinlock. The table lock nests outside the + * entry locks. Allocations of new entries take the table lock as writers so + * no other lookups can happen while allocating new entries. Entry updates + * take the table lock as readers so multiple entries can be updated in + * parallel. An L2T entry can be dropped by decrementing its reference count + * and therefore can happen in parallel with entry allocation but no entry + * can change state or increment its ref count during allocation as both of + * these perform lookups. + * + * Note: We do not take refereces to net_devices in this module because both + * the TOE and the sockets already hold references to the interfaces and the + * lifetime of an L2T entry is fully contained in the lifetime of the TOE. + */ + +static inline unsigned int vlan_prio(const struct l2t_entry *e) +{ + return e->vlan >> 13; +} + +static inline unsigned int arp_hash(u32 key, int ifindex, + const struct l2t_data *d) +{ + return jhash_2words(key, ifindex, 0) & (d->nentries - 1); +} + +static inline void neigh_replace(struct l2t_entry *e, struct neighbour *n) +{ + dprintk("%s %d e %p e->neigh %p neigh %p\n", __FUNCTION__, __LINE__, + e, e->neigh, n); + neigh_hold(n); + if (e->neigh) + neigh_release(e->neigh); + e->neigh = n; +} + +/* + * Set up an L2T entry and send any packets waiting in the arp queue. The + * supplied skb is used for the CPL_L2T_WRITE_REQ. Must be called with the + * entry locked. + */ +static int setup_l2e_send_pending(struct t3cdev *dev, struct sk_buff *skb, + struct l2t_entry *e) +{ + struct cpl_l2t_write_req *req; + + if (!skb) { + skb = alloc_skb(sizeof(*req), GFP_ATOMIC); + if (!skb) + return -ENOMEM; + } + + dprintk("%s %d e %p neigh %p\n", __FUNCTION__, __LINE__, e, + e->neigh); + req = (struct cpl_l2t_write_req *)__skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_L2T_WRITE_REQ, e->idx)); + req->params = htonl(V_L2T_W_IDX(e->idx) | V_L2T_W_IFF(e->smt_idx) | + V_L2T_W_VLAN(e->vlan & VLAN_VID_MASK) | + V_L2T_W_PRIO(vlan_prio(e))); + memcpy(e->dmac, e->neigh->ha, sizeof(e->dmac)); + memcpy(req->dst_mac, e->dmac, sizeof(req->dst_mac)); + dprintk("%s updating HW idx %d with %02x:%02x:%02x:%02x:%02x:%02x\n", + __FUNCTION__, e->idx, + req->dst_mac[0], + req->dst_mac[1], + req->dst_mac[2], + req->dst_mac[3], + req->dst_mac[4], + req->dst_mac[5]); + skb->priority = CPL_PRIORITY_CONTROL; + t3c_send(dev, skb); + while (e->arpq_head) { + skb = e->arpq_head; + e->arpq_head = skb->next; + skb->next = NULL; + t3c_send(dev, skb); + } + e->arpq_tail = NULL; + e->state = L2T_STATE_VALID; + + return 0; +} + +/* + * Add a packet to the an L2T entry's queue of packets awaiting resolution. + * Must be called with the entry's lock held. + */ +static inline void arpq_enqueue(struct l2t_entry *e, struct sk_buff *skb) +{ + dprintk("%s %d e %p neigh %p\n", __FUNCTION__, __LINE__, e, e->neigh); + skb->next = NULL; + if (e->arpq_head) + e->arpq_tail->next = skb; + else + e->arpq_head = skb; + e->arpq_tail = skb; +} + +int t3_l2t_send_slow(struct t3cdev *dev, struct sk_buff *skb, + struct l2t_entry *e) +{ +again: + switch (e->state) { + case L2T_STATE_STALE: /* entry is stale, kick off revalidation */ + dprintk("%s %d STALE - e %p neigh %p\n", __FUNCTION__, + __LINE__, e, e->neigh); + neigh_event_send(e->neigh, NULL); + spin_lock_bh(&e->lock); + if (e->state == L2T_STATE_STALE) + e->state = L2T_STATE_VALID; + spin_unlock_bh(&e->lock); + case L2T_STATE_VALID: /* fast-path, send the packet on */ + return t3c_send(dev, skb); + case L2T_STATE_RESOLVING: + spin_lock_bh(&e->lock); + if (e->state != L2T_STATE_RESOLVING) { // ARP already completed + spin_unlock_bh(&e->lock); + goto again; + } + dprintk("%s %d RESOLVING - e %p neigh %p\n", __FUNCTION__, + __LINE__, e, e->neigh); + arpq_enqueue(e, skb); + spin_unlock_bh(&e->lock); + + /* + * Only the first packet added to the arpq should kick off + * resolution. However, because the alloc_skb below can fail, + * we allow each packet added to the arpq to retry resolution + * as a way of recovering from transient memory exhaustion. + * A better way would be to use a work request to retry L2T + * entries when there's no memory. + */ + if (!neigh_event_send(e->neigh, NULL)) { + skb = alloc_skb(sizeof(struct cpl_l2t_write_req), + GFP_ATOMIC); + if (!skb) + break; + + spin_lock_bh(&e->lock); + if (e->arpq_head) + setup_l2e_send_pending(dev, skb, e); + else /* we lost the race */ + __kfree_skb(skb); + spin_unlock_bh(&e->lock); + } + } + return 0; +} +EXPORT_SYMBOL(t3_l2t_send_slow); + +void t3_l2t_send_event(struct t3cdev *dev, struct l2t_entry *e) +{ + dprintk("%s l2t %p neigh %p nud_state %x\n", __FUNCTION__, e, + e->neigh, e->neigh->nud_state); + +again: + switch (e->state) { + case L2T_STATE_STALE: /* entry is stale, kick off revalidation */ + dprintk("%s %d STALE - e %p neigh %p\n", __FUNCTION__, + __LINE__, e, e->neigh); + neigh_event_send(e->neigh, NULL); + spin_lock_bh(&e->lock); + if (e->state == L2T_STATE_STALE) { + e->state = L2T_STATE_VALID; + dprintk("%s STALE->VALID!\n", __FUNCTION__); + } + spin_unlock_bh(&e->lock); + return; + case L2T_STATE_VALID: /* fast-path, send the packet on */ + dprintk("%s %d VALID - e %p neigh %p\n", __FUNCTION__, + __LINE__, e, e->neigh); + return; + case L2T_STATE_RESOLVING: + spin_lock_bh(&e->lock); + if (e->state != L2T_STATE_RESOLVING) { // ARP already completed + spin_unlock_bh(&e->lock); + goto again; + } + dprintk("%s %d RESOLVING - e %p neigh %p\n", __FUNCTION__, + __LINE__, e, e->neigh); + spin_unlock_bh(&e->lock); + + /* + * Only the first packet added to the arpq should kick off + * resolution. However, because the alloc_skb below can fail, + * we allow each packet added to the arpq to retry resolution + * as a way of recovering from transient memory exhaustion. + * A better way would be to use a work request to retry L2T + * entries when there's no memory. + */ + neigh_event_send(e->neigh, NULL); + } + return; +} +EXPORT_SYMBOL(t3_l2t_send_event); + +/* + * Allocate a free L2T entry. Must be called with l2t_data.lock held. + */ +static struct l2t_entry *alloc_l2e(struct l2t_data *d) +{ + struct l2t_entry *end, *e, **p; + + if (!atomic_read(&d->nfree)) + return NULL; + + /* there's definitely a free entry */ + for (e = d->rover, end = &d->l2tab[d->nentries]; e != end; ++e) + if (atomic_read(&e->refcnt) == 0) + goto found; + + for (e = &d->l2tab[1]; atomic_read(&e->refcnt); ++e) ; +found: + d->rover = e + 1; + atomic_dec(&d->nfree); + + /* + * The entry we found may be an inactive entry that is + * presently in the hash table. We need to remove it. + */ + if (e->state != L2T_STATE_UNUSED) { + int hash = arp_hash(e->addr, e->ifindex, d); + + for (p = &d->l2tab[hash].first; *p; p = &(*p)->next) + if (*p == e) { + *p = e->next; + break; + } + e->state = L2T_STATE_UNUSED; + } + dprintk("%s %d e %p neigh %p\n", __FUNCTION__, __LINE__, e, e->neigh); + return e; +} + +/* + * Called when an L2T entry has no more users. The entry is left in the hash + * table since it is likely to be reused but we also bump nfree to indicate + * that the entry can be reallocated for a different neighbor. We also drop + * the existing neighbor reference in case the neighbor is going away and is + * waiting on our reference. + * + * Because entries can be reallocated to other neighbors once their ref count + * drops to 0 we need to take the entry's lock to avoid races with a new + * incarnation. + */ +void t3_l2e_free(struct l2t_data *d, struct l2t_entry *e) +{ + spin_lock_bh(&e->lock); + if (atomic_read(&e->refcnt) == 0) { /* hasn't been recycled */ + dprintk("%s %d e %p neigh %p\n", __FUNCTION__, __LINE__, + e, e->neigh); + if (e->neigh) { + neigh_release(e->neigh); + e->neigh = NULL; + } + /* + * Don't need to worry about the arpq, an L2T entry can't be + * released if any packets are waiting for resolution as we + * need to be able to communicate with the TOE to close a + * connection. + */ + } + spin_unlock_bh(&e->lock); + atomic_inc(&d->nfree); +} +EXPORT_SYMBOL(t3_l2e_free); + +/* + * Update an L2T entry that was previously used for the same next hop as neigh. + * Must be called with softirqs disabled. + */ +static inline void reuse_entry(struct l2t_entry *e, struct neighbour *neigh) +{ + unsigned int nud_state; + + spin_lock(&e->lock); /* avoid race with t3_l2t_free */ + dprintk("%s %d e %p neigh %p\n", __FUNCTION__, __LINE__, e, neigh); + + if (neigh != e->neigh) + neigh_replace(e, neigh); + nud_state = neigh->nud_state; + if (memcmp(e->dmac, neigh->ha, sizeof(e->dmac)) || + !(nud_state & NUD_VALID)) + e->state = L2T_STATE_RESOLVING; + else if (nud_state & NUD_CONNECTED) + e->state = L2T_STATE_VALID; + else + e->state = L2T_STATE_STALE; + spin_unlock(&e->lock); +} + +struct l2t_entry *t3_l2t_get(struct t3cdev *dev, struct neighbour *neigh, + unsigned int smt_idx) +{ + struct l2t_entry *e; + struct l2t_data *d = L2DATA(dev); + u32 addr = *(u32 *) neigh->primary_key; + int ifidx = neigh->dev->ifindex; + int hash = arp_hash(addr, ifidx, d); + + write_lock_bh(&d->lock); + for (e = d->l2tab[hash].first; e; e = e->next) + if (e->addr == addr && e->ifindex == ifidx && + e->smt_idx == smt_idx) { + l2t_hold(d, e); + if (atomic_read(&e->refcnt) == 1) + reuse_entry(e, neigh); + goto done; + } + + /* Need to allocate a new entry */ + e = alloc_l2e(d); + if (e) { + spin_lock(&e->lock); /* avoid race with t3_l2t_free */ + e->next = d->l2tab[hash].first; + d->l2tab[hash].first = e; + e->state = L2T_STATE_RESOLVING; + e->addr = addr; + e->ifindex = ifidx; + e->smt_idx = smt_idx; + atomic_set(&e->refcnt, 1); + neigh_replace(e, neigh); + if (neigh->dev->priv_flags & IFF_802_1Q_VLAN) + e->vlan = VLAN_DEV_INFO(neigh->dev)->vlan_id; + else + e->vlan = VLAN_NONE; + spin_unlock(&e->lock); + } +done: + dprintk("%s %d e %p neigh %p\n", __FUNCTION__, __LINE__, e, neigh); + write_unlock_bh(&d->lock); + return e; +} +EXPORT_SYMBOL(t3_l2t_get); + +/* + * Called when address resolution fails for an L2T entry to handle packets + * on the arpq head. If a packet specifies a failure handler it is invoked, + * otherwise the packets is sent to the TOE. + * + * XXX: maybe we should abandon the latter behavior and just require a failure + * handler. + */ +static void handle_failed_resolution(struct t3cdev *dev, struct sk_buff *arpq) +{ + while (arpq) { + struct sk_buff *skb = arpq; + struct l2t_skb_cb *cb = L2T_SKB_CB(skb); + + arpq = skb->next; + skb->next = NULL; + if (cb->arp_failure_handler) + cb->arp_failure_handler(dev, skb); + else + t3c_send(dev, skb); + } +} + +/* + * Called when the host's ARP layer makes a change to some entry that is + * loaded into the HW L2 table. + */ +void t3_l2t_update(struct t3cdev *dev, struct neighbour *neigh, int flags, struct net_device *lldev) +{ + struct l2t_entry *e; + struct sk_buff *arpq = NULL; + struct l2t_data *d = L2DATA(dev); + u32 addr = *(u32 *) neigh->primary_key; + int ifidx = neigh->dev->ifindex; + int hash = arp_hash(addr, ifidx, d); + + read_lock_bh(&d->lock); + for (e = d->l2tab[hash].first; e; e = e->next) + if (e->addr == addr && e->ifindex == ifidx) { + spin_lock(&e->lock); + goto found; + } + read_unlock_bh(&d->lock); + return; + +found: + dprintk("%s l2t %p neigh %p nud_state %x\n", __FUNCTION__, e, + e->neigh, e->neigh ? e->neigh->nud_state : -1); + read_unlock(&d->lock); + if (atomic_read(&e->refcnt)) { + if (neigh != e->neigh) + neigh_replace(e, neigh); + + if (e->state == L2T_STATE_RESOLVING) { + dprintk("%s %d RESOLVING - e %p neigh %p\n", + __FUNCTION__, __LINE__, e, e->neigh); + if (neigh->nud_state & NUD_FAILED) { + arpq = e->arpq_head; + e->arpq_head = e->arpq_tail = NULL; + } else if (neigh_is_connected(neigh)) + setup_l2e_send_pending(dev, NULL, e); + } else { + e->state = neigh_is_connected(neigh) ? + L2T_STATE_VALID : L2T_STATE_STALE; + dprintk("%s %d state -> %d - e %p neigh %p\n", + __FUNCTION__, __LINE__, e->state, + e, e->neigh); + if (memcmp(e->dmac, neigh->ha, 6)) + setup_l2e_send_pending(dev, NULL, e); + } + } + spin_unlock_bh(&e->lock); + + if (arpq) + handle_failed_resolution(dev, arpq); +} + +struct l2t_data *t3_init_l2t(unsigned int l2t_capacity) +{ + struct l2t_data *d; + int i, size = sizeof(*d) + l2t_capacity * sizeof(struct l2t_entry); + + d = t3_alloc_mem(size); + if (!d) + return NULL; + + d->nentries = l2t_capacity; + d->rover = &d->l2tab[1]; /* entry 0 is not used */ + atomic_set(&d->nfree, l2t_capacity - 1); + rwlock_init(&d->lock); + + for (i = 0; i < l2t_capacity; ++i) { + d->l2tab[i].idx = i; + d->l2tab[i].state = L2T_STATE_UNUSED; + spin_lock_init(&d->l2tab[i].lock); + atomic_set(&d->l2tab[i].refcnt, 0); + } + return d; +} + +void t3_free_l2t(struct l2t_data *d) +{ + t3_free_mem(d); +} + +#ifdef CONFIG_PROC_FS +#include +#include +#include + +static inline void *l2t_get_idx(struct seq_file *seq, loff_t pos) +{ + struct l2t_data *d = seq->private; + + return pos >= d->nentries ? NULL : &d->l2tab[pos]; +} + +static void *l2t_seq_start(struct seq_file *seq, loff_t *pos) +{ + return *pos ? l2t_get_idx(seq, *pos) : SEQ_START_TOKEN; +} + +static void *l2t_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + v = l2t_get_idx(seq, *pos + 1); + if (v) + ++*pos; + return v; +} + +static void l2t_seq_stop(struct seq_file *seq, void *v) +{ +} + +static char l2e_state(const struct l2t_entry *e) +{ + switch (e->state) { + case L2T_STATE_VALID: return 'V'; /* valid, fast-path entry */ + case L2T_STATE_STALE: return 'S'; /* needs revalidation, but usable */ + case L2T_STATE_RESOLVING: + return e->arpq_head ? 'A' : 'R'; + default: + return 'U'; + } +} + +static int l2t_seq_show(struct seq_file *seq, void *v) +{ + if (v == SEQ_START_TOKEN) + seq_puts(seq, "Index IP address Ethernet address VLAN " + "Prio State Users SMTIDX Port\n"); + else { + char ip[20]; + struct l2t_entry *e = v; + + spin_lock_bh(&e->lock); + sprintf(ip, "%u.%u.%u.%u", NIPQUAD(e->addr)); + seq_printf(seq, "%-5u %-15s %02x:%02x:%02x:%02x:%02x:%02x %4d" + " %3u %c %7u %4u %s\n", + e->idx, ip, e->dmac[0], e->dmac[1], e->dmac[2], + e->dmac[3], e->dmac[4], e->dmac[5], + e->vlan & VLAN_VID_MASK, vlan_prio(e), + l2e_state(e), atomic_read(&e->refcnt), e->smt_idx, + e->neigh ? e->neigh->dev->name : ""); + spin_unlock_bh(&e->lock); + } + return 0; +} + +static struct seq_operations l2t_seq_ops = { + .start = l2t_seq_start, + .next = l2t_seq_next, + .stop = l2t_seq_stop, + .show = l2t_seq_show +}; + +static int l2t_seq_open(struct inode *inode, struct file *file) +{ + int rc = seq_open(file, &l2t_seq_ops); + + if (!rc) { + struct proc_dir_entry *dp = PDE(inode); + struct seq_file *seq = file->private_data; + + seq->private = dp->data; + } + return rc; +} + +static struct file_operations l2t_seq_fops = { + .owner = THIS_MODULE, + .open = l2t_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +/* + * Create the proc entries for the L2 table under dir. + */ +int t3_l2t_proc_setup(struct proc_dir_entry *dir, struct l2t_data *d) +{ + struct proc_dir_entry *p; + + if (!dir) + return -EINVAL; + + p = create_proc_entry("l2t", S_IRUGO, dir); + if (!p) + return -ENOMEM; + + p->proc_fops = &l2t_seq_fops; + p->data = d; + return 0; +} + +void t3_l2t_proc_free(struct proc_dir_entry *dir) +{ + if (dir) + remove_proc_entry("l2t", dir); +} +#endif diff --git a/drivers/infiniband/hw/cxgb3/t3c/l2t.h b/drivers/infiniband/hw/cxgb3/t3c/l2t.h new file mode 100644 index 0000000..48a247f --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/t3c/l2t.h @@ -0,0 +1,147 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _CHELSIO_L2T_H +#define _CHELSIO_L2T_H + +#include +#include +#include "t3cdev.h" +#include + +enum { + L2T_STATE_VALID, /* entry is up to date */ + L2T_STATE_STALE, /* entry may be used but needs revalidation */ + L2T_STATE_RESOLVING, /* entry needs address resolution */ + L2T_STATE_UNUSED /* entry not in use */ +}; + +struct neighbour; +struct sk_buff; + +/* + * Each L2T entry plays multiple roles. First of all, it keeps state for the + * corresponding entry of the HW L2 table and maintains a queue of offload + * packets awaiting address resolution. Second, it is a node of a hash table + * chain, where the nodes of the chain are linked together through their next + * pointer. Finally, each node is a bucket of a hash table, pointing to the + * first element in its chain through its first pointer. + */ +struct l2t_entry { + u16 state; /* entry state */ + u16 idx; /* entry index */ + u32 addr; /* dest IP address */ + int ifindex; /* neighbor's net_device's ifindex */ + u16 smt_idx; /* SMT index */ + u16 vlan; /* VLAN TCI (id: bits 0-11, prio: 13-15 */ + struct neighbour *neigh; /* associated neighbour */ + struct l2t_entry *first; /* start of hash chain */ + struct l2t_entry *next; /* next l2t_entry on chain */ + struct sk_buff *arpq_head; /* queue of packets awaiting resolution */ + struct sk_buff *arpq_tail; + spinlock_t lock; + atomic_t refcnt; /* entry reference count */ + u8 dmac[6]; /* neighbour's MAC address */ +}; + +struct l2t_data { + unsigned int nentries; /* number of entries */ + struct l2t_entry *rover; /* starting point for next allocation */ + atomic_t nfree; /* number of free entries */ + rwlock_t lock; + struct l2t_entry l2tab[0]; +}; + +typedef void (*arp_failure_handler_func)(struct t3cdev *dev, + struct sk_buff *skb); + +/* + * Callback stored in an skb to handle address resolution failure. + */ +struct l2t_skb_cb { + arp_failure_handler_func arp_failure_handler; +}; + +#define L2T_SKB_CB(skb) ((struct l2t_skb_cb *)(skb)->cb) + +static inline void set_arp_failure_handler(struct sk_buff *skb, + arp_failure_handler_func hnd) +{ + L2T_SKB_CB(skb)->arp_failure_handler = hnd; +} + +/* + * Getting to the L2 data from a toe device. + */ +#define L2DATA(dev) ((dev)->l2opt) + +void t3_l2e_free(struct l2t_data *d, struct l2t_entry *e); +void t3_l2t_update(struct t3cdev *dev, struct neighbour *neigh, int flags, struct net_device *lldev); +struct l2t_entry *t3_l2t_get(struct t3cdev *dev, struct neighbour *neigh, + unsigned int smt_idx); +int t3_l2t_send_slow(struct t3cdev *dev, struct sk_buff *skb, + struct l2t_entry *e); +void t3_l2t_send_event(struct t3cdev *dev, struct l2t_entry *e); +struct l2t_data *t3_init_l2t(unsigned int l2t_capacity); +void t3_free_l2t(struct l2t_data *d); + +#ifdef CONFIG_PROC_FS +int t3_l2t_proc_setup(struct proc_dir_entry *dir, struct l2t_data *d); +void t3_l2t_proc_free(struct proc_dir_entry *dir); +#else +#define l2t_proc_setup(dir, d) 0 +#define l2t_proc_free(dir) +#endif + +int t3c_send(struct t3cdev *dev, struct sk_buff *skb); + +static inline int l2t_send(struct t3cdev *dev, struct sk_buff *skb, + struct l2t_entry *e) +{ + if (likely(e->state == L2T_STATE_VALID)) + return t3c_send(dev, skb); + return t3_l2t_send_slow(dev, skb, e); +} + +static inline void l2t_release(struct l2t_data *d, struct l2t_entry *e) +{ + if (atomic_dec_and_test(&e->refcnt)) + t3_l2e_free(d, e); +} + +static inline void l2t_hold(struct l2t_data *d, struct l2t_entry *e) +{ + if (atomic_add_return(1, &e->refcnt) == 1) /* 0 -> 1 transition */ + atomic_dec(&d->nfree); +} + +#endif From mamidala at cse.ohio-state.edu Fri Jun 23 07:28:15 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Fri, 23 Jun 2006 10:28:15 -0400 (EDT) Subject: [openib-general] mckey program Message-ID: Hi, I was checking the mckey.c program for IB. I did some quick check and found that the rdma_resolve_addr function is invoking the cma_handler with erroneous event. mckey: event: 1, error: -19 Is there any easy way to check what might be happening? Thanks, Amith ---------- Forwarded message ---------- Date: Thu, 22 Jun 2006 09:45:26 -0400 (EDT) From: amith rajith mamidala To: Hal Rosenstock Cc: Sean Hefty Subject: Re: Multicast Addresses Hi Hal, IPoIB interface is started. I can ping to other nodes using this. I have also tried 224.0.0.1 but the error is still the same. Are there any other set of programs which can make sure that the desired set-up is in place. Thanks, Amith On 21 Jun 2006, Hal Rosenstock wrote: > Hi again Amith, > > On Wed, 2006-06-21 at 17:40, amith rajith mamidala wrote: > > Hi Sean, > > > > I did a quick test of the mckey program with the following inputs for the > > receiver: > > --> mckey recv 224.0.0.0 > > > > I got the following error with the receiver returning even though the > > sender was not called. I wanted to double check with you if I am giving > > the correct options. > > > > cmatose: starting client > > cmatose: joining > > cmatose: event: 1, error: -19 > > test complete > > return status -19 > > -19 is -ENODEV. Do you have an IPoIB interface started ? I see lots of > other reasons in the library this might be returned too. > > 224.0.0.0 is a base reserved address. > > Can you try 224.0.0.1 (all systems on subnet) ? > > -- Hal > > > > > Thanks, > > Amith > > > > > > On 21 Jun 2006, Hal Rosenstock wrote: > > > > > Hi Amith, > > > > > > On Wed, 2006-06-21 at 16:41, amith rajith mamidala wrote: > > > > Hi, > > > > > > > > I had a basic question. How do we specify the multicast addresses while > > > > using rdma_cm? I am looking at the mckey program, > > > > > > The syntax is: > > > > > > mckey {s[end] | r[ecv]} mcast_addr [bind_addr]] > > > > > > I think that mcast_addr is an IP address as is the bind_addr. bind_addr > > > is a unicast IP address whereas mcast_addr is a multicast one. > > > > > > It's Sean's test program so I added him to this. > > > > > > -- Hal > > > > > > > > > > > Thanks, > > > > Amith > > > > > > > > > > From paul.lundin at gmail.com Fri Jun 23 07:44:46 2006 From: paul.lundin at gmail.com (Paul) Date: Fri, 23 Jun 2006 10:44:46 -0400 Subject: [openib-general] OFED-1.0 fails install on AMD64 In-Reply-To: <449BAACE.6000609@mellanox.co.il> References: <8953B8331AA98041B0C11DBC678AFC0812C7B1@srvemail1.calpont.com> <449BAACE.6000609@mellanox.co.il> Message-ID: Eitan, Anything using version 4 of gcc should (could ?) have the same problem. If you google the "relocation R_X86_64_32 against" section of the error you will see a good deal of people with the same/similar issues (not on OFED, but on many other things). I do not belive the issue lies with OFED in this instance. Though I could be wrong. Regards. On 6/23/06, Eitan Zahavi wrote: > > Hi Don, > > Sorry for my late response. ibutils compilation (of libibdmcom) is > breaking with the > error message: > > > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be > > used when making a shared object; recompile with -fPIC > > For the command: > > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2 > > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -o > > libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1" > > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo > > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo > > g++ -shared -nostdlib > > So obviously one has to figure out why -shared did not cause -fPIC ? > Also not clear why this does not break on other machines. Anyways, > reproducing the problem is my first target. > > One obvious thing to try is to set CFLAGS=-fPIC > > As I do not have access to the exact type of your machine : FSM Labs v > 2.2.3 with the 2.6.16 kernel > (as the weekend started over hear) I guess I will be able to reproduce > only Sun/Mon. > > Eitan > > Don Snedigar wrote: > > I just downloaded the OFED-1.0 and the install was going fine until > > ibutils. At that point, the install fails with : > > > > Open MPI RPM will be created during the installation process > > > > > > Building ibutils RPM. Please wait... > > > > Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define > > 'configure_options --prefix=/usr/local/ofed > > --mandir=/usr/local/ofed/share/man > > --cache-file=/var/tmp/OFED/ibutils.cache > > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define > > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' > > --define '_mandir %{_prefix}/share/man' --define 'build_root > > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm > > - > > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed > > --mandir=/usr/local/ofed/share/man > > --cache-file=/var/tmp/OFED/ibutils.cache > > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define > > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' > > --define '_mandir %{_prefix}/share/man' --define 'build_root > > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" > > > > See log file: /tmp/OFED.28656.log > > > > > > I dug down into the log file it indicates and found : > > > > g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 > > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT > > ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc > > - -o .libs/ibnl_scanner.o > > ibnl_scanner.ll: In function 'int ibnl_lex()': > > ibnl_scanner.ll:197: warning: ignoring return value of 'size_t > > fwrite(const void*, size_t, size_t, FILE*)', declared with attribute > > warn_unused_result > > g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 > > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT > > ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -o > > ibnl_scanner.o >/dev/null 2>&1 > > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2 > > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -o > > libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1" > > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo > > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo > > g++ -shared -nostdlib > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o .libs/Fabric.o > > .libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o .libs/TopoMatch.o > > .libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o .libs/ibnl_parser.o > > .libs/ibnl_scanner.o -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0 > > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64 > > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../.. -L/lib/../lib64 > > -L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o -m64 > > -mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o > > .libs/libibdmcom.so.1.1.1 > > /usr/bin/ld: > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o): > > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be > > used when making a shared object; recompile with -fPIC > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not read > > symbols: Bad value > > collect2: ld returned 1 exit status > > make[3]: *** [libibdmcom.la] Error 1 > > make[3]: Leaving directory > > `/var/tmp/OFEDRPM/BUILD/ibutils- 1.0/ibdm/datamodel' > > make[2]: *** [all-recursive] Error 1 > > make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm' > > make[1]: *** [all] Error 2 > > make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils- 1.0/ibdm' > > make: *** [all-recursive] Error 1 > > error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install) > > > > > > RPM build errors: > > Bad exit status from /var/tmp/rpm-tmp.16738 (%install) > > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed > > --mandir=/usr/local/ofed/share/man > > --cache-file=/var/tmp/OFED/ibutils.cache > > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define > > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' > > --define '_mandir %{_prefix}/share/man' --define 'build_root > > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" > > > > Can anyone shed any light on this ? > > > > Machine is dual Opteron, 2 gig memory, kernel 2.6.16 > > > > Don Snedigar > > Calpont Corp. > > 214-618-9516 > > > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dsnedigar at calpont.com Fri Jun 23 07:51:05 2006 From: dsnedigar at calpont.com (Don Snedigar) Date: Fri, 23 Jun 2006 09:51:05 -0500 Subject: [openib-general] OFED-1.0 fails install on AMD64 Message-ID: <8953B8331AA98041B0C11DBC678AFC0816AB2D@srvemail1.calpont.com> Agreed Paul. Google turns up hundreds, if not thousands, of hits about this. Its not an OFED problem... I was able to resolve the problem late last night by upgrading the compiler to gcc-4.0.2. Thanks for all the help though! Don ________________________________ From: Paul [mailto:paul.lundin at gmail.com] Sent: Friday, June 23, 2006 9:45 AM To: Eitan Zahavi Cc: Don Snedigar; openib-general at openib.org Subject: Re: [openib-general] OFED-1.0 fails install on AMD64 Eitan, Anything using version 4 of gcc should (could ?) have the same problem. If you google the "relocation R_X86_64_32 against" section of the error you will see a good deal of people with the same/similar issues (not on OFED, but on many other things). I do not belive the issue lies with OFED in this instance. Though I could be wrong. Regards. On 6/23/06, Eitan Zahavi < eitan at mellanox.co.il > wrote: Hi Don, Sorry for my late response. ibutils compilation (of libibdmcom) is breaking with the error message: > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be > used when making a shared object; recompile with -fPIC For the command: > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2 > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -o > libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1" > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo > g++ -shared -nostdlib So obviously one has to figure out why -shared did not cause -fPIC ? Also not clear why this does not break on other machines. Anyways, reproducing the problem is my first target. One obvious thing to try is to set CFLAGS=-fPIC As I do not have access to the exact type of your machine : FSM Labs v 2.2.3 with the 2.6.16 kernel (as the weekend started over hear) I guess I will be able to reproduce only Sun/Mon. Eitan Don Snedigar wrote: > I just downloaded the OFED-1.0 and the install was going fine until > ibutils. At that point, the install fails with : > > Open MPI RPM will be created during the installation process > > > Building ibutils RPM. Please wait... > > Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define > 'configure_options --prefix=/usr/local/ofed > --mandir=/usr/local/ofed/share/man > --cache-file=/var/tmp/OFED/ibutils.cache > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' > --define '_mandir %{_prefix}/share/man' --define 'build_root > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm > - > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed > --mandir=/usr/local/ofed/share/man > --cache-file=/var/tmp/OFED/ibutils.cache > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' > --define '_mandir %{_prefix}/share/man' --define 'build_root > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" > > See log file: /tmp/OFED.28656.log > > > I dug down into the log file it indicates and found : > > g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT > ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc > - -o .libs/ibnl_scanner.o > ibnl_scanner.ll: In function 'int ibnl_lex()': > ibnl_scanner.ll:197: warning: ignoring return value of 'size_t > fwrite(const void*, size_t, size_t, FILE*)', declared with attribute > warn_unused_result > g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT > ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -o > ibnl_scanner.o >/dev/null 2>&1 > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2 > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -o > libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1" > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo > g++ -shared -nostdlib > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o .libs/Fabric.o > .libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o .libs/TopoMatch.o > .libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o .libs/ibnl_parser.o > .libs/ibnl_scanner.o -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0 > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64 > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../.. -L/lib/../lib64 > -L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o -m64 > -mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o > .libs/libibdmcom.so.1.1.1 > /usr/bin/ld: > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o): > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be > used when making a shared object; recompile with -fPIC > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not read > symbols: Bad value > collect2: ld returned 1 exit status > make[3]: *** [libibdmcom.la] Error 1 > make[3]: Leaving directory > `/var/tmp/OFEDRPM/BUILD/ibutils- 1.0/ibdm/datamodel' > make[2]: *** [all-recursive] Error 1 > make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm' > make[1]: *** [all] Error 2 > make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils- 1.0/ibdm' > make: *** [all-recursive] Error 1 > error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install) > > > RPM build errors: > Bad exit status from /var/tmp/rpm-tmp.16738 (%install) > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed > --mandir=/usr/local/ofed/share/man > --cache-file=/var/tmp/OFED/ibutils.cache > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' > --define '_mandir %{_prefix}/share/man' --define 'build_root > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" > > Can anyone shed any light on this ? > > Machine is dual Opteron, 2 gig memory, kernel 2.6.16 > > Don Snedigar > Calpont Corp. > 214-618-9516 > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.lundin at gmail.com Fri Jun 23 07:54:48 2006 From: paul.lundin at gmail.com (Paul) Date: Fri, 23 Jun 2006 10:54:48 -0400 Subject: [openib-general] OFED-1.0 fails install on AMD64 In-Reply-To: <8953B8331AA98041B0C11DBC678AFC0816AB2D@srvemail1.calpont.com> References: <8953B8331AA98041B0C11DBC678AFC0816AB2D@srvemail1.calpont.com> Message-ID: Your welcome. Good to hear that you got it working. On 6/23/06, Don Snedigar wrote: > > Agreed Paul. Google turns up hundreds, if not thousands, of hits about > this. Its not an OFED problem... > > I was able to resolve the problem late last night by upgrading the > compiler to gcc-4.0.2. > > Thanks for all the help though! > > Don > > ------------------------------ > *From:* Paul [mailto:paul.lundin at gmail.com] > *Sent:* Friday, June 23, 2006 9:45 AM > *To:* Eitan Zahavi > *Cc:* Don Snedigar; openib-general at openib.org > > *Subject:* Re: [openib-general] OFED-1.0 fails install on AMD64 > > Eitan, > Anything using version 4 of gcc should (could ?) have the same problem. > If you google the "relocation R_X86_64_32 against" section of the error > you will see a good deal of people with the same/similar issues (not on > OFED, but on many other things). I do not belive the issue lies with OFED in > this instance. Though I could be wrong. > > Regards. > > On 6/23/06, Eitan Zahavi < eitan at mellanox.co.il> wrote: > > > > Hi Don, > > > > Sorry for my late response. ibutils compilation (of libibdmcom) is > > breaking with the > > error message: > > > > > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not > > be > > > used when making a shared object; recompile with -fPIC > > > > For the command: > > > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2 > > > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe > > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -o > > > libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1" > > > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo > > > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo > > > g++ -shared -nostdlib > > > > So obviously one has to figure out why -shared did not cause -fPIC ? > > Also not clear why this does not break on other machines. Anyways, > > reproducing the problem is my first target. > > > > One obvious thing to try is to set CFLAGS=-fPIC > > > > As I do not have access to the exact type of your machine : FSM Labs v > > 2.2.3 with the 2.6.16 kernel > > (as the weekend started over hear) I guess I will be able to reproduce > > only Sun/Mon. > > > > Eitan > > > > Don Snedigar wrote: > > > I just downloaded the OFED-1.0 and the install was going fine until > > > ibutils. At that point, the install fails with : > > > > > > Open MPI RPM will be created during the installation process > > > > > > > > > Building ibutils RPM. Please wait... > > > > > > Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' > > --define > > > 'configure_options --prefix=/usr/local/ofed > > > --mandir=/usr/local/ofed/share/man > > > --cache-file=/var/tmp/OFED/ibutils.cache > > > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define > > > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' > > > --define '_mandir %{_prefix}/share/man' --define 'build_root > > > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm > > > - > > > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > > > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed > > > > > --mandir=/usr/local/ofed/share/man > > > --cache-file=/var/tmp/OFED/ibutils.cache > > > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define > > > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' > > > --define '_mandir %{_prefix}/share/man' --define 'build_root > > > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" > > > > > > See log file: /tmp/OFED.28656.log > > > > > > > > > I dug down into the log file it indicates and found : > > > > > > g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 > > > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe > > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT > > > ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc > > > - -o .libs/ibnl_scanner.o > > > ibnl_scanner.ll: In function 'int ibnl_lex()': > > > ibnl_scanner.ll:197: warning: ignoring return value of 'size_t > > > fwrite(const void*, size_t, size_t, FILE*)', declared with attribute > > > warn_unused_result > > > g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2 > > > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe > > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT > > > ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc > > -o > > > ibnl_scanner.o >/dev/null 2>&1 > > > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2 > > > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe > > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -o > > > libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1" > > > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo > > > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo > > > g++ -shared -nostdlib > > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o > > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o .libs/Fabric.o > > > .libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o .libs/TopoMatch.o > > > .libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o > > .libs/ibnl_parser.o > > > .libs/ibnl_scanner.o -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0 > > > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64 > > > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../.. -L/lib/../lib64 > > > -L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s > > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o > > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o -m64 > > > -mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o > > > .libs/libibdmcom.so.1.1.1 > > > /usr/bin/ld: > > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o): > > > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not > > be > > > used when making a shared object; recompile with -fPIC > > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not read > > > symbols: Bad value > > > collect2: ld returned 1 exit status > > > make[3]: *** [libibdmcom.la] Error 1 > > > make[3]: Leaving directory > > > `/var/tmp/OFEDRPM/BUILD/ibutils- 1.0/ibdm/datamodel' > > > make[2]: *** [all-recursive] Error 1 > > > make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm' > > > make[1]: *** [all] Error 2 > > > make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils- 1.0/ibdm' > > > make: *** [all-recursive] Error 1 > > > error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install) > > > > > > > > > RPM build errors: > > > Bad exit status from /var/tmp/rpm-tmp.16738 (%install) > > > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > > > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed > > > --mandir=/usr/local/ofed/share/man > > > --cache-file=/var/tmp/OFED/ibutils.cache > > > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define > > > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' > > > --define '_mandir %{_prefix}/share/man' --define 'build_root > > > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm" > > > > > > Can anyone shed any light on this ? > > > > > > Machine is dual Opteron, 2 gig memory, kernel 2.6.16 > > > > > > Don Snedigar > > > Calpont Corp. > > > 214-618-9516 > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bpradip at in.ibm.com Fri Jun 23 09:09:50 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Fri, 23 Jun 2006 21:39:50 +0530 Subject: [openib-general] resend [PATCH] rping.c: Fix hang if either the server or the client exits early In-Reply-To: <1151071344.7808.42.camel@stevo-desktop> References: <20060622192259.GA24588@harry-potter.ibm.com> <1151007847.3040.51.camel@stevo-desktop> <449BE393.3020308@in.ibm.com> <1151071344.7808.42.camel@stevo-desktop> Message-ID: <449C124E.4050308@in.ibm.com> Steve Wise wrote: > On Fri, 2006-06-23 at 18:20 +0530, Pradipta Kumar Banerjee wrote: >> Steve Wise wrote: >>> The goal of adding the return codes was so that the rping program could >>> exit with a status indicating success or failure. Every rping run >>> results in a DISCONNECT event, so I don't think we want to treat that >>> case as an error. >> DISCONNECT event will be generated when the connection is closed or in case of >> some error (like CCAE_LLP_CONNECTION_LOST, CCAE_BAD_CLOSE in case of Ammasso >> driver etc). > > You'll also get the DISCONNECT event when one side finished the rping > loops and does rdma_disconnect(). So receiving that event isn't > necessarily an error... Yes definitely, but this event can _also_ be received due to errors!! > > >>> Also, can you explain why thi fixes Amith's problem, which sounded like >>> a process was hanging? >>> >> On debugging I found that the main thread was blocked in ibv_destroy_cq(), >> cm_thread was blocked in rdma_get_cm_event->write() and cq_thread was blocked in >> ibv_get_cq_event->read >> Taking the return value of the DISCONNECT event into consideration forcefully >> killed the process. >> On delving deeper into this problem, I think that there is more to this rping >> hang. Let me work on this further. >> > > I think rping needs some coordination on these threads and when they > should be killed. > Right.. Thanks, Pradipta >> On a related note - I noticed another rping hang in the following case >> - Start the rping as a client without first starting an rping server >> - If you are lucky the first run itself will result in the 'lt-rping' process in >> 'D' state. If not repeating the procedure will result in the hang. >> >> This is the o/p. >> >> cq completion failed status 5 >> wait for CONNECTED state 10 >> connect error -1 >> >> Thanks, >> Pradipta. >> >> >>> Thanks, >>> >>> Steve. >>> >>> >>> >>> On Fri, 2006-06-23 at 00:53 +0530, Pradipta Kumar Banerjee wrote: From iod00d at hp.com Fri Jun 23 10:14:57 2006 From: iod00d at hp.com (Grant Grundler) Date: Fri, 23 Jun 2006 10:14:57 -0700 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <1151071471.3204.12.camel@laptopd505.fenrus.org> References: <20060620203050.31536.5341.stgit@stevo-desktop> <20060620203055.31536.15131.stgit@stevo-desktop> <1150836226.2891.231.camel@laptopd505.fenrus.org> <1151070290.7808.33.camel@stevo-desktop> <1151070532.3204.10.camel@laptopd505.fenrus.org> <1151071005.7808.39.camel@stevo-desktop> <1151071471.3204.12.camel@laptopd505.fenrus.org> Message-ID: <20060623171457.GA3610@esmail.cup.hp.com> On Fri, Jun 23, 2006 at 04:04:31PM +0200, Arjan van de Ven wrote: > > I thought the posted write WILL eventually get to adapter memory. Not > > stall forever cached in a bridge. I'm wrong? > > I'm not sure there is a theoretical upper bound.... I'm not aware of one either since MMIO writes can travel across many other chips that are not constrained by PCI ordering rules (I'm thinking of SGI Altix...) > (and if it's several msec per bridge, then you have a lot of latency > anyway) That's what my original concern was when I saw you point this out. But MMIO reads here would be expensive and many drivers tolerate this latency in exchange for avoiding the MMIO read in the performance path. grant From Don.Albert at Bull.com Fri Jun 23 10:14:31 2006 From: Don.Albert at Bull.com (Don.Albert at Bull.com) Date: Fri, 23 Jun 2006 10:14:31 -0700 Subject: [openib-general] Stopping Infiniband kernel modules from loading at system boot Message-ID: Short of uninstalling the OFED-1.0 release, how can I stop the Infiniband related kernel modules from loading at system boot? I am trying to debug a problem with programs hanging in the kernel, so I thought that I would try manually loading the modules one at a time to see if I could isolate the problem. This is on a RHEL4 U3 system with the 2.6.16 kernel and the OFED-1.0 release installed. I used "/sbin/chkconfig" to turn off the "openibd" and "opensmd" services in the /etc/rc.d/ runlevel files. I even removed "ifcfg-ib0" and "ifcfg-ib1" from the /etc/sysconfig/networking-scripts directory. I don't see any other scripts that would cause these modules to be loaded. But every time I reboot, I get the following modules loaded, according to /sbin/lsmod: ib_mthca 117424 0 ib_mad 35896 1 ib_mthca ib_core 45952 2 ib_mthca,ib_mad What have I missed? -Don Albert- -------------- next part -------------- An HTML attachment was scrubbed... URL: From boris at mellanox.com Fri Jun 23 10:57:38 2006 From: boris at mellanox.com (Boris Shpolyansky) Date: Fri, 23 Jun 2006 10:57:38 -0700 Subject: [openib-general] Stopping Infiniband kernel modules from loading at system boot Message-ID: <1E3DCD1C63492545881FACB6063A57C13242D1@mtiexch01.mti.com> Hi Don, I believe you need to disable the "hotplug" loading of the Infiniband drivers by putting the modules you have listed into /etc/hotplug/blacklist file. Please, let me know if this helped. Regards, Boris Shpolyansky Application Engineer Mellanox Technologies Inc. 2900 Stender Way Santa Clara, CA 95054 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Don.Albert at Bull.com Sent: Friday, June 23, 2006 10:15 AM To: openfabrics-ewg at openib.org; openib-general at openib.org Subject: [openib-general] Stopping Infiniband kernel modules from loading at system boot Short of uninstalling the OFED-1.0 release, how can I stop the Infiniband related kernel modules from loading at system boot? I am trying to debug a problem with programs hanging in the kernel, so I thought that I would try manually loading the modules one at a time to see if I could isolate the problem. This is on a RHEL4 U3 system with the 2.6.16 kernel and the OFED-1.0 release installed. I used "/sbin/chkconfig" to turn off the "openibd" and "opensmd" services in the /etc/rc.d/ runlevel files. I even removed "ifcfg-ib0" and "ifcfg-ib1" from the /etc/sysconfig/networking-scripts directory. I don't see any other scripts that would cause these modules to be loaded. But every time I reboot, I get the following modules loaded, according to /sbin/lsmod: ib_mthca 117424 0 ib_mad 35896 1 ib_mthca ib_core 45952 2 ib_mthca,ib_mad What have I missed? -Don Albert- -------------- next part -------------- An HTML attachment was scrubbed... URL: From pw at osc.edu Fri Jun 23 11:19:59 2006 From: pw at osc.edu (Pete Wyckoff) Date: Fri, 23 Jun 2006 14:19:59 -0400 Subject: [openib-general] amso userspace Message-ID: <20060623181959.GA21488@osc.edu> Having seen your four patchsets recently, I thought I'd give amso in openib another shot, r8187 is what I'm looking at now. Here's a few questions for you. (I did ccflash2 to the fw in ogc kit 20060308 already, and use the boot_image in there too.) Should I expect to be able to use the kernel directories in branches/iwarp directly with linux-2.6.17.1? It looks like your branch may be out of date with respect to trunk for a few files. I used it anyway and it does seem to build and run. In the userspace source, amso_create_qp limits max_send_sge and max_recv_sge to 4. Is this really the hardware limit? It seems quite low. Should I expect the examples in branches/iwarp/src/userspace/libibverbs/examples to work? I was hoping to use rc_pingpong.c as a way to understand what was going wrong with my code, but it exits when it finds that its local lid is zero (line 578). One spot in my code where I'm trying to understand why libamso errors is this transition to RTS (not using rdmacm, just bringing it up by hand): /* transition qp to ready-to-send */ mask = IBV_QP_STATE | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY; memset(&attr, 0, sizeof(attr)); attr.qp_state = IBV_QPS_RTS; attr.sq_psn = 0; attr.max_rd_atomic = 1; attr.timeout = 26; /* 4.096us * 2^26 = 5 min */ attr.retry_cnt = 20; attr.rnr_retry = 20; ret = ibv_modify_qp(qp, &attr, mask); if (ret) error_xerrno(ret, "%s: ibv_modify_qp RTR -> RTS", __func__); The return value is 11, EAGAIN. With C2_DEBUG on, the kernel says to the console: c2: c2_qp_modify:145 qp=ffff81003e71f180, IB_QPS_RTR --> IB_QPS_RTS c2: c2_qp_modify: c2_errno=-11 c2: c2_qp_modify:243 qp=ffff81003e71f180, cur_state=IB_QPS_RTR I'm guessing one of those values must be off, but can't see where anything is enforced in the lib or kernel driver. Some of these fields don't make sense for non-IB fabrics. Just using a mask of IBV_QP_STATE caused the same return value. Can you see the problem right off? (This code does work fine on mthca.) Thanks, -- Pete From swise at opengridcomputing.com Fri Jun 23 11:26:24 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 13:26:24 -0500 Subject: [openib-general] amso userspace In-Reply-To: <20060623181959.GA21488@osc.edu> References: <20060623181959.GA21488@osc.edu> Message-ID: <1151087184.7808.67.camel@stevo-desktop> > Should I expect to be able to use the kernel directories in branches/iwarp > directly with linux-2.6.17.1? It looks like your branch may be out > of date with respect to trunk for a few files. I used it anyway and > it does seem to build and run. I haven't tried branches/iwarp with a 2.6.17 kernel. It works fine with 2.6.16 though, and I expect it to work fine in 2.6.17. The branch is a snapshot of the main trunk and we only update it occasionally. > > In the userspace source, amso_create_qp limits max_send_sge and > max_recv_sge to 4. Is this really the hardware limit? It seems > quite low. > Yep. That's a HW limit. > Should I expect the examples in > branches/iwarp/src/userspace/libibverbs/examples to work? I was > hoping to use rc_pingpong.c as a way to understand what was going > wrong with my code, but it exits when it finds that its local lid is > zero (line 578). Those examples only work for IB transports. The examples in librdma/examples will run over iwarp because they utilize the RDMA CMA. > > One spot in my code where I'm trying to understand why libamso > errors is this transition to RTS (not using rdmacm, just bringing > it up by hand): > > /* transition qp to ready-to-send */ > mask = > IBV_QP_STATE > | IBV_QP_SQ_PSN > | IBV_QP_MAX_QP_RD_ATOMIC > | IBV_QP_TIMEOUT > | IBV_QP_RETRY_CNT > | IBV_QP_RNR_RETRY; > memset(&attr, 0, sizeof(attr)); > attr.qp_state = IBV_QPS_RTS; > attr.sq_psn = 0; > attr.max_rd_atomic = 1; > attr.timeout = 26; /* 4.096us * 2^26 = 5 min */ > attr.retry_cnt = 20; > attr.rnr_retry = 20; > ret = ibv_modify_qp(qp, &attr, mask); > if (ret) > error_xerrno(ret, "%s: ibv_modify_qp RTR -> RTS", __func__); > > The return value is 11, EAGAIN. > > With C2_DEBUG on, the kernel says to the console: > > c2: c2_qp_modify:145 qp=ffff81003e71f180, IB_QPS_RTR --> IB_QPS_RTS > c2: c2_qp_modify: c2_errno=-11 > c2: c2_qp_modify:243 qp=ffff81003e71f180, cur_state=IB_QPS_RTR > > I'm guessing one of those values must be off, but can't see where > anything is enforced in the lib or kernel driver. Some of these > fields don't make sense for non-IB fabrics. Just using a mask of > IBV_QP_STATE caused the same return value. Can you see the problem > right off? (This code does work fine on mthca.) > You need to use librdmacm to setup iwarp connections. That's the only way it will work for the amso device. See librdma/examples. I also posted a patch to perftest/rdma_lat.c and rdma_bw.c that added a -c option to utilize the RDMA CMA. The patch didn't get pulled in, however... Steve. From krause at cup.hp.com Fri Jun 23 11:02:31 2006 From: krause at cup.hp.com (Michael Krause) Date: Fri, 23 Jun 2006 11:02:31 -0700 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <20060623171457.GA3610@esmail.cup.hp.com> References: <20060620203050.31536.5341.stgit@stevo-desktop> <20060620203055.31536.15131.stgit@stevo-desktop> <1150836226.2891.231.camel@laptopd505.fenrus.org> <1151070290.7808.33.camel@stevo-desktop> <1151070532.3204.10.camel@laptopd505.fenrus.org> <1151071005.7808.39.camel@stevo-desktop> <1151071471.3204.12.camel@laptopd505.fenrus.org> <20060623171457.GA3610@esmail.cup.hp.com> Message-ID: <6.2.0.14.2.20060623105755.0201e650@esmail.cup.hp.com> At 10:14 AM 6/23/2006, Grant Grundler wrote: >On Fri, Jun 23, 2006 at 04:04:31PM +0200, Arjan van de Ven wrote: > > > I thought the posted write WILL eventually get to adapter memory. Not > > > stall forever cached in a bridge. I'm wrong? > > > > I'm not sure there is a theoretical upper bound.... > >I'm not aware of one either since MMIO writes can travel >across many other chips that are not constrained by >PCI ordering rules (I'm thinking of SGI Altix...) It is processor / coherency backplane technology specific as to the number of outstanding writes. There is also no guarantee that such writes will hit the top of the PCI hierarchy in the order they were posted in a multi-core / processor system. Hence, it is up to software to guarantee that ordering is preserved and to not assume anything about ordering from a hardware perspective. Once a transaction hits the PCI hierarchy, then the PCI ordering rules apply and depending upon the transaction type and other rules, what is guaranteed is deterministic in nature. > > (and if it's several msec per bridge, then you have a lot of latency > > anyway) > >That's what my original concern was when I saw you point this out. >But MMIO reads here would be expensive and many drivers tolerate >this latency in exchange for avoiding the MMIO read in the >performance path. As the saying goes, MMIO Reads are "pure evil" and should be avoided at all costs if performance is the goal. Even in a relatively flat I/O hierarchy, the additional latency is non-trivial and can lead to a significant loss in performance for the system. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From dledford at redhat.com Fri Jun 23 11:44:46 2006 From: dledford at redhat.com (Doug Ledford) Date: Fri, 23 Jun 2006 14:44:46 -0400 Subject: [openib-general] [openfabrics-ewg] Stopping Infiniband kernel modules from loading at system boot In-Reply-To: References: Message-ID: <1151088287.22762.14.camel@fc5.xsintricity.com> On Fri, 2006-06-23 at 10:14 -0700, Don.Albert at Bull.com wrote: > > Short of uninstalling the OFED-1.0 release, how can I stop the > Infiniband related kernel modules from loading at system boot? > > I am trying to debug a problem with programs hanging in the kernel, > so I thought that I would try manually loading the modules one at a > time to see if I could isolate the problem. This is on a RHEL4 U3 > system with the 2.6.16 kernel and the OFED-1.0 release installed. > > I used "/sbin/chkconfig" to turn off the "openibd" and "opensmd" > services in the /etc/rc.d/runlevel files. I even removed "ifcfg-ib0" > and "ifcfg-ib1" from the /etc/sysconfig/networking-scriptsdirectory. > I don't see any other scripts that would cause these modules to be > loaded. But every time I reboot, I get the following modules loaded, > according to /sbin/lsmod: > > ib_mthca 117424 0 > ib_mad 35896 1 ib_mthca > ib_core 45952 2 ib_mthca,ib_mad > > What have I missed? /etc/rc.d/rc.sysinit In the sysinit we load all the modules required to support the hardware in the system (that's when it prints the Initializing hardware: storage network sound other [OK] message). In order to stop that you have to move the modules out of the way. But, I'm a bit surprised that ib_mad is loaded as that doesn't seem a hard dependancy for ib_mthca (or more appropriately, I'm surprised to see ib_mad and not a bunch of other ib modules as well, check the /etc/modprobe.conf and /etc/modprobe.conf.dist to see if there are rules to force lots of ib modules to be loaded any time ib_core is loaded). -- Doug Ledford http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband From pw at osc.edu Fri Jun 23 11:56:25 2006 From: pw at osc.edu (Pete Wyckoff) Date: Fri, 23 Jun 2006 14:56:25 -0400 Subject: [openib-general] amso userspace In-Reply-To: <1151087184.7808.67.camel@stevo-desktop> References: <20060623181959.GA21488@osc.edu> <1151087184.7808.67.camel@stevo-desktop> Message-ID: <20060623185625.GB21488@osc.edu> swise at opengridcomputing.com wrote on Fri, 23 Jun 2006 13:26 -0500: > You need to use librdmacm to setup iwarp connections. That's the only > way it will work for the amso device. See librdma/examples. I also > posted a patch to perftest/rdma_lat.c and rdma_bw.c that added a -c > option to utilize the RDMA CMA. The patch didn't get pulled in, > however... Thanks for the clarification. rping and cmatose from the iwarp branch work fine. (The trunk versions are slightly different.) I'll have to think about whether I'm willing to switch over to rdmacm just yet. I was hoping to stick with my hand-rolled TCP-based connection setup, but understand why that is not possible if I want to support iwarp gear on the same code base. Having libamso and librdmacm show up in fedora-extras would definitely help us make the transition, if you want the nudging. :) Thanks for all the iwarp work. -- Pete From pradeep at us.ibm.com Fri Jun 23 12:26:56 2006 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 23 Jun 2006 12:26:56 -0700 Subject: [openib-general] Stopping Infiniband kernel modules from loading at system boot In-Reply-To: Message-ID: I was encountering some similar problems in the past and I put an entry into /etc/hotplug/blaclist like the following: # Mellanox InfiniBand ib_mthca This has worked for me. Pradeep pradeep at us.ibm.com Don.Albert at Bull.com Sent by: openib-general-bounces at openib.org 06/23/2006 10:14 AM To openfabrics-ewg at openib.org, openib-general at openib.org cc Subject [openib-general] Stopping Infiniband kernel modules from loading at system boot Short of uninstalling the OFED-1.0 release, how can I stop the Infiniband related kernel modules from loading at system boot? I am trying to debug a problem with programs hanging in the kernel, so I thought that I would try manually loading the modules one at a time to see if I could isolate the problem. This is on a RHEL4 U3 system with the 2.6.16 kernel and the OFED-1.0 release installed. I used "/sbin/chkconfig" to turn off the "openibd" and "opensmd" services in the /etc/rc.d/ runlevel files. I even removed "ifcfg-ib0" and "ifcfg-ib1" from the /etc/sysconfig/networking-scripts directory. I don't see any other scripts that would cause these modules to be loaded. But every time I reboot, I get the following modules loaded, according to /sbin/lsmod: ib_mthca 117424 0 ib_mad 35896 1 ib_mthca ib_core 45952 2 ib_mthca,ib_mad What have I missed? -Don Albert- _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Fri Jun 23 13:31:37 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 15:31:37 -0500 Subject: [openib-general] [PATCH v2 2/2] iWARP changes to librdmacm. In-Reply-To: <20060620200312.20092.87834.stgit@stevo-desktop> References: <20060620200304.20092.44110.stgit@stevo-desktop> <20060620200312.20092.87834.stgit@stevo-desktop> Message-ID: <1151094697.7808.82.camel@stevo-desktop> Sean, Are these changes acceptable? Steve. On Tue, 2006-06-20 at 15:03 -0500, Steve Wise wrote: > For iWARP, rdma_disconnect() moves the QP to SQD instead of ERR. The > iWARP providers map SQD to the RDMAC verbs CLOSING state. > --- > > librdmacm/src/cma.c | 22 +++++++++++++++++++++- > 1 files changed, 21 insertions(+), 1 deletions(-) > > diff --git a/librdmacm/src/cma.c b/librdmacm/src/cma.c > index e99d15c..a250f69 100644 > --- a/librdmacm/src/cma.c > +++ b/librdmacm/src/cma.c > @@ -633,6 +633,17 @@ static int ucma_modify_qp_rts(struct rdm > return ibv_modify_qp(id->qp, &qp_attr, qp_attr_mask); > } > > +static int ucma_modify_qp_sqd(struct rdma_cm_id *id) > +{ > + struct ibv_qp_attr qp_attr; > + > + if (!id->qp) > + return 0; > + > + qp_attr.qp_state = IBV_QPS_SQD; > + return ibv_modify_qp(id->qp, &qp_attr, IBV_QP_STATE); > +} > + > static int ucma_modify_qp_err(struct rdma_cm_id *id) > { > struct ibv_qp_attr qp_attr; > @@ -881,7 +892,16 @@ int rdma_disconnect(struct rdma_cm_id *i > void *msg; > int ret, size; > > - ret = ucma_modify_qp_err(id); > + switch (ibv_get_transport_type(id->verbs)) { > + case IBV_TRANSPORT_IB: > + ret = ucma_modify_qp_err(id); > + break; > + case IBV_TRANSPORT_IWARP: > + ret = ucma_modify_qp_sqd(id); > + break; > + default: > + ret = -EINVAL; > + } > if (ret) > return ret; > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Fri Jun 23 13:32:33 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 Jun 2006 15:32:33 -0500 Subject: [openib-general] [PATCH v2 1/2] iWARP changes to libibverbs. In-Reply-To: References: <20060620200304.20092.44110.stgit@stevo-desktop> <20060620200308.20092.76324.stgit@stevo-desktop> Message-ID: <1151094753.7808.84.camel@stevo-desktop> On Tue, 2006-06-20 at 15:27 -0700, Roland Dreier wrote: > Looks pretty good. I'll get this into the libibverbs development tree > soon (I'm working on the MADV_DONTFORK stuff right now). > > - R. Sounds good. Once you commit the libibverbs changes, we can commit the librdma changes that depend on them (assume everyone agrees to the changes). Stevo. From Don.Albert at Bull.com Fri Jun 23 13:35:53 2006 From: Don.Albert at Bull.com (Don.Albert at Bull.com) Date: Fri, 23 Jun 2006 13:35:53 -0700 Subject: [openib-general] Stopping Infiniband kernel modules from loading at system boot In-Reply-To: Message-ID: Thanks to Pradeep Satyanarayana and Boris Shpolyansky for suggesting that I add an entry to /etc/hotplug/blacklist, but I thought that the "/etc/hotplug" stuff was replaced in the latest kernels with "/etc/udev" functionality. Is this not true? -Don Albert- -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Jun 23 14:22:14 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Jun 2006 17:22:14 -0400 Subject: [openib-general] [openfabrics-ewg] Stopping Infiniband kernel modules from loading at system boot In-Reply-To: <1151088287.22762.14.camel@fc5.xsintricity.com> References: <1151088287.22762.14.camel@fc5.xsintricity.com> Message-ID: <1151097726.4391.300562.camel@hal.voltaire.com> Hi Doug, On Fri, 2006-06-23 at 14:44, Doug Ledford wrote: > On Fri, 2006-06-23 at 10:14 -0700, Don.Albert at Bull.com wrote: > > > > Short of uninstalling the OFED-1.0 release, how can I stop the > > Infiniband related kernel modules from loading at system boot? > > > > I am trying to debug a problem with programs hanging in the kernel, > > so I thought that I would try manually loading the modules one at a > > time to see if I could isolate the problem. This is on a RHEL4 U3 > > system with the 2.6.16 kernel and the OFED-1.0 release installed. > > > > I used "/sbin/chkconfig" to turn off the "openibd" and "opensmd" > > services in the /etc/rc.d/runlevel files. I even removed "ifcfg-ib0" > > and "ifcfg-ib1" from the /etc/sysconfig/networking-scriptsdirectory. > > I don't see any other scripts that would cause these modules to be > > loaded. But every time I reboot, I get the following modules loaded, > > according to /sbin/lsmod: > > > > ib_mthca 117424 0 > > ib_mad 35896 1 ib_mthca > > ib_core 45952 2 ib_mthca,ib_mad > > > > What have I missed? > > /etc/rc.d/rc.sysinit > > In the sysinit we load all the modules required to support the hardware > in the system (that's when it prints the Initializing hardware: storage > network sound other [OK] message). In order to stop that you have to > move the modules out of the way. But, I'm a bit surprised that ib_mad > is loaded as that doesn't seem a hard dependancy for ib_mthca (or more > appropriately, I'm surprised to see ib_mad ib_mthca.ko has a number of unresolved symbols which are in the MAD module (ib_mad.ko) as well as IB core (ib_core.ko) so those are loaded when mthca is. If you look in /lib/modules/2.6.n/modules.dep, you should see these dependencies for ib_mthca. -- Hal > and not a bunch of other ib > modules as well, check the /etc/modprobe.conf > and /etc/modprobe.conf.dist to see if there are rules to force lots of > ib modules to be loaded any time ib_core is loaded). From Don.Albert at Bull.com Fri Jun 23 14:36:13 2006 From: Don.Albert at Bull.com (Don.Albert at Bull.com) Date: Fri, 23 Jun 2006 14:36:13 -0700 Subject: [openib-general] Link Initialization problem and hangs in MTHCA on OFED-1.0 Message-ID: I was corresponding with Hal Rosenstock about this problem, but he suggested that I resubmit to a wider audience. The previous messages are under the subject of "How do I use "madeye" to diagnose a problem?". I was trying to use "madeye" to find out if any MAD packets were being received by a node in which the link fails to initialize. I have a small two-node testbed system which consists of two EM64T machines ("koa" and "jatoba") cabled back-to-back with two Mellanox MT25204 (4x DDR) HCAs. This configuration worked with a backported 2.6.11-34 kernel and revision 6500 from the OpenIB svn trunk. I was able to run basic tests and several sets of MPI benchmarks. Since moving to a "2.6.16" kernel and the OFED-1.0 release, we cannot get the link on the "jatoba" machine to come up. The "madeye" module seems to show that no MAD packets are being received when the Subnet Manager is run on the other machine. When I try to run SM on "jatoba", or try to run any other program that uses MAD, I get process hangs. Here is a portion of the stack traces for one of the hung processes, obtained by doing "echo t > /proc/sysrq-trigger" and looking at the dmesg output. ibis D 0000000000000003 0 5489 5097 5522 (NOTLB) ffff8100788c7d28 ffff810037cb9030 ffff8100788c7c78 ffff81007c606640 ffffffff803c1b65 0000000000000001 ffffffff801350ce ffff810003392418 ffff8100788c6000 ffff8100788c7cb8 Call Trace: {_spin_lock_irqsave+14} {lock_timer_base+27} {:ib_mthca:mthca_table_put+65} {_spin_unlock_irq+9} {wait_for_completion+179} {default_wake_function+0} {default_wake_function+0} {:ib_mad:ib_cancel_rmpp_recvs+144} {:ib_mad:ib_unregister_mad_agent+1019} {:ib_umad:ib_umad_ioctl+564} {autoremove_wake_function+0} {do_ioctl+45} {vfs_ioctl+658} {mntput_no_expire+28} {sys_ioctl+60} {system_call+126} It seems to be a lock or mutex problem, but I don't know how to proceed from here. Some things I have tried are: Connecting the two machines to a switch instead of back-to-back, to use the SM in the switch. The link to "koa" comes up, but the link to "jatoba" does not. Physically swapping the two HCAs between the two machines: the problem stays on the "jatoba" side. Turning on "debug_level" traces with "modprobe ib_mthca debug_level=1" on both machines. The traces seem to be identical on both, except for the actual PCI bus location and the memory addresses being mapped. No additional traces are generated when the hangs occur. The machines are both EM64T but are not identical. The "koa" side has the HCA on PCI "06:00.0", and the "jatoba" side has the HCA on "03:00.0". The two machines are: koa (the working one) is an Intel SE7520BD2 motherboard (7520 chip set). jatoba (the bad one) is an Intel SE7525GP2 motherboard (7525 chip set). Can anyone suggest what to try or look at next? -Don Albert- -------------- next part -------------- An HTML attachment was scrubbed... URL: From phucviet at gmail.com Fri Jun 23 14:49:49 2006 From: phucviet at gmail.com (Cong ty Tin hoc Phuc Viet) Date: Sat, 24 Jun 2006 04:49:49 +0700 Subject: [openib-general] Chuyen Cung cap: Phan mem Quan ly nhan su, Cong Van, Quan ly Kho - CNK, Qly Khach hang, Kinh doanh phan phoi, .. Message-ID: <20060623214932.78E43F0005@sentry-two.sandia.gov> An HTML attachment was scrubbed... URL: From dledford at redhat.com Fri Jun 23 14:52:46 2006 From: dledford at redhat.com (Doug Ledford) Date: Fri, 23 Jun 2006 17:52:46 -0400 Subject: [openib-general] [openfabrics-ewg] Stopping Infiniband kernel modules from loading at system boot In-Reply-To: <1151097726.4391.300562.camel@hal.voltaire.com> References: <1151088287.22762.14.camel@fc5.xsintricity.com> <1151097726.4391.300562.camel@hal.voltaire.com> Message-ID: <1151099566.22762.21.camel@fc5.xsintricity.com> On Fri, 2006-06-23 at 17:22 -0400, Hal Rosenstock wrote: > > > ib_mthca 117424 0 > > > ib_mad 35896 1 ib_mthca ^^^^^^^^ > > > ib_core 45952 2 ib_mthca,ib_mad > > In the sysinit we load all the modules required to support the hardware > > in the system (that's when it prints the Initializing hardware: storage > > network sound other [OK] message). In order to stop that you have to > > move the modules out of the way. But, I'm a bit surprised that ib_mad > > is loaded as that doesn't seem a hard dependancy for ib_mthca (or more > > appropriately, I'm surprised to see ib_mad > > ib_mthca.ko has a number of unresolved symbols which are in the MAD > module (ib_mad.ko) as well as IB core (ib_core.ko) so those are loaded > when mthca is. If you look in /lib/modules/2.6.n/modules.dep, you should > see these dependencies for ib_mthca. Yeah, if I hadn't been spacing during my reply I would have noticed what I highlighted above.... -- Doug Ledford http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband From sean.hefty at intel.com Fri Jun 23 16:33:19 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 23 Jun 2006 16:33:19 -0700 Subject: [openib-general] mckey program In-Reply-To: Message-ID: <000001c6971d$69cdab50$1c781cac@amr.corp.intel.com> >I was checking the mckey.c program for IB. >I did some quick check and found that the rdma_resolve_addr function >is invoking the cma_handler with erroneous event. > >mckey: event: 1, error: -19 > >Is there any easy way to check what might be happening? Try adding a route for 224.0.0.1 to the ipoib dev. - Sean From sean.hefty at intel.com Fri Jun 23 16:40:11 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 23 Jun 2006 16:40:11 -0700 Subject: [openib-general] [PATCH v2 2/2] iWARP changes to librdmacm. In-Reply-To: <1151094697.7808.82.camel@stevo-desktop> Message-ID: <000601c6971e$5fbc0b10$1c781cac@amr.corp.intel.com> >Are these changes acceptable? These look fine to commit by me. - Sean From pradeep at us.ibm.com Fri Jun 23 16:45:36 2006 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 23 Jun 2006 16:45:36 -0700 Subject: [openib-general] Stopping Infiniband kernel modules from loading at system boot In-Reply-To: Message-ID: I am using a slightly older kernel -2.6.16-rc2 and it works for me. Pradeep pradeep at us.ibm.com Don.Albert at Bull.com Sent by: openib-general-bounces at openib.org 06/23/2006 01:35 PM To openfabrics-ewg at openib.org, openib-general at openib.org cc Subject Re: [openib-general] Stopping Infiniband kernel modules from loading at system boot Thanks to Pradeep Satyanarayana and Boris Shpolyansky for suggesting that I add an entry to /etc/hotplug/blacklist, but I thought that the "/etc/hotplug" stuff was replaced in the latest kernels with "/etc/udev" functionality. Is this not true? -Don Albert- _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Fri Jun 23 10:14:57 2006 From: iod00d at hp.com (Grant Grundler) Date: Fri, 23 Jun 2006 10:14:57 -0700 Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver. In-Reply-To: <1151071471.3204.12.camel@laptopd505.fenrus.org> References: <20060620203050.31536.5341.stgit@stevo-desktop> <20060620203055.31536.15131.stgit@stevo-desktop> <1150836226.2891.231.camel@laptopd505.fenrus.org> <1151070290.7808.33.camel@stevo-desktop> <1151070532.3204.10.camel@laptopd505.fenrus.org> <1151071005.7808.39.camel@stevo-desktop> <1151071471.3204.12.camel@laptopd505.fenrus.org> Message-ID: <20060623171457.GA3610@esmail.cup.hp.com> On Fri, Jun 23, 2006 at 04:04:31PM +0200, Arjan van de Ven wrote: > > I thought the posted write WILL eventually get to adapter memory. Not > > stall forever cached in a bridge. I'm wrong? > > I'm not sure there is a theoretical upper bound.... I'm not aware of one either since MMIO writes can travel across many other chips that are not constrained by PCI ordering rules (I'm thinking of SGI Altix...) > (and if it's several msec per bridge, then you have a lot of latency > anyway) That's what my original concern was when I saw you point this out. But MMIO reads here would be expensive and many drivers tolerate this latency in exchange for avoiding the MMIO read in the performance path. grant - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ From halr at voltaire.com Sat Jun 24 05:39:15 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Jun 2006 08:39:15 -0400 Subject: [openib-general] mckey program In-Reply-To: <000001c6971d$69cdab50$1c781cac@amr.corp.intel.com> References: <000001c6971d$69cdab50$1c781cac@amr.corp.intel.com> Message-ID: <1151152754.4482.7620.camel@hal.voltaire.com> On Fri, 2006-06-23 at 19:33, Sean Hefty wrote: > >I was checking the mckey.c program for IB. > >I did some quick check and found that the rdma_resolve_addr function > >is invoking the cma_handler with erroneous event. > > > >mckey: event: 1, error: -19 > > > >Is there any easy way to check what might be happening? > > Try adding a route for 224.0.0.1 to the ipoib dev. Could it also be done without adding the route but using the bind_addr option on mckey ? mckey {s[end] | r[ecv]} mcast_addr [bind_addr]] -- Hal > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Sun Jun 25 12:29:34 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 25 Jun 2006 22:29:34 +0300 Subject: [openib-general] [PATCH] osm: pkey manager fail to clear unused block Message-ID: <86hd296nkx.fsf@mtl066.yok.mtl.com> Hi Hal Done some basic test with the simulator. Seems that the manager miss all blocks needed to be cleared (as they used to have some pkeys that are not part of the policy). The following trivial patch fixes the issue. The bug sneaked in through the last few patched. The fix avoids skipping cases where the new pkey block is null and for that sake requires an empty block to be used in the lower level function. Eitan Signed-off-by: Eitan Zahavi Index: opensm/osm_pkey_mgr.c =================================================================== --- opensm/osm_pkey_mgr.c (revision 8189) +++ opensm/osm_pkey_mgr.c (working copy) @@ -194,10 +194,12 @@ pkey_mgr_update_pkey_entry( IN const ib_pkey_table_t *block, IN const uint16_t block_index ) { + ib_pkey_table_t empty_block = {0, }; osm_madw_context_t context; osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); uint32_t attr_mod; + if (!block) block = &empty_block; context.pkey_context.node_guid = osm_node_get_node_guid( p_node ); context.pkey_context.port_guid = osm_physp_get_port_guid( p_physp ); context.pkey_context.set_method = TRUE; @@ -360,7 +362,7 @@ static boolean_t pkey_mgr_update_port( block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - if (block && (!new_block || !memcmp( new_block, block, sizeof( *block ) ))) + if (block && new_block && !memcmp( new_block, block, sizeof( *block ) )) continue; status = pkey_mgr_update_pkey_entry( p_req, p_physp , new_block, block_index ); From eitan at mellanox.co.il Mon Jun 26 00:00:48 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 26 Jun 2006 10:00:48 +0300 Subject: [openib-general] [PATCHv2] osm: pkey manager fail to clear unused block Message-ID: <86fyhs765b.fsf@mtl066.yok.mtl.com> Hi Hal Had some second thought (slept on it) about this patch. It has a problem as it would continuously set a block if it is all empty (since the new block does not exist) The new patch fixes it by catching the case of null new_block and still comparing to the old block. I also hope I did better job on the indentation (at least I used untabify). > Done some basic test with the simulator. Seems that the manager miss all > blocks needed to be cleared (as they used to have some pkeys that are not part of > the policy). The following trivial patch fixes the issue. The bug sneaked in > through the last few patched. Eitan Signed-off-by: Eitan Zahavi Index: opensm/osm_pkey_mgr.c =================================================================== --- opensm/osm_pkey_mgr.c (revision 8189) +++ opensm/osm_pkey_mgr.c (working copy) @@ -276,6 +276,7 @@ static boolean_t pkey_mgr_update_port( boolean_t ret_val = FALSE; osm_pending_pkey_t *p_pending; boolean_t found; + ib_pkey_table_t empty_block = {.pkey_entry = {0}, }; p_physp = osm_port_get_default_phys_ptr( p_port ); if ( !osm_physp_is_valid( p_physp ) ) @@ -360,7 +361,8 @@ static boolean_t pkey_mgr_update_port( block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); - if (block && (!new_block || !memcmp( new_block, block, sizeof( *block ) ))) + if (!new_block) new_block = &empty_block; + if (block && !memcmp( new_block, block, sizeof( *block ) )) continue; status = pkey_mgr_update_pkey_entry( p_req, p_physp , new_block, block_index ); From jackm at mellanox.co.il Mon Jun 26 00:51:12 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 26 Jun 2006 10:51:12 +0300 Subject: [openib-general] Kernel Oops related to IPoIB (multicast module?) Message-ID: <200606261051.12515.jackm@mellanox.co.il> Problem in main trunk (SVN 8189): The following Oops occurred upon unloading the openib driver. I unloaded the driver immediately following a reboot (the driver had been loaded during the boot sequence). I did NOT run opensm before unloading the driver. Evidently, ipoib was still attempting to connect with an SA, when the ipoib module was unloaded (modprobe -r). After the ipoib module was unloaded (or at least rendered inaccessible), the ib_sa module attempted to invoke "ib_sa_mcmember_rec_callback" (for a callback address that was part of the unloaded ipoib module). Hence, the Oops below. The "modprobe" process in the trace below is "modprobe -r ib_sa" (After unloading ib_ipoib, we attempt to unload ib_sa). Following the Oops, I've included info on the running environment. Jack =============================================== Jun 26 10:19:56 sw134 ifdown: ib0 device: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Jun 26 10:19:58 sw134 kernel: Unable to handle kernel paging request at ffffffff883219dd RIP: Jun 26 10:19:58 sw134 kernel: [] Jun 26 10:19:58 sw134 kernel: PGD 103027 PUD 105027 PMD 7bd53067 PTE 0 Jun 26 10:19:58 sw134 kernel: Oops: 0010 [1] SMP Jun 26 10:19:58 sw134 kernel: last sysfs file: /devices/pci0000:00/0000:00:00.0/irq Jun 26 10:19:58 sw134 kernel: CPU 2 Jun 26 10:19:58 sw134 kernel: Modules linked in: autofs4 ipv6 ib_sa ib_uverbs ib_umad nfs lockd nfs_acl sunrpc ib_mthca ib_mad ib_core af_ packet button battery ac apparmor aamatch_pcre loop dm_mod hw_random shpchp ehci_hcd uhci_hcd i8xx_tco usbcore pci_hotplug e1000 i2c_i801 i2c_core ide_cd cdrom floppy ext3 jbd sg edd fan thermal processor ata_piix libata piix sd_mod scsi_mod ide_disk ide_core Jun 26 10:19:58 sw134 kernel: Pid: 4457, comm: modprobe Tainted: G U 2.6.16.16-1.6-smp #1 Jun 26 10:19:58 sw134 kernel: RIP: 0010:[] [] Jun 26 10:19:58 sw134 kernel: RSP: 0018:ffff81007163dd90 EFLAGS: 00010246 Jun 26 10:19:58 sw134 kernel: RAX: 0000000000000005 RBX: ffff81007d78be00 RCX: ffffffff8831747f Jun 26 10:19:58 sw134 kernel: RDX: ffff81007dec3000 RSI: 0000000000000000 RDI: 00000000fffffffc Jun 26 10:19:58 sw134 kernel: RBP: ffff810079960fd0 R08: 0000000000000206 R09: 0000000000000002 Jun 26 10:19:58 sw134 kernel: R10: ffff810001029400 R11: 0000000000000000 R12: 00000000fffffffc Jun 26 10:19:58 sw134 kernel: R13: 0000000000000000 R14: 00000000005182a8 R15: 0000000000000000 Jun 26 10:19:58 sw134 kernel: FS: 00002ba7037ef6d0(0000) GS:ffff81007e3ab340(0000) knlGS:0000000000000000 Jun 26 10:19:58 sw134 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 26 10:19:58 sw134 kernel: CR2: ffffffff883219dd CR3: 0000000072da0000 CR4: 00000000000006e0 Jun 26 10:19:58 sw134 ifdown: ib0 Jun 26 10:19:58 sw134 kernel: Process modprobe (pid: 4457, threadinfo ffff81007163c000, task ffff81006fcb7040) Jun 26 10:19:58 sw134 ifdown: Interface not available and no configuration found. Jun 26 10:19:58 sw134 kernel: Stack: ffffffff883174bf 0000000000000bd4 000000027163de78 ffff81007163de80 Jun 26 10:19:58 sw134 kernel: ffff81007163de78 ffff81007d810790 ffff81007163de68 0000000000000001 Jun 26 10:19:59 sw134 kernel: 0000000000000000 ffff81007d78be00 Jun 26 10:19:59 sw134 kernel: Call Trace: {:ib_sa:ib_sa_mcmember_rec_callback+64} Jun 26 10:19:59 sw134 kernel: {:ib_sa:send_handler+72} {:ib_mad:ib_unregister_mad_agent+345} Jun 26 10:19:59 sw134 kernel: {wait_for_completion+155} {find_next_bit+85} Jun 26 10:19:59 sw134 kernel: {:ib_sa:ib_sa_remove_one+58} {:ib_core:ib_unregister_client+47} Jun 26 10:19:59 sw134 kernel: {:ib_sa:ib_sa_cleanup+16} {sys_delete_module+540} Jun 26 10:19:59 sw134 kernel: {do_munmap+619} {__up_write+33} Jun 26 10:19:59 sw134 kernel: {system_call+126} Jun 26 10:19:59 sw134 kernel: Jun 26 10:19:59 sw134 kernel: Code: Bad RIP value. Jun 26 10:19:59 sw134 kernel: RIP [] RSP Jun 26 10:19:59 sw134 kernel: CR2: ffffffff883219dd Jun 26 10:20:01 sw134 /usr/sbin/cron[4615]: (root) CMD (/mswg/projects/test_suite2/etc/check_daemon.csh >/dev/null) =================================== Host information given below: ************************************************************* Host Architecture : x86_64 Linux Distribution: SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10 Kernel Version : 2.6.16.16-1.6-smp Memory size : 2060956 kB Driver Version : openib_gen2-20060625-1800 (REV=8189) HCA ID(s) : mthca0 HCA model(s) : 25204 FW version(s) : 1.0.800 Board(s) : MT_0230000001 ************************************************************* From bugzilla-daemon at openib.org Mon Jun 26 01:15:05 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 26 Jun 2006 01:15:05 -0700 (PDT) Subject: [openib-general] [Bug 148] New: WSD: When connecting to a remote host, with no socket listening, time out is returned Message-ID: <20060626081505.AA5A2228735@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=148 Summary: WSD: When connecting to a remote host, with no socket listening, time out is returned Product: OpenFabrics Windows Version: unspecified Platform: Other OS/Version: Other Status: NEW Severity: major Priority: P2 Component: WSD AssignedTo: bugzilla at openib.org ReportedBy: tzachid at mellanox.co.il As a result it takes about 20 seconds for the connection to fall to IPOIB. On TCP, the remote side will send reset, and the connection will end in about a second. One more consequence of this problem is that when there is a fallback to IPOIB, it takes ~20 seconds to realize that no one is listening there instead of less than 1ms. This also causes one of the WHQL tests to fail. (Waiting for time out takes too long). Investigation that was made by Yossi showed that this issue is related to the CM. It seems that Invalid_sid was retuned from the remote side and was ignored. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Mon Jun 26 01:16:25 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 26 Jun 2006 01:16:25 -0700 (PDT) Subject: [openib-general] [Bug 148] WSD: When connecting to a remote host, with no socket listening, time out is returned Message-ID: <20060626081625.A5C87228735@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=148 ------- Comment #1 from tzachid at mellanox.co.il 2006-06-26 01:16 ------- Created an attachment (id=27) --> (http://openib.org/bugzilla/attachment.cgi?id=27&action=view) Suggested fix The following patch by Yossi, solves this problem ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Mon Jun 26 01:19:04 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 26 Jun 2006 01:19:04 -0700 (PDT) Subject: [openib-general] [Bug 148] WSD: When connecting to a remote host, with no socket listening, time out is returned Message-ID: <20060626081904.04507228738@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=148 tzachid at mellanox.co.il changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|bugzilla at openib.org |ftillier at silverstorm.com ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From ogerlitz at voltaire.com Mon Jun 26 03:01:18 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 26 Jun 2006 13:01:18 +0300 Subject: [openib-general] [GIT PULL] please pull infiniband.git In-Reply-To: References: Message-ID: <449FB06E.3020709@voltaire.com> Roland Dreier wrote: > Linus, please pull from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > This is mostly merging the new iSER (iSCSI over RDMA transport) initiator: Hi Roland, Following the merge done by Linus yesterday, I have cloned, built, installed and booted the linux-2.6 tree and now running iSER over it! Thanks a lot for all your help and guidance (&& compile error findings...) through the upstream push cycle. I'd like to thank Mike Christie for his co-op (and patience) in integrating iser within the open iscsi framework and much help along the push cycle, specially for working on (now upstream) libiscsi iSER is the first consumer of the (now upstream) RDMA CM, I'd like to thank Sean Hefty for his co-op in the CMA design cycle and the very fast and robust coding while implementing it. Or. From bpradip at in.ibm.com Mon Jun 26 03:24:19 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Mon, 26 Jun 2006 15:54:19 +0530 Subject: [openib-general] [PATCH 0/2] perftest: Modified perftest utils to work with new stack and libraries Message-ID: <20060626102410.GA17835@harry-potter.ibm.com> modified perftest utilities to work with the latest stack and libraries. This patchset consists changes for rdma_lat and rdma_bw only. 1 - rdma_lat.c changes 2 - rdma_bw.c changes -- Thanks, Pradipta Kumar From bpradip at in.ibm.com Mon Jun 26 03:27:16 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Mon, 26 Jun 2006 15:57:16 +0530 Subject: [openib-general] [PATCH 1/2] perftest: Modified perftest utils to work with new stack and libraries Message-ID: <20060626102715.GB17835@harry-potter.ibm.com> This is the patch for rdma_lat.c Signed-off-by: Pradipta Kumar Banerjee --- Index: rdma_lat.c ============================================================================= --- ../perftest-org/rdma_lat.c 2006-06-22 18:28:13.000000000 +0530 +++ rdma_lat.c 2006-06-22 18:36:12.000000000 +0530 @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -83,6 +84,7 @@ struct pingpong_context { struct ibv_sge list; struct ibv_send_wr wr; struct rdma_cm_id *cm_id; + struct rdma_event_channel *cm_channel; }; struct pingpong_dest { @@ -612,11 +614,12 @@ static void pp_close_cma(struct pingpong } } - rdma_get_cm_event(&event); + rdma_get_cm_event(ctx->cm_channel, &event); if (event->event != RDMA_CM_EVENT_DISCONNECTED) printf("unexpected event during disconnect %d\n", event->event); rdma_ack_cm_event(event); rdma_destroy_id(ctx->cm_id); + rdma_destroy_event_channel(ctx->cm_channel); } static struct pingpong_context *pp_server_connect_cma(unsigned short port, int size, int tx_depth, @@ -629,17 +632,26 @@ static struct pingpong_context *pp_serve int ret; struct sockaddr_in sin; struct rdma_cm_id *child_cm_id; + struct rdma_event_channel *channel; struct pingpong_context *ctx; - + printf("%s starting server\n", __FUNCTION__); - ret = rdma_create_id(&listen_id, NULL); - if (ret) { - fprintf(stderr, "%s rdma_create_id failed %d\n", __FUNCTION__, ret); + channel = rdma_create_event_channel(); + if (!channel) { + ret = errno; + fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n", + __FUNCTION__, ret); return NULL; } + ret = rdma_create_id(channel, &listen_id, NULL); + if (ret) { + fprintf(stderr, "%s rdma_create_id failed %d\n", __FUNCTION__, ret); + goto err3; + } + memset(&sin, 0, sizeof(sin)); sin.sin_addr.s_addr = 0; - sin.sin_family = PF_INET; + sin.sin_family = AF_INET; sin.sin_port = htons(port); ret = rdma_bind_addr(listen_id, (struct sockaddr *)&sin); if (ret) { @@ -653,7 +665,7 @@ static struct pingpong_context *pp_serve goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -678,7 +690,8 @@ static struct pingpong_context *pp_serve fprintf(stderr,"%s pp_init_cma_ctx failed\n", __FUNCTION__); goto err0; } - + + ctx->cm_channel = channel; my_dest->qpn = 0; my_dest->psn = 0xbb; my_dest->rkey = ctx->mr->rkey; @@ -694,7 +707,7 @@ static struct pingpong_context *pp_serve goto err0; } rdma_ack_cm_event(event); - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) { fprintf(stderr,"rdma_get_cm_event error %d\n", ret); rdma_destroy_id(child_cm_id); @@ -713,8 +726,10 @@ err0: err1: rdma_ack_cm_event(event); err2: - rdma_destroy_id(listen_id); fprintf(stderr,"%s NOT connected!\n", __FUNCTION__); + rdma_destroy_id(listen_id); +err3: + rdma_destroy_event_channel(channel); return NULL; } @@ -750,6 +765,7 @@ static struct pingpong_context *pp_clien int ret; struct sockaddr_in sin; struct rdma_cm_id *cm_id; + struct rdma_event_channel *channel; struct pingpong_context *ctx; fprintf(stderr,"%s starting client\n", __FUNCTION__); @@ -758,10 +774,18 @@ static struct pingpong_context *pp_clien return NULL; } - ret = rdma_create_id(&cm_id, NULL); + channel = rdma_create_event_channel(); + if (!channel) { + ret = errno; + fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n", + __FUNCTION__, ret); + return NULL; + } + + ret = rdma_create_id(channel, &cm_id, NULL); if (ret) { fprintf(stderr,"%s rdma_create_id failed %d\n", __FUNCTION__, ret); - return NULL; + goto err3; } sin.sin_family = PF_INET; @@ -772,7 +796,7 @@ static struct pingpong_context *pp_clien goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -789,7 +813,7 @@ static struct pingpong_context *pp_clien goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -806,7 +830,8 @@ static struct pingpong_context *pp_clien fprintf(stderr,"%s pp_init_cma_ctx failed\n", __FUNCTION__); goto err2; } - + + ctx->cm_channel = channel; my_dest->qpn = 0; my_dest->psn = 0xaa; my_dest->rkey = ctx->mr->rkey; @@ -823,7 +848,7 @@ static struct pingpong_context *pp_clien goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -845,8 +870,10 @@ static struct pingpong_context *pp_clien err1: rdma_ack_cm_event(event); err2: - fprintf(stderr,"NOT connected!\n"); + fprintf(stderr,"%s NOT connected!\n", __FUNCTION__); rdma_destroy_id(cm_id); +err3: + rdma_destroy_event_channel(channel); return NULL; } From bpradip at in.ibm.com Mon Jun 26 03:29:28 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Mon, 26 Jun 2006 15:59:28 +0530 Subject: [openib-general] [PATCH 2/2] perftest: Modified perftest utils to work with new stack and libraries Message-ID: <20060626102926.GC17835@harry-potter.ibm.com> This is the patch for rdma_bw.c Signed-off-by: Pradipta Kumar Banerjee --- Index: rdma_bw.c ============================================================================= --- ../perftest-org/rdma_bw.c 2006-06-22 18:28:13.000000000 +0530 +++ rdma_bw.c 2006-06-22 18:40:01.000000000 +0530 @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -75,6 +76,7 @@ struct pingpong_context { struct ibv_sge list; struct ibv_send_wr wr; struct rdma_cm_id *cm_id; + struct rdma_event_channel *cm_channel; }; struct pingpong_dest { @@ -545,11 +547,12 @@ static void pp_close_cma(struct pingpong } } - rdma_get_cm_event(&event); + rdma_get_cm_event(ctx->cm_channel, &event); if (event->event != RDMA_CM_EVENT_DISCONNECTED) printf("unexpected event during disconnect %d\n", event->event); rdma_ack_cm_event(event); rdma_destroy_id(ctx->cm_id); + rdma_destroy_event_channel(ctx->cm_channel); } static struct pingpong_context *pp_server_connect_cma(unsigned short port, int size, int tx_depth, @@ -562,13 +565,22 @@ static struct pingpong_context *pp_serve int ret; struct sockaddr_in sin; struct rdma_cm_id *child_cm_id; + struct rdma_event_channel *channel; struct pingpong_context *ctx; printf("%s starting server\n", __FUNCTION__); - ret = rdma_create_id(&listen_id, NULL); + channel = rdma_create_event_channel(); + if (!channel) { + ret = errno; + fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n", + __FUNCTION__, ret); + return NULL; + } + + ret = rdma_create_id(channel, &listen_id, NULL); if (ret) { fprintf(stderr, "%s rdma_create_id failed %d\n", __FUNCTION__, ret); - return NULL; + goto err3; } sin.sin_addr.s_addr = 0; @@ -586,7 +598,7 @@ static struct pingpong_context *pp_serve goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -612,6 +624,7 @@ static struct pingpong_context *pp_serve goto err0; } + ctx->cm_channel = channel; my_dest->qpn = 0; my_dest->psn = 0xbb; my_dest->rkey = ctx->mr->rkey; @@ -627,7 +640,7 @@ static struct pingpong_context *pp_serve goto err0; } rdma_ack_cm_event(event); - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) { fprintf(stderr,"rdma_get_cm_event error %d\n", ret); rdma_destroy_id(child_cm_id); @@ -646,8 +659,10 @@ err0: err1: rdma_ack_cm_event(event); err2: - rdma_destroy_id(listen_id); fprintf(stderr,"%s NOT connected!\n", __FUNCTION__); + rdma_destroy_id(listen_id); +err3: + rdma_destroy_event_channel(channel); return NULL; } @@ -683,6 +698,7 @@ static struct pingpong_context *pp_clien int ret; struct sockaddr_in sin; struct rdma_cm_id *cm_id; + struct rdma_event_channel *channel; struct pingpong_context *ctx; fprintf(stderr,"%s starting client\n", __FUNCTION__); @@ -691,10 +707,18 @@ static struct pingpong_context *pp_clien return NULL; } - ret = rdma_create_id(&cm_id, NULL); + channel = rdma_create_event_channel(); + if (!channel) { + ret = errno; + fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n", + __FUNCTION__, ret); + return NULL; + } + + ret = rdma_create_id(channel, &cm_id, NULL); if (ret) { fprintf(stderr,"%s rdma_create_id failed %d\n", __FUNCTION__, ret); - return NULL; + goto err3; } sin.sin_family = PF_INET; @@ -705,7 +729,7 @@ static struct pingpong_context *pp_clien goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -722,7 +746,7 @@ static struct pingpong_context *pp_clien goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -740,6 +764,7 @@ static struct pingpong_context *pp_clien goto err2; } + ctx->cm_channel = channel; my_dest->qpn = 0; my_dest->psn = 0xaa; my_dest->rkey = ctx->mr->rkey; @@ -756,7 +781,7 @@ static struct pingpong_context *pp_clien goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -779,6 +804,8 @@ err1: err2: fprintf(stderr,"NOT connected!\n"); rdma_destroy_id(cm_id); +err3: + rdma_destroy_event_channel(channel); return NULL; } From halr at voltaire.com Mon Jun 26 03:43:48 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Jun 2006 06:43:48 -0400 Subject: [openib-general] [PATCHv2] osm: pkey manager fail to clear unused block In-Reply-To: <86fyhs765b.fsf@mtl066.yok.mtl.com> References: <86fyhs765b.fsf@mtl066.yok.mtl.com> Message-ID: <1151318627.4482.119971.camel@hal.voltaire.com> Hi Eitan, On Mon, 2006-06-26 at 03:00, Eitan Zahavi wrote: > Hi Hal > > Had some second thought (slept on it) about this patch. > It has a problem as it would continuously set a block if it is all empty (since the new > block does not exist) > > The new patch fixes it by catching the case of null new_block and > still comparing to the old block. > > I also hope I did better job on the indentation (at least I used untabify). > > > Done some basic test with the simulator. Seems that the manager miss all > > blocks needed to be cleared (as they used to have some pkeys that are not part of > > the policy). The following trivial patch fixes the issue. The bug sneaked in > > through the last few patched. > > Eitan > > Signed-off-by: Eitan Zahavi Thanks. Applied. -- Hal From sashak at voltaire.com Mon Jun 26 07:42:43 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 26 Jun 2006 17:42:43 +0300 Subject: [openib-general] [PATCHv2] osm: pkey manager fail to clear unused block In-Reply-To: <86fyhs765b.fsf@mtl066.yok.mtl.com> References: <86fyhs765b.fsf@mtl066.yok.mtl.com> Message-ID: <20060626144243.GF16738@sashak.voltaire.com> Hi Eitan, On 10:00 Mon 26 Jun , Eitan Zahavi wrote: > Hi Hal > > Had some second thought (slept on it) about this patch. > It has a problem as it would continuously set a block if it is all empty (since the new > block does not exist) > > The new patch fixes it by catching the case of null new_block and > still comparing to the old block. > > I also hope I did better job on the indentation (at least I used untabify). > > > Done some basic test with the simulator. Seems that the manager miss all > > blocks needed to be cleared (as they used to have some pkeys that are not part of > > the policy). The following trivial patch fixes the issue. The bug sneaked in > > through the last few patched. And what with peer port's pkey table update. Is there the same problem? Sasha > > Eitan > > Signed-off-by: Eitan Zahavi > Index: opensm/osm_pkey_mgr.c > =================================================================== > --- opensm/osm_pkey_mgr.c (revision 8189) > +++ opensm/osm_pkey_mgr.c (working copy) > @@ -276,6 +276,7 @@ static boolean_t pkey_mgr_update_port( > boolean_t ret_val = FALSE; > osm_pending_pkey_t *p_pending; > boolean_t found; > + ib_pkey_table_t empty_block = {.pkey_entry = {0}, }; > > p_physp = osm_port_get_default_phys_ptr( p_port ); > if ( !osm_physp_is_valid( p_physp ) ) > @@ -360,7 +361,8 @@ static boolean_t pkey_mgr_update_port( > block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); > new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); > > - if (block && (!new_block || !memcmp( new_block, block, sizeof( *block ) ))) > + if (!new_block) new_block = &empty_block; > + if (block && !memcmp( new_block, block, sizeof( *block ) )) > continue; > > status = pkey_mgr_update_pkey_entry( p_req, p_physp , new_block, block_index ); > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From iod00d at hp.com Mon Jun 26 08:15:06 2006 From: iod00d at hp.com (Grant Grundler) Date: Mon, 26 Jun 2006 08:15:06 -0700 Subject: [openib-general] [PATCH 0/2] perftest: Modified perftest utils to work with new stack and libraries In-Reply-To: <20060626102410.GA17835@harry-potter.ibm.com> References: <20060626102410.GA17835@harry-potter.ibm.com> Message-ID: <20060626151506.GA14684@esmail.cup.hp.com> On Mon, Jun 26, 2006 at 03:54:19PM +0530, Pradipta Kumar Banerjee wrote: > modified perftest utilities to work with the latest stack and libraries. > This patchset consists changes for rdma_lat and rdma_bw only. > > 1 - rdma_lat.c changes > 2 - rdma_bw.c changes Pradipta, thanks for posting the patches...but could you do us a favor and provide a useful changelog entry? We can see it's a patch and which files the patch modifies. The changelog should summarize "what problem does this patch fix?". thanks again, grant From halr at voltaire.com Mon Jun 26 08:15:53 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Jun 2006 11:15:53 -0400 Subject: [openib-general] [PATCHv2] osm: pkey manager fail to clear unused block In-Reply-To: <20060626144243.GF16738@sashak.voltaire.com> References: <86fyhs765b.fsf@mtl066.yok.mtl.com> <20060626144243.GF16738@sashak.voltaire.com> Message-ID: <1151334761.4482.130837.camel@hal.voltaire.com> On Mon, 2006-06-26 at 10:42, Sasha Khapyorsky wrote: > Hi Eitan, > > On 10:00 Mon 26 Jun , Eitan Zahavi wrote: > > Hi Hal > > > > Had some second thought (slept on it) about this patch. > > It has a problem as it would continuously set a block if it is all empty (since the new > > block does not exist) > > > > The new patch fixes it by catching the case of null new_block and > > still comparing to the old block. > > > > I also hope I did better job on the indentation (at least I used untabify). > > > > > Done some basic test with the simulator. Seems that the manager miss all > > > blocks needed to be cleared (as they used to have some pkeys that are not part of > > > the policy). The following trivial patch fixes the issue. The bug sneaked in > > > through the last few patched. > > And what with peer port's pkey table update. Is there the same problem? Looks to me like the same logic is there. -- Hal > > Sasha > > > > > Eitan > > > > Signed-off-by: Eitan Zahavi > > Index: opensm/osm_pkey_mgr.c > > =================================================================== > > --- opensm/osm_pkey_mgr.c (revision 8189) > > +++ opensm/osm_pkey_mgr.c (working copy) > > @@ -276,6 +276,7 @@ static boolean_t pkey_mgr_update_port( > > boolean_t ret_val = FALSE; > > osm_pending_pkey_t *p_pending; > > boolean_t found; > > + ib_pkey_table_t empty_block = {.pkey_entry = {0}, }; > > > > p_physp = osm_port_get_default_phys_ptr( p_port ); > > if ( !osm_physp_is_valid( p_physp ) ) > > @@ -360,7 +361,8 @@ static boolean_t pkey_mgr_update_port( > > block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); > > new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); > > > > - if (block && (!new_block || !memcmp( new_block, block, sizeof( *block ) ))) > > + if (!new_block) new_block = &empty_block; > > + if (block && !memcmp( new_block, block, sizeof( *block ) )) > > continue; > > > > status = pkey_mgr_update_pkey_entry( p_req, p_physp , new_block, block_index ); > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From sashak at voltaire.com Mon Jun 26 08:35:00 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 26 Jun 2006 18:35:00 +0300 Subject: [openib-general] [PATCH] opensm: libibmad: match MAD TransactionID Message-ID: <20060626153500.18078.85785.stgit@sashak.voltaire.com> Match MAD TransactionID on receiving. This prevents request/response MADs mixing - reproducible when poll() (in libibumad) returns timeout. Signed-off-by: Sasha Khapyorsky --- libibmad/src/rpc.c | 66 ++++++++++++++++++++++++---------------------------- 1 files changed, 31 insertions(+), 35 deletions(-) diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index e929ba4..d9dc407 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -105,57 +105,54 @@ madrpc_portid(void) } static int -_do_madrpc(void *umad, int agentid, int len, int timeout) +_do_madrpc(void *sndbuf, void *rcvbuf, int agentid, int len, int timeout) { + uint32_t trid; /* only low 32 bits */ int retries; int length, status; - ib_user_mad_t *mad; - ib_mad_addr_t addr; if (!timeout) timeout = def_madrpc_timeout; if (ibdebug > 1) { IBWARN(">>> sending: len %d pktsz %d", len, umad_size() + len); - xdump(stderr, "send buf\n", umad, umad_size() + len); + xdump(stderr, "send buf\n", sndbuf, umad_size() + len); } - /* Save user MAD header in case of retry */ - mad = umad; - memcpy(&addr, &mad->addr, sizeof addr); - if (save_mad) { - memcpy(save_mad, umad_get_mad(umad), + memcpy(save_mad, umad_get_mad(sndbuf), save_mad_len < len ? save_mad_len : len); save_mad = 0; } + trid = mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F); + for (retries = 0; retries < madrpc_retries; retries++) { if (retries) { ERRS("retry %d (timeout %d ms)", retries, timeout); - /* Restore user MAD header */ - memcpy(&mad->addr, &addr, sizeof addr); } length = len; - if (umad_send(mad_portid, agentid, umad, length, timeout, 0) < 0) { + if (umad_send(mad_portid, agentid, sndbuf, length, timeout, 0) < 0) { IBWARN("send failed; %m"); return -1; } /* Use same timeout on receive side just in case */ /* send packet is lost somewhere. */ - if (umad_recv(mad_portid, umad, &length, timeout) < 0) { - IBWARN("recv failed: %m"); - return -1; - } - - if (ibdebug > 1) { - IBWARN("rcv buf:"); - xdump(stderr, "rcv buf\n", umad_get_mad(umad), IB_MAD_SIZE); - } - - status = umad_status(umad); + do { + if (umad_recv(mad_portid, rcvbuf, &length, timeout) < 0) { + IBWARN("recv failed: %m"); + return -1; + } + + if (ibdebug > 1) { + IBWARN("rcv buf:"); + xdump(stderr, "rcv buf\n", umad_get_mad(rcvbuf), IB_MAD_SIZE); + } + } while ((uint32_t)mad_get_field64(umad_get_mad(rcvbuf), 0, IB_MAD_TRID_F) != trid); + + status = umad_status(rcvbuf); if (!status) return length; /* done */ if (status == ENOMEM) @@ -170,19 +167,19 @@ void * madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata) { int status, len; - uint8_t pktbuf[1024], *mad; - void *umad = pktbuf; + uint8_t sndbuf[1024], rcvbuf[1024], *mad; - memset(pktbuf, 0, umad_size() + IB_MAD_SIZE); + len = 0; + memset(sndbuf, 0, umad_size() + IB_MAD_SIZE); - if ((len = mad_build_pkt(umad, rpc, dport, 0, payload)) < 0) + if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) return 0; - if ((len = _do_madrpc(umad, mad_class_agent(rpc->mgtclass), + if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass), len, rpc->timeout)) < 0) return 0; - mad = umad_get_mad(umad); + mad = umad_get_mad(rcvbuf); if ((status = mad_get_field(mad, 0, IB_DRSMP_STATUS_F)) != 0) { ERRS("MAD completed with error status 0x%x", status); @@ -204,21 +201,20 @@ void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data) { int status, len; - uint8_t pktbuf[1024], *mad; - void *umad = pktbuf; + uint8_t sndbuf[1024], rcvbuf[1024], *mad; - memset(pktbuf, 0, umad_size() + IB_MAD_SIZE); + memset(sndbuf, 0, umad_size() + IB_MAD_SIZE); DEBUG("rmpp %p data %p", rmpp, data); - if ((len = mad_build_pkt(umad, rpc, dport, rmpp, data)) < 0) + if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) return 0; - if ((len = _do_madrpc(umad, mad_class_agent(rpc->mgtclass), + if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass), len, rpc->timeout)) < 0) return 0; - mad = umad_get_mad(umad); + mad = umad_get_mad(rcvbuf); if ((status = mad_get_field(mad, 0, IB_MAD_STATUS_F)) != 0) { ERRS("MAD completed with error status 0x%x", status); From rdreier at cisco.com Mon Jun 26 09:27:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 Jun 2006 09:27:34 -0700 Subject: [openib-general] it's a girl... Message-ID: Hi, just quick note to let everyone know that my daughter was born last week. So please don't expect me to do anything, read anything, think about anything, or accomplish anything at all for a while... - Roland From mshefty at ichips.intel.com Mon Jun 26 09:44:31 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 26 Jun 2006 09:44:31 -0700 Subject: [openib-general] ucma into kernel.org In-Reply-To: <001e01c69300$b9020c00$020010ac@haggard> References: <1150465355.29508.4.camel@stevo-desktop> <4492D706.4060106@ichips.intel.com> <15ddcffd0606180435g366a6effs4d4826c8b3fbbd4f@mail.gmail.com> <001e01c69300$b9020c00$020010ac@haggard> Message-ID: <44A00EEF.702@ichips.intel.com> Steve Wise wrote: > I agree that it would be nice to get this into 2.6.18. It seems stable > enough IMO. It's not a stability issue. We wanted to make sure that the user to kernel interface was correct before pushing anything upstream. At the time the decision was made (a couple of months ago), this made sense, and the ABI has changed since that time. It would be nice to know that there are at least a couple of applications using the userspace library before trying to push anything upstream. I know that DAPL is using it. Are there any others? - Sean From mst at mellanox.co.il Mon Jun 26 10:46:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 26 Jun 2006 20:46:11 +0300 Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog) In-Reply-To: References: <20060530183454.GH10234@mellanox.co.il> Message-ID: <20060626174611.GB19929@mellanox.co.il> Quoting r. Sean Hefty : > >Yes, that was my thinking. To avoid touching all users, maybe the simplest way > >is to make ib_cm discard the new cm_id without reject if the client callback > >returned -ENOMEM? > > > >If you consider that in out of memory situation sending reject will also likely > >fail, this might be a good idea, regardless. > > > >Sounds good? > > I'd like to get some other feedback, but this approach sounds reasonable. Here's an untested patch that does this. Comments? Signed-off-by: Jack Morgenstein Index: src/drivers/infiniband/core/cma.c =================================================================== --- src.orig/drivers/infiniband/core/cma.c 2006-06-07 11:33:04.359936000 +0300 +++ src/drivers/infiniband/core/cma.c 2006-06-15 13:44:07.030643000 +0300 @@ -118,7 +118,8 @@ struct rdma_id_private { wait_queue_head_t wait_remove; atomic_t dev_remove; int backlog; + atomic_t curr_backlog; int timeout_ms; struct ib_sa_query *query; int query_id; @@ -328,6 +329,7 @@ struct rdma_cm_id* rdma_create_id(rdma_c atomic_set(&id_priv->dev_remove, 0); INIT_LIST_HEAD(&id_priv->listen_list); get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num); + atomic_set(&id_priv->curr_backlog, 0); return &id_priv->id; } @@ -1022,6 +1024,9 @@ static int cma_listen_handler(struct rdm { struct rdma_id_private *id_priv = id->context; + if (atomic_read(&id_priv->curr_backlog) > id_priv->backlog) + return -ENOMEM; + id->context = id_priv->id.context; id->event_handler = id_priv->id.event_handler; return id_priv->id.event_handler(id, event); @@ -1870,6 +1875,25 @@ out: } EXPORT_SYMBOL(rdma_disconnect); + +void rdma_backlog_added_one(struct rdma_cm_id *id) +{ + struct rdma_id_private *id_priv; + + id_priv = container_of(id, struct rdma_id_private, id); + atomic_inc(&id_priv->curr_backlog); +} +EXPORT_SYMBOL(rdma_backlog_added_one); + +void rdma_backlog_removed_one(struct rdma_cm_id *id) +{ + struct rdma_id_private *id_priv; + + id_priv = container_of(id, struct rdma_id_private, id); + atomic_dec(&id_priv->curr_backlog); +} +EXPORT_SYMBOL(rdma_backlog_removed_one); + static void cma_add_one(struct ib_device *device) { struct cma_device *cma_dev; Index: src/drivers/infiniband/include/rdma/rdma_cm.h =================================================================== --- src.orig/drivers/infiniband/include/rdma/rdma_cm.h 2006-05-10 11:18:37.538572000 +0300 +++ src/drivers/infiniband/include/rdma/rdma_cm.h 2006-06-15 15:49:37.708725000 +0300 @@ -252,5 +252,21 @@ int rdma_reject(struct rdma_cm_id *id, c */ int rdma_disconnect(struct rdma_cm_id *id); +/** + * rdma_backlog_added_one - This function is called by the passive side to + * notify cma that one connection request has been added to backlog queue. + * + * No error checking is done here (e.g., if backlog is already at max, etc) + */ +void rdma_backlog_added_one(struct rdma_cm_id *id); + +/** + * rdma_backlog_added_one - This function is called by the passive side to + * notify cma that one connection request has been added to backlog queue. + * + * No error checking is done here (e.g., if queue was already empty) + */ +void rdma_backlog_removed_one(struct rdma_cm_id *id); + #endif /* RDMA_CM_H */ Index: src/drivers/infiniband/core/cm.c =================================================================== --- src.orig/drivers/infiniband/core/cm.c 2006-06-07 11:33:04.109937000 +0300 +++ src/drivers/infiniband/core/cm.c 2006-06-15 15:50:34.222140000 +0300 @@ -701,6 +701,19 @@ static void cm_reset_to_idle(struct cm_i } } +static void ib_destroy_cm_id_no_resp(struct ib_cm_id *cm_id) +{ + struct cm_id_private *cm_id_priv; + + cm_id_priv = container_of(cm_id, struct cm_id_private, id); + + cm_free_id(cm_id->local_id); + cm_deref_id(cm_id_priv); + kfree(cm_id_priv->compare_data); + kfree(cm_id_priv->private_data); + kfree(cm_id_priv); +} + void ib_destroy_cm_id(struct ib_cm_id *cm_id) { struct cm_id_private *cm_id_priv; @@ -1162,7 +1177,10 @@ static void cm_process_work(struct cm_id cm_free_work(work); } cm_deref_id(cm_id_priv); - if (ret) + + if (ret == -ENOMEM) + ib_destroy_cm_id_no_resp(&cm_id_priv->id); + else if (ret) ib_destroy_cm_id(&cm_id_priv->id); } -- MST From mst at mellanox.co.il Mon Jun 26 10:41:17 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 26 Jun 2006 20:41:17 +0300 Subject: [openib-general] ucma into kernel.org In-Reply-To: <44A00EEF.702@ichips.intel.com> References: <44A00EEF.702@ichips.intel.com> Message-ID: <20060626174117.GA19929@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: ucma into kernel.org > > Steve Wise wrote: > > I agree that it would be nice to get this into 2.6.18. It seems stable > > enough IMO. > > It's not a stability issue. We wanted to make sure that the user to kernel > interface was correct before pushing anything upstream. How about the cma changes required by ucma to get/set options? I think they are not upstream yet. Could these go upstream, to make building ucma out-of-kernel possible, without kernel patches? -- MST From halr at voltaire.com Mon Jun 26 11:04:17 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Jun 2006 14:04:17 -0400 Subject: [openib-general] [PATCH] opensm: libibmad: match MAD TransactionID In-Reply-To: <20060626153500.18078.85785.stgit@sashak.voltaire.com> References: <20060626153500.18078.85785.stgit@sashak.voltaire.com> Message-ID: <1151345056.4482.137712.camel@hal.voltaire.com> On Mon, 2006-06-26 at 11:35, Sasha Khapyorsky wrote: > Match MAD TransactionID on receiving. This prevents request/response MADs > mixing - reproducible when poll() (in libibumad) returns timeout. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From pradeep at us.ibm.com Mon Jun 26 11:19:55 2006 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 26 Jun 2006 11:19:55 -0700 Subject: [openib-general] bug #33 Message-ID: I am curious -was the root cause of bug# 33 determined? Which of the fixes between OFED RC4 and RC5 closed this bug? Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Jun 26 11:21:54 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 26 Jun 2006 11:21:54 -0700 Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog) In-Reply-To: <20060626174611.GB19929@mellanox.co.il> References: <20060530183454.GH10234@mellanox.co.il> <20060626174611.GB19929@mellanox.co.il> Message-ID: <44A025C2.2070204@ichips.intel.com> Michael S. Tsirkin wrote: > Here's an untested patch that does this. Comments? Rather than exporting wrapper functions around atomic inc/dec, I would rather the user just maintain the current backlog themselves, with the patch limited to the cm.c file only. > Index: src/drivers/infiniband/core/cm.c > =================================================================== > --- src.orig/drivers/infiniband/core/cm.c 2006-06-07 11:33:04.109937000 +0300 > +++ src/drivers/infiniband/core/cm.c 2006-06-15 15:50:34.222140000 +0300 > @@ -701,6 +701,19 @@ static void cm_reset_to_idle(struct cm_i > } > } > > +static void ib_destroy_cm_id_no_resp(struct ib_cm_id *cm_id) > +{ > + struct cm_id_private *cm_id_priv; > + > + cm_id_priv = container_of(cm_id, struct cm_id_private, id); > + > + cm_free_id(cm_id->local_id); > + cm_deref_id(cm_id_priv); > + kfree(cm_id_priv->compare_data); > + kfree(cm_id_priv->private_data); > + kfree(cm_id_priv); > +} I think that we need to dequeue and free any additional work items as well here. See the bottom of ib_destroy_cm_id(). (It may makes sense for ib_destroy_cm_id() to call the new routine, but I'm not sure about that yet.) We will also need to wait for all references on the cm_id to go to 0. (Incoming MADs could be accessing the cm_id, such as receiving a REJ while we're processing a REQ.) There are likely some additional race conditions / cleanup not handled here as well. We may still need to perform some state checking to ensure that the cm_id is not in any lists / trees, and that there are no outstanding MADs associated with the id. (A user could have sent an MRA or other CM MAD from their callback, before returning an error.) - Sean From mshefty at ichips.intel.com Mon Jun 26 11:25:47 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 26 Jun 2006 11:25:47 -0700 Subject: [openib-general] ucma into kernel.org In-Reply-To: <20060626174117.GA19929@mellanox.co.il> References: <44A00EEF.702@ichips.intel.com> <20060626174117.GA19929@mellanox.co.il> Message-ID: <44A026AB.8090607@ichips.intel.com> Michael S. Tsirkin wrote: > How about the cma changes required by ucma to get/set options? I think they are > not upstream yet. Could these go upstream, to make building ucma out-of-kernel > possible, without kernel patches? Wouldn't you have to patch the kernel to include the kernel ucma anyway? - Sean From narravul at cse.ohio-state.edu Mon Jun 26 11:22:03 2006 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Mon, 26 Jun 2006 14:22:03 -0400 (EDT) Subject: [openib-general] Interface for getting RNIC's IP address Message-ID: Is there any s/w interface to obtain the local RNIC's IP address? The current rdma cm examples, rping and cmatose, require the user to enter the ip address as a command line parameter. I am currently looking for a way to get this programatically. Thanks, --Sundeep. From mst at mellanox.co.il Mon Jun 26 11:37:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 26 Jun 2006 21:37:11 +0300 Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog) In-Reply-To: <44A025C2.2070204@ichips.intel.com> References: <44A025C2.2070204@ichips.intel.com> Message-ID: <20060626183711.GA20281@mellanox.co.il> Sean, thanks for comments. Quoting r. Sean Hefty : > It may makes sense for > ib_destroy_cm_id() to call the new routine, but I'm not sure about that yet. Maybe add a new routine getting a response flag, and use that from ib_destroy_cm_id? -- MST From mst at mellanox.co.il Mon Jun 26 11:28:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 26 Jun 2006 21:28:30 +0300 Subject: [openib-general] ucma into kernel.org In-Reply-To: <44A026AB.8090607@ichips.intel.com> References: <44A026AB.8090607@ichips.intel.com> Message-ID: <20060626182830.GD19929@mellanox.co.il> Quoting r. Sean Hefty : > > How about the cma changes required by ucma to get/set options? I think they > > are not upstream yet. Could these go upstream, to make building ucma > > out-of-kernel possible, without kernel patches? > > Wouldn't you have to patch the kernel to include the kernel ucma anyway? I would? Why can't it be compiled as an out of kernel module? -- MST From mshefty at ichips.intel.com Mon Jun 26 12:04:45 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 26 Jun 2006 12:04:45 -0700 Subject: [openib-general] ucma into kernel.org In-Reply-To: <20060626182830.GD19929@mellanox.co.il> References: <44A026AB.8090607@ichips.intel.com> <20060626182830.GD19929@mellanox.co.il> Message-ID: <44A02FCD.8030404@ichips.intel.com> Michael S. Tsirkin wrote: >>Wouldn't you have to patch the kernel to include the kernel ucma anyway? > > > I would? Why can't it be compiled as an out of kernel module? I understand you now. UD QP and multicast support were also recently added. I don't think that we want to risk pushing them upstream for 2.6.18 as well, since it requires adding the ib_multicast module. - Sean From mshefty at ichips.intel.com Mon Jun 26 12:13:34 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 26 Jun 2006 12:13:34 -0700 Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog) In-Reply-To: <20060626183711.GA20281@mellanox.co.il> References: <44A025C2.2070204@ichips.intel.com> <20060626183711.GA20281@mellanox.co.il> Message-ID: <44A031DE.6030905@ichips.intel.com> Michael S. Tsirkin wrote: >>It may makes sense for >>ib_destroy_cm_id() to call the new routine, but I'm not sure about that yet. > > > Maybe add a new routine getting a response flag, and use that from > ib_destroy_cm_id? I'm not following what you mean here. Originally, I was suggesting taking the bottom portion of ib_destroy_cm_id() and making it the "destroy no response" call. But after thinking about it more, I' don't believe that the cleanup is that easy. We still need to check and modify the cm_id state to ensure that newly received MADs are handled correctly, plus remove the cm_id from any trees used to track the connection. - Sean From swise at opengridcomputing.com Mon Jun 26 12:19:26 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 26 Jun 2006 14:19:26 -0500 Subject: [openib-general] Interface for getting RNIC's IP address In-Reply-To: References: Message-ID: <1151349566.2398.59.camel@stevo-desktop> On Mon, 2006-06-26 at 14:22 -0400, Sundeep Narravula wrote: > Is there any s/w interface to obtain the local RNIC's IP address? > > The current rdma cm examples, rping and cmatose, require the user to enter > the ip address as a command line parameter. I am currently looking for a > way to get this programatically. > You can use 0.0.0.0 which will allow you to listen across all rdma devices. Otherwise, you have to "know" which ip address is bound to the device you wish to listen on. > Thanks, > --Sundeep. > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Mon Jun 26 12:24:08 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 26 Jun 2006 22:24:08 +0300 Subject: [openib-general] ucma into kernel.org In-Reply-To: <44A02FCD.8030404@ichips.intel.com> References: <44A02FCD.8030404@ichips.intel.com> Message-ID: <20060626192408.GA20568@mellanox.co.il> Quoting r. Sean Hefty : > UD QP and multicast support were also recently added. These options are slightly different however - kernel ULPs I think will also want to set the number of retries/timeout (SDP needs it). So you can look it as a kind of fix, not a new feature. And, the change is I think smaller. No? > I don't think that we want to risk pushing them upstream for 2.6.18 as well, > since it requires adding the ib_multicast module. Yes, we still see crashes with the new ib_multicast. -- MST From caitlinb at broadcom.com Mon Jun 26 12:39:30 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 26 Jun 2006 12:39:30 -0700 Subject: [openib-general] Interface for getting RNIC's IP address Message-ID: <54AD0F12E08D1541B826BE97C98F99F15F58B7@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > On Mon, 2006-06-26 at 14:22 -0400, Sundeep Narravula wrote: >> Is there any s/w interface to obtain the local RNIC's IP address? >> >> The current rdma cm examples, rping and cmatose, require the user to >> enter the ip address as a command line parameter. I am currently >> looking for a way to get this programatically. >> > > You can use 0.0.0.0 which will allow you to listen across all > rdma devices. Otherwise, you have to "know" which ip address > is bound to the device you wish to listen on. > > > True, but if you know what netdevice the rdma device you want to listen on is associated with then you can work from the IP Address to the netdevice. From mst at mellanox.co.il Mon Jun 26 12:40:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 26 Jun 2006 22:40:10 +0300 Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog) In-Reply-To: <44A031DE.6030905@ichips.intel.com> References: <44A031DE.6030905@ichips.intel.com> Message-ID: <20060626194010.GB20568@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] RFC: CMA backlog (was Re: CMA backlog) > > Michael S. Tsirkin wrote: > >>It may makes sense for > >>ib_destroy_cm_id() to call the new routine, but I'm not sure about that yet. > > > > > > Maybe add a new routine getting a response flag, and use that from > > ib_destroy_cm_id? > > I'm not following what you mean here. I'm just saying that we can use exactly the code in ib_destroy_cm_id, but avoid calling ib_send_cm_rej in this one case: case IB_CM_REQ_RCVD: case IB_CM_MRA_REQ_SENT: case IB_CM_REP_RCVD: case IB_CM_MRA_REP_SENT: + if (noresponse) + cm_reset_to_idle(cm_id_priv); spin_unlock_irqrestore(&cm_id_priv->lock, flags); + if (noresponse) ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); So we get all the handling for free, just avoid sending out the MAD. -- MST From mshefty at ichips.intel.com Mon Jun 26 13:17:07 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 26 Jun 2006 13:17:07 -0700 Subject: [openib-general] ucma into kernel.org In-Reply-To: <20060626192408.GA20568@mellanox.co.il> References: <44A02FCD.8030404@ichips.intel.com> <20060626192408.GA20568@mellanox.co.il> Message-ID: <44A040C3.1060600@ichips.intel.com> Michael S. Tsirkin wrote: >>UD QP and multicast support were also recently added. > > These options are slightly different however - kernel ULPs I think will also > want to set the number of retries/timeout (SDP needs it). So you can look it as > a kind of fix, not a new feature. And, the change is I think smaller. > > No? I agree that they're different. I was merely pointing out that the ucma has those changes too. You would need a special out of kernel version of the ucma and compatible librdmacm to talk with whatever kernel cma is upstream. - Sean From mshefty at ichips.intel.com Mon Jun 26 13:19:38 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 26 Jun 2006 13:19:38 -0700 Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog) In-Reply-To: <20060626194010.GB20568@mellanox.co.il> References: <44A031DE.6030905@ichips.intel.com> <20060626194010.GB20568@mellanox.co.il> Message-ID: <44A0415A.9000105@ichips.intel.com> Michael S. Tsirkin wrote: > I'm just saying that we can use exactly the code in ib_destroy_cm_id, but > avoid calling ib_send_cm_rej in this one case: Ah... yes, something like that should work. - Sean From pw at osc.edu Mon Jun 26 14:53:19 2006 From: pw at osc.edu (Pete Wyckoff) Date: Mon, 26 Jun 2006 17:53:19 -0400 Subject: [openib-general] max_send_sge < max_sge Message-ID: <20060626215319.GA9291@osc.edu> Using stock 2.6.17.1, with verbs 1.0.3-1.fc4 and mthca 1.0.2-1.fc4 with MT25204, this line: ret = ibv_query_device(ctx, &hca_cap); tells me that hca_cap.max_sge = 30. However, this code fails, with the last kernel write returning EINVAL: memset(&att, 0, sizeof(att)); att.send_cq = 1024; att.recv_cq = 1024; att.cap.max_recv_wr = 512; att.cap.max_send_wr = 512; att.cap.max_recv_sge = 30; att.cap.max_send_sge = 30; att.qp_type = IBV_QPT_RC; qp = ibv_create_qp(pd, &att); But if I set: att.cap.max_recv_sge = 30; att.cap.max_send_sge = 29; /* hca_cap.max_sge - 1 */ the QP create succeeds. Is this a known issue? Should I always subtract 1 from the reported max on the send side? Just for this hardware? -- Pete From rjwalsh at pathscale.com Mon Jun 26 15:53:07 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Mon, 26 Jun 2006 15:53:07 -0700 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <20060626215319.GA9291@osc.edu> References: <20060626215319.GA9291@osc.edu> Message-ID: <1151362387.20061.0.camel@hematite.internal.keyresearch.com> On Mon, 2006-06-26 at 17:53 -0400, Pete Wyckoff wrote: > Using stock 2.6.17.1, with verbs 1.0.3-1.fc4 and mthca 1.0.2-1.fc4 > with MT25204, this line: > > ret = ibv_query_device(ctx, &hca_cap); > > tells me that hca_cap.max_sge = 30. > > However, this code fails, with the last kernel write returning EINVAL: > > memset(&att, 0, sizeof(att)); > att.send_cq = 1024; > att.recv_cq = 1024; > att.cap.max_recv_wr = 512; > att.cap.max_send_wr = 512; > att.cap.max_recv_sge = 30; > att.cap.max_send_sge = 30; > att.qp_type = IBV_QPT_RC; > qp = ibv_create_qp(pd, &att); > > But if I set: > > att.cap.max_recv_sge = 30; > att.cap.max_send_sge = 29; /* hca_cap.max_sge - 1 */ > > the QP create succeeds. > > Is this a known issue? Should I always subtract 1 from the reported > max on the send side? Just for this hardware? Probably something else has a QP allocated? Like the SMA, maybe? -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From rjwalsh at pathscale.com Mon Jun 26 15:53:48 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Mon, 26 Jun 2006 15:53:48 -0700 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <1151362387.20061.0.camel@hematite.internal.keyresearch.com> References: <20060626215319.GA9291@osc.edu> <1151362387.20061.0.camel@hematite.internal.keyresearch.com> Message-ID: <1151362428.20061.2.camel@hematite.internal.keyresearch.com> On Mon, 2006-06-26 at 15:53 -0700, Robert Walsh wrote: > On Mon, 2006-06-26 at 17:53 -0400, Pete Wyckoff wrote: > > Using stock 2.6.17.1, with verbs 1.0.3-1.fc4 and mthca 1.0.2-1.fc4 > > with MT25204, this line: > > > > ret = ibv_query_device(ctx, &hca_cap); > > > > tells me that hca_cap.max_sge = 30. > > > > However, this code fails, with the last kernel write returning EINVAL: > > > > memset(&att, 0, sizeof(att)); > > att.send_cq = 1024; > > att.recv_cq = 1024; > > att.cap.max_recv_wr = 512; > > att.cap.max_send_wr = 512; > > att.cap.max_recv_sge = 30; > > att.cap.max_send_sge = 30; > > att.qp_type = IBV_QPT_RC; > > qp = ibv_create_qp(pd, &att); > > > > But if I set: > > > > att.cap.max_recv_sge = 30; > > att.cap.max_send_sge = 29; /* hca_cap.max_sge - 1 */ > > > > the QP create succeeds. > > > > Is this a known issue? Should I always subtract 1 from the reported > > max on the send side? Just for this hardware? > > Probably something else has a QP allocated? Like the SMA, maybe? Doh - never mind. SGE's, not QPs. Wasn't paying attention :-) -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From contactpost at sanook.com Mon Jun 26 15:58:47 2006 From: contactpost at sanook.com (contactpost) Date: Tue, 27 Jun 2006 05:58:47 +0700 Subject: [openib-general] BUSINESS INVITATION Message-ID: ___________________________________________________________________________ ��ҹ���Ѻ��������ѵ��ѵԨҡ http://www.journal.msu.ac.th From mshefty at ichips.intel.com Mon Jun 26 17:03:14 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 26 Jun 2006 17:03:14 -0700 Subject: [openib-general] Kernel Oops related to IPoIB (multicast module?) In-Reply-To: <200606261051.12515.jackm@mellanox.co.il> References: <200606261051.12515.jackm@mellanox.co.il> Message-ID: <44A075C2.6060409@ichips.intel.com> Jack Morgenstein wrote: > The following Oops occurred upon unloading the openib driver. I unloaded the > driver immediately following a reboot (the driver had been loaded during the > boot sequence). I did NOT run opensm before unloading the driver. > > Evidently, ipoib was still attempting to connect with an SA, when the ipoib > module was unloaded (modprobe -r). After the ipoib module was unloaded (or at > least rendered inaccessible), the ib_sa module attempted to invoke > "ib_sa_mcmember_rec_callback" (for a callback address that was part of the > unloaded ipoib module). Hence, the Oops below. > > The "modprobe" process in the trace below is "modprobe -r ib_sa" (After > unloading ib_ipoib, we attempt to unload ib_sa). Following the Oops, I've > included info on the running environment. Thanks for the additional information. I've been trying to reproduce this, but haven't been able to yet. I did notice that there's a several second delay when calling modprobe -r ip_iboib, but only if I've tried to configure ib0 first. (No SM was running.) I am confused on one area. After executing modprobe -r ib_ipoib, what kept ib_sa loaded? (Why was modprobe -r ib_sa necessary?) I would have expected it to be unloaded at the same time. - Sean From mst at mellanox.co.il Mon Jun 26 23:42:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 27 Jun 2006 09:42:34 +0300 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <20060626215319.GA9291@osc.edu> References: <20060626215319.GA9291@osc.edu> Message-ID: <20060627064234.GG19300@mellanox.co.il> Quoting r. Pete Wyckoff : > Subject: max_send_sge < max_sge > > Using stock 2.6.17.1, with verbs 1.0.3-1.fc4 and mthca 1.0.2-1.fc4 > with MT25204, this line: > > ret = ibv_query_device(ctx, &hca_cap); > > tells me that hca_cap.max_sge = 30. > > However, this code fails, with the last kernel write returning EINVAL: > > memset(&att, 0, sizeof(att)); > att.send_cq = 1024; > att.recv_cq = 1024; > att.cap.max_recv_wr = 512; > att.cap.max_send_wr = 512; > att.cap.max_recv_sge = 30; > att.cap.max_send_sge = 30; > att.qp_type = IBV_QPT_RC; > qp = ibv_create_qp(pd, &att); Some Mellanox HCAs support different max sge values for send queue versus receive queue, or for different QP types. ibv_query_device returns the maximum value hardware can support. > Is this a known issue? Yes. The fact that ibv_query_device returns some value in hca_cap can not guarantee that ibv_create_qp with these parameters will succeed. For example, system administrator might have imposed a limit on the amount of memory you can pin down, and you will get ENOMEM. > Should I always subtract 1 from the reported max on the send side? Just for > this hardware? Unless you use it, passing the absolute maximum value supported by hardware does not seem, to me, to make sense - it will just slow you down, and waste resources. Is there a protocol out there that actually has a use for 30 sge? In my opinion, for the application to be robust it has to either use small values that empirically work on most systems, or be able to scale down to require less resources if an allocation fails. -- MST From mst at mellanox.co.il Mon Jun 26 23:49:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 27 Jun 2006 09:49:01 +0300 Subject: [openib-general] [git pull] please pull infiniband.git In-Reply-To: References: Message-ID: <20060627064901.GH19300@mellanox.co.il> Quoting r. Roland Dreier : > Subject: [git pull] please pull infiniband.git > > Linus, please pull from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > This has a couple of mthca driver bug fixes: > > Michael S. Tsirkin: > IB/mthca: restore missing PCI registers after reset > IB/mthca: memfree completion with error FW bug workaround > > drivers/infiniband/hw/mthca/mthca_cq.c | 11 +++++ > drivers/infiniband/hw/mthca/mthca_reset.c | 59 +++++++++++++++++++++++++++++ > 2 files changed, 69 insertions(+), 1 deletions(-) These two patches didn't seem to make it to 2.6.17, did they? Is there support for their inclusion in -stable? These both actually fix stability issues for our customers. -- MST From eitan at mellanox.co.il Mon Jun 26 23:48:37 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 27 Jun 2006 09:48:37 +0300 Subject: [openib-general] [PATCH] opensm: libibmad: match MAD TransactionID In-Reply-To: <20060626153500.18078.85785.stgit@sashak.voltaire.com> References: <20060626153500.18078.85785.stgit@sashak.voltaire.com> Message-ID: <44A0D4C5.4060705@mellanox.co.il> Hi Sasha Can you provide a little more info on the cause and impact of the issue you are solving with this patch? How is it related to work on the thread: "mad: add GID/class checking for matching received to sent MADs"? Thanks Sasha Khapyorsky wrote: > Match MAD TransactionID on receiving. This prevents request/response MADs > mixing - reproducible when poll() (in libibumad) returns timeout. > From bpradip at in.ibm.com Mon Jun 26 23:56:25 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Tue, 27 Jun 2006 12:26:25 +0530 Subject: [openib-general] [PATCH 0/2] perftest: Modified perftest utils to work with new stack and libraries In-Reply-To: <20060626151506.GA14684@esmail.cup.hp.com> References: <20060626102410.GA17835@harry-potter.ibm.com> <20060626151506.GA14684@esmail.cup.hp.com> Message-ID: <44A0D699.1030808@in.ibm.com> Grant Grundler wrote: > On Mon, Jun 26, 2006 at 03:54:19PM +0530, Pradipta Kumar Banerjee wrote: >> modified perftest utilities to work with the latest stack and libraries. >> This patchset consists changes for rdma_lat and rdma_bw only. >> >> 1 - rdma_lat.c changes >> 2 - rdma_bw.c changes > > Pradipta, > thanks for posting the patches...but could you do us a favor > and provide a useful changelog entry? > > We can see it's a patch and which files the patch modifies. > The changelog should summarize "what problem does this patch fix?". > Grant, Will repost the patches again with the changelog. Thanks, Pradipta > thanks again, > grant > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From halr at voltaire.com Tue Jun 27 04:15:05 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Jun 2006 07:15:05 -0400 Subject: [openib-general] [PATCH] OpenSM/SA: Eliminate some no longer needed code Message-ID: <1151406904.4482.179805.camel@hal.voltaire.com> OpenSM/SA: Eliminate some no longer needed code No longer a need to check whether the LID is beyond the vector table size. In fact, this turns an edge case into an error (when LMC > 0 and a non base LID is requested which is above the last base LID but within that port's LID range). In any case, osm_get_port_by_base_lid uses cl_ptr_vector_get_at which does this check at the proper time. Signed-off-by: Hal Rosenstock Index: opensm/osm_sa_pkey_record.c =================================================================== --- opensm/osm_sa_pkey_record.c (revision 8236) +++ opensm/osm_sa_pkey_record.c (working copy) @@ -419,25 +419,14 @@ osm_pkey_rec_rcv_process( CL_ASSERT( cl_ptr_vector_get_size(p_tbl) < 0x10000 ); - if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid)) + status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port ); + if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) ) { - status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port ); - if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) ) - { - status = IB_NOT_FOUND; - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_pkey_rec_rcv_process: ERR 460B: " - "No port found with LID 0x%x\n", - cl_ntoh16(p_rcvd_rec->lid) ); - } - } - else - { /* LID out of range */ status = IB_NOT_FOUND; osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_pkey_rec_rcv_process: ERR 4609: " - "Given LID (0x%X) is out of range:0x%X\n", - cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) ); + "osm_pkey_rec_rcv_process: ERR 460B: " + "No port found with LID 0x%x\n", + cl_ntoh16(p_rcvd_rec->lid) ); } } Index: opensm/osm_sa_portinfo_record.c =================================================================== --- opensm/osm_sa_portinfo_record.c (revision 8236) +++ opensm/osm_sa_portinfo_record.c (working copy) @@ -677,25 +677,14 @@ osm_pir_rcv_process( */ if( comp_mask & IB_PIR_COMPMASK_LID ) { - if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid)) - { - status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port ); - if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) ) - { - status = IB_NOT_FOUND; - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_pir_rcv_process: ERR 2109: " - "No port found with LID 0x%x\n", - cl_ntoh16(p_rcvd_rec->lid) ); - } - } - else + status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port ); + if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) ) { status = IB_NOT_FOUND; osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_pir_rcv_process: ERR 2101: " - "Given LID (0x%X) is out of range:0x%X\n", - cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) ); + "osm_pir_rcv_process: ERR 2109: " + "No port found with LID 0x%x\n", + cl_ntoh16(p_rcvd_rec->lid) ); } } else Index: opensm/osm_sa_slvl_record.c =================================================================== --- opensm/osm_sa_slvl_record.c (revision 8236) +++ opensm/osm_sa_slvl_record.c (working copy) @@ -387,25 +387,14 @@ osm_slvl_rec_rcv_process( CL_ASSERT( cl_ptr_vector_get_size(p_tbl) < 0x10000 ); - if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid)) + status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port ); + if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) ) { - status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port ); - if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) ) - { - status = IB_NOT_FOUND; - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_slvl_rec_rcv_process: ERR 2608: " - "No port found with LID 0x%x\n", - cl_ntoh16(p_rcvd_rec->lid) ); - } - } - else - { /* LID out of range */ status = IB_NOT_FOUND; osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_slvl_rec_rcv_process: ERR 2601: " - "Given LID (0x%X) is out of range:0x%X\n", - cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl)); + "osm_slvl_rec_rcv_process: ERR 2608: " + "No port found with LID 0x%x\n", + cl_ntoh16(p_rcvd_rec->lid) ); } } Index: opensm/osm_sa_vlarb_record.c =================================================================== --- opensm/osm_sa_vlarb_record.c (revision 8236) +++ opensm/osm_sa_vlarb_record.c (working copy) @@ -407,25 +407,14 @@ osm_vlarb_rec_rcv_process( CL_ASSERT( cl_ptr_vector_get_size(p_tbl) < 0x10000 ); - if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid)) + status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port ); + if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) ) { - status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port ); - if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) ) - { - status = IB_NOT_FOUND; - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_vlarb_rec_rcv_process: ERR 2A09: " - "No port found with LID 0x%x\n", - cl_ntoh16(p_rcvd_rec->lid) ); - } - } - else - { /* LID out of range */ status = IB_NOT_FOUND; osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "osm_vlarb_rec_rcv_process: ERR 2A01: " - "Given LID (0x%X) is out of range:0x%X\n", - cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) ); + "osm_vlarb_rec_rcv_process: ERR 2A09: " + "No port found with LID 0x%x\n", + cl_ntoh16(p_rcvd_rec->lid) ); } } From rkuchimanchi at silverstorm.com Tue Jun 27 05:16:36 2006 From: rkuchimanchi at silverstorm.com (Ramachandra K) Date: Tue, 27 Jun 2006 17:46:36 +0530 Subject: [openib-general] Local QP operation error Message-ID: <44A121A4.8090509@silverstorm.com> In a kernel module, on polling the CQ, I am getting a local QP operation error (IB_WC_LOC_QP_OP_ERR). Work request posted was of type IB_WR_SEND and the QP was moved to IB_QPS_RTS state before posting the send work request. The IB specifcation says that this error indicates an internal QP consistency error. What are the possible reasons for this and is there any way I can pin point the inconsistency ? I would appreciate any hints to resolve this error. Regards, Ram From tziporet at mellanox.co.il Tue Jun 27 05:39:41 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 27 Jun 2006 15:39:41 +0300 Subject: [openib-general] Kernel Oops related to IPoIB (multicast module?) In-Reply-To: <44A075C2.6060409@ichips.intel.com> References: <200606261051.12515.jackm@mellanox.co.il> <44A075C2.6060409@ichips.intel.com> Message-ID: <44A1270D.2070109@mellanox.co.il> Sean Hefty wrote: > Thanks for the additional information. I've been trying to reproduce this, but > haven't been able to yet. I did notice that there's a several second delay when > calling modprobe -r ip_iboib, but only if I've tried to configure ib0 first. > (No SM was running.) > > I am confused on one area. After executing modprobe -r ib_ipoib, what kept > ib_sa loaded? (Why was modprobe -r ib_sa necessary?) I would have expected it > to be unloaded at the same time. > > - Sean > > Hi Sean, Resolving this issue is critical for us since it prevent us from any usage of the new multicsat module. An easy way to reproduce it is to use the OFED "openibd" script. Just run "openibd start" and than "openibd stop" and you will see the problem. This script is available within OFED release. Thanks, Tziporet From mst at mellanox.co.il Tue Jun 27 05:45:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 27 Jun 2006 15:45:05 +0300 Subject: [openib-general] Local QP operation error In-Reply-To: <44A121A4.8090509@silverstorm.com> References: <44A121A4.8090509@silverstorm.com> Message-ID: <20060627124505.GL19300@mellanox.co.il> Quoting r. Ramachandra K : > Subject: Local QP operation error > > In a kernel module, on polling the CQ, I am getting a local QP > operation error (IB_WC_LOC_QP_OP_ERR). Work request > posted was of type IB_WR_SEND and the QP was moved to > IB_QPS_RTS state before posting the send work request. > > The IB specifcation says that this error indicates an internal QP consistency > error. What are the possible reasons for this and is there any way I can pin > point the inconsistency ? This normally indicates some kind of driver bug, or memory corruption. What is the value of the vendor_err field? -- MST From Thomas.Talpey at netapp.com Tue Jun 27 06:06:17 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 27 Jun 2006 09:06:17 -0400 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <20060627064234.GG19300@mellanox.co.il> References: <20060626215319.GA9291@osc.edu> <20060627064234.GG19300@mellanox.co.il> Message-ID: <7.0.1.0.2.20060627090204.04471ba0@netapp.com> At 02:42 AM 6/27/2006, Michael S. Tsirkin wrote: >Unless you use it, passing the absolute maximum value supported by >hardware does >not seem, to me, to make sense - it will just slow you down, and waste >resources. Is there a protocol out there that actually has a use for 30 sge? It's not a protocol thing, it's a memory registration thing. But I agree, that's a huge number of segments for send and receive. 2-4 is more typical. I'd be interested to know what wants 30 as well... Tom. From tziporet at mellanox.co.il Tue Jun 27 06:04:42 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 27 Jun 2006 16:04:42 +0300 Subject: [openib-general] ucma into kernel.org In-Reply-To: <20060626174117.GA19929@mellanox.co.il> References: <44A00EEF.702@ichips.intel.com> <20060626174117.GA19929@mellanox.co.il> Message-ID: <44A12CEA.6010204@mellanox.co.il> > > How about the cma changes required by ucma to get/set options? I think they are > not upstream yet. Could these go upstream, to make building ucma out-of-kernel > possible, without kernel patches? > > Hi Sean, These features are needed for uDAPL and were requested by Woody and Arlin for Intel MPI scalability. Since in OFED 1.1 we are going to take CMA from kernel 2.6.18 we need them upstream. Can you drive these enhancements only to 2.6.18. Thanks, Tziporet From rkuchimanchi at silverstorm.com Tue Jun 27 06:21:19 2006 From: rkuchimanchi at silverstorm.com (Ramachandra K) Date: Tue, 27 Jun 2006 18:51:19 +0530 Subject: [openib-general] Local QP operation error In-Reply-To: <20060627124505.GL19300@mellanox.co.il> References: <44A121A4.8090509@silverstorm.com> <20060627124505.GL19300@mellanox.co.il> Message-ID: <44A130CF.2060908@silverstorm.com> Michael S. Tsirkin wrote: >>The IB specifcation says that this error indicates an internal QP consistency >>error. What are the possible reasons for this and is there any way I can pin >>point the inconsistency ? >> >> > >This normally indicates some kind of driver bug, or memory corruption. >What is the value of the vendor_err field? > > > The vendor_err field value is 115 (0x73). Just to clarify, I am writing the kernel module that is getting the local QP operation error. I guess I am missing something in my code that is causing the error. But I am unable to pinpoint the cause of the error. Does this error point to some issue with the DMA address specified in the work request SGE ? Regards, Ram -------------- next part -------------- An HTML attachment was scrubbed... URL: From Thomas.Talpey at netapp.com Tue Jun 27 06:29:53 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 27 Jun 2006 09:29:53 -0400 Subject: [openib-general] Local QP operation error In-Reply-To: <44A130CF.2060908@silverstorm.com> References: <44A121A4.8090509@silverstorm.com> <20060627124505.GL19300@mellanox.co.il> <44A130CF.2060908@silverstorm.com> Message-ID: <7.0.1.0.2.20060627092733.04471ce8@netapp.com> At 09:21 AM 6/27/2006, Ramachandra K wrote: >Does this error point to some issue with the DMA address specified >in the work request SGE ? Ding Ding Ding Ding! :-) We recently identified the exact issue in the NFS/RDMA server, which happened only when running on ia64. If you're not using the dma_map_* api, that's maybe something to look at. ;-) Tom. From mst at mellanox.co.il Tue Jun 27 06:31:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 27 Jun 2006 16:31:19 +0300 Subject: [openib-general] Local QP operation error In-Reply-To: <44A130CF.2060908@silverstorm.com> References: <44A130CF.2060908@silverstorm.com> Message-ID: <20060627133119.GO19300@mellanox.co.il> Quoting r. Ramachandra K : > Just to clarify, I am writing the kernel module that is getting the local > QP operation error. I guess I am missing something in my code that > is causing the error. But I am unable to pinpoint the cause of the error. > > Does this error point to some issue with the DMA address specified > in the work request SGE ? Yes, it seems hardware could not read (gather) data when executing the work request SGE. -- MST From mshefty at ichips.intel.com Tue Jun 27 08:45:52 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 27 Jun 2006 08:45:52 -0700 Subject: [openib-general] Kernel Oops related to IPoIB (multicast module?) In-Reply-To: <44A1270D.2070109@mellanox.co.il> References: <200606261051.12515.jackm@mellanox.co.il> <44A075C2.6060409@ichips.intel.com> <44A1270D.2070109@mellanox.co.il> Message-ID: <44A152B0.3000007@ichips.intel.com> Tziporet Koren wrote: > Resolving this issue is critical for us since it prevent us from any > usage of the new multicsat module. > An easy way to reproduce it is to use the OFED "openibd" script. Just > run "openibd start" and than "openibd stop" and you will see the > problem. This script is available within OFED release. I am working on trying to resolve this as my top priority at the moment, but I have not been able to reproduce this on my systems. I want to understand why ib_sa was not unloaded as part of modprobe -r ib_ipoib, but why ib_multicast apparently was. I will examine the script that you mentioned, but I typically do not run the OFED release. - Sean From sashak at voltaire.com Tue Jun 27 10:07:20 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 27 Jun 2006 20:07:20 +0300 Subject: [openib-general] [PATCH] opensm: libibmad: match MAD TransactionID In-Reply-To: <44A0D4C5.4060705@mellanox.co.il> References: <20060626153500.18078.85785.stgit@sashak.voltaire.com> <44A0D4C5.4060705@mellanox.co.il> Message-ID: <20060627170720.GO16738@sashak.voltaire.com> Hi Eitan, On 09:48 Tue 27 Jun , Eitan Behave wrote: > Hi Sasha > > Can you provide a little more info on the cause and impact of the > issue you are solving with this patch? umad_recv() uses poll(), when it is timeouted umad_recv() returns error and _do_madrpc() returns with error too. The next _do_madrpc() session will got the previous response MAD. And so on. > > How is it related to work on the thread: > "mad: add GID/class checking for matching received to sent MADs"? It is not related. Sasha > > Thanks > > Sasha Khapyorsky wrote: > >Match MAD TransactionID on receiving. This prevents request/response MADs > >mixing - reproducible when poll() (in libibumad) returns timeout. > > From halr at voltaire.com Tue Jun 27 10:25:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Jun 2006 13:25:31 -0400 Subject: [openib-general] [PATCH][TRIVIAL] OpenSM/osm_pkey_mgr.c: In pkey_mgr_get_physp_max_blocks, use routine rather than accessing structure member directly Message-ID: <1151429130.4482.194685.camel@hal.voltaire.com> OpenSM/osm_pkey_mgr.c: In pkey_mgr_get_physp_max_blocks, use routine rather than accessing structure member directly Signed-off-by: Hal Rosenstock Index: opensm/osm_pkey_mgr.c =================================================================== --- opensm/osm_pkey_mgr.c (revision 8220) +++ opensm/osm_pkey_mgr.c (working copy) @@ -81,7 +81,7 @@ pkey_mgr_get_physp_max_blocks( num_pkeys = cl_ntoh16( p_node->node_info.partition_cap ); else { - p_sw = osm_get_switch_by_guid( p_subn, p_node->node_info.node_guid ); + p_sw = osm_get_switch_by_guid( p_subn, osm_node_get_node_guid( p_node ) ); if (p_sw) num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap ); } From halr at voltaire.com Tue Jun 27 10:29:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Jun 2006 13:29:10 -0400 Subject: [openib-general] [PATCH] OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_switch_port, better BSP0 handling Message-ID: <1151429320.4482.194834.camel@hal.voltaire.com> OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_switch_port, better BSP0 handling In __osm_pi_rcv_process_switch_port, if base switch port 0, then copy the received PortInfo attribute into the physp structure regardless of the port state. On BSP0, the port state is not used so this protects against an SMA which set this to LINK_DOWN. This makes the code for BSP0 more similar to how it originally was at the cost of an extra copy of the PortInfo attribute. Signed-off-by: Hal Rosenstock Index: opensm/osm_port_info_rcv.c =================================================================== --- opensm/osm_port_info_rcv.c (revision 8252) +++ opensm/osm_port_info_rcv.c (working copy) @@ -239,6 +239,8 @@ __osm_pi_rcv_process_switch_port( uint8_t port_num; uint8_t remote_port_num; osm_dr_path_t path; + osm_switch_t *p_sw; + ib_switch_info_t *p_si; OSM_LOG_ENTER( p_rcv->p_log, __osm_pi_rcv_process_switch_port ); @@ -350,6 +352,15 @@ __osm_pi_rcv_process_switch_port( "__osm_pi_rcv_process_switch_port: ERR 0F04: " "Invalid base LID 0x%x corrected\n", cl_ntoh16( orig_lid ) ); + /* Determine if base switch port 0 */ + p_sw = osm_get_switch_by_guid(p_rcv->p_subn, + osm_node_get_node_guid( p_node )); + if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) && + !ib_switch_info_is_enhanced_port0(p_si)) + { + /* PortState is not used on BSP0 but just in case it is DOWN */ + p_physp->port_info = *p_pi; + } __osm_pi_rcv_process_endport(p_rcv, p_physp, p_pi); } From halr at voltaire.com Tue Jun 27 10:32:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Jun 2006 13:32:23 -0400 Subject: [openib-general] [PATCH] opensm: libibmad: match MAD TransactionID In-Reply-To: <20060627170720.GO16738@sashak.voltaire.com> References: <20060626153500.18078.85785.stgit@sashak.voltaire.com> <44A0D4C5.4060705@mellanox.co.il> <20060627170720.GO16738@sashak.voltaire.com> Message-ID: <1151429349.4482.194836.camel@hal.voltaire.com> On Tue, 2006-06-27 at 13:07, Sasha Khapyorsky wrote: > Hi Eitan, > > On 09:48 Tue 27 Jun , Eitan Behave wrote: > > Hi Sasha > > > > Can you provide a little more info on the cause and impact of the > > issue you are solving with this patch? > > umad_recv() uses poll(), when it is timeouted umad_recv() returns error > and _do_madrpc() returns with error too. The next _do_madrpc() session > will got the previous response MAD. And so on. One more note to add to this: This only affects the OpenIB diagnostics and not OpenSM as the latter does not use this library; it uses umad directly not via rpc. -- Hal > > How is it related to work on the thread: > > "mad: add GID/class checking for matching received to sent MADs"? > > It is not related. > > Sasha > > > > > Thanks > > > > Sasha Khapyorsky wrote: > > >Match MAD TransactionID on receiving. This prevents request/response MADs > > >mixing - reproducible when poll() (in libibumad) returns timeout. > > > From bpradip at in.ibm.com Tue Jun 27 10:51:49 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Tue, 27 Jun 2006 23:21:49 +0530 Subject: [openib-general] [IWARP BRANCH] [PATCH 0/3] Fix rdma_lat and rdma_bw to work with the new stack and libraries Message-ID: <20060627175141.GA9249@harry-potter.ibm.com> The present rdma_lat and rdma_bw utilizing the RDMA CM is broken and doesn't work with the latest libraries. The present code breaks because of using the old signature for the function rdma_get_cm_event. old function signature - int rdma_get_cm_event(struct rdma_cm_event **event) new function signature - int rdma_get_cm_event(struct rdma_event_channel *channel, struct rdma_cm_event **event) This patchset consists changes for rdma_lat, rdma_bw and Makefile. 1 - rdma_lat.c changes 2 - rdma_bw.c changes 3 - Makefile changes Signed-off-by: Pradipta Kumar Banerjee --- Thanks, Pradipta Kumar. From bpradip at in.ibm.com Tue Jun 27 10:56:26 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Tue, 27 Jun 2006 23:26:26 +0530 Subject: [openib-general] [IWARP BRANCH] [PATCH 1/3] Fix rdma_lat and rdma_bw to work with the new stack and libraries Message-ID: <20060627175624.GB9249@harry-potter.ibm.com> This patch fixes the broken rdma_lat by using the correct function signature for rdma_get_cm_event. old function signature - int rdma_get_cm_event(struct rdma_cm_event **event) new function signature - int rdma_get_cm_event(struct rdma_event_channel *channel, struct rdma_cm_event **event) Signed-off-by: Pradipta Kumar Banerjee --- Index: rdma_lat.c ============================================================================= --- ../perftest-org/rdma_lat.c 2006-06-22 18:28:13.000000000 +0530 +++ rdma_lat.c 2006-06-22 18:36:12.000000000 +0530 @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -83,6 +84,7 @@ struct pingpong_context { struct ibv_sge list; struct ibv_send_wr wr; struct rdma_cm_id *cm_id; + struct rdma_event_channel *cm_channel; }; struct pingpong_dest { @@ -612,11 +614,12 @@ static void pp_close_cma(struct pingpong } } - rdma_get_cm_event(&event); + rdma_get_cm_event(ctx->cm_channel, &event); if (event->event != RDMA_CM_EVENT_DISCONNECTED) printf("unexpected event during disconnect %d\n", event->event); rdma_ack_cm_event(event); rdma_destroy_id(ctx->cm_id); + rdma_destroy_event_channel(ctx->cm_channel); } static struct pingpong_context *pp_server_connect_cma(unsigned short port, int size, int tx_depth, @@ -629,17 +632,26 @@ static struct pingpong_context *pp_serve int ret; struct sockaddr_in sin; struct rdma_cm_id *child_cm_id; + struct rdma_event_channel *channel; struct pingpong_context *ctx; - + printf("%s starting server\n", __FUNCTION__); - ret = rdma_create_id(&listen_id, NULL); - if (ret) { - fprintf(stderr, "%s rdma_create_id failed %d\n", __FUNCTION__, ret); + channel = rdma_create_event_channel(); + if (!channel) { + ret = errno; + fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n", + __FUNCTION__, ret); return NULL; } + ret = rdma_create_id(channel, &listen_id, NULL); + if (ret) { + fprintf(stderr, "%s rdma_create_id failed %d\n", __FUNCTION__, ret); + goto err3; + } + memset(&sin, 0, sizeof(sin)); sin.sin_addr.s_addr = 0; - sin.sin_family = PF_INET; + sin.sin_family = AF_INET; sin.sin_port = htons(port); ret = rdma_bind_addr(listen_id, (struct sockaddr *)&sin); if (ret) { @@ -653,7 +665,7 @@ static struct pingpong_context *pp_serve goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -678,7 +690,8 @@ static struct pingpong_context *pp_serve fprintf(stderr,"%s pp_init_cma_ctx failed\n", __FUNCTION__); goto err0; } - + + ctx->cm_channel = channel; my_dest->qpn = 0; my_dest->psn = 0xbb; my_dest->rkey = ctx->mr->rkey; @@ -694,7 +707,7 @@ static struct pingpong_context *pp_serve goto err0; } rdma_ack_cm_event(event); - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) { fprintf(stderr,"rdma_get_cm_event error %d\n", ret); rdma_destroy_id(child_cm_id); @@ -713,8 +726,10 @@ err0: err1: rdma_ack_cm_event(event); err2: - rdma_destroy_id(listen_id); fprintf(stderr,"%s NOT connected!\n", __FUNCTION__); + rdma_destroy_id(listen_id); +err3: + rdma_destroy_event_channel(channel); return NULL; } @@ -750,6 +765,7 @@ static struct pingpong_context *pp_clien int ret; struct sockaddr_in sin; struct rdma_cm_id *cm_id; + struct rdma_event_channel *channel; struct pingpong_context *ctx; fprintf(stderr,"%s starting client\n", __FUNCTION__); @@ -758,10 +774,18 @@ static struct pingpong_context *pp_clien return NULL; } - ret = rdma_create_id(&cm_id, NULL); + channel = rdma_create_event_channel(); + if (!channel) { + ret = errno; + fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n", + __FUNCTION__, ret); + return NULL; + } + + ret = rdma_create_id(channel, &cm_id, NULL); if (ret) { fprintf(stderr,"%s rdma_create_id failed %d\n", __FUNCTION__, ret); - return NULL; + goto err3; } sin.sin_family = PF_INET; @@ -772,7 +796,7 @@ static struct pingpong_context *pp_clien goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -789,7 +813,7 @@ static struct pingpong_context *pp_clien goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -806,7 +830,8 @@ static struct pingpong_context *pp_clien fprintf(stderr,"%s pp_init_cma_ctx failed\n", __FUNCTION__); goto err2; } - + + ctx->cm_channel = channel; my_dest->qpn = 0; my_dest->psn = 0xaa; my_dest->rkey = ctx->mr->rkey; @@ -823,7 +848,7 @@ static struct pingpong_context *pp_clien goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -845,8 +870,10 @@ static struct pingpong_context *pp_clien err1: rdma_ack_cm_event(event); err2: - fprintf(stderr,"NOT connected!\n"); + fprintf(stderr,"%s NOT connected!\n", __FUNCTION__); rdma_destroy_id(cm_id); +err3: + rdma_destroy_event_channel(channel); return NULL; } From bpradip at in.ibm.com Tue Jun 27 10:58:28 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Tue, 27 Jun 2006 23:28:28 +0530 Subject: [openib-general] [IWARP BRANCH] [PATCH 2/3] Fix rdma_lat and rdma_bw to work with the new stack and libraries Message-ID: <20060627175826.GC9249@harry-potter.ibm.com> This patch fixes the broken rdma_bw by using the correct function signature for rdma_get_cm_event. old function signature - int rdma_get_cm_event(struct rdma_cm_event **event) new function signature - int rdma_get_cm_event(struct rdma_event_channel *channel, struct rdma_cm_event **event) Signed-off-by: Pradipta Kumar Banerjee --- Index: rdma_bw.c ============================================================================= --- ../perftest-org/rdma_bw.c 2006-06-22 18:28:13.000000000 +0530 +++ rdma_bw.c 2006-06-22 18:40:01.000000000 +0530 @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -75,6 +76,7 @@ struct pingpong_context { struct ibv_sge list; struct ibv_send_wr wr; struct rdma_cm_id *cm_id; + struct rdma_event_channel *cm_channel; }; struct pingpong_dest { @@ -545,11 +547,12 @@ static void pp_close_cma(struct pingpong } } - rdma_get_cm_event(&event); + rdma_get_cm_event(ctx->cm_channel, &event); if (event->event != RDMA_CM_EVENT_DISCONNECTED) printf("unexpected event during disconnect %d\n", event->event); rdma_ack_cm_event(event); rdma_destroy_id(ctx->cm_id); + rdma_destroy_event_channel(ctx->cm_channel); } static struct pingpong_context *pp_server_connect_cma(unsigned short port, int size, int tx_depth, @@ -562,13 +565,22 @@ static struct pingpong_context *pp_serve int ret; struct sockaddr_in sin; struct rdma_cm_id *child_cm_id; + struct rdma_event_channel *channel; struct pingpong_context *ctx; printf("%s starting server\n", __FUNCTION__); - ret = rdma_create_id(&listen_id, NULL); + channel = rdma_create_event_channel(); + if (!channel) { + ret = errno; + fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n", + __FUNCTION__, ret); + return NULL; + } + + ret = rdma_create_id(channel, &listen_id, NULL); if (ret) { fprintf(stderr, "%s rdma_create_id failed %d\n", __FUNCTION__, ret); - return NULL; + goto err3; } sin.sin_addr.s_addr = 0; @@ -586,7 +598,7 @@ static struct pingpong_context *pp_serve goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -612,6 +624,7 @@ static struct pingpong_context *pp_serve goto err0; } + ctx->cm_channel = channel; my_dest->qpn = 0; my_dest->psn = 0xbb; my_dest->rkey = ctx->mr->rkey; @@ -627,7 +640,7 @@ static struct pingpong_context *pp_serve goto err0; } rdma_ack_cm_event(event); - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) { fprintf(stderr,"rdma_get_cm_event error %d\n", ret); rdma_destroy_id(child_cm_id); @@ -646,8 +659,10 @@ err0: err1: rdma_ack_cm_event(event); err2: - rdma_destroy_id(listen_id); fprintf(stderr,"%s NOT connected!\n", __FUNCTION__); + rdma_destroy_id(listen_id); +err3: + rdma_destroy_event_channel(channel); return NULL; } @@ -683,6 +698,7 @@ static struct pingpong_context *pp_clien int ret; struct sockaddr_in sin; struct rdma_cm_id *cm_id; + struct rdma_event_channel *channel; struct pingpong_context *ctx; fprintf(stderr,"%s starting client\n", __FUNCTION__); @@ -691,10 +707,18 @@ static struct pingpong_context *pp_clien return NULL; } - ret = rdma_create_id(&cm_id, NULL); + channel = rdma_create_event_channel(); + if (!channel) { + ret = errno; + fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n", + __FUNCTION__, ret); + return NULL; + } + + ret = rdma_create_id(channel, &cm_id, NULL); if (ret) { fprintf(stderr,"%s rdma_create_id failed %d\n", __FUNCTION__, ret); - return NULL; + goto err3; } sin.sin_family = PF_INET; @@ -705,7 +729,7 @@ static struct pingpong_context *pp_clien goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -722,7 +746,7 @@ static struct pingpong_context *pp_clien goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -740,6 +764,7 @@ static struct pingpong_context *pp_clien goto err2; } + ctx->cm_channel = channel; my_dest->qpn = 0; my_dest->psn = 0xaa; my_dest->rkey = ctx->mr->rkey; @@ -756,7 +781,7 @@ static struct pingpong_context *pp_clien goto err2; } - ret = rdma_get_cm_event(&event); + ret = rdma_get_cm_event(channel, &event); if (ret) goto err2; @@ -779,6 +804,8 @@ err1: err2: fprintf(stderr,"NOT connected!\n"); rdma_destroy_id(cm_id); +err3: + rdma_destroy_event_channel(channel); return NULL; } From bpradip at in.ibm.com Tue Jun 27 11:01:22 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Tue, 27 Jun 2006 23:31:22 +0530 Subject: [openib-general] [IWARP BRANCH] [PATCH 3/3] Fix rdma_lat and rdma_bw to work with the new stack and libraries Message-ID: <20060627180120.GD9249@harry-potter.ibm.com> This fixes the Makefile to properly build rdma_lat and rdma_bw Includes the librdmacm library. Signed-off-by: Pradipta Kumar Banerjee --- Index: Makefile ============================================================================= --- bkp/Makefile 2006-06-22 10:18:58.000000000 +0530 +++ Makefile 2006-06-22 10:26:55.000000000 +0530 @@ -10,7 +10,7 @@ EXTRA_HEADERS = get_clock.h LOADLIBES += LDFLAGS += -${TESTS}: LOADLIBES += -libverbs +${TESTS}: LOADLIBES += -libverbs -lrdmacm ${TESTS} ${UTILS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS} $(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $< ${EXTRA_FILES} $(LOADLIBES) $(LDLIBS) -o $@ From swise at opengridcomputing.com Tue Jun 27 11:07:27 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 27 Jun 2006 13:07:27 -0500 Subject: [openib-general] [IWARP BRANCH] [PATCH 0/3] Fix rdma_lat and rdma_bw to work with the new stack and libraries In-Reply-To: <20060627175141.GA9249@harry-potter.ibm.com> References: <20060627175141.GA9249@harry-potter.ibm.com> Message-ID: <1151431647.3207.47.camel@stevo-desktop> Committed in the iwarp branch. r8254. Thanks, Steve. On Tue, 2006-06-27 at 23:21 +0530, Pradipta Kumar Banerjee wrote: > The present rdma_lat and rdma_bw utilizing the RDMA CM is broken and doesn't > work with the latest libraries. The present code breaks because of using the old > signature for the function rdma_get_cm_event. > > old function signature - int rdma_get_cm_event(struct rdma_cm_event **event) > new function signature - int rdma_get_cm_event(struct rdma_event_channel *channel, > struct rdma_cm_event **event) > > This patchset consists changes for rdma_lat, rdma_bw and Makefile. > > 1 - rdma_lat.c changes > 2 - rdma_bw.c changes > 3 - Makefile changes > > Signed-off-by: Pradipta Kumar Banerjee > > --- > > Thanks, > Pradipta Kumar. From mst at mellanox.co.il Tue Jun 27 11:18:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 27 Jun 2006 21:18:27 +0300 Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog) In-Reply-To: <44A0415A.9000105@ichips.intel.com> References: <44A0415A.9000105@ichips.intel.com> Message-ID: <20060627181827.GD4896@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: RFC: CMA backlog (was Re: CMA backlog) > > Michael S. Tsirkin wrote: > > I'm just saying that we can use exactly the code in ib_destroy_cm_id, but > > avoid calling ib_send_cm_rej in this one case: > > Ah... yes, something like that should work. Like this then (untested)? Signed-off-by: Michael S. Tsirkin Index: linux-2.6.17-2.6.18/drivers/infiniband/core/cm.c =================================================================== --- linux-2.6.17-2.6.18.orig/drivers/infiniband/core/cm.c 2006-06-27 12:21:34.000000000 +0300 +++ linux-2.6.17-2.6.18/drivers/infiniband/core/cm.c 2006-06-27 21:16:49.000000000 +0300 @@ -701,7 +701,7 @@ static void cm_reset_to_idle(struct cm_i } } -void ib_destroy_cm_id(struct ib_cm_id *cm_id) +static void cm_destroy_id(struct ib_cm_id *cm_id, int reject) { struct cm_id_private *cm_id_priv; struct cm_work *work; @@ -731,9 +731,9 @@ retest: ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); spin_unlock_irqrestore(&cm_id_priv->lock, flags); ib_send_cm_rej(cm_id, IB_CM_REJ_TIMEOUT, - &cm_id_priv->av.port->cm_dev->ca_guid, - sizeof cm_id_priv->av.port->cm_dev->ca_guid, - NULL, 0); + &cm_id_priv->av.port->cm_dev->ca_guid, + sizeof cm_id_priv->av.port->cm_dev->ca_guid, + NULL, 0); break; case IB_CM_MRA_REQ_RCVD: case IB_CM_REP_SENT: @@ -744,9 +744,14 @@ retest: case IB_CM_MRA_REQ_SENT: case IB_CM_REP_RCVD: case IB_CM_MRA_REP_SENT: - spin_unlock_irqrestore(&cm_id_priv->lock, flags); - ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED, - NULL, 0, NULL, 0); + if (reject) { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); + } else { + cm_reset_to_idle(cm_id_priv); + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + } break; case IB_CM_ESTABLISHED: spin_unlock_irqrestore(&cm_id_priv->lock, flags); @@ -775,6 +780,12 @@ retest: kfree(cm_id_priv->private_data); kfree(cm_id_priv); } + +void ib_destroy_cm_id(struct ib_cm_id *cm_id) +{ + cm_destroy_id(cm_id, 1); +} + EXPORT_SYMBOL(ib_destroy_cm_id); int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, @@ -1163,7 +1174,7 @@ static void cm_process_work(struct cm_id } cm_deref_id(cm_id_priv); if (ret) - ib_destroy_cm_id(&cm_id_priv->id); + cm_destroy_id(&cm_id_priv->id, ret != -ENOMEM); } static void cm_format_mra(struct cm_mra_msg *mra_msg, -- MST From halr at voltaire.com Tue Jun 27 12:22:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Jun 2006 15:22:54 -0400 Subject: [openib-general] [PATCH]OpenSM/osm_lid_mgr.c: In __osm_lid_mgr_init_sweep, support enhanced switch port 0 for LMC > 0 Message-ID: <1151436172.4482.199561.camel@hal.voltaire.com> OpenSM/osm_lid_mgr.c: In __osm_lid_mgr_init_sweep, support enhanced switch port 0 for LMC > 0 Base port 0 is constrained to have LMC of 0 whereas enhanced switch port 0 is not. Support enhanced switch port 0 is more like CA and router ports in terms of this. Signed-off-by: Hal Rosenstock Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 8239) +++ opensm/osm_lid_mgr.c (working copy) @@ -94,6 +94,7 @@ #include #include #include +#include #include #include #include @@ -351,6 +352,8 @@ __osm_lid_mgr_init_sweep( osm_lid_mgr_range_t *p_range = NULL; osm_port_t *p_port; cl_qmap_t *p_port_guid_tbl; + osm_switch_t *p_sw; + ib_switch_info_t *p_si; uint8_t lmc_num_lids = (uint8_t)(1 << p_mgr->p_subn->opt.lmc); uint16_t lmc_mask; uint16_t req_lid, num_lids; @@ -436,7 +439,20 @@ __osm_lid_mgr_init_sweep( IB_NODE_TYPE_SWITCH ) num_lids = lmc_num_lids; else - num_lids = 1; + { + /* Determine if enhanced switch port 0 */ + p_sw = osm_get_switch_by_guid(p_mgr->p_subn, + osm_node_get_node_guid(osm_port_get_parent_node(p_port))); + if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) && + ib_switch_info_is_enhanced_port0(p_si)) + { + num_lids = lmc_num_lids; + } + else + { + num_lids = 1; + } + } if ((num_lids != 1) && (((db_min_lid & lmc_mask) != db_min_lid) || From mshefty at ichips.intel.com Tue Jun 27 12:36:30 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 27 Jun 2006 12:36:30 -0700 Subject: [openib-general] Kernel Oops related to IPoIB (multicast module?) In-Reply-To: <200606261051.12515.jackm@mellanox.co.il> References: <200606261051.12515.jackm@mellanox.co.il> Message-ID: <44A188BE.1080207@ichips.intel.com> Jack Morgenstein wrote: > Evidently, ipoib was still attempting to connect with an SA, when the ipoib > module was unloaded (modprobe -r). After the ipoib module was unloaded (or at > least rendered inaccessible), the ib_sa module attempted to invoke > "ib_sa_mcmember_rec_callback" (for a callback address that was part of the > unloaded ipoib module). Hence, the Oops below. I still haven't been able to reproduce this, but I _think_ I understand what's likely happening. The SA query interface always invokes a callback, regardless if a call succeeds. So if a call to ib_sa_mcmmember_rec_set() fails (which happens in this case because the SM is down), the user's callback is still invoked. The multicast module is coded assuming that an immediate failure does not result in a callback, so the callback is unexpected, which throws off the reference counting. I should have a patch for this shortly, but since I can't reproduce the problem, my testing of it will be limited. - Sean From jlentini at netapp.com Tue Jun 27 13:00:05 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 27 Jun 2006 16:00:05 -0400 (EDT) Subject: [openib-general] new uDAPL co-maintainer Message-ID: In recognition of his many contributions to the DAPL project, Arlin Davis is joining the project as an official co-maintainer. Arlin and I will collaborate on DAPL maintenance and development decisions. james -- James Lentini | Network Appliance | 781-768-5359 | jlentini at netapp.com From pw at osc.edu Tue Jun 27 13:21:03 2006 From: pw at osc.edu (Pete Wyckoff) Date: Tue, 27 Jun 2006 16:21:03 -0400 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <20060627064234.GG19300@mellanox.co.il> References: <20060626215319.GA9291@osc.edu> <20060627064234.GG19300@mellanox.co.il> Message-ID: <20060627202103.GA10737@osc.edu> mst at mellanox.co.il wrote on Tue, 27 Jun 2006 09:42 +0300: > Quoting r. Pete Wyckoff : > > Is this a known issue? > > Yes. The fact that ibv_query_device returns some value in hca_cap can not > guarantee that ibv_create_qp with these parameters will succeed. For example, > system administrator might have imposed a limit on the amount of memory you can > pin down, and you will get ENOMEM. I was hoping to get a guaranteed maximum number from ibv_query_device so that I would know that calls to ibv_create_qp would not fail due to my asking for too many CQ entries. My code has some idea of how many it wants (16), and compares that to the hca_cap values to settle for what it can get. I only happened to notice that 30 wouldn't work even though it was so claimed when debugging. > > Should I always subtract 1 from the reported max on the send side? Just for > > this hardware? > > Unless you use it, passing the absolute maximum value supported by hardware does > not seem, to me, to make sense - it will just slow you down, and waste > resources. Is there a protocol out there that actually has a use for 30 sge? Perhaps I don't understand what is more resource-costly about using 29 sge when they are supported by the hardware. I'm using them on the send side to avoid having to either: 1. memcpy 29 little buffers into one big buffer or 2. send 29 rdma writes instead of a single rdma write with 29 sges The buffer on the receiver is contiguous and big enough to hold everything. > In my opinion, for the application to be robust it has to either use small > values that empirically work on most systems, or be able to scale down to > require less resources if an allocation fails. Scale down? So if ibv_create_qp fails, you think I should look at the return value (which is NULL, not ENOMEM or EINVAL or anything informative), and then gradually reduce the values for max_recv_sge, max_send_sge, max_recv_wr, max_send_wr, max_inline_data below the reported HCA maximum until I find something that works? I'll subtract 1 from the hca_cap.max_sge for Mellanox hardware before doing the comparison against how many SGEs I'd like to get. Otherwise I can't see much alternative to trusting the hca_cap values that are returned. -- Pete From pw at osc.edu Tue Jun 27 13:34:33 2006 From: pw at osc.edu (Pete Wyckoff) Date: Tue, 27 Jun 2006 16:34:33 -0400 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <7.0.1.0.2.20060627090204.04471ba0@netapp.com> References: <20060626215319.GA9291@osc.edu> <20060627064234.GG19300@mellanox.co.il> <7.0.1.0.2.20060627090204.04471ba0@netapp.com> Message-ID: <20060627203433.GB10737@osc.edu> Thomas.Talpey at netapp.com wrote on Tue, 27 Jun 2006 09:06 -0400: > At 02:42 AM 6/27/2006, Michael S. Tsirkin wrote: > >Unless you use it, passing the absolute maximum value supported by > >hardware does > >not seem, to me, to make sense - it will just slow you down, and waste > >resources. Is there a protocol out there that actually has a use for 30 sge? > > It's not a protocol thing, it's a memory registration thing. But I agree, > that's a huge number of segments for send and receive. 2-4 is more > typical. I'd be interested to know what wants 30 as well... This is the OpenIB port of pvfs2: http://www.pvfs.org/pvfs2/download.html See pvfs2/src/io/bmi/bmi_ib/openib.c for the bottom of the transport stack. The max_sge-1 aspect I'm complaining about isn't checked in yet. It's a file system application. The MPI-IO interface provides datatypes and file views that let a client write complex subsets of the in-memory data to a file with a single call. One case that happens is contiguous-in-file but discontiguous-in-memory, where the file system client writes data from multiple addresses to a single region in a file. The application calls MPI_File_write or a variant, and this complex buffer description filters all the way down to the OpenIB transport, which then has to figure out how to get the data to the server. These separate data regions may have been allocated all at once using MPI_Alloc_mem (rarely), or may have been used previously for file system operations so are already pinned in the registration cache. Are you implying there is more memory registration work that has to happen beyond making sure each of the SGE buffers is pinned and has a valid lkey? It would not be a major problem to avoid using more than a couple of SGEs; however, I didn't see any reason to avoid them. Please let me know if you see a problem with this approach. -- Pete From ralphc at pathscale.com Tue Jun 27 14:06:02 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Tue, 27 Jun 2006 14:06:02 -0700 Subject: [openib-general] [PATCH] change libipathverbs to use the new initialization convention Message-ID: <1151442362.4572.67.camel@brick.pathscale.com> The libibverbs.so.2 has a different device plug-in module intialization convention from libibverbs.so.1. This patch updates the InfiniPath libipathverbs module to conform to the new convention. Signed-off-by Ralph Campbell Index: src/userspace/libipathverbs/src/ipathverbs.map =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.map (revision 8255) +++ src/userspace/libipathverbs/src/ipathverbs.map (working copy) @@ -1,4 +1,4 @@ { - global: openib_driver_init; + global: ibv_driver_init; local: *; }; Index: src/userspace/libipathverbs/src/ipathverbs.c =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.c (revision 8255) +++ src/userspace/libipathverbs/src/ipathverbs.c (working copy) @@ -145,30 +145,24 @@ .free_context = ipath_free_context }; -struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev) +struct ibv_device *ibv_driver_init(const char *uverbs_sys_path, + int abi_version) { - struct sysfs_device *pcidev; - struct sysfs_attribute *attr; + char value[8]; struct ipath_device *dev; - unsigned vendor, device; - int i; + unsigned vendor, device; + int i; - pcidev = sysfs_get_classdev_device(sysdev); - if (!pcidev) + if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor", + value, sizeof value) < 0) return NULL; + sscanf(value, "%i", &vendor); - attr = sysfs_get_device_attr(pcidev, "vendor"); - if (!attr) + if (ibv_read_sysfs_file(uverbs_sys_path, "device/device", + value, sizeof value) < 0) return NULL; - sscanf(attr->value, "%i", &vendor); - sysfs_close_attribute(attr); + sscanf(value, "%i", &device); - attr = sysfs_get_device_attr(pcidev, "device"); - if (!attr) - return NULL; - sscanf(attr->value, "%i", &device); - sysfs_close_attribute(attr); - for (i = 0; i < sizeof hca_table / sizeof hca_table[0]; ++i) if (vendor == hca_table[i].vendor && device == hca_table[i].device) @@ -180,13 +174,12 @@ dev = malloc(sizeof *dev); if (!dev) { fprintf(stderr, PFX "Fatal: couldn't allocate device for %s\n", - sysdev->name); - abort(); + uverbs_sys_path); + return NULL; } dev->ibv_dev.ops = ipath_dev_ops; dev->hca_type = hca_table[i].type; - dev->page_size = sysconf(_SC_PAGESIZE); return &dev->ibv_dev; } Index: src/userspace/libipathverbs/src/ipathverbs.h =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.h (revision 8255) +++ src/userspace/libipathverbs/src/ipathverbs.h (working copy) @@ -57,7 +57,6 @@ struct ipath_device { struct ibv_device ibv_dev; enum ipath_hca_type hca_type; - int page_size; }; struct ipath_context { From mshefty at ichips.intel.com Tue Jun 27 14:13:17 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 27 Jun 2006 14:13:17 -0700 Subject: [openib-general] Kernel Oops related to IPoIB (multicast module?) In-Reply-To: <44A188BE.1080207@ichips.intel.com> References: <200606261051.12515.jackm@mellanox.co.il> <44A188BE.1080207@ichips.intel.com> Message-ID: <44A19F6D.7060008@ichips.intel.com> Sean Hefty wrote: > The SA query interface always invokes a callback, regardless if a call succeeds. > So if a call to ib_sa_mcmmember_rec_set() fails (which happens in this case > because the SM is down), the user's callback is still invoked. The multicast > module is coded assuming that an immediate failure does not result in a > callback, so the callback is unexpected, which throws off the reference counting. I've committed a patch that should hopefully fix this problem. The problem was that a return code of 0 from the SA query calls should have been treated as valid, rather than an error. - Sean From mst at mellanox.co.il Tue Jun 27 14:28:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 28 Jun 2006 00:28:51 +0300 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <20060627203433.GB10737@osc.edu> References: <20060627203433.GB10737@osc.edu> Message-ID: <20060627212851.GB5398@mellanox.co.il> Quoting r. Pete Wyckoff : > It would not be a major problem to avoid using more than a couple of > SGEs; however, I didn't see any reason to avoid them. Please let me > know if you see a problem with this approach. A QP with a large number of SGEs per WQE enabled uses up more resources and might also be slower if typical WR has a small number of SGEs. So you should anticipate the typical number of SGEs for best performance. As I mentioned previously, even if you do want a large number of SGEs, but you want your application to be robust and scalable, you should scale your parameters down if QP allocation fails since device query does not guarantee the allocation will always succeed. -- MST From ralphc at pathscale.com Tue Jun 27 15:02:18 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Tue, 27 Jun 2006 15:02:18 -0700 Subject: [openib-general] [PATCH] add support for ibv_query_qp(), ibv_query_srq() to libipathverbs Message-ID: <1151445738.4572.73.camel@brick.pathscale.com> This patch adds support for ibv_query_qp() and ibv_query_srq() to libipathverbs which are new in libibverbs.so.2. Note that it layers on top of my previous patch. Signed-off-by: Ralph Campbell Index: src/userspace/libipathverbs/src/ipathverbs.h =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.h (old) +++ src/userspace/libipathverbs/src/ipathverbs.h (new) @@ -96,6 +96,10 @@ struct ibv_qp *ipath_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); +int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, + enum ibv_qp_attr_mask attr_mask, + struct ibv_qp_init_attr *init_attr); + int ipath_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask); @@ -114,6 +118,8 @@ struct ibv_srq_attr *attr, enum ibv_srq_attr_mask attr_mask); +int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr); + int ipath_destroy_srq(struct ibv_srq *srq); Index: src/userspace/libipathverbs/src/verbs.c =================================================================== --- libipathverbs/src/verbs.c (old) +++ libipathverbs/src/verbs.c (new) @@ -40,7 +40,7 @@ #include #include -#include +#include #include #include @@ -193,6 +193,16 @@ return qp; } +int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, + enum ibv_qp_attr_mask attr_mask, + struct ibv_qp_init_attr *init_attr) +{ + struct ibv_query_qp cmd; + + return ibv_cmd_query_qp(qp, attr, attr_mask, init_attr, + &cmd, sizeof cmd); +} + int ipath_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask) { @@ -244,6 +254,13 @@ return ibv_cmd_modify_srq(srq, attr, attr_mask, &cmd, sizeof cmd); } +int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr) +{ + struct ibv_query_srq cmd; + + return ibv_cmd_query_srq(srq, attr, &cmd, sizeof cmd); +} + int ipath_destroy_srq(struct ibv_srq *srq) { int ret; From ralphc at pathscale.com Tue Jun 27 15:17:43 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Tue, 27 Jun 2006 15:17:43 -0700 Subject: [openib-general] [PATCH] trivial white space clean up in libipathverbs Message-ID: <1151446663.4572.79.camel@brick.pathscale.com> This patch just corrects some white space code conventions. Signed-off-by: Ralph Campbell Index: src/userspace/libipathverbs/src/ipathverbs.h =================================================================== --- src/userspace/libipathverbs/src/ipathverbs.h (old) +++ src/userspace/libipathverbs/src/ipathverbs.h (new) @@ -122,7 +122,6 @@ int ipath_destroy_srq(struct ibv_srq *srq); - struct ibv_ah *ipath_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr); int ipath_destroy_ah(struct ibv_ah *ah); Index: src/userspace/libipathverbs/src/verbs.c =================================================================== --- src/userspace/libipathverbs/src/verbs.c (old) +++ src/userspace/libipathverbs/src/verbs.c (new) @@ -83,11 +83,11 @@ struct ibv_pd *pd; pd = malloc(sizeof *pd); - if(!pd) + if (!pd) return NULL; - if(ibv_cmd_alloc_pd(context, pd, &cmd, sizeof cmd, - &resp, sizeof resp)) { + if (ibv_cmd_alloc_pd(context, pd, &cmd, sizeof cmd, + &resp, sizeof resp)) { free(pd); return NULL; } @@ -232,7 +232,7 @@ int ret; srq = malloc(sizeof *srq); - if(srq == NULL) + if (srq == NULL) return NULL; ret = ibv_cmd_create_srq(pd, srq, attr, &cmd, sizeof cmd, @@ -278,10 +278,10 @@ struct ibv_ah *ah; ah = malloc(sizeof *ah); - if(ah == NULL) + if (ah == NULL) return NULL; - if(ibv_cmd_create_ah(pd, ah, attr)) { + if (ibv_cmd_create_ah(pd, ah, attr)) { free(ah); return NULL; } From sean.hefty at intel.com Tue Jun 27 15:21:05 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 27 Jun 2006 15:21:05 -0700 Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog) In-Reply-To: <20060627181827.GD4896@mellanox.co.il> Message-ID: <000101c69a37$fb0deb30$e598070a@amr.corp.intel.com> If a user of the IB CM returns -ENOMEM from their connection callback, simply drop the incoming REQ. Do not send a reject, which should allow the sender to retry the request. This is necessary for SDP to support a backlog. Signed-off-by: Michael S. Tsirkin Signed-off-by: Sean Hefty --- This is a slightly modified version of the patch. I passed the return code directly to the destroy function for future flexibility, and limited the behavior change to REQ processing only. I ran some basic tests to make sure that this didn't break anything. If this looks okay to you, I can commit this to SVN. Index: cm.c =================================================================== --- cm.c (revision 8224) +++ cm.c (working copy) @@ -702,7 +702,7 @@ static void cm_reset_to_idle(struct cm_i } } -void ib_destroy_cm_id(struct ib_cm_id *cm_id) +static void cm_destroy_id(struct ib_cm_id *cm_id, int err) { struct cm_id_private *cm_id_priv; struct cm_work *work; @@ -736,12 +736,22 @@ retest: sizeof cm_id_priv->av.port->cm_dev->ca_guid, NULL, 0); break; + case IB_CM_REQ_RCVD: + if (err == -ENOMEM) { + /* Do not reject to allow future retries. */ + cm_reset_to_idle(cm_id_priv); + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + } else { + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); + } + break; case IB_CM_MRA_REQ_RCVD: case IB_CM_REP_SENT: case IB_CM_MRA_REP_RCVD: ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); /* Fall through */ - case IB_CM_REQ_RCVD: case IB_CM_MRA_REQ_SENT: case IB_CM_REP_RCVD: case IB_CM_MRA_REP_SENT: @@ -776,6 +786,11 @@ retest: kfree(cm_id_priv->private_data); kfree(cm_id_priv); } + +void ib_destroy_cm_id(struct ib_cm_id *cm_id) +{ + cm_destroy_id(cm_id, 0); +} EXPORT_SYMBOL(ib_destroy_cm_id); int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, @@ -1164,7 +1179,7 @@ static void cm_process_work(struct cm_id } cm_deref_id(cm_id_priv); if (ret) - ib_destroy_cm_id(&cm_id_priv->id); + cm_destroy_id(&cm_id_priv->id, ret); } static void cm_format_mra(struct cm_mra_msg *mra_msg, From mst at mellanox.co.il Tue Jun 27 15:38:26 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 28 Jun 2006 01:38:26 +0300 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <20060627202103.GA10737@osc.edu> References: <20060627202103.GA10737@osc.edu> Message-ID: <20060627223826.GC5398@mellanox.co.il> Quoting r. Pete Wyckoff : > Subject: Re: max_send_sge < max_sge > > mst at mellanox.co.il wrote on Tue, 27 Jun 2006 09:42 +0300: > > Quoting r. Pete Wyckoff : > > > Is this a known issue? > > > > Yes. The fact that ibv_query_device returns some value in hca_cap can not > > guarantee that ibv_create_qp with these parameters will succeed. For > > example, system administrator might have imposed a limit on the amount of > > memory you can pin down, and you will get ENOMEM. > > I was hoping to get a guaranteed maximum number from ibv_query_device so that > I would know that calls to ibv_create_qp would not fail due to my asking for > too many CQ entries. My code has some idea of how many it wants (16), and > compares that to the hca_cap values to settle for what it can get. I only > happened to notice that 30 wouldn't work even though it was so claimed when > debugging. Ah. I see. Unfortunately I don't think ibv_query_device currently provides this guarantee, and its not something easy to fix. What are you doing of the hca cap is below the values you want? Also, please see below for ideas about extending the API in a way that might be useful to you. > > > Should I always subtract 1 from the reported max on the send side? Just > > > for this hardware? > > > > Unless you use it, passing the absolute maximum value supported by hardware > > does not seem, to me, to make sense - it will just slow you down, and waste > > resources. Is there a protocol out there that actually has a use for 30 > > sge? > > Perhaps I don't understand what is more resource-costly about using > 29 sge when they are supported by the hardware. Well, more SGEs per WR does mean more resources are consumed for the same amount of WRs per QP. OK? > I'm using them on the send side to avoid having to either: > 1. memcpy 29 little buffers into one big buffer > or > 2. send 29 rdma writes instead of a single rdma write with 29 sges > The buffer on the receiver is contiguous and big enough to hold > everything. Its the same thing. Seems I'm not being clear. I was just saying that large SGE and WR values have cost so one should use a smallest SGE and WR numbers that still give good performance, not maximum thinkable values. But you probably know this :) > > In my opinion, for the application to be robust it has to either use small > > values that empirically work on most systems, or be able to scale down to > > require less resources if an allocation fails. > > Scale down? So if ibv_create_qp fails, you think I should look at > the return value (which is NULL, not ENOMEM or EINVAL or anything > informative), and then gradually reduce the values for max_recv_sge, > max_send_sge, max_recv_wr, max_send_wr, max_inline_data below the > reported HCA maximum until I find something that works? Well, if there's no bug I see no reason for ibv_create_qp to fail except that you are asking for too much WRs/SGEs. So yes, the trick you describe will work I think. At some point, I tried to think about extending the API in such a way that verbs like ibv_create_qp would round the parameters down to whatever does work. Would something like this be useful to you? Further, if the given SGE/WR pair can't be satisfied, will you want to scale down the number of SGEs or the number of WRs? > I'll subtract 1 from the hca_cap.max_sge for Mellanox hardware > before doing the comparison against how many SGEs I'd like to get. > Otherwise I can't see much alternative to trusting the hca_cap > values that are returned. If this works for you, great. I was just trying to point out query device can not guarantee that QP allocaton will always succeed even if you stay within limits it reports. For example, are you using a large number of WRs per QP as well? If so after alocating a couple of QPs you might run out of locked memory limit allowed per-user, depending on your system setup. QP allocation will then fail, even if you use the hcacap - 1 heuristic. -- MST From mshefty at ichips.intel.com Tue Jun 27 15:40:55 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 27 Jun 2006 15:40:55 -0700 Subject: [openib-general] ucma into kernel.org In-Reply-To: <44A12CEA.6010204@mellanox.co.il> References: <44A00EEF.702@ichips.intel.com> <20060626174117.GA19929@mellanox.co.il> <44A12CEA.6010204@mellanox.co.il> Message-ID: <44A1B3F7.7090504@ichips.intel.com> Tziporet Koren wrote: > These features are needed for uDAPL and were requested by Woody and > Arlin for Intel MPI scalability. > Since in OFED 1.1 we are going to take CMA from kernel 2.6.18 we need > them upstream. > > Can you drive these enhancements only to 2.6.18. I would like these features in OFED 1.1 as well. However, there are no users of those new interfaces in 2.6.18 that would justify their inclusion. I can target userspace support of the RDMA CM for 2.6.19, but I don't think it makes sense to try for 2.6.18. - Sean From mst at mellanox.co.il Tue Jun 27 15:42:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 28 Jun 2006 01:42:57 +0300 Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog) In-Reply-To: <000101c69a37$fb0deb30$e598070a@amr.corp.intel.com> References: <000101c69a37$fb0deb30$e598070a@amr.corp.intel.com> Message-ID: <20060627224257.GD5398@mellanox.co.il> Quoting r. Sean Hefty : > This is a slightly modified version of the patch. I passed the return > code directly to the destroy function for future flexibility, and > limited the behavior change to REQ processing only. > > I ran some basic tests to make sure that this didn't break anything. > If this looks okay to you, I can commit this to SVN. Looks good to me. Please go ahead, then I'll use this in SDP and test this way. -- MST From mshefty at ichips.intel.com Tue Jun 27 15:51:56 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 27 Jun 2006 15:51:56 -0700 Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog) In-Reply-To: <20060627224257.GD5398@mellanox.co.il> References: <000101c69a37$fb0deb30$e598070a@amr.corp.intel.com> <20060627224257.GD5398@mellanox.co.il> Message-ID: <44A1B68C.9030806@ichips.intel.com> Michael S. Tsirkin wrote: > Looks good to me. Please go ahead, then I'll use this in SDP and test this way. Committed in 8261. From mst at mellanox.co.il Tue Jun 27 15:48:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 28 Jun 2006 01:48:57 +0300 Subject: [openib-general] ucma into kernel.org In-Reply-To: <44A1B3F7.7090504@ichips.intel.com> References: <44A1B3F7.7090504@ichips.intel.com> Message-ID: <20060627224857.GE5398@mellanox.co.il> Quoting r. Sean Hefty : > > Can you drive these enhancements only to 2.6.18. > > I would like these features in OFED 1.1 as well. However, there are no users > of those new interfaces in 2.6.18 that would justify their inclusion. I think setting the number of retries and timeout in CMA might be useful for iSER as well. Or, what do you think? -- MST From mst at mellanox.co.il Tue Jun 27 16:04:20 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 28 Jun 2006 02:04:20 +0300 Subject: [openib-general] ucma into kernel.org In-Reply-To: <44A1B3F7.7090504@ichips.intel.com> References: <44A1B3F7.7090504@ichips.intel.com> Message-ID: <20060627230420.GF5398@mellanox.co.il> Quoting r. Sean Hefty : > > Can you drive these enhancements only to 2.6.18. > > I would like these features in OFED 1.1 as well. Would you consider making a git repository available with just the CMA code appropriate for OFED 1.1? Mixing git and SVN code to build OFED is really painful for us. -- MST From mshefty at ichips.intel.com Tue Jun 27 16:20:00 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 27 Jun 2006 16:20:00 -0700 Subject: [openib-general] ucma into kernel.org In-Reply-To: <20060627230420.GF5398@mellanox.co.il> References: <44A1B3F7.7090504@ichips.intel.com> <20060627230420.GF5398@mellanox.co.il> Message-ID: <44A1BD20.1090009@ichips.intel.com> Michael S. Tsirkin wrote: > Would you consider making a git repository available with just > the CMA code appropriate for OFED 1.1? Mixing git and SVN code > to build OFED is really painful for us. Sure, I can consider doing that. There would just be some logistics to work out, like the location of the git tree. Would a patch series in Roland's git tree work? Once he returns, we can start queuing up patches for 2.6.19, which could include any or all of the following: userspace support for the RDMA CM iWarp support latest changes for IB (UD QP and multicast) - Sean From mst at mellanox.co.il Tue Jun 27 16:45:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 28 Jun 2006 02:45:31 +0300 Subject: [openib-general] ucma into kernel.org In-Reply-To: <44A1BD20.1090009@ichips.intel.com> References: <44A1BD20.1090009@ichips.intel.com> Message-ID: <20060627234531.GG5398@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: ucma into kernel.org > > Michael S. Tsirkin wrote: > > Would you consider making a git repository available with just > > the CMA code appropriate for OFED 1.1? Mixing git and SVN code > > to build OFED is really painful for us. > > Sure, I can consider doing that. There would just be some logistics to work > out, like the location of the git tree. Oh, there's no reason to decide this up front: as I learned hosting a clone of a git tree is *really* trivial. For example, we can arrange to host a clone of your tree at mellanox.co.il if you like, and let you push there. And its also trivial to clone and switch to another location whenever you like. > Would a patch series in Roland's git tree work? You mean a head there, like for-ofed-1.1? Why not. But it does mean you'll need Roland to apply your patches to his tree. > Once he returns, we can start > queuing up patches for 2.6.19, which could include any or all of the following: > > userspace support for the RDMA CM > iWarp support > latest changes for IB (UD QP and multicast) And hopefully the retry/timeout options which started this dicussion? :) It is probably best to take whatever is needed in OFED and have a branch with these things, separate from for-2.6.19. -- MST From sean.hefty at intel.com Tue Jun 27 17:21:53 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 27 Jun 2006 17:21:53 -0700 Subject: [openib-general] [PATCH] ib_addr: fix get/set gid alignment issues Message-ID: <000001c69a48$db8e3290$e598070a@amr.corp.intel.com> The device address contains unsigned character arrays, which contain raw GID addresses. The GIDs may not be naturally aligned, so do not cast them to structures or unions. Signed-off-by: Sean Hefty --- This fixes an alignment issue pointed out by Michael when adding MGID support to the ib_addr module. Index: include/rdma/ib_addr.h =================================================================== --- include/rdma/ib_addr.h (revision 8224) +++ include/rdma/ib_addr.h (working copy) @@ -89,14 +89,16 @@ static inline void ib_addr_set_pkey(stru dev_addr->broadcast[9] = (unsigned char) pkey; } -static inline union ib_gid *ib_addr_get_mgid(struct rdma_dev_addr *dev_addr) +static inline void ib_addr_get_mgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) { - return (union ib_gid *) (dev_addr->broadcast + 4); + memcpy(gid, dev_addr->broadcast + 4, sizeof *gid); } -static inline union ib_gid *ib_addr_get_sgid(struct rdma_dev_addr *dev_addr) +static inline void ib_addr_get_sgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) { - return (union ib_gid *) (dev_addr->src_dev_addr + 4); + memcpy(gid, dev_addr->src_dev_addr + 4, sizeof *gid); } static inline void ib_addr_set_sgid(struct rdma_dev_addr *dev_addr, @@ -105,9 +107,10 @@ static inline void ib_addr_set_sgid(stru memcpy(dev_addr->src_dev_addr + 4, gid, sizeof *gid); } -static inline union ib_gid *ib_addr_get_dgid(struct rdma_dev_addr *dev_addr) +static inline void ib_addr_get_dgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) { - return (union ib_gid *) (dev_addr->dst_dev_addr + 4); + memcpy(gid, dev_addr->dst_dev_addr + 4, sizeof *gid); } static inline void ib_addr_set_dgid(struct rdma_dev_addr *dev_addr, Index: core/ucma_ib.c =================================================================== --- core/ucma_ib.c (revision 8224) +++ core/ucma_ib.c (working copy) @@ -40,27 +40,27 @@ static int ucma_get_paths(struct rdma_cm struct ib_sa_cursor *cursor; struct ib_sa_path_rec *path; struct ib_user_path_rec user_path; - union ib_gid *gid; + union ib_gid gid; int left, ret = 0; u16 pkey; if (!id->device) return -ENODEV; - gid = ib_addr_get_dgid(&id->route.addr.dev_addr); + ib_addr_get_dgid(&id->route.addr.dev_addr, &gid); pkey = ib_addr_get_pkey(&id->route.addr.dev_addr); - cursor = ib_create_path_cursor(id->device, id->port_num, gid); + cursor = ib_create_path_cursor(id->device, id->port_num, &gid); if (IS_ERR(cursor)) return PTR_ERR(cursor); - gid = ib_addr_get_sgid(&id->route.addr.dev_addr); + ib_addr_get_sgid(&id->route.addr.dev_addr, &gid); left = *len; *len = 0; for (path = ib_get_next_sa_attr(&cursor); path; path = ib_get_next_sa_attr(&cursor)) { if (pkey == path->pkey && - !memcmp(gid, path->sgid.raw, sizeof *gid)) { + !memcmp(&gid, path->sgid.raw, sizeof gid)) { if (paths) { ib_copy_path_rec_to_user(&user_path, path); if (copy_to_user(paths, &user_path, Index: core/cma.c =================================================================== --- core/cma.c (revision 8224) +++ core/cma.c (working copy) @@ -278,14 +278,14 @@ static void cma_detach_from_dev(struct r static int cma_acquire_ib_dev(struct rdma_id_private *id_priv) { struct cma_device *cma_dev; - union ib_gid *gid; + union ib_gid gid; int ret = -ENODEV; - gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr, &gid), mutex_lock(&lock); list_for_each_entry(cma_dev, &dev_list, list) { - ret = ib_find_cached_gid(cma_dev->device, gid, + ret = ib_find_cached_gid(cma_dev->device, &gid, &id_priv->id.port_num, NULL); if (!ret) { cma_attach_to_dev(id_priv, cma_dev); @@ -1266,8 +1266,8 @@ static int cma_query_ib_route(struct rdm struct ib_sa_path_rec path_rec; memset(&path_rec, 0, sizeof path_rec); - path_rec.sgid = *ib_addr_get_sgid(addr); - path_rec.dgid = *ib_addr_get_dgid(addr); + ib_addr_get_sgid(addr, &path_rec.sgid); + ib_addr_get_dgid(addr, &path_rec.dgid); path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); path_rec.numb_path = 1; @@ -1326,8 +1326,10 @@ static int cma_resolve_ib_route(struct r goto err1; } + ib_addr_get_sgid(addr, &route->path_rec->sgid); + ib_addr_get_dgid(addr, &route->path_rec->dgid); ret = ib_get_path_rec(id_priv->id.device, id_priv->id.port_num, - ib_addr_get_sgid(addr), ib_addr_get_dgid(addr), + &route->path_rec->sgid, &route->path_rec->dgid, ib_addr_get_pkey(addr), route->path_rec); if (!ret) { route->num_paths = 1; @@ -1463,7 +1465,7 @@ static int cma_bind_loopback(struct rdma { struct cma_device *cma_dev; struct ib_port_attr port_attr; - union ib_gid *gid; + union ib_gid gid; u16 pkey; int ret; u8 p; @@ -1484,8 +1486,7 @@ static int cma_bind_loopback(struct rdma } port_found: - gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); - ret = ib_get_cached_gid(cma_dev->device, p, 0, gid); + ret = ib_get_cached_gid(cma_dev->device, p, 0, &gid); if (ret) goto out; @@ -1493,6 +1494,7 @@ port_found: if (ret) goto out; + ib_addr_set_sgid(&id_priv->id.route.addr.dev_addr, &gid); ib_addr_set_pkey(&id_priv->id.route.addr.dev_addr, pkey); id_priv->id.port_num = p; cma_attach_to_dev(id_priv, cma_dev); @@ -1539,6 +1541,7 @@ static int cma_resolve_loopback(struct r { struct cma_work *work; struct sockaddr_in *src_in, *dst_in; + union ib_gid gid; int ret; work = kzalloc(sizeof *work, GFP_KERNEL); @@ -1551,8 +1554,8 @@ static int cma_resolve_loopback(struct r goto err; } - ib_addr_set_dgid(&id_priv->id.route.addr.dev_addr, - ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr)); + ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr, &gid); + ib_addr_set_dgid(&id_priv->id.route.addr.dev_addr, &gid); if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) { src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr; @@ -2153,8 +2156,9 @@ static int cma_join_ib_multicast(struct ib_sa_comp_mask comp_mask; int ret; + ib_addr_get_mgid(dev_addr, &rec.mgid); ret = ib_get_mcmember_rec(id_priv->id.device, id_priv->id.port_num, - ib_addr_get_mgid(dev_addr), &rec); + &rec.mgid, &rec); if (ret) return ret; @@ -2163,8 +2167,8 @@ static int cma_join_ib_multicast(struct mc_map[8] = ib_addr_get_pkey(dev_addr) >> 8; mc_map[9] = (unsigned char) ib_addr_get_pkey(dev_addr); - rec.mgid = *(union ib_gid *) (mc_map + 4); - rec.port_gid = *ib_addr_get_sgid(dev_addr); + memcpy(&rec.mgid.raw, mc_map + 4, sizeof rec.mgid); + ib_addr_get_sgid(dev_addr, &rec.port_gid); rec.pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); rec.join_state = 1; rec.qkey = sin->sin_addr.s_addr; Index: core/ucma.c =================================================================== --- core/ucma.c (revision 8224) +++ core/ucma.c (working copy) @@ -453,10 +453,10 @@ static void ucma_copy_ib_route(struct rd switch (route->num_paths) { case 0: dev_addr = &route->addr.dev_addr; - memcpy(&resp->ib_route[0].dgid, ib_addr_get_dgid(dev_addr), - sizeof(union ib_gid)); - memcpy(&resp->ib_route[0].sgid, ib_addr_get_sgid(dev_addr), - sizeof(union ib_gid)); + ib_addr_get_dgid(dev_addr, + (union ib_gid *) &resp->ib_route[0].dgid); + ib_addr_get_sgid(dev_addr, + (union ib_gid *) &resp->ib_route[0].sgid); resp->ib_route[0].pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); break; case 2: From tziporet at mellanox.co.il Tue Jun 27 22:53:43 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 28 Jun 2006 08:53:43 +0300 Subject: [openib-general] ucma into kernel.org In-Reply-To: <44A1BD20.1090009@ichips.intel.com> References: <44A1B3F7.7090504@ichips.intel.com> <20060627230420.GF5398@mellanox.co.il> <44A1BD20.1090009@ichips.intel.com> Message-ID: <44A21967.7040907@mellanox.co.il> Sean Hefty wrote: > Sure, I can consider doing that. There would just be some logistics > to work out, like the location of the git tree. > > Would a patch series in Roland's git tree work? Once he returns, we > can start queuing up patches for 2.6.19, which could include any or > all of the following: > > userspace support for the RDMA CM > iWarp support > latest changes for IB (UD QP and multicast) > > - Sean > For OFED 1.1 we need only userspace support for the RDMA CM Tziporet From tziporet at mellanox.co.il Tue Jun 27 23:33:05 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 28 Jun 2006 09:33:05 +0300 Subject: [openib-general] Kernel Oops related to IPoIB (multicast module?) In-Reply-To: <44A152B0.3000007@ichips.intel.com> References: <200606261051.12515.jackm@mellanox.co.il> <44A075C2.6060409@ichips.intel.com> <44A1270D.2070109@mellanox.co.il> <44A152B0.3000007@ichips.intel.com> Message-ID: <44A222A1.1020502@mellanox.co.il> Sean Hefty wrote: > > I am working on trying to resolve this as my top priority at the > moment, but I have not been able to reproduce this on my systems. I > want to understand why ib_sa was not unloaded as part of modprobe -r > ib_ipoib, but why ib_multicast apparently was. I will examine the > script that you mentioned, but I typically do not run the OFED release. > > - Sean > No need to run the OFED release, just take openibd script from https://openib.org/svn/gen2/branches/1.0/ofed/openib/scripts/ and use it: openibd start and openibd stop. In order for it to load/unload modules you need also to have the file openib.conf under /etc/infiniband directory with this content: # Start HCA driver upon boot ONBOOT=yes # Load MTHCA MTHCA_LOAD=yes # Load IPoIB IPOIB_LOAD=yes Tziporet From erezz at voltaire.com Wed Jun 28 04:41:35 2006 From: erezz at voltaire.com (Erez Zilber) Date: Wed, 28 Jun 2006 14:41:35 +0300 Subject: [openib-general] [PATCH] iser: fix iSER description in Kconfig Message-ID: <44A26AEF.6090204@voltaire.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: iser_description.diff URL: From Thomas.Talpey at netapp.com Wed Jun 28 05:36:51 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 28 Jun 2006 08:36:51 -0400 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <20060627203433.GB10737@osc.edu> References: <20060626215319.GA9291@osc.edu> <20060627064234.GG19300@mellanox.co.il> <7.0.1.0.2.20060627090204.04471ba0@netapp.com> <20060627203433.GB10737@osc.edu> Message-ID: <7.0.1.0.2.20060628082216.04740028@netapp.com> Yep, you're confirming my comment that the sge size is dependent on the memory registration strategy (and not the protocol itself). Because you have a pool approach, you potentially have a lot of discontiguous regions. Therefore, you need more sge's. (You could have the same issue with large preregistrations, etc.) If it's just for RDMA Write, the penalty really isn't that high - you can easily break the i/o up into separate RDMA Write ops and pump them out in a sequence. The HCA streams them, and using unsignalled completion on the WRs means the host overhead can be low. For sends, it's more painful. You have to "pull them up". Do you really need send inlines to be that big? I guess if you're supporting a writev() api over inline you don't have much control, but even writev has a maxiov. The approach the NFS/RDMA client takes is basically to have a pool of dedicated buffers for headers, with a certain amount of space for "small" sends. This maximum inline size is typically 1K or maybe 4K (it's configurable), and it copies send data into them if it fits. All other operations are posted as "chunks", which are explicit protocol objects corresponding to { mr, offset, length } triplets. The protocol supports an arbitrary number of them, but typically 8 is plenty. Each chunk results in an RDMA op from the server. If the server is coded well, the RDMA streams beautifully and there is no bandwidth issue. Just some ideas. I feel your pain. Tom. At 04:34 PM 6/27/2006, Pete Wyckoff wrote: >Thomas.Talpey at netapp.com wrote on Tue, 27 Jun 2006 09:06 -0400: >> At 02:42 AM 6/27/2006, Michael S. Tsirkin wrote: >> >Unless you use it, passing the absolute maximum value supported by >> >hardware does >> >not seem, to me, to make sense - it will just slow you down, and waste >> >resources. Is there a protocol out there that actually has a use >for 30 sge? >> >> It's not a protocol thing, it's a memory registration thing. But I agree, >> that's a huge number of segments for send and receive. 2-4 is more >> typical. I'd be interested to know what wants 30 as well... > >This is the OpenIB port of pvfs2: http://www.pvfs.org/pvfs2/download.html >See pvfs2/src/io/bmi/bmi_ib/openib.c for the bottom of the transport >stack. The max_sge-1 aspect I'm complaining about isn't checked in yet. > >It's a file system application. The MPI-IO interface provides >datatypes and file views that let a client write complex subsets of >the in-memory data to a file with a single call. One case that >happens is contiguous-in-file but discontiguous-in-memory, where the >file system client writes data from multiple addresses to a single >region in a file. The application calls MPI_File_write or a >variant, and this complex buffer description filters all the way >down to the OpenIB transport, which then has to figure out how to >get the data to the server. > >These separate data regions may have been allocated all at once >using MPI_Alloc_mem (rarely), or may have been used previously for >file system operations so are already pinned in the registration >cache. Are you implying there is more memory registration work that >has to happen beyond making sure each of the SGE buffers is pinned >and has a valid lkey? > >It would not be a major problem to avoid using more than a couple of >SGEs; however, I didn't see any reason to avoid them. Please let me >know if you see a problem with this approach. > > -- Pete From mst at mellanox.co.il Wed Jun 28 05:42:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 28 Jun 2006 15:42:07 +0300 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <7.0.1.0.2.20060628082216.04740028@netapp.com> References: <7.0.1.0.2.20060628082216.04740028@netapp.com> Message-ID: <20060628124207.GZ19300@mellanox.co.il> Quoting r. Talpey, Thomas : > Just some ideas. I feel your pain. Is there something that would make life easier for you? -- MST From Thomas.Talpey at netapp.com Wed Jun 28 05:51:57 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 28 Jun 2006 08:51:57 -0400 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <20060628124207.GZ19300@mellanox.co.il> References: <7.0.1.0.2.20060628082216.04740028@netapp.com> <20060628124207.GZ19300@mellanox.co.il> Message-ID: <7.0.1.0.2.20060628084952.042a2d90@netapp.com> At 08:42 AM 6/28/2006, Michael S. Tsirkin wrote: >Quoting r. Talpey, Thomas : >> Just some ideas. I feel your pain. > >Is there something that would make life easier for you? A work-request-based IBTA1.2/iWARP-compliant FMR implementation. Please. :-) Tom. From pw at osc.edu Wed Jun 28 07:21:21 2006 From: pw at osc.edu (Pete Wyckoff) Date: Wed, 28 Jun 2006 10:21:21 -0400 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <20060627223826.GC5398@mellanox.co.il> References: <20060627202103.GA10737@osc.edu> <20060627223826.GC5398@mellanox.co.il> Message-ID: <20060628142121.GA11906@osc.edu> mst at mellanox.co.il wrote on Wed, 28 Jun 2006 01:38 +0300: > If this works for you, great. I was just trying to point out query device can > not guarantee that QP allocaton will always succeed even if you stay within > limits it reports. > > For example, are you using a large number of WRs per QP as well? If so after > alocating a couple of QPs you might run out of locked memory limit allowed > per-user, depending on your system setup. QP allocation will then fail, even if > you use the hcacap - 1 heuristic. Thanks for all the comments. I'm not specifically trying to be a pain here. The bit I was failing to notice was that when considering many QP allocations, the resource demands add up faster when using more SGEs each. Still find it odd that the very first QP created can not achieve the maximum-reported values, but understand your general argument. Regarding the API, some interfaces I've seen will do the equivalent of putting the "max currently available" values in ibv_qp_init_attr so userspace can reconsider and try again. I never liked that very much, and it doesn't help much in this multi-dimensional space where WRs and SGEs apparently share the same overall constraints. Plus the returned values aren't guaranteed to be valid next time an attempt is made anyway, so don't do that. :) It may make people realize what's going on faster to get an actual return value somewhere. Right now many failure conditions are lumped into the returned NULL pointer: attr->cap values are bigger than HCA max, a library malloc falied, the HCA is out of new QP resources, the HCA is on fire. That said, an API that returns an explicit error code is clumsy: int ibv_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr, struct ibv_qp **newqp); struct ibv_qp *qp; int ret = ibv_create_qp(pd, &attr, &qp); if (ret < 0) printf("create qp failed: %s", strerror(ret)); So I'll have to vote against that bad idea too. It would be possible but odd to store the return code in errno. I.e., use the current API, but augmented to stick the return value in the (thread-private) errno. I'm not sure if I've seen anything outside of libc use errno. Having "ibv_errno" would be icky. Thanks, -- Pete From mst at mellanox.co.il Wed Jun 28 07:37:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 28 Jun 2006 17:37:56 +0300 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <7.0.1.0.2.20060628084952.042a2d90@netapp.com> References: <7.0.1.0.2.20060628084952.042a2d90@netapp.com> Message-ID: <20060628143755.GC19300@mellanox.co.il> Quoting r. Talpey, Thomas : > Subject: Re: max_send_sge < max_sge > > At 08:42 AM 6/28/2006, Michael S. Tsirkin wrote: > >Quoting r. Talpey, Thomas : > >> Just some ideas. I feel your pain. > > > >Is there something that would make life easier for you? > > A work-request-based IBTA1.2/iWARP-compliant FMR implementation. Hmm. No an easy one :) Just to clarify: what feature exactly do you want to use? The spec has 3 relevant compliance statements as far as I can see. With respect to fast registration: o10-37.2.6: If the HCA supports the Base Memory Management Extensions, the Fast Registration must take place before any subsequent Work Request on the same Send Queue is started. It looks like Fast Registration can bypass previous work requests, so existing FMR implementation already has this property I think. So I am guessing that what you want is one of the o10-37.2.19: Relaxed ordered, o10-37.2.20: Local Invalidate Fencing or o10-37.2.21: Send with Invalidate. Which one is it then? -- MST From mst at mellanox.co.il Wed Jun 28 07:51:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 28 Jun 2006 17:51:02 +0300 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <20060628142121.GA11906@osc.edu> References: <20060628142121.GA11906@osc.edu> Message-ID: <20060628145102.GD19300@mellanox.co.il> Quoting r. Pete Wyckoff : > Subject: Re: max_send_sge < max_sge > > mst at mellanox.co.il wrote on Wed, 28 Jun 2006 01:38 +0300: > > If this works for you, great. I was just trying to point out query device > > can not guarantee that QP allocaton will always succeed even if you stay > > within limits it reports. > > > > For example, are you using a large number of WRs per QP as well? If so > > after alocating a couple of QPs you might run out of locked memory limit > > allowed per-user, depending on your system setup. QP allocation will then > > fail, even if you use the hcacap - 1 heuristic. > > Thanks for all the comments. I'm not specifically trying to be a > pain here. The bit I was failing to notice was that when > considering many QP allocations, the resource demands add up faster > when using more SGEs each. Still find it odd that the very first > QP created can not achieve the maximum-reported values, but > understand your general argument. Yea, that's because the API only can report 1 max value. But when this was considered the concensus was its not worth extending the API because of the other issues you mention. > Regarding the API, some interfaces I've seen will do the equivalent > of putting the "max currently available" values in ibv_qp_init_attr > so userspace can reconsider and try again. I never liked that very > much, and it doesn't help much in this multi-dimensional space where > WRs and SGEs apparently share the same overall constraints. Plus > the returned values aren't guaranteed to be valid next time an > attempt is made anyway, so don't do that. :) Yep. We could have an option to have the stack scale the requested values down to some legal set instead of failing an allocation. But we couldn't come up with a clean way to tell the stack e.g. what should it round down: the SGE or WR value. Do you think selecting something arbitrarily might still be a good idea? So in the end we are back to either using low numbers that just work empirically, or starting with some value and going down till it succeeds. -- MST From caitlinb at broadcom.com Wed Jun 28 09:52:35 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 28 Jun 2006 09:52:35 -0700 Subject: [openib-general] max_send_sge < max_sge Message-ID: <54AD0F12E08D1541B826BE97C98F99F15F5B4D@NT-SJCA-0751.brcm.ad.broadcom.com> > > Yep. We could have an option to have the stack scale the > requested values down to some legal set instead of failing an > allocation. But we couldn't come up with a clean way to tell > the stack e.g. what should it round down: the SGE or WR > value. Do you think selecting something arbitrarily might still be a > good idea? > Having a "query only" option might help here. With this size SGE, what is the largest number of SGEs that I could current get? (but don't actually allocate that yet) From mst at mellanox.co.il Wed Jun 28 10:14:28 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 28 Jun 2006 20:14:28 +0300 Subject: [openib-general] [PATCH -stable] IB/mthca: restore missing PCI registers after reset Message-ID: <20060628171428.GF19300@mellanox.co.il> Hello, stable team! The pull of the following fix was requested by Roland Dreier just a couple of days before 2.6.17 came out, and so it seems it missed 2.6.17 by a narrow margin: http://lkml.org/lkml/2006/6/13/164 It is now upsteam: http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=13aa6ecb47990cfc78e20e347fdd3f1df6189426 As I hear from users about systems where mthca does not work at all without this patch, please consider it for -stable. Note: Roland Dreier is currently unavailable, and said he will be for a while. I am assuming since he ACKed this for 2.6.17 it's good for -stable as well as far as he's concerned. --- mthca does not restore the following PCI-X/PCI Express registers after reset: PCI-X device: PCI-X command register PCI-X bridge: upstream and downstream split transaction registers PCI Express : PCI Express device control and link control registers This causes instability and/or bad performance on systems where one of these registers is set to a non-default value by BIOS. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/hw/mthca/mthca_reset.c b/drivers/infiniband/hw/mthca/mthca_reset.c index df5e494..f4fddd5 100644 --- a/drivers/infiniband/hw/mthca/mthca_reset.c +++ b/drivers/infiniband/hw/mthca/mthca_reset.c @@ -49,6 +49,12 @@ int mthca_reset(struct mthca_dev *mdev) u32 *hca_header = NULL; u32 *bridge_header = NULL; struct pci_dev *bridge = NULL; + int bridge_pcix_cap = 0; + int hca_pcie_cap = 0; + int hca_pcix_cap = 0; + + u16 devctl; + u16 linkctl; #define MTHCA_RESET_OFFSET 0xf0010 #define MTHCA_RESET_VALUE swab32(1) @@ -110,6 +116,9 @@ #define MTHCA_RESET_VALUE swab32(1) } } + hca_pcix_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); + hca_pcie_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP); + if (bridge) { bridge_header = kmalloc(256, GFP_KERNEL); if (!bridge_header) { @@ -129,6 +138,13 @@ #define MTHCA_RESET_VALUE swab32(1) goto out; } } + bridge_pcix_cap = pci_find_capability(bridge, PCI_CAP_ID_PCIX); + if (!bridge_pcix_cap) { + err = -ENODEV; + mthca_err(mdev, "Couldn't locate HCA bridge " + "PCI-X capability, aborting.\n"); + goto out; + } } /* actually hit reset */ @@ -178,6 +194,20 @@ #define MTHCA_RESET_VALUE swab32(1) good: /* Now restore the PCI headers */ if (bridge) { + if (pci_write_config_dword(bridge, bridge_pcix_cap + 0x8, + bridge_header[(bridge_pcix_cap + 0x8) / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge Upstream " + "split transaction control, aborting.\n"); + goto out; + } + if (pci_write_config_dword(bridge, bridge_pcix_cap + 0xc, + bridge_header[(bridge_pcix_cap + 0xc) / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge Downstream " + "split transaction control, aborting.\n"); + goto out; + } /* * Bridge control register is at 0x3e, so we'll * naturally restore it last in this loop. @@ -203,6 +233,35 @@ good: } } + if (hca_pcix_cap) { + if (pci_write_config_dword(mdev->pdev, hca_pcix_cap, + hca_header[hca_pcix_cap / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA PCI-X " + "command register, aborting.\n"); + goto out; + } + } + + if (hca_pcie_cap) { + devctl = hca_header[(hca_pcie_cap + PCI_EXP_DEVCTL) / 4]; + if (pci_write_config_word(mdev->pdev, hca_pcie_cap + PCI_EXP_DEVCTL, + devctl)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA PCI Express " + "Device Control register, aborting.\n"); + goto out; + } + linkctl = hca_header[(hca_pcie_cap + PCI_EXP_LNKCTL) / 4]; + if (pci_write_config_word(mdev->pdev, hca_pcie_cap + PCI_EXP_LNKCTL, + linkctl)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA PCI Express " + "Link control register, aborting.\n"); + goto out; + } + } + for (i = 0; i < 16; ++i) { if (i * 4 == PCI_COMMAND) continue; -- MST From bugzilla-daemon at openib.org Wed Jun 28 12:17:33 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 28 Jun 2006 12:17:33 -0700 (PDT) Subject: [openib-general] [Bug 159] New: OFED1.0: Missing interfaces Message-ID: <20060628191733.B7E3922873F@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=159 Summary: OFED1.0: Missing interfaces Product: OpenFabrics Linux Version: gen2 Platform: Other OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Verbs AssignedTo: bugzilla at openib.org ReportedBy: venkatesh.babu at 3leafnetworks.com I was looking for Gen2 equvivalent of Gen1 Access Layer interfaces tsIbInServiceNoticeHandler() and ib_cm_path_migrate(), I could not find any which provides the similar functionality. I was wandering if you can give any comments on why it was omitted in Gen2 and/or if there are any plans of implementing it in future releases. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From pw at osc.edu Wed Jun 28 12:29:39 2006 From: pw at osc.edu (Pete Wyckoff) Date: Wed, 28 Jun 2006 15:29:39 -0400 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <20060628145102.GD19300@mellanox.co.il> References: <20060628142121.GA11906@osc.edu> <20060628145102.GD19300@mellanox.co.il> Message-ID: <20060628192939.GA12298@osc.edu> mst at mellanox.co.il wrote on Wed, 28 Jun 2006 17:51 +0300: > Yea, that's because the API only can report 1 max value. But when > this was considered the concensus was its not worth extending the API > because of the other issues you mention. Maybe you should report min(max_recv_sge, max_send_sge) instead of max(). In this case I don't care because I currently need fewer SGEs than either limit. I'm just worried you're going to get the same complaint by newbie IB users later. > Yep. We could have an option to have the stack scale the requested values down > to some legal set instead of failing an allocation. But we couldn't come up > with a clean way to tell the stack e.g. what should it round down: the SGE or > WR value. Do you think selecting something arbitrarily might still be a good > idea? No. If I get fewer WRs than requested, the app would break. If I get fewer SGs, things would work for this particular app with some more infrastructure to check for that, but I don't see how that could be a general rule. I like the model where the app can provision itself by querying the NIC before opening any QPs, then get the same settings for every QP, until the maximum number of QPs is reached. We already have a way in PVFS2 to close "idle" connections, but it isn't hooked up into QP allocation failure yet. I prefer to do that than to limp along on certain connections with fewer WRs or SGs, along with all the code that would have to be added to handle that situation. > So in the end we are back to either using low numbers that just work > empirically, or starting with some value and going down till it succeeds. Yep. Thanks for the insight. It'll be fun when I try to get this to work on amso with only 4 SGEs per QP. -- Pete From Thomas.Talpey at netapp.com Wed Jun 28 12:31:34 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 28 Jun 2006 15:31:34 -0400 Subject: [openib-general] max_send_sge < max_sge In-Reply-To: <20060628145102.GD19300@mellanox.co.il> References: <20060628142121.GA11906@osc.edu> <20060628145102.GD19300@mellanox.co.il> Message-ID: <7.0.1.0.2.20060628152709.04471ce8@netapp.com> At 10:51 AM 6/28/2006, Michael S. Tsirkin wrote: >Yep. We could have an option to have the stack scale the requested values down >to some legal set instead of failing an allocation. But we couldn't come up >with a clean way to tell the stack e.g. what should it round down: the SGE or >WR value. Do you think selecting something arbitrarily might still be a good >idea? No! Well, not as the default. Otherwise, the consumer has to go back and check what happened even on success, which is a royal pain and highly inefficient. Maybe we should pass in an optional attribute structure, that is returned with the granted attributes on success, or the would-have-been attributes on failure? Tom. From bugzilla-daemon at openib.org Wed Jun 28 12:55:00 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 28 Jun 2006 12:55:00 -0700 (PDT) Subject: [openib-general] [Bug 160] New: OFED1.0: ib_modify_qp() of RC QP fails with -EINVAL Message-ID: <20060628195500.DE33A22873F@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=160 Summary: OFED1.0: ib_modify_qp() of RC QP fails with -EINVAL Product: OpenFabrics Linux Version: gen2 Platform: All OS/Version: All Status: NEW Severity: major Priority: P2 Component: Verbs AssignedTo: bugzilla at openib.org ReportedBy: venkatesh.babu at 3leafnetworks.com I have created a RC QP and estblishing a connection with remote RC QP using the interfaces defined in ib_cm.h. I was loading the alternate_path before calling ib_send_cm_req(). To transition the RC QP to IB_QPS_RTR state, I was calling ib_cm_init_qp_attr() to initialize the struct ib_qp_attr and calling ib_modify_qp(). It faild with -EINVAL. I found that this problem is due to a bug in ib_cm_init_qp_attr() which was not initializing the struct ib_qp_attr fields correctly. I made the following changes in cm_init_qp_rtr_attr() of openib-1.0/src/linux-kernel/infiniband/core/cm.c if (cm_id_priv->alt_av.ah_attr.dlid) { *qp_attr_mask |= IB_QP_ALT_PATH; + qp_attr->alt_port_num = + cm_id_priv->alt_av.port->port_num; qp_attr->alt_ah_attr = cm_id_priv->alt_av.ah_attr; } With this patch ib_modify_qp() worked fine and I was able to establish the connection with remote RC QP. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Wed Jun 28 15:52:59 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 28 Jun 2006 15:52:59 -0700 (PDT) Subject: [openib-general] [Bug 160] OFED1.0: ib_modify_qp() of RC QP fails with -EINVAL Message-ID: <20060628225259.3BFC922873F@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=160 ------- Comment #1 from sean.hefty at intel.com 2006-06-28 15:52 ------- Thanks for the info. I have committed the fix to SVN revision 8267. The OFED release will need to be updated separately. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mshefty at ichips.intel.com Wed Jun 28 16:24:05 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 28 Jun 2006 16:24:05 -0700 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> Message-ID: <44A30F95.2050408@ichips.intel.com> Roland Dreier wrote: >>I suggest the following design: the CMA would replace the event handler >>provided with the qp_init_attr struct with a callback of its own and >>keep the original handler/context on a private structure. > > > This is probably fine. There is one further situation where the > connection needs to be established, beyond RTU and the communication > established async event. Namely, if a receive completion is polled. > Since async events are, well, asynchronous, there's no guarantee that > the communication established event will be reported any time soon... This brings up a good point. Even if a user gets a communication established event, the IB CM could have already timed out and failed the connection. I don't think that we can do anything about this. I should also point out that the proposed design will not work for userspace. I'm hesitant to make this change until a solution for userspace can also be found, in the hope that a common fix can be shared. - Sean From bos at pathscale.com Wed Jun 28 16:54:53 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Jun 2006 16:54:53 -0700 Subject: [openib-general] ipath patch series a-comin', but no IB maintainer to shepherd them Message-ID: <1151538893.13430.43.camel@obsidian> Hi, Andrew - I have a pile of patches for the ipath driver that I'd like to get in during the "open season" window. Roland has his hands full with diapers and other sprog paraphernalia as of a few days ago, so I doubt he'll see this message soon, much less care about the patches. Given Roland's presumed unavailability, would the appropriate thing be to drop the patches into -mm and then push them along to Linus, or what? References: <1151538893.13430.43.camel@obsidian> Message-ID: <20060628171318.7d97d617.akpm@osdl.org> "Bryan O'Sullivan" wrote: > > Hi, Andrew - > > I have a pile of patches for the ipath driver that I'd like to get in > during the "open season" window. Roland has his hands full with diapers > and other sprog paraphernalia as of a few days ago, so I doubt he'll see > this message soon, much less care about the patches. > > Given Roland's presumed unavailability, would the appropriate thing be > to drop the patches into -mm and then push them along to Linus, or what? > We can do that, sure. Please cc openib and lkml and netdev and whatever-else-you-can-think of when you send them over. From mst at mellanox.co.il Wed Jun 28 22:33:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 29 Jun 2006 08:33:01 +0300 Subject: [openib-general] [Bug 160] OFED1.0: ib_modify_qp() of RC QP fails with -EINVAL In-Reply-To: <20060628225259.3BFC922873F@openib.ca.sandia.gov> References: <20060628225259.3BFC922873F@openib.ca.sandia.gov> Message-ID: <20060629053301.GA5127@mellanox.co.il> Quoting r. bugzilla-daemon at openib.org : > Subject: [openib-general] [Bug 160] OFED1.0: ib_modify_qp() of RC QP fails with -EINVAL > > http://openib.org/bugzilla/show_bug.cgi?id=160 > > > > > > ------- Comment #1 from sean.hefty at intel.com 2006-06-28 15:52 ------- > Thanks for the info. I have committed the fix to SVN revision 8267. The OFED > release will need to be updated separately. OFED is tracking 2.6.18 so to get things there they need to be submitted to Roland's for-2.6.18 tree. -- MST From mst at mellanox.co.il Wed Jun 28 22:45:24 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 29 Jun 2006 08:45:24 +0300 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> Message-ID: <20060629054524.GC5127@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: design for communication established affiliated asynchronous event handling > > >I suggest the following design: the CMA would replace the event handler > >provided with the qp_init_attr struct with a callback of its own and > >keep the original handler/context on a private structure. > > This is probably fine. There is one further situation where the > connection needs to be established, beyond RTU and the communication > established async event. Namely, if a receive completion is polled. > Since async events are, well, asynchronous, there's no guarantee that > the communication established event will be reported any time soon... How about user taking this into account and not arming the CQ / not polling it until the established event? -- MST From sean.hefty at intel.com Wed Jun 28 22:50:38 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 28 Jun 2006 22:50:38 -0700 Subject: [openib-general] [Bug 160] OFED1.0: ib_modify_qp() of RC QP fails with -EINVAL In-Reply-To: <20060629053301.GA5127@mellanox.co.il> Message-ID: <000001c69b3f$f3045470$e7d8180a@amr.corp.intel.com> >OFED is tracking 2.6.18 so to get things there they need to be submitted to >Roland's for-2.6.18 tree. I downloaded Linus' latest tree today, and will submit a patch tomorrow. - Sean From sean.hefty at intel.com Wed Jun 28 22:52:28 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 28 Jun 2006 22:52:28 -0700 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <20060629054524.GC5127@mellanox.co.il> Message-ID: <000101c69b40$3463b640$e7d8180a@amr.corp.intel.com> >How about user taking this into account and not arming the CQ / >not polling it until the established event? The CQ could be in use by other QPs. - Sean From halr at voltaire.com Thu Jun 29 04:10:53 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Jun 2006 07:10:53 -0400 Subject: [openib-general] [PATCH][MINOR] OpenSM/osm_inform.c: n __dump_all_informs, don't scan inform list unless logging has the debug level turned on Message-ID: <1151579451.4541.55584.camel@hal.voltaire.com> OpenSM/osm_inform.c: In __dump_all_informs, don't scan inform list unless logging has the debug level turned on Signed-off-by: Hal Rosenstock Index: opensm/osm_inform.c =================================================================== --- opensm/osm_inform.c (revision 8274) +++ opensm/osm_inform.c (working copy) @@ -179,6 +179,9 @@ __dump_all_informs( OSM_LOG_ENTER( p_log, __dump_all_informs ); + if( ! osm_log_is_active( p_log, OSM_LOG_DEBUG ) ) + goto Exit; + p_list_item = cl_qlist_head( &p_subn->sa_infr_list ); while (p_list_item != cl_qlist_end( &p_subn->sa_infr_list )) { @@ -188,6 +191,7 @@ __dump_all_informs( p_list_item = cl_qlist_next( p_list_item ); } + Exit: OSM_LOG_EXIT( p_log ); } From bugzilla-daemon at openib.org Thu Jun 29 07:34:34 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Thu, 29 Jun 2006 07:34:34 -0700 (PDT) Subject: [openib-general] [Bug 163] New: ibv_ack_async_event seg-fault when requested event is SRQ limit Message-ID: <20060629143434.B011022873F@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=163 Summary: ibv_ack_async_event seg-fault when requested event is SRQ limit Product: OpenFabrics Linux Version: gen2 Platform: Other OS/Version: Other Status: NEW Severity: blocker Priority: P2 Component: IB Core AssignedTo: bugzilla at openib.org ReportedBy: amip at mellanox.co.il CC: ziv at mellanox.co.il the event->element.srq returned from read() in ibv_get_async_event() is NULL ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sean.hefty at intel.com Thu Jun 29 10:07:38 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 29 Jun 2006 10:07:38 -0700 Subject: [openib-general] ipath patch series a-comin', but no IB maintainer to shepherd them In-Reply-To: <20060629163857.GT19300@mellanox.co.il> Message-ID: <000001c69b9e$86268fd0$8698070a@amr.corp.intel.com> >This currently includes a single patch from Venkatesh Babu: > IB/core: Set alternate port number when initializing QP attributes. > >that has been checked into openib svn by Sean. Thanks Michael. I will assume that you will push this change in through Roland when he's back. - Sean From mshefty at ichips.intel.com Thu Jun 29 09:48:40 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 29 Jun 2006 09:48:40 -0700 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: References: Message-ID: <44A40468.9070600@ichips.intel.com> Rimmer, Todd wrote: > The CM would open the CA, provide its async event callback routine and > perform a special register_cm() verbs call. Of course most CM traffic > would occur on the GSI QP, so this open CA instance was only for this > purpose. This special verb was only available in kernel space (avoiding > security issue of application stealing CM interface and because our CM > was in the kernel anyway). Thanks for the info. I'm considering this sort of approach. - Sean From mst at mellanox.co.il Thu Jun 29 09:38:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 29 Jun 2006 19:38:57 +0300 Subject: [openib-general] ipath patch series a-comin', but no IB maintainer to shepherd them In-Reply-To: <20060628171318.7d97d617.akpm@osdl.org> References: <20060628171318.7d97d617.akpm@osdl.org> Message-ID: <20060629163857.GT19300@mellanox.co.il> Quoting r. Andrew Morton : > > Hi, Andrew - > > > > I have a pile of patches for the ipath driver that I'd like to get in > > during the "open season" window. Roland has his hands full with diapers > > and other sprog paraphernalia as of a few days ago, so I doubt he'll see > > this message soon, much less care about the patches. > > > > Given Roland's presumed unavailability, would the appropriate thing be > > to drop the patches into -mm and then push them along to Linus, or what? > > > > We can do that, sure. Please cc openib and lkml and netdev and > whatever-else-you-can-think of when you send them over. Yes, -mm seems like a good way to get more review. Further, in the hope that this will help keep things reasonably stable till Roland comes back, and help everyone see what's being merged, I have created a git branch for all things infiniband going into 2.6.18. You can get at it here: git://www.mellanox.co.il/~git/infiniband mst-for-2.6.18 This currently includes a single patch from Venkatesh Babu: IB/core: Set alternate port number when initializing QP attributes. that has been checked into openib svn by Sean. Please Cc me on infiniband patches that are going to be merged and I'll do my best to compile, test and if it works put them there. If everyone does this, I also hope this will help Roland when he's back to figure out where do things stand. Thanks, -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Thu Jun 29 08:19:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Jun 2006 11:19:03 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] OpenSM/Remote SM: Eliminate some unneeded status checking Message-ID: <1151593785.4541.65570.camel@hal.voltaire.com> OpenSM/Remote SM: Eliminate some unneeded status checking Since osm_remote_sm.c:osm_remote_sm_init cannot fail, don't return any status and don't check in an consumers of this API Signed-off-by: Hal Rosenstock Index: include/opensm/osm_remote_sm.h =================================================================== --- include/opensm/osm_remote_sm.h (revision 8288) +++ include/opensm/osm_remote_sm.h (working copy) @@ -188,7 +188,7 @@ osm_remote_sm_destroy( * * SYNOPSIS */ -ib_api_status_t +void osm_remote_sm_init( IN osm_remote_sm_t* const p_sm, IN const osm_port_t* const p_port, @@ -205,7 +205,7 @@ osm_remote_sm_init( * [in] Pointer to the SMInfo attribute for this SM. * * RETURN VALUES -* IB_SUCCESS if the SM object was initialized successfully. +* This function does not return a value. * * NOTES * Allows calling other Remote SM methods. Index: opensm/osm_remote_sm.c =================================================================== --- opensm/osm_remote_sm.c (revision 8277) +++ opensm/osm_remote_sm.c (working copy) @@ -74,7 +74,7 @@ osm_remote_sm_destroy( /********************************************************************** **********************************************************************/ -ib_api_status_t +void osm_remote_sm_init( IN osm_remote_sm_t* const p_sm, IN const osm_port_t* const p_port, @@ -87,5 +87,5 @@ osm_remote_sm_init( p_sm->p_port = p_port; p_sm->smi = *p_smi; - return( IB_SUCCESS ); + return; } Index: opensm/osm_sminfo_rcv.c =================================================================== --- opensm/osm_sminfo_rcv.c (revision 8287) +++ opensm/osm_sminfo_rcv.c (working copy) @@ -568,7 +568,6 @@ __osm_sminfo_rcv_process_get_response( osm_port_t* p_port; ib_net64_t port_guid; osm_remote_sm_t* p_sm; - ib_api_status_t status; osm_signal_t process_get_sm_ret_val = OSM_SIGNAL_NONE; OSM_LOG_ENTER( p_rcv->p_log, __osm_sminfo_rcv_process_get_response ); @@ -647,15 +646,7 @@ __osm_sminfo_rcv_process_get_response( goto Exit; } - status = osm_remote_sm_init( p_sm, p_port, p_smi ); - if( status != IB_SUCCESS ) - { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_sminfo_rcv_process_get_response: ERR 2F15: " - "Other SM object initialization failed (%s)\n", - ib_get_err_str( status ) ); - goto Exit; - } + osm_remote_sm_init( p_sm, p_port, p_smi ); cl_qmap_insert( p_sm_tbl, port_guid, &p_sm->map_item ); } From trimmer at silverstorm.com Thu Jun 29 05:48:25 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Thu, 29 Jun 2006 08:48:25 -0400 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <44A30F95.2050408@ichips.intel.com> Message-ID: > -----Original Message----- > From: openib Sean Hefty > Sent: Wednesday, June 28, 2006 7:24 PM > > Roland Dreier wrote: > >>I suggest the following design: the CMA would replace the event handler > >>provided with the qp_init_attr struct with a callback of its own and > >>keep the original handler/context on a private structure. > > I should also point out that the proposed design will not work for > userspace. > I'm hesitant to make this change until a solution for userspace can also > be > found, in the hope that a common fix can be shared. > > - Sean The approach we took in our proprietary stack was to provide a verbs driver interface for the CM to register itself with the verbs driver. The CM would open the CA, provide its async event callback routine and perform a special register_cm() verbs call. Of course most CM traffic would occur on the GSI QP, so this open CA instance was only for this purpose. This special verb was only available in kernel space (avoiding security issue of application stealing CM interface and because our CM was in the kernel anyway). When the CA got an Async Event for a Communication Established event, it would deliver it to both the CM (regardless of which QP it was for) and to the open instance owning the QP. All other async events were only delivered to the appropriate open instance. This put the handling in the kernel and at a low level where it would not impact handling of other async events and avoided complications of user vs kernel async event filters. Depending on the design of APM, the CM might also be interested in APM related Async Events (in our design the application had an opportunity to select a new alternate path, so it was more appropriate to let the ULP handle these events directly). Todd Rimmer From halr at voltaire.com Thu Jun 29 08:09:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Jun 2006 11:09:33 -0400 Subject: [openib-general] [PATCHv2] OpenSM/osm_lid_mgr.c: Support enhanced switch port 0 for LMC > 0 Message-ID: <1151593772.4541.65566.camel@hal.voltaire.com> OpenSM/osm_lid_mgr.c: Support enhanced switch port 0 for LMC > 0 Base port 0 is constrained to have LMC of 0 whereas enhanced switch port 0 is not. Support enhanced switch port 0 is more like CA and router ports in terms of LMC. Signed-off-by: Hal Rosenstock Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 8277) +++ opensm/osm_lid_mgr.c (working copy) @@ -94,6 +94,7 @@ #include #include #include +#include #include #include #include @@ -351,6 +352,8 @@ __osm_lid_mgr_init_sweep( osm_lid_mgr_range_t *p_range = NULL; osm_port_t *p_port; cl_qmap_t *p_port_guid_tbl; + osm_switch_t *p_sw; + ib_switch_info_t *p_si; uint8_t lmc_num_lids = (uint8_t)(1 << p_mgr->p_subn->opt.lmc); uint16_t lmc_mask; uint16_t req_lid, num_lids; @@ -436,7 +439,20 @@ __osm_lid_mgr_init_sweep( IB_NODE_TYPE_SWITCH ) num_lids = lmc_num_lids; else - num_lids = 1; + { + /* Determine if enhanced switch port 0 */ + p_sw = osm_get_switch_by_guid(p_mgr->p_subn, + osm_node_get_node_guid(osm_port_get_parent_node(p_port))); + if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) && + ib_switch_info_is_enhanced_port0(p_si)) + { + num_lids = lmc_num_lids; + } + else + { + num_lids = 1; + } + } if ((num_lids != 1) && (((db_min_lid & lmc_mask) != db_min_lid) || @@ -539,7 +555,18 @@ __osm_lid_mgr_init_sweep( } else { - num_lids = 1; + /* Determine if enhanced switch port 0 */ + p_sw = osm_get_switch_by_guid(p_mgr->p_subn, + osm_node_get_node_guid(osm_port_get_parent_node(p_port))); + if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) && + ib_switch_info_is_enhanced_port0(p_si)) + { + num_lids = lmc_num_lids; + } + else + { + num_lids = 1; + } } /* Make sure the lid is aligned */ @@ -798,6 +825,8 @@ __osm_lid_mgr_get_port_lid( uint8_t num_lids = (1 << p_mgr->p_subn->opt.lmc); int lid_changed = 0; uint16_t lmc_mask; + osm_switch_t *p_sw; + ib_switch_info_t *p_si; OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_get_port_lid ); @@ -809,10 +838,19 @@ __osm_lid_mgr_get_port_lid( /* get the lid from the guid2lid */ guid = cl_ntoh64( osm_port_get_guid( p_port ) ); - /* if the port is a switch then we only need one lid */ + /* if the port is a switch with base switch port 0 then we only need one lid */ if( osm_node_get_type( osm_port_get_parent_node( p_port ) ) == IB_NODE_TYPE_SWITCH ) - num_lids = 1; + { + /* Determine if base switch port 0 */ + p_sw = osm_get_switch_by_guid(p_mgr->p_subn, + osm_node_get_node_guid(osm_port_get_parent_node(p_port))); + if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) && + !ib_switch_info_is_enhanced_port0(p_si)) + { + num_lids = 1; + } + } /* if the port matches the guid2lid */ if (!osm_db_guid2lid_get( p_mgr->p_g2l, guid, &min_lid, &max_lid)) From halr at voltaire.com Thu Jun 29 08:15:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Jun 2006 11:15:49 -0400 Subject: [openib-general] [PATCH] OpenSM/osm_sa_portinfo_record.c: Support enhanced switch port 0 for LMC > 0 Message-ID: <1151593780.4541.65568.camel@hal.voltaire.com> OpenSM/osm_sa_portinfo_record.c: Support enhanced switch port 0 for LMC > 0 In __osm_sa_pir_create, handle enhanced switch port 0 (and the possibility that it's LMC > 0) Signed-off-by: Hal Rosenstock Index: opensm/osm_sa_portinfo_record.c =================================================================== --- opensm/osm_sa_portinfo_record.c (revision 8277) +++ opensm/osm_sa_portinfo_record.c (working copy) @@ -60,6 +60,7 @@ #include #include #include +#include #include #include #include @@ -197,24 +198,34 @@ __osm_sa_pir_create( uint16_t max_lid_ho; uint16_t base_lid_ho; uint16_t match_lid_ho; + osm_physp_t *p_node_physp; + osm_switch_t *p_sw; + ib_switch_info_t *p_si; OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_pir_create ); - if(p_physp->p_node->node_info.node_type == IB_NODE_TYPE_SWITCH) + if (p_physp->p_node->node_info.node_type == IB_NODE_TYPE_SWITCH) { - lmc = 0; - base_lid_ho = cl_ntoh16( - osm_physp_get_base_lid( - osm_node_get_physp_ptr(p_physp->p_node, 0)) - ); - max_lid_ho = base_lid_ho; + p_node_physp = osm_node_get_physp_ptr( p_physp->p_node, 0 ); + base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_node_physp ) ); + p_sw = osm_get_switch_by_guid( p_rcv->p_subn, + osm_physp_get_port_guid( p_node_physp ) ); + if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || + !ib_switch_info_is_enhanced_port0( p_si )) + { + lmc = 0; + } + else + { + lmc = osm_physp_get_lmc( p_node_physp ); + } } else { lmc = osm_physp_get_lmc( p_physp ); base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_physp ) ); - max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 ); } + max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 ); if( p_ctxt->comp_mask & IB_PIR_COMPMASK_LID ) { From trimmer at silverstorm.com Thu Jun 29 05:12:45 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Thu, 29 Jun 2006 08:12:45 -0400 Subject: [openib-general] design for communication established affiliated asynchronous event handling In-Reply-To: <20060629054524.GC5127@mellanox.co.il> Message-ID: > -----Original Message----- > From: Michael S. Tsirkin > Sent: Thursday, June 29, 2006 1:45 AM > > Quoting r. Roland Dreier : > > Subject: Re: design for communication established affiliated > asynchronous event handling > > > > >I suggest the following design: the CMA would replace the event handler > > >provided with the qp_init_attr struct with a callback of its own and > > >keep the original handler/context on a private structure. > > > > This is probably fine. There is one further situation where the > > connection needs to be established, beyond RTU and the communication > > established async event. Namely, if a receive completion is polled. > > Since async events are, well, asynchronous, there's no guarantee that > > the communication established event will be reported any time soon... > > How about user taking this into account and not arming the CQ / > not polling it until the established event? If the ULP is properly designed, the asynchronous-ness of the event (or RTU for that matter) should not be an issue. Per the IBTA CM state machine, the passive side upon sending the REP should move its endpoint (the QP and the ULPs state machine) state to Ready to Receive. QPs in RTR can have send WQEs posted to them, however they will not be sent until the QP is moved to RTS. This means the ULP while in RTR can perform its normal receive completion handling and even build and post send requests in response to such received messages. Such sends will be queued until the QP later moves to RTS. Most ULPs have some sort of application level flow control. This may be simply RNR NAK or it could be a credit system (such as SRP) or an additional application initialization protocol (such as SDP). Hence the active side will generally perform limited sends (typically one) to the passive side until it gets a response from the passive side (which won't happen until the QP is in RTS). Hence for a good ULP protocol, there is no risk of overflowing the send Q while waiting to move to RTS. The only thing the passive side ULP should not do until in RTS is any sort of "periodic status messages which don't require active side acknowledgement". Since the RTS state could be delayed, the ULP should not risk overflowing its send Q with such messages. Most of the standard ULP protocols (SDP, etc) do not have such messages or they require ULP level protocol negotiation before they are activated. Hence if this is all properly handled, the passive side's RTU/Async Event handling sequence will merely move the QP to RTS and notify the ULP. The ULP will likely do very limited work for this notification (perhaps just a state transition) as all the real work should have been done before sending the REP. The movement to RTS will enable the QP to start processing its Send Q and everything will be good. Taking this approach keeps the CM/CMA and ULP simpler in design and merely allows the RTS/RTU/Async Event handling to be another event in a state machine. Todd Rimmer From eitan at mellanox.co.il Thu Jun 29 12:51:12 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 29 Jun 2006 22:51:12 +0300 Subject: [openib-general] [PATCH] [TRIVIAL] OpenSM/Remote SM: Eliminate some unneeded statuschecking Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236891B@mtlexch01.mtl.com> sure > OpenSM/Remote SM: Eliminate some unneeded status checking > > Since osm_remote_sm.c:osm_remote_sm_init cannot fail, don't return any > status and don't check in an consumers of this API > > Signed-off-by: Hal Rosenstock > From eitan at mellanox.co.il Thu Jun 29 12:52:18 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 29 Jun 2006 22:52:18 +0300 Subject: [openib-general] [PATCH] OpenSM/osm_sa_portinfo_record.c: Support enhanced switchport 0 for LMC > 0 Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236891C@mtlexch01.mtl.com> Right ! > > OpenSM/osm_sa_portinfo_record.c: Support enhanced switch port 0 for LMC > > 0 > > In __osm_sa_pir_create, handle enhanced switch port 0 (and the > possibility that it's LMC > 0) > > Signed-off-by: Hal Rosenstock > > Index: opensm/osm_sa_portinfo_record.c > =================================================================== > --- opensm/osm_sa_portinfo_record.c (revision 8277) > +++ opensm/osm_sa_portinfo_record.c (working copy) > @@ -60,6 +60,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -197,24 +198,34 @@ __osm_sa_pir_create( > uint16_t max_lid_ho; > uint16_t base_lid_ho; > uint16_t match_lid_ho; > + osm_physp_t *p_node_physp; > + osm_switch_t *p_sw; > + ib_switch_info_t *p_si; > > OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_pir_create ); > > - if(p_physp->p_node->node_info.node_type == IB_NODE_TYPE_SWITCH) > + if (p_physp->p_node->node_info.node_type == IB_NODE_TYPE_SWITCH) > { > - lmc = 0; > - base_lid_ho = cl_ntoh16( > - osm_physp_get_base_lid( > - osm_node_get_physp_ptr(p_physp->p_node, 0)) > - ); > - max_lid_ho = base_lid_ho; > + p_node_physp = osm_node_get_physp_ptr( p_physp->p_node, 0 ); > + base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_node_physp ) ); > + p_sw = osm_get_switch_by_guid( p_rcv->p_subn, > + osm_physp_get_port_guid( p_node_physp ) ); > + if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || > + !ib_switch_info_is_enhanced_port0( p_si )) > + { > + lmc = 0; > + } > + else > + { > + lmc = osm_physp_get_lmc( p_node_physp ); > + } > } > else > { > lmc = osm_physp_get_lmc( p_physp ); > base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_physp ) ); > - max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 ); > } > + max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 ); > > if( p_ctxt->comp_mask & IB_PIR_COMPMASK_LID ) > { > From eitan at mellanox.co.il Thu Jun 29 12:54:33 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 29 Jun 2006 22:54:33 +0300 Subject: [openib-general] [PATCHv2] OpenSM/osm_lid_mgr.c: Support enhanced switch port 0 forLMC > 0 Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236891D@mtlexch01.mtl.com> Hi Hal, I think the check for num lids is so similar it deserves an inline function. What do you say? I refer to: > + if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) && > + ib_switch_info_is_enhanced_port0(p_si)) > + { > + num_lids = lmc_num_lids; > + } > + else > + { > + num_lids = 1; > + } > + } > From bos at pathscale.com Thu Jun 29 14:40:51 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:40:51 -0700 Subject: [openib-general] [PATCH 0 of 39] ipath - bug fixes, performance enhancements, and portability improvements Message-ID: Hi, Andrew - These patches bring the ipath driver up to date with a number of bug fixes, performance improvements, and better PowerPC support. There are a few whitespace and formatting patches in the series, but they're all self- contained. The patches have been tested internally, and shouldn't contain anything controversial. My hope is that they'll sit in -mm for a little bit, and make it into an early 2.6.18 -rc kernel. Thanks, Message-ID: Signed-off-by: Ralph Campbell Signed-off-by: Bryan O'Sullivan diff -r ebf646d10db0 -r c93c2b42d279 drivers/infiniband/hw/ipath/ipath_rc.c --- a/drivers/infiniband/hw/ipath/ipath_rc.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_rc.c Thu Jun 29 14:33:25 2006 -0700 @@ -1053,32 +1053,32 @@ static inline void ipath_rc_rcv_resp(str goto ack_done; } rdma_read: - if (unlikely(qp->s_state != OP(RDMA_READ_REQUEST))) - goto ack_done; - if (unlikely(tlen != (hdrsize + pmtu + 4))) - goto ack_done; - if (unlikely(pmtu >= qp->s_len)) - goto ack_done; - /* We got a response so update the timeout. */ - if (unlikely(qp->s_last == qp->s_tail || - get_swqe_ptr(qp, qp->s_last)->wr.opcode != - IB_WR_RDMA_READ)) - goto ack_done; - spin_lock(&dev->pending_lock); - if (qp->s_rnr_timeout == 0 && !list_empty(&qp->timerwait)) - list_move_tail(&qp->timerwait, - &dev->pending[dev->pending_index]); - spin_unlock(&dev->pending_lock); - /* - * Update the RDMA receive state but do the copy w/o holding the - * locks and blocking interrupts. XXX Yet another place that - * affects relaxed RDMA order since we don't want s_sge modified. - */ - qp->s_len -= pmtu; - qp->s_last_psn = psn; - spin_unlock_irqrestore(&qp->s_lock, flags); - ipath_copy_sge(&qp->s_sge, data, pmtu); - goto bail; + if (unlikely(qp->s_state != OP(RDMA_READ_REQUEST))) + goto ack_done; + if (unlikely(tlen != (hdrsize + pmtu + 4))) + goto ack_done; + if (unlikely(pmtu >= qp->s_len)) + goto ack_done; + /* We got a response so update the timeout. */ + if (unlikely(qp->s_last == qp->s_tail || + get_swqe_ptr(qp, qp->s_last)->wr.opcode != + IB_WR_RDMA_READ)) + goto ack_done; + spin_lock(&dev->pending_lock); + if (qp->s_rnr_timeout == 0 && !list_empty(&qp->timerwait)) + list_move_tail(&qp->timerwait, + &dev->pending[dev->pending_index]); + spin_unlock(&dev->pending_lock); + /* + * Update the RDMA receive state but do the copy w/o holding the + * locks and blocking interrupts. XXX Yet another place that + * affects relaxed RDMA order since we don't want s_sge modified. + */ + qp->s_len -= pmtu; + qp->s_last_psn = psn; + spin_unlock_irqrestore(&qp->s_lock, flags); + ipath_copy_sge(&qp->s_sge, data, pmtu); + goto bail; case OP(RDMA_READ_RESPONSE_LAST): /* ACKs READ req. */ From bos at pathscale.com Thu Jun 29 14:40:52 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:40:52 -0700 Subject: [openib-general] [PATCH 1 of 39] IB/ipath - Name zero counter offsets so it's clear they aren't counters In-Reply-To: Message-ID: Signed-off-by: Ralph Campbell Signed-off-by: Bryan O'Sullivan diff -r 28e3d8204fdb -r addf90abc724 drivers/infiniband/hw/ipath/ipath_mad.c --- a/drivers/infiniband/hw/ipath/ipath_mad.c Fri Jun 23 22:47:27 2006 +0700 +++ b/drivers/infiniband/hw/ipath/ipath_mad.c Thu Jun 29 14:33:25 2006 -0700 @@ -215,7 +215,7 @@ static int recv_subn_get_portinfo(struct /* P_KeyViolations are counted by hardware. */ pip->pkey_violations = cpu_to_be16((ipath_layer_get_cr_errpkey(dev->dd) - - dev->n_pkey_violations) & 0xFFFF); + dev->z_pkey_violations) & 0xFFFF); pip->qkey_violations = cpu_to_be16(dev->qkey_violations); /* Only the hardware GUID is supported for now */ pip->guid_cap = 1; @@ -389,7 +389,7 @@ static int recv_subn_set_portinfo(struct * later. */ if (pip->pkey_violations == 0) - dev->n_pkey_violations = + dev->z_pkey_violations = ipath_layer_get_cr_errpkey(dev->dd); if (pip->qkey_violations == 0) @@ -844,18 +844,18 @@ static int recv_pma_get_portcounters(str ipath_layer_get_counters(dev->dd, &cntrs); /* Adjust counters for any resets done. */ - cntrs.symbol_error_counter -= dev->n_symbol_error_counter; + cntrs.symbol_error_counter -= dev->z_symbol_error_counter; cntrs.link_error_recovery_counter -= - dev->n_link_error_recovery_counter; - cntrs.link_downed_counter -= dev->n_link_downed_counter; + dev->z_link_error_recovery_counter; + cntrs.link_downed_counter -= dev->z_link_downed_counter; cntrs.port_rcv_errors += dev->rcv_errors; - cntrs.port_rcv_errors -= dev->n_port_rcv_errors; - cntrs.port_rcv_remphys_errors -= dev->n_port_rcv_remphys_errors; - cntrs.port_xmit_discards -= dev->n_port_xmit_discards; - cntrs.port_xmit_data -= dev->n_port_xmit_data; - cntrs.port_rcv_data -= dev->n_port_rcv_data; - cntrs.port_xmit_packets -= dev->n_port_xmit_packets; - cntrs.port_rcv_packets -= dev->n_port_rcv_packets; + cntrs.port_rcv_errors -= dev->z_port_rcv_errors; + cntrs.port_rcv_remphys_errors -= dev->z_port_rcv_remphys_errors; + cntrs.port_xmit_discards -= dev->z_port_xmit_discards; + cntrs.port_xmit_data -= dev->z_port_xmit_data; + cntrs.port_rcv_data -= dev->z_port_rcv_data; + cntrs.port_xmit_packets -= dev->z_port_xmit_packets; + cntrs.port_rcv_packets -= dev->z_port_rcv_packets; memset(pmp->data, 0, sizeof(pmp->data)); @@ -928,10 +928,10 @@ static int recv_pma_get_portcounters_ext &rpkts, &xwait); /* Adjust counters for any resets done. */ - swords -= dev->n_port_xmit_data; - rwords -= dev->n_port_rcv_data; - spkts -= dev->n_port_xmit_packets; - rpkts -= dev->n_port_rcv_packets; + swords -= dev->z_port_xmit_data; + rwords -= dev->z_port_rcv_data; + spkts -= dev->z_port_xmit_packets; + rpkts -= dev->z_port_rcv_packets; memset(pmp->data, 0, sizeof(pmp->data)); @@ -967,37 +967,37 @@ static int recv_pma_set_portcounters(str ipath_layer_get_counters(dev->dd, &cntrs); if (p->counter_select & IB_PMA_SEL_SYMBOL_ERROR) - dev->n_symbol_error_counter = cntrs.symbol_error_counter; + dev->z_symbol_error_counter = cntrs.symbol_error_counter; if (p->counter_select & IB_PMA_SEL_LINK_ERROR_RECOVERY) - dev->n_link_error_recovery_counter = + dev->z_link_error_recovery_counter = cntrs.link_error_recovery_counter; if (p->counter_select & IB_PMA_SEL_LINK_DOWNED) - dev->n_link_downed_counter = cntrs.link_downed_counter; + dev->z_link_downed_counter = cntrs.link_downed_counter; if (p->counter_select & IB_PMA_SEL_PORT_RCV_ERRORS) - dev->n_port_rcv_errors = + dev->z_port_rcv_errors = cntrs.port_rcv_errors + dev->rcv_errors; if (p->counter_select & IB_PMA_SEL_PORT_RCV_REMPHYS_ERRORS) - dev->n_port_rcv_remphys_errors = + dev->z_port_rcv_remphys_errors = cntrs.port_rcv_remphys_errors; if (p->counter_select & IB_PMA_SEL_PORT_XMIT_DISCARDS) - dev->n_port_xmit_discards = cntrs.port_xmit_discards; + dev->z_port_xmit_discards = cntrs.port_xmit_discards; if (p->counter_select & IB_PMA_SEL_PORT_XMIT_DATA) - dev->n_port_xmit_data = cntrs.port_xmit_data; + dev->z_port_xmit_data = cntrs.port_xmit_data; if (p->counter_select & IB_PMA_SEL_PORT_RCV_DATA) - dev->n_port_rcv_data = cntrs.port_rcv_data; + dev->z_port_rcv_data = cntrs.port_rcv_data; if (p->counter_select & IB_PMA_SEL_PORT_XMIT_PACKETS) - dev->n_port_xmit_packets = cntrs.port_xmit_packets; + dev->z_port_xmit_packets = cntrs.port_xmit_packets; if (p->counter_select & IB_PMA_SEL_PORT_RCV_PACKETS) - dev->n_port_rcv_packets = cntrs.port_rcv_packets; + dev->z_port_rcv_packets = cntrs.port_rcv_packets; return recv_pma_get_portcounters(pmp, ibdev, port); } @@ -1014,16 +1014,16 @@ static int recv_pma_set_portcounters_ext &rpkts, &xwait); if (p->counter_select & IB_PMA_SELX_PORT_XMIT_DATA) - dev->n_port_xmit_data = swords; + dev->z_port_xmit_data = swords; if (p->counter_select & IB_PMA_SELX_PORT_RCV_DATA) - dev->n_port_rcv_data = rwords; + dev->z_port_rcv_data = rwords; if (p->counter_select & IB_PMA_SELX_PORT_XMIT_PACKETS) - dev->n_port_xmit_packets = spkts; + dev->z_port_xmit_packets = spkts; if (p->counter_select & IB_PMA_SELX_PORT_RCV_PACKETS) - dev->n_port_rcv_packets = rpkts; + dev->z_port_rcv_packets = rpkts; if (p->counter_select & IB_PMA_SELX_PORT_UNI_XMIT_PACKETS) dev->n_unicast_xmit = 0; @@ -1285,18 +1285,18 @@ int ipath_process_mad(struct ib_device * ipath_layer_get_counters(to_idev(ibdev)->dd, &cntrs); dev->rcv_errors++; - dev->n_symbol_error_counter = cntrs.symbol_error_counter; - dev->n_link_error_recovery_counter = + dev->z_symbol_error_counter = cntrs.symbol_error_counter; + dev->z_link_error_recovery_counter = cntrs.link_error_recovery_counter; - dev->n_link_downed_counter = cntrs.link_downed_counter; - dev->n_port_rcv_errors = cntrs.port_rcv_errors + 1; - dev->n_port_rcv_remphys_errors = + dev->z_link_downed_counter = cntrs.link_downed_counter; + dev->z_port_rcv_errors = cntrs.port_rcv_errors + 1; + dev->z_port_rcv_remphys_errors = cntrs.port_rcv_remphys_errors; - dev->n_port_xmit_discards = cntrs.port_xmit_discards; - dev->n_port_xmit_data = cntrs.port_xmit_data; - dev->n_port_rcv_data = cntrs.port_rcv_data; - dev->n_port_xmit_packets = cntrs.port_xmit_packets; - dev->n_port_rcv_packets = cntrs.port_rcv_packets; + dev->z_port_xmit_discards = cntrs.port_xmit_discards; + dev->z_port_xmit_data = cntrs.port_xmit_data; + dev->z_port_rcv_data = cntrs.port_rcv_data; + dev->z_port_xmit_packets = cntrs.port_xmit_packets; + dev->z_port_rcv_packets = cntrs.port_rcv_packets; } switch (in_mad->mad_hdr.mgmt_class) { case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: diff -r 28e3d8204fdb -r addf90abc724 drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Fri Jun 23 22:47:27 2006 +0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:25 2006 -0700 @@ -646,7 +646,7 @@ static int ipath_query_port(struct ib_de props->max_msg_sz = 4096; props->pkey_tbl_len = ipath_layer_get_npkeys(dev->dd); props->bad_pkey_cntr = ipath_layer_get_cr_errpkey(dev->dd) - - dev->n_pkey_violations; + dev->z_pkey_violations; props->qkey_viol_cntr = dev->qkey_violations; props->active_width = IB_WIDTH_4X; /* See rate_show() */ diff -r 28e3d8204fdb -r addf90abc724 drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Fri Jun 23 22:47:27 2006 +0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Jun 29 14:33:25 2006 -0700 @@ -442,17 +442,17 @@ struct ipath_ibdev { u64 n_unicast_rcv; /* total unicast packets received */ u64 n_multicast_xmit; /* total multicast packets sent */ u64 n_multicast_rcv; /* total multicast packets received */ - u64 n_symbol_error_counter; /* starting count for PMA */ - u64 n_link_error_recovery_counter; /* starting count for PMA */ - u64 n_link_downed_counter; /* starting count for PMA */ - u64 n_port_rcv_errors; /* starting count for PMA */ - u64 n_port_rcv_remphys_errors; /* starting count for PMA */ - u64 n_port_xmit_discards; /* starting count for PMA */ - u64 n_port_xmit_data; /* starting count for PMA */ - u64 n_port_rcv_data; /* starting count for PMA */ - u64 n_port_xmit_packets; /* starting count for PMA */ - u64 n_port_rcv_packets; /* starting count for PMA */ - u32 n_pkey_violations; /* starting count for PMA */ + u64 z_symbol_error_counter; /* starting count for PMA */ + u64 z_link_error_recovery_counter; /* starting count for PMA */ + u64 z_link_downed_counter; /* starting count for PMA */ + u64 z_port_rcv_errors; /* starting count for PMA */ + u64 z_port_rcv_remphys_errors; /* starting count for PMA */ + u64 z_port_xmit_discards; /* starting count for PMA */ + u64 z_port_xmit_data; /* starting count for PMA */ + u64 z_port_rcv_data; /* starting count for PMA */ + u64 z_port_xmit_packets; /* starting count for PMA */ + u64 z_port_rcv_packets; /* starting count for PMA */ + u32 z_pkey_violations; /* starting count for PMA */ u32 n_rc_resends; u32 n_rc_acks; u32 n_rc_qacks; From bos at pathscale.com Thu Jun 29 14:40:54 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:40:54 -0700 Subject: [openib-general] [PATCH 3 of 39] IB/ipath - Share more common code between RC and UC protocols In-Reply-To: Message-ID: Signed-off-by: Ralph Campbell Signed-off-by: Bryan O'Sullivan diff -r f7c82500b9c7 -r ebf646d10db0 drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:25 2006 -0700 @@ -709,9 +709,7 @@ struct ib_qp *ipath_create_qp(struct ib_ spin_lock_init(&qp->r_rq.lock); atomic_set(&qp->refcount, 0); init_waitqueue_head(&qp->wait); - tasklet_init(&qp->s_task, - init_attr->qp_type == IB_QPT_RC ? - ipath_do_rc_send : ipath_do_uc_send, + tasklet_init(&qp->s_task, ipath_do_ruc_send, (unsigned long)qp); INIT_LIST_HEAD(&qp->piowait); INIT_LIST_HEAD(&qp->timerwait); @@ -896,9 +894,9 @@ void ipath_get_credit(struct ipath_qp *q * as many packets as we like. Otherwise, we have to * honor the credit field. */ - if (credit == IPS_AETH_CREDIT_INVAL) { + if (credit == IPS_AETH_CREDIT_INVAL) qp->s_lsn = (u32) -1; - } else if (qp->s_lsn != (u32) -1) { + else if (qp->s_lsn != (u32) -1) { /* Compute new LSN (i.e., MSN + credit) */ credit = (aeth + credit_table[credit]) & IPS_MSN_MASK; if (ipath_cmp24(credit, qp->s_lsn) > 0) diff -r f7c82500b9c7 -r ebf646d10db0 drivers/infiniband/hw/ipath/ipath_rc.c --- a/drivers/infiniband/hw/ipath/ipath_rc.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_rc.c Thu Jun 29 14:33:25 2006 -0700 @@ -73,9 +73,9 @@ static void ipath_init_restart(struct ip * Return bth0 if constructed; otherwise, return 0. * Note the QP s_lock must be held. */ -static inline u32 ipath_make_rc_ack(struct ipath_qp *qp, - struct ipath_other_headers *ohdr, - u32 pmtu) +u32 ipath_make_rc_ack(struct ipath_qp *qp, + struct ipath_other_headers *ohdr, + u32 pmtu) { struct ipath_sge_state *ss; u32 hwords; @@ -96,8 +96,7 @@ static inline u32 ipath_make_rc_ack(stru if (len > pmtu) { len = pmtu; qp->s_ack_state = OP(RDMA_READ_RESPONSE_FIRST); - } - else + } else qp->s_ack_state = OP(RDMA_READ_RESPONSE_ONLY); qp->s_rdma_len -= len; bth0 = qp->s_ack_state << 24; @@ -177,9 +176,9 @@ static inline u32 ipath_make_rc_ack(stru * Return 1 if constructed; otherwise, return 0. * Note the QP s_lock must be held. */ -static inline int ipath_make_rc_req(struct ipath_qp *qp, - struct ipath_other_headers *ohdr, - u32 pmtu, u32 *bth0p, u32 *bth2p) +int ipath_make_rc_req(struct ipath_qp *qp, + struct ipath_other_headers *ohdr, + u32 pmtu, u32 *bth0p, u32 *bth2p) { struct ipath_ibdev *dev = to_idev(qp->ibqp.device); struct ipath_sge_state *ss; @@ -497,160 +496,33 @@ done: return 0; } -static inline void ipath_make_rc_grh(struct ipath_qp *qp, - struct ib_global_route *grh, - u32 nwords) -{ - struct ipath_ibdev *dev = to_idev(qp->ibqp.device); - - /* GRH header size in 32-bit words. */ - qp->s_hdrwords += 10; - qp->s_hdr.u.l.grh.version_tclass_flow = - cpu_to_be32((6 << 28) | - (grh->traffic_class << 20) | - grh->flow_label); - qp->s_hdr.u.l.grh.paylen = - cpu_to_be16(((qp->s_hdrwords - 12) + nwords + - SIZE_OF_CRC) << 2); - /* next_hdr is defined by C8-7 in ch. 8.4.1 */ - qp->s_hdr.u.l.grh.next_hdr = 0x1B; - qp->s_hdr.u.l.grh.hop_limit = grh->hop_limit; - /* The SGID is 32-bit aligned. */ - qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; - qp->s_hdr.u.l.grh.sgid.global.interface_id = - ipath_layer_get_guid(dev->dd); - qp->s_hdr.u.l.grh.dgid = grh->dgid; -} - /** - * ipath_do_rc_send - perform a send on an RC QP - * @data: contains a pointer to the QP + * send_rc_ack - Construct an ACK packet and send it + * @qp: a pointer to the QP * - * Process entries in the send work queue until credit or queue is - * exhausted. Only allow one CPU to send a packet per QP (tasklet). - * Otherwise, after we drop the QP s_lock, two threads could send - * packets out of order. + * This is called from ipath_rc_rcv() and only uses the receive + * side QP state. + * Note that RDMA reads are handled in the send side QP state and tasklet. */ -void ipath_do_rc_send(unsigned long data) -{ - struct ipath_qp *qp = (struct ipath_qp *)data; - struct ipath_ibdev *dev = to_idev(qp->ibqp.device); - unsigned long flags; - u16 lrh0; - u32 nwords; - u32 extra_bytes; - u32 bth0; - u32 bth2; - u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); - struct ipath_other_headers *ohdr; - - if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags)) - goto bail; - - if (unlikely(qp->remote_ah_attr.dlid == - ipath_layer_get_lid(dev->dd))) { - struct ib_wc wc; - - /* - * Pass in an uninitialized ib_wc to be consistent with - * other places where ipath_ruc_loopback() is called. - */ - ipath_ruc_loopback(qp, &wc); - goto clear; - } - - ohdr = &qp->s_hdr.u.oth; - if (qp->remote_ah_attr.ah_flags & IB_AH_GRH) - ohdr = &qp->s_hdr.u.l.oth; - -again: - /* Check for a constructed packet to be sent. */ - if (qp->s_hdrwords != 0) { - /* - * If no PIO bufs are available, return. An interrupt will - * call ipath_ib_piobufavail() when one is available. - */ - _VERBS_INFO("h %u %p\n", qp->s_hdrwords, &qp->s_hdr); - _VERBS_INFO("d %u %p %u %p %u %u %u %u\n", qp->s_cur_size, - qp->s_cur_sge->sg_list, - qp->s_cur_sge->num_sge, - qp->s_cur_sge->sge.vaddr, - qp->s_cur_sge->sge.sge_length, - qp->s_cur_sge->sge.length, - qp->s_cur_sge->sge.m, - qp->s_cur_sge->sge.n); - if (ipath_verbs_send(dev->dd, qp->s_hdrwords, - (u32 *) &qp->s_hdr, qp->s_cur_size, - qp->s_cur_sge)) { - ipath_no_bufs_available(qp, dev); - goto bail; - } - dev->n_unicast_xmit++; - /* Record that we sent the packet and s_hdr is empty. */ - qp->s_hdrwords = 0; - } - - /* - * The lock is needed to synchronize between setting - * qp->s_ack_state, resend timer, and post_send(). - */ - spin_lock_irqsave(&qp->s_lock, flags); - - /* Sending responses has higher priority over sending requests. */ - if (qp->s_ack_state != OP(ACKNOWLEDGE) && - (bth0 = ipath_make_rc_ack(qp, ohdr, pmtu)) != 0) - bth2 = qp->s_ack_psn++ & IPS_PSN_MASK; - else if (!ipath_make_rc_req(qp, ohdr, pmtu, &bth0, &bth2)) - goto done; - - spin_unlock_irqrestore(&qp->s_lock, flags); - - /* Construct the header. */ - extra_bytes = (4 - qp->s_cur_size) & 3; - nwords = (qp->s_cur_size + extra_bytes) >> 2; - lrh0 = IPS_LRH_BTH; - if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { - ipath_make_rc_grh(qp, &qp->remote_ah_attr.grh, nwords); - lrh0 = IPS_LRH_GRH; - } - lrh0 |= qp->remote_ah_attr.sl << 4; - qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); - qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); - qp->s_hdr.lrh[2] = cpu_to_be16(qp->s_hdrwords + nwords + - SIZE_OF_CRC); - qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd)); - bth0 |= ipath_layer_get_pkey(dev->dd, qp->s_pkey_index); - bth0 |= extra_bytes << 20; - ohdr->bth[0] = cpu_to_be32(bth0); - ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); - ohdr->bth[2] = cpu_to_be32(bth2); - - /* Check for more work to do. */ - goto again; - -done: - spin_unlock_irqrestore(&qp->s_lock, flags); -clear: - clear_bit(IPATH_S_BUSY, &qp->s_flags); -bail: - return; -} - static void send_rc_ack(struct ipath_qp *qp) { struct ipath_ibdev *dev = to_idev(qp->ibqp.device); u16 lrh0; u32 bth0; + u32 hwords; + struct ipath_ib_header hdr; struct ipath_other_headers *ohdr; /* Construct the header. */ - ohdr = &qp->s_hdr.u.oth; + ohdr = &hdr.u.oth; lrh0 = IPS_LRH_BTH; /* header size in 32-bit words LRH+BTH+AETH = (8+12+4)/4. */ - qp->s_hdrwords = 6; + hwords = 6; if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { - ipath_make_rc_grh(qp, &qp->remote_ah_attr.grh, 0); - ohdr = &qp->s_hdr.u.l.oth; + hwords += ipath_make_grh(dev, &hdr.u.l.grh, + &qp->remote_ah_attr.grh, + hwords, 0); + ohdr = &hdr.u.l.oth; lrh0 = IPS_LRH_GRH; } bth0 = ipath_layer_get_pkey(dev->dd, qp->s_pkey_index); @@ -658,15 +530,14 @@ static void send_rc_ack(struct ipath_qp if (qp->s_ack_state >= OP(COMPARE_SWAP)) { bth0 |= IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24; ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->s_ack_atomic); - qp->s_hdrwords += sizeof(ohdr->u.at.atomic_ack_eth) / 4; - } - else + hwords += sizeof(ohdr->u.at.atomic_ack_eth) / 4; + } else bth0 |= OP(ACKNOWLEDGE) << 24; lrh0 |= qp->remote_ah_attr.sl << 4; - qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); - qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); - qp->s_hdr.lrh[2] = cpu_to_be16(qp->s_hdrwords + SIZE_OF_CRC); - qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd)); + hdr.lrh[0] = cpu_to_be16(lrh0); + hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); + hdr.lrh[2] = cpu_to_be16(hwords + SIZE_OF_CRC); + hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd)); ohdr->bth[0] = cpu_to_be32(bth0); ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); ohdr->bth[2] = cpu_to_be32(qp->s_ack_psn & IPS_PSN_MASK); @@ -674,8 +545,7 @@ static void send_rc_ack(struct ipath_qp /* * If we can send the ACK, clear the ACK state. */ - if (ipath_verbs_send(dev->dd, qp->s_hdrwords, (u32 *) &qp->s_hdr, - 0, NULL) == 0) { + if (ipath_verbs_send(dev->dd, hwords, (u32 *) &hdr, 0, NULL) == 0) { qp->s_ack_state = OP(ACKNOWLEDGE); dev->n_rc_qacks++; dev->n_unicast_xmit++; @@ -805,7 +675,7 @@ bail: * @qp: the QP * @psn: the packet sequence number to restart at * - * This is called from ipath_rc_rcv() to process an incoming RC ACK + * This is called from ipath_rc_rcv_resp() to process an incoming RC ACK * for the given QP. * Called at interrupt level with the QP s_lock held. */ @@ -1231,18 +1101,12 @@ static inline void ipath_rc_rcv_resp(str * ICRC (4). */ if (unlikely(tlen <= (hdrsize + pad + 8))) { - /* - * XXX Need to generate an error CQ - * entry. - */ + /* XXX Need to generate an error CQ entry. */ goto ack_done; } tlen -= hdrsize + pad + 8; if (unlikely(tlen != qp->s_len)) { - /* - * XXX Need to generate an error CQ - * entry. - */ + /* XXX Need to generate an error CQ entry. */ goto ack_done; } if (!header_in_data) @@ -1384,7 +1248,7 @@ static inline int ipath_rc_rcv_error(str case OP(COMPARE_SWAP): case OP(FETCH_ADD): /* - * Check for the PSN of the last atomic operations + * Check for the PSN of the last atomic operation * performed and resend the result if found. */ if ((psn & IPS_PSN_MASK) != qp->r_atomic_psn) { @@ -1454,11 +1318,6 @@ void ipath_rc_rcv(struct ipath_ibdev *de } else psn = be32_to_cpu(ohdr->bth[2]); } - /* - * The opcode is in the low byte when its in network order - * (top byte when in host order). - */ - opcode = be32_to_cpu(ohdr->bth[0]) >> 24; /* * Process responses (ACKs) before anything else. Note that the @@ -1466,6 +1325,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de * queue rather than the expected receive packet sequence number. * In other words, this QP is the requester. */ + opcode = be32_to_cpu(ohdr->bth[0]) >> 24; if (opcode >= OP(RDMA_READ_RESPONSE_FIRST) && opcode <= OP(ATOMIC_ACKNOWLEDGE)) { ipath_rc_rcv_resp(dev, ohdr, data, tlen, qp, opcode, psn, diff -r f7c82500b9c7 -r ebf646d10db0 drivers/infiniband/hw/ipath/ipath_ruc.c --- a/drivers/infiniband/hw/ipath/ipath_ruc.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c Thu Jun 29 14:33:25 2006 -0700 @@ -32,6 +32,7 @@ */ #include "ipath_verbs.h" +#include "ips_common.h" /* * Convert the AETH RNR timeout code into the number of milliseconds. @@ -188,7 +189,6 @@ bail: /** * ipath_ruc_loopback - handle UC and RC lookback requests * @sqp: the loopback QP - * @wc: the work completion entry * * This is called from ipath_do_uc_send() or ipath_do_rc_send() to * forward a WQE addressed to the same HCA. @@ -197,13 +197,14 @@ bail: * receive interrupts since this is a connected protocol and all packets * will pass through here. */ -void ipath_ruc_loopback(struct ipath_qp *sqp, struct ib_wc *wc) +static void ipath_ruc_loopback(struct ipath_qp *sqp) { struct ipath_ibdev *dev = to_idev(sqp->ibqp.device); struct ipath_qp *qp; struct ipath_swqe *wqe; struct ipath_sge *sge; unsigned long flags; + struct ib_wc wc; u64 sdata; qp = ipath_lookup_qpn(&dev->qp_table, sqp->remote_qpn); @@ -234,8 +235,8 @@ again: wqe = get_swqe_ptr(sqp, sqp->s_last); spin_unlock_irqrestore(&sqp->s_lock, flags); - wc->wc_flags = 0; - wc->imm_data = 0; + wc.wc_flags = 0; + wc.imm_data = 0; sqp->s_sge.sge = wqe->sg_list[0]; sqp->s_sge.sg_list = wqe->sg_list + 1; @@ -243,8 +244,8 @@ again: sqp->s_len = wqe->length; switch (wqe->wr.opcode) { case IB_WR_SEND_WITH_IMM: - wc->wc_flags = IB_WC_WITH_IMM; - wc->imm_data = wqe->wr.imm_data; + wc.wc_flags = IB_WC_WITH_IMM; + wc.imm_data = wqe->wr.imm_data; /* FALLTHROUGH */ case IB_WR_SEND: spin_lock_irqsave(&qp->r_rq.lock, flags); @@ -255,7 +256,7 @@ again: if (qp->ibqp.qp_type == IB_QPT_UC) goto send_comp; if (sqp->s_rnr_retry == 0) { - wc->status = IB_WC_RNR_RETRY_EXC_ERR; + wc.status = IB_WC_RNR_RETRY_EXC_ERR; goto err; } if (sqp->s_rnr_retry_cnt < 7) @@ -270,8 +271,8 @@ again: break; case IB_WR_RDMA_WRITE_WITH_IMM: - wc->wc_flags = IB_WC_WITH_IMM; - wc->imm_data = wqe->wr.imm_data; + wc.wc_flags = IB_WC_WITH_IMM; + wc.imm_data = wqe->wr.imm_data; spin_lock_irqsave(&qp->r_rq.lock, flags); if (!ipath_get_rwqe(qp, 1)) goto rnr_nak; @@ -285,20 +286,20 @@ again: wqe->wr.wr.rdma.rkey, IB_ACCESS_REMOTE_WRITE))) { acc_err: - wc->status = IB_WC_REM_ACCESS_ERR; + wc.status = IB_WC_REM_ACCESS_ERR; err: - wc->wr_id = wqe->wr.wr_id; - wc->opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - wc->vendor_err = 0; - wc->byte_len = 0; - wc->qp_num = sqp->ibqp.qp_num; - wc->src_qp = sqp->remote_qpn; - wc->pkey_index = 0; - wc->slid = sqp->remote_ah_attr.dlid; - wc->sl = sqp->remote_ah_attr.sl; - wc->dlid_path_bits = 0; - wc->port_num = 0; - ipath_sqerror_qp(sqp, wc); + wc.wr_id = wqe->wr.wr_id; + wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.qp_num = sqp->ibqp.qp_num; + wc.src_qp = sqp->remote_qpn; + wc.pkey_index = 0; + wc.slid = sqp->remote_ah_attr.dlid; + wc.sl = sqp->remote_ah_attr.sl; + wc.dlid_path_bits = 0; + wc.port_num = 0; + ipath_sqerror_qp(sqp, &wc); goto done; } break; @@ -374,22 +375,22 @@ again: goto send_comp; if (wqe->wr.opcode == IB_WR_RDMA_WRITE_WITH_IMM) - wc->opcode = IB_WC_RECV_RDMA_WITH_IMM; + wc.opcode = IB_WC_RECV_RDMA_WITH_IMM; else - wc->opcode = IB_WC_RECV; - wc->wr_id = qp->r_wr_id; - wc->status = IB_WC_SUCCESS; - wc->vendor_err = 0; - wc->byte_len = wqe->length; - wc->qp_num = qp->ibqp.qp_num; - wc->src_qp = qp->remote_qpn; + wc.opcode = IB_WC_RECV; + wc.wr_id = qp->r_wr_id; + wc.status = IB_WC_SUCCESS; + wc.vendor_err = 0; + wc.byte_len = wqe->length; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = qp->remote_qpn; /* XXX do we know which pkey matched? Only needed for GSI. */ - wc->pkey_index = 0; - wc->slid = qp->remote_ah_attr.dlid; - wc->sl = qp->remote_ah_attr.sl; - wc->dlid_path_bits = 0; + wc.pkey_index = 0; + wc.slid = qp->remote_ah_attr.dlid; + wc.sl = qp->remote_ah_attr.sl; + wc.dlid_path_bits = 0; /* Signal completion event if the solicited bit is set. */ - ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc, + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, wqe->wr.send_flags & IB_SEND_SOLICITED); send_comp: @@ -397,19 +398,19 @@ send_comp: if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &sqp->s_flags) || (wqe->wr.send_flags & IB_SEND_SIGNALED)) { - wc->wr_id = wqe->wr.wr_id; - wc->status = IB_WC_SUCCESS; - wc->opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - wc->vendor_err = 0; - wc->byte_len = wqe->length; - wc->qp_num = sqp->ibqp.qp_num; - wc->src_qp = 0; - wc->pkey_index = 0; - wc->slid = 0; - wc->sl = 0; - wc->dlid_path_bits = 0; - wc->port_num = 0; - ipath_cq_enter(to_icq(sqp->ibqp.send_cq), wc, 0); + wc.wr_id = wqe->wr.wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; + wc.vendor_err = 0; + wc.byte_len = wqe->length; + wc.qp_num = sqp->ibqp.qp_num; + wc.src_qp = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + ipath_cq_enter(to_icq(sqp->ibqp.send_cq), &wc, 0); } /* Update s_last now that we are finished with the SWQE */ @@ -455,11 +456,11 @@ void ipath_no_bufs_available(struct ipat } /** - * ipath_post_rc_send - post RC and UC sends + * ipath_post_ruc_send - post RC and UC sends * @qp: the QP to post on * @wr: the work request to send */ -int ipath_post_rc_send(struct ipath_qp *qp, struct ib_send_wr *wr) +int ipath_post_ruc_send(struct ipath_qp *qp, struct ib_send_wr *wr) { struct ipath_swqe *wqe; unsigned long flags; @@ -534,13 +535,149 @@ int ipath_post_rc_send(struct ipath_qp * qp->s_head = next; spin_unlock_irqrestore(&qp->s_lock, flags); - if (qp->ibqp.qp_type == IB_QPT_UC) - ipath_do_uc_send((unsigned long) qp); - else - ipath_do_rc_send((unsigned long) qp); + ipath_do_ruc_send((unsigned long) qp); ret = 0; bail: return ret; } + +/** + * ipath_make_grh - construct a GRH header + * @dev: a pointer to the ipath device + * @hdr: a pointer to the GRH header being constructed + * @grh: the global route address to send to + * @hwords: the number of 32 bit words of header being sent + * @nwords: the number of 32 bit words of data being sent + * + * Return the size of the header in 32 bit words. + */ +u32 ipath_make_grh(struct ipath_ibdev *dev, struct ib_grh *hdr, + struct ib_global_route *grh, u32 hwords, u32 nwords) +{ + hdr->version_tclass_flow = + cpu_to_be32((6 << 28) | + (grh->traffic_class << 20) | + grh->flow_label); + hdr->paylen = cpu_to_be16((hwords - 2 + nwords + SIZE_OF_CRC) << 2); + /* next_hdr is defined by C8-7 in ch. 8.4.1 */ + hdr->next_hdr = 0x1B; + hdr->hop_limit = grh->hop_limit; + /* The SGID is 32-bit aligned. */ + hdr->sgid.global.subnet_prefix = dev->gid_prefix; + hdr->sgid.global.interface_id = ipath_layer_get_guid(dev->dd); + hdr->dgid = grh->dgid; + + /* GRH header size in 32-bit words. */ + return sizeof(struct ib_grh) / sizeof(u32); +} + +/** + * ipath_do_ruc_send - perform a send on an RC or UC QP + * @data: contains a pointer to the QP + * + * Process entries in the send work queue until credit or queue is + * exhausted. Only allow one CPU to send a packet per QP (tasklet). + * Otherwise, after we drop the QP s_lock, two threads could send + * packets out of order. + */ +void ipath_do_ruc_send(unsigned long data) +{ + struct ipath_qp *qp = (struct ipath_qp *)data; + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + unsigned long flags; + u16 lrh0; + u32 nwords; + u32 extra_bytes; + u32 bth0; + u32 bth2; + u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); + struct ipath_other_headers *ohdr; + + if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags)) + goto bail; + + if (unlikely(qp->remote_ah_attr.dlid == + ipath_layer_get_lid(dev->dd))) { + ipath_ruc_loopback(qp); + goto clear; + } + + ohdr = &qp->s_hdr.u.oth; + if (qp->remote_ah_attr.ah_flags & IB_AH_GRH) + ohdr = &qp->s_hdr.u.l.oth; + +again: + /* Check for a constructed packet to be sent. */ + if (qp->s_hdrwords != 0) { + /* + * If no PIO bufs are available, return. An interrupt will + * call ipath_ib_piobufavail() when one is available. + */ + if (ipath_verbs_send(dev->dd, qp->s_hdrwords, + (u32 *) &qp->s_hdr, qp->s_cur_size, + qp->s_cur_sge)) { + ipath_no_bufs_available(qp, dev); + goto bail; + } + dev->n_unicast_xmit++; + /* Record that we sent the packet and s_hdr is empty. */ + qp->s_hdrwords = 0; + } + + /* + * The lock is needed to synchronize between setting + * qp->s_ack_state, resend timer, and post_send(). + */ + spin_lock_irqsave(&qp->s_lock, flags); + + /* Sending responses has higher priority over sending requests. */ + if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE && + (bth0 = ipath_make_rc_ack(qp, ohdr, pmtu)) != 0) + bth2 = qp->s_ack_psn++ & IPS_PSN_MASK; + else if (!((qp->ibqp.qp_type == IB_QPT_RC) ? + ipath_make_rc_req(qp, ohdr, pmtu, &bth0, &bth2) : + ipath_make_uc_req(qp, ohdr, pmtu, &bth0, &bth2))) { + /* + * Clear the busy bit before unlocking to avoid races with + * adding new work queue items and then failing to process + * them. + */ + clear_bit(IPATH_S_BUSY, &qp->s_flags); + spin_unlock_irqrestore(&qp->s_lock, flags); + goto bail; + } + + spin_unlock_irqrestore(&qp->s_lock, flags); + + /* Construct the header. */ + extra_bytes = (4 - qp->s_cur_size) & 3; + nwords = (qp->s_cur_size + extra_bytes) >> 2; + lrh0 = IPS_LRH_BTH; + if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { + qp->s_hdrwords += ipath_make_grh(dev, &qp->s_hdr.u.l.grh, + &qp->remote_ah_attr.grh, + qp->s_hdrwords, nwords); + lrh0 = IPS_LRH_GRH; + } + lrh0 |= qp->remote_ah_attr.sl << 4; + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); + qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); + qp->s_hdr.lrh[2] = cpu_to_be16(qp->s_hdrwords + nwords + + SIZE_OF_CRC); + qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd)); + bth0 |= ipath_layer_get_pkey(dev->dd, qp->s_pkey_index); + bth0 |= extra_bytes << 20; + ohdr->bth[0] = cpu_to_be32(bth0); + ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); + ohdr->bth[2] = cpu_to_be32(bth2); + + /* Check for more work to do. */ + goto again; + +clear: + clear_bit(IPATH_S_BUSY, &qp->s_flags); +bail: + return; +} diff -r f7c82500b9c7 -r ebf646d10db0 drivers/infiniband/hw/ipath/ipath_uc.c --- a/drivers/infiniband/hw/ipath/ipath_uc.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_uc.c Thu Jun 29 14:33:25 2006 -0700 @@ -62,90 +62,40 @@ static void complete_last_send(struct ip } /** - * ipath_do_uc_send - do a send on a UC queue - * @data: contains a pointer to the QP to send on - * - * Process entries in the send work queue until the queue is exhausted. - * Only allow one CPU to send a packet per QP (tasklet). - * Otherwise, after we drop the QP lock, two threads could send - * packets out of order. - * This is similar to ipath_do_rc_send() below except we don't have - * timeouts or resends. + * ipath_make_uc_req - construct a request packet (SEND, RDMA write) + * @qp: a pointer to the QP + * @ohdr: a pointer to the IB header being constructed + * @pmtu: the path MTU + * @bth0p: pointer to the BTH opcode word + * @bth2p: pointer to the BTH PSN word + * + * Return 1 if constructed; otherwise, return 0. + * Note the QP s_lock must be held and interrupts disabled. */ -void ipath_do_uc_send(unsigned long data) +int ipath_make_uc_req(struct ipath_qp *qp, + struct ipath_other_headers *ohdr, + u32 pmtu, u32 *bth0p, u32 *bth2p) { - struct ipath_qp *qp = (struct ipath_qp *)data; - struct ipath_ibdev *dev = to_idev(qp->ibqp.device); struct ipath_swqe *wqe; - unsigned long flags; - u16 lrh0; u32 hwords; - u32 nwords; - u32 extra_bytes; u32 bth0; - u32 bth2; - u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); u32 len; - struct ipath_other_headers *ohdr; struct ib_wc wc; - if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags)) - goto bail; - - if (unlikely(qp->remote_ah_attr.dlid == - ipath_layer_get_lid(dev->dd))) { - /* Pass in an uninitialized ib_wc to save stack space. */ - ipath_ruc_loopback(qp, &wc); - clear_bit(IPATH_S_BUSY, &qp->s_flags); - goto bail; - } - - ohdr = &qp->s_hdr.u.oth; - if (qp->remote_ah_attr.ah_flags & IB_AH_GRH) - ohdr = &qp->s_hdr.u.l.oth; - -again: - /* Check for a constructed packet to be sent. */ - if (qp->s_hdrwords != 0) { - /* - * If no PIO bufs are available, return. - * An interrupt will call ipath_ib_piobufavail() - * when one is available. - */ - if (ipath_verbs_send(dev->dd, qp->s_hdrwords, - (u32 *) &qp->s_hdr, - qp->s_cur_size, - qp->s_cur_sge)) { - ipath_no_bufs_available(qp, dev); - goto bail; - } - dev->n_unicast_xmit++; - /* Record that we sent the packet and s_hdr is empty. */ - qp->s_hdrwords = 0; - } - - lrh0 = IPS_LRH_BTH; + if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) + goto done; + /* header size in 32-bit words LRH+BTH = (8+12)/4. */ hwords = 5; - - /* - * The lock is needed to synchronize between - * setting qp->s_ack_state and post_send(). - */ - spin_lock_irqsave(&qp->s_lock, flags); - - if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) - goto done; - - bth0 = ipath_layer_get_pkey(dev->dd, qp->s_pkey_index); - - /* Send a request. */ + bth0 = 0; + + /* Get the next send request. */ wqe = get_swqe_ptr(qp, qp->s_last); switch (qp->s_state) { default: /* - * Signal the completion of the last send (if there is - * one). + * Signal the completion of the last send + * (if there is one). */ if (qp->s_last != qp->s_tail) complete_last_send(qp, wqe, &wc); @@ -258,61 +208,16 @@ again: } break; } - bth2 = qp->s_next_psn++ & IPS_PSN_MASK; qp->s_len -= len; - bth0 |= qp->s_state << 24; - - spin_unlock_irqrestore(&qp->s_lock, flags); - - /* Construct the header. */ - extra_bytes = (4 - len) & 3; - nwords = (len + extra_bytes) >> 2; - if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { - /* Header size in 32-bit words. */ - hwords += 10; - lrh0 = IPS_LRH_GRH; - qp->s_hdr.u.l.grh.version_tclass_flow = - cpu_to_be32((6 << 28) | - (qp->remote_ah_attr.grh.traffic_class - << 20) | - qp->remote_ah_attr.grh.flow_label); - qp->s_hdr.u.l.grh.paylen = - cpu_to_be16(((hwords - 12) + nwords + - SIZE_OF_CRC) << 2); - /* next_hdr is defined by C8-7 in ch. 8.4.1 */ - qp->s_hdr.u.l.grh.next_hdr = 0x1B; - qp->s_hdr.u.l.grh.hop_limit = - qp->remote_ah_attr.grh.hop_limit; - /* The SGID is 32-bit aligned. */ - qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = - dev->gid_prefix; - qp->s_hdr.u.l.grh.sgid.global.interface_id = - ipath_layer_get_guid(dev->dd); - qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid; - } qp->s_hdrwords = hwords; qp->s_cur_sge = &qp->s_sge; qp->s_cur_size = len; - lrh0 |= qp->remote_ah_attr.sl << 4; - qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); - /* DEST LID */ - qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); - qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC); - qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd)); - bth0 |= extra_bytes << 20; - ohdr->bth[0] = cpu_to_be32(bth0); - ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); - ohdr->bth[2] = cpu_to_be32(bth2); - - /* Check for more work to do. */ - goto again; + *bth0p = bth0 | (qp->s_state << 24); + *bth2p = qp->s_next_psn++ & IPS_PSN_MASK; + return 1; done: - spin_unlock_irqrestore(&qp->s_lock, flags); - clear_bit(IPATH_S_BUSY, &qp->s_flags); - -bail: - return; + return 0; } /** @@ -536,12 +441,13 @@ void ipath_uc_rcv(struct ipath_ibdev *de if (qp->r_len != 0) { u32 rkey = be32_to_cpu(reth->rkey); u64 vaddr = be64_to_cpu(reth->vaddr); + int ok; /* Check rkey */ - if (unlikely(!ipath_rkey_ok( - dev, &qp->r_sge, qp->r_len, - vaddr, rkey, - IB_ACCESS_REMOTE_WRITE))) { + ok = ipath_rkey_ok(dev, &qp->r_sge, qp->r_len, + vaddr, rkey, + IB_ACCESS_REMOTE_WRITE); + if (unlikely(!ok)) { dev->n_pkt_drops++; goto done; } @@ -559,8 +465,7 @@ void ipath_uc_rcv(struct ipath_ibdev *de } if (opcode == OP(RDMA_WRITE_ONLY)) goto rdma_last; - else if (opcode == - OP(RDMA_WRITE_ONLY_WITH_IMMEDIATE)) + else if (opcode == OP(RDMA_WRITE_ONLY_WITH_IMMEDIATE)) goto rdma_last_imm; /* FALLTHROUGH */ case OP(RDMA_WRITE_MIDDLE): @@ -593,9 +498,9 @@ void ipath_uc_rcv(struct ipath_ibdev *de dev->n_pkt_drops++; goto done; } - if (qp->r_reuse_sge) { + if (qp->r_reuse_sge) qp->r_reuse_sge = 0; - } else if (!ipath_get_rwqe(qp, 1)) { + else if (!ipath_get_rwqe(qp, 1)) { dev->n_pkt_drops++; goto done; } diff -r f7c82500b9c7 -r ebf646d10db0 drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:25 2006 -0700 @@ -194,7 +194,7 @@ static int ipath_post_send(struct ib_qp switch (qp->ibqp.qp_type) { case IB_QPT_UC: case IB_QPT_RC: - err = ipath_post_rc_send(qp, wr); + err = ipath_post_ruc_send(qp, wr); break; case IB_QPT_SMI: diff -r f7c82500b9c7 -r ebf646d10db0 drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Jun 29 14:33:25 2006 -0700 @@ -581,10 +581,6 @@ void ipath_sqerror_qp(struct ipath_qp *q void ipath_get_credit(struct ipath_qp *qp, u32 aeth); -void ipath_do_rc_send(unsigned long data); - -void ipath_do_uc_send(unsigned long data); - void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int sig); int ipath_rkey_ok(struct ipath_ibdev *dev, struct ipath_sge_state *ss, @@ -597,7 +593,7 @@ void ipath_copy_sge(struct ipath_sge_sta void ipath_skip_sge(struct ipath_sge_state *ss, u32 length); -int ipath_post_rc_send(struct ipath_qp *qp, struct ib_send_wr *wr); +int ipath_post_ruc_send(struct ipath_qp *qp, struct ib_send_wr *wr); void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, int has_grh, void *data, u32 tlen, struct ipath_qp *qp); @@ -679,7 +675,19 @@ void ipath_insert_rnr_queue(struct ipath int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only); -void ipath_ruc_loopback(struct ipath_qp *sqp, struct ib_wc *wc); +u32 ipath_make_grh(struct ipath_ibdev *dev, struct ib_grh *hdr, + struct ib_global_route *grh, u32 hwords, u32 nwords); + +void ipath_do_ruc_send(unsigned long data); + +u32 ipath_make_rc_ack(struct ipath_qp *qp, struct ipath_other_headers *ohdr, + u32 pmtu); + +int ipath_make_rc_req(struct ipath_qp *qp, struct ipath_other_headers *ohdr, + u32 pmtu, u32 *bth0p, u32 *bth2p); + +int ipath_make_uc_req(struct ipath_qp *qp, struct ipath_other_headers *ohdr, + u32 pmtu, u32 *bth0p, u32 *bth2p); extern const enum ib_wc_opcode ib_ipath_wc_opcode[]; From bos at pathscale.com Thu Jun 29 14:40:53 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:40:53 -0700 Subject: [openib-general] [PATCH 2 of 39] IB/ipath - update copyrights and other strings to reflect new company name In-Reply-To: Message-ID: Signed-off-by: Bryan O'Sullivan diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/Kconfig --- a/drivers/infiniband/hw/ipath/Kconfig Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/Kconfig Thu Jun 29 14:33:25 2006 -0700 @@ -1,16 +1,16 @@ config IPATH_CORE config IPATH_CORE - tristate "PathScale InfiniPath Driver" + tristate "QLogic InfiniPath Driver" depends on 64BIT && PCI_MSI && NET ---help--- - This is a low-level driver for PathScale InfiniPath host channel + This is a low-level driver for QLogic InfiniPath host channel adapters (HCAs) based on the HT-400 and PE-800 chips. config INFINIBAND_IPATH - tristate "PathScale InfiniPath Verbs Driver" + tristate "QLogic InfiniPath Verbs Driver" depends on IPATH_CORE && INFINIBAND ---help--- This is a driver that provides InfiniBand verbs support for - PathScale InfiniPath host channel adapters (HCAs). This + QLogic InfiniPath host channel adapters (HCAs). This allows these devices to be used with both kernel upper level protocols such as IP-over-InfiniBand as well as with userspace applications (in conjunction with InfiniBand userspace access). diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/Makefile --- a/drivers/infiniband/hw/ipath/Makefile Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/Makefile Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,4 @@ EXTRA_CFLAGS += -DIPATH_IDSTR='"PathScal -EXTRA_CFLAGS += -DIPATH_IDSTR='"PathScale kernel.org driver"' \ +EXTRA_CFLAGS += -DIPATH_IDSTR='"QLogic kernel.org driver"' \ -DIPATH_KERN_TYPE=0 obj-$(CONFIG_IPATH_CORE) += ipath_core.o diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_common.h --- a/drivers/infiniband/hw/ipath/ipath_common.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_common.h Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -38,7 +39,7 @@ * to communicate between kernel and user code. */ -/* This is the IEEE-assigned OUI for PathScale, Inc. */ +/* This is the IEEE-assigned OUI for QLogic, Inc. InfiniPath */ #define IPATH_SRC_OUI_1 0x00 #define IPATH_SRC_OUI_2 0x11 #define IPATH_SRC_OUI_3 0x75 @@ -342,9 +343,9 @@ struct ipath_base_info { /* * Similarly, this is the kernel version going back to the user. It's * slightly different, in that we want to tell if the driver was built as - * part of a PathScale release, or from the driver from OpenIB, kernel.org, + * part of a QLogic release, or from the driver from OpenIB, kernel.org, * or a standard distribution, for support reasons. The high bit is 0 for - * non-PathScale, and 1 for PathScale-built/supplied. + * non-QLogic, and 1 for QLogic-built/supplied. * * It's returned by the driver to the user code during initialization in the * spi_sw_version field of ipath_base_info, so the user code can in turn diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_cq.c --- a/drivers/infiniband/hw/ipath/ipath_cq.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_cq.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_debug.h --- a/drivers/infiniband/hw/ipath/ipath_debug.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_debug.h Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_diag.c --- a/drivers/infiniband/hw/ipath/ipath_diag.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_diag.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -52,7 +53,7 @@ const char *ipath_get_unit_name(int unit EXPORT_SYMBOL_GPL(ipath_get_unit_name); -#define DRIVER_LOAD_MSG "PathScale " IPATH_DRV_NAME " loaded: " +#define DRIVER_LOAD_MSG "QLogic " IPATH_DRV_NAME " loaded: " #define PFX IPATH_DRV_NAME ": " /* @@ -74,8 +75,8 @@ EXPORT_SYMBOL_GPL(ipath_debug); EXPORT_SYMBOL_GPL(ipath_debug); MODULE_LICENSE("GPL"); -MODULE_AUTHOR("PathScale "); -MODULE_DESCRIPTION("Pathscale InfiniPath driver"); +MODULE_AUTHOR("QLogic "); +MODULE_DESCRIPTION("QLogic InfiniPath driver"); const char *ipath_ibcstatus_str[] = { "Disabled", @@ -452,7 +453,7 @@ static int __devinit ipath_init_one(stru ipath_init_pe800_funcs(dd); break; default: - ipath_dev_err(dd, "Found unknown PathScale deviceid 0x%x, " + ipath_dev_err(dd, "Found unknown QLogic deviceid 0x%x, " "failing\n", ent->device); return -ENODEV; } diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_eeprom.c --- a/drivers/infiniband/hw/ipath/ipath_eeprom.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_file_ops.c --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_fs.c --- a/drivers/infiniband/hw/ipath/ipath_fs.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_fs.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_ht400.c --- a/drivers/infiniband/hw/ipath/ipath_ht400.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ht400.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_init_chip.c --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:25 2006 -0700 @@ -1,6 +1,7 @@ #ifndef _IPATH_KERNEL_H #define _IPATH_KERNEL_H /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_keys.c --- a/drivers/infiniband/hw/ipath/ipath_keys.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_keys.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_layer.c --- a/drivers/infiniband/hw/ipath/ipath_layer.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_layer.h --- a/drivers/infiniband/hw/ipath/ipath_layer.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.h Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_mad.c --- a/drivers/infiniband/hw/ipath/ipath_mad.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_mad.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_mr.c --- a/drivers/infiniband/hw/ipath/ipath_mr.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_mr.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_pe800.c --- a/drivers/infiniband/hw/ipath/ipath_pe800.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_pe800.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -44,7 +45,7 @@ /* * This file contains all the chip-specific register information and - * access functions for the PathScale PE800, the PCI-Express chip. + * access functions for the QLogic InfiniPath PE800, the PCI-Express chip. * * This lists the InfiniPath PE800 registers, in the actual chip layout. * This structure should never be directly accessed. diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_rc.c --- a/drivers/infiniband/hw/ipath/ipath_rc.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_rc.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_registers.h --- a/drivers/infiniband/hw/ipath/ipath_registers.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_registers.h Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_ruc.c --- a/drivers/infiniband/hw/ipath/ipath_ruc.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_srq.c --- a/drivers/infiniband/hw/ipath/ipath_srq.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_srq.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_stats.c --- a/drivers/infiniband/hw/ipath/ipath_stats.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_stats.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_sysfs.c --- a/drivers/infiniband/hw/ipath/ipath_sysfs.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_uc.c --- a/drivers/infiniband/hw/ipath/ipath_uc.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_uc.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_ud.c --- a/drivers/infiniband/hw/ipath/ipath_ud.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ud.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_user_pages.c --- a/drivers/infiniband/hw/ipath/ipath_user_pages.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -56,8 +57,8 @@ MODULE_PARM_DESC(debug, "Verbs debug mas MODULE_PARM_DESC(debug, "Verbs debug mask"); MODULE_LICENSE("GPL"); -MODULE_AUTHOR("PathScale "); -MODULE_DESCRIPTION("Pathscale InfiniPath driver"); +MODULE_AUTHOR("QLogic "); +MODULE_DESCRIPTION("QLogic InfiniPath driver"); const int ib_ipath_state_ops[IB_QPS_ERR + 1] = { [IB_QPS_RESET] = 0, diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_verbs_mcast.c --- a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_wc_x86_64.c --- a/drivers/infiniband/hw/ipath/ipath_wc_x86_64.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_wc_x86_64.c Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ips_common.h --- a/drivers/infiniband/hw/ipath/ips_common.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ips_common.h Thu Jun 29 14:33:25 2006 -0700 @@ -1,6 +1,7 @@ #ifndef IPS_COMMON_H #define IPS_COMMON_H /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/verbs_debug.h --- a/drivers/infiniband/hw/ipath/verbs_debug.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/verbs_debug.h Thu Jun 29 14:33:25 2006 -0700 @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two From bos at pathscale.com Thu Jun 29 14:41:00 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:00 -0700 Subject: [openib-general] [PATCH 9 of 39] IB/ipath - don't allow resources to be created with illegal values In-Reply-To: Message-ID: Signed-off-by: Robert Walsh Signed-off-by: Bryan O'Sullivan diff -r 081142011371 -r ac81d2563bba drivers/infiniband/hw/ipath/ipath_mr.c --- a/drivers/infiniband/hw/ipath/ipath_mr.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_mr.c Thu Jun 29 14:33:25 2006 -0700 @@ -169,6 +169,11 @@ struct ib_mr *ipath_reg_user_mr(struct i struct ib_umem_chunk *chunk; int n, m, i; struct ib_mr *ret; + + if (region->length == 0) { + ret = ERR_PTR(-EINVAL); + goto bail; + } n = 0; list_for_each_entry(chunk, ®ion->chunk_list, list) diff -r 081142011371 -r ac81d2563bba drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:25 2006 -0700 @@ -667,6 +667,14 @@ struct ib_qp *ipath_create_qp(struct ib_ goto bail; } + if (init_attr->cap.max_send_sge + + init_attr->cap.max_recv_sge + + init_attr->cap.max_send_wr + + init_attr->cap.max_recv_wr == 0) { + ret = ERR_PTR(-EINVAL); + goto bail; + } + switch (init_attr->qp_type) { case IB_QPT_UC: case IB_QPT_RC: diff -r 081142011371 -r ac81d2563bba drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:25 2006 -0700 @@ -788,6 +788,17 @@ static struct ib_ah *ipath_create_ah(str if (ah_attr->dlid >= IPS_MULTICAST_LID_BASE && ah_attr->dlid != IPS_PERMISSIVE_LID && !(ah_attr->ah_flags & IB_AH_GRH)) { + ret = ERR_PTR(-EINVAL); + goto bail; + } + + if (ah_attr->dlid == 0) { + ret = ERR_PTR(-EINVAL); + goto bail; + } + + if (ah_attr->port_num != 1 || + ah_attr->port_num > pd->device->phys_port_cnt) { ret = ERR_PTR(-EINVAL); goto bail; } From bos at pathscale.com Thu Jun 29 14:40:59 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:40:59 -0700 Subject: [openib-general] [PATCH 8 of 39] IB/ipath - remove some duplicate code In-Reply-To: Message-ID: <08114201137114764a83.1151617259@eng-12.pathscale.com> Signed-off-by: Robert Walsh Signed-off-by: Bryan O'Sullivan diff -r 8f08597cacd2 -r 081142011371 drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:25 2006 -0700 @@ -511,9 +511,6 @@ int ipath_modify_qp(struct ib_qp *ibqp, if (attr_mask & IB_QP_QKEY) qp->qkey = attr->qkey; - if (attr_mask & IB_QP_PKEY_INDEX) - qp->s_pkey_index = attr->pkey_index; - qp->state = new_state; spin_unlock(&qp->s_lock); spin_unlock_irqrestore(&qp->r_rq.lock, flags); From bos at pathscale.com Thu Jun 29 14:41:01 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:01 -0700 Subject: [openib-general] [PATCH 10 of 39] IB/ipath - fix some memory leaks on failure paths In-Reply-To: Message-ID: <160e5cf91761a2daf6db.1151617261@eng-12.pathscale.com> Signed-off-by: Robert Walsh Signed-off-by: Bryan O'Sullivan diff -r ac81d2563bba -r 160e5cf91761 drivers/infiniband/hw/ipath/ipath_init_chip.c --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Jun 29 14:33:25 2006 -0700 @@ -115,6 +115,7 @@ static int create_port0_egr(struct ipath "eager TID %u\n", e); while (e != 0) dev_kfree_skb(skbs[--e]); + vfree(skbs); ret = -ENOMEM; goto bail; } diff -r ac81d2563bba -r 160e5cf91761 drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:25 2006 -0700 @@ -692,6 +692,7 @@ struct ib_qp *ipath_create_qp(struct ib_ case IB_QPT_GSI: qp = kmalloc(sizeof(*qp), GFP_KERNEL); if (!qp) { + vfree(swq); ret = ERR_PTR(-ENOMEM); goto bail; } @@ -702,6 +703,7 @@ struct ib_qp *ipath_create_qp(struct ib_ qp->r_rq.wq = vmalloc(qp->r_rq.size * sz); if (!qp->r_rq.wq) { kfree(qp); + vfree(swq); ret = ERR_PTR(-ENOMEM); goto bail; } From bos at pathscale.com Thu Jun 29 14:41:05 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:05 -0700 Subject: [openib-general] [PATCH 14 of 39] IB/ipath - removed unused field ipath_kregvirt from struct ipath_devdata In-Reply-To: Message-ID: Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r a94e9f9c9c23 -r e43b4df874a9 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 @@ -496,10 +496,8 @@ static int __devinit ipath_init_one(stru ((void __iomem *)dd->ipath_kregbase + len); dd->ipath_physaddr = addr; /* used for io_remap, etc. */ /* for user mmap */ - dd->ipath_kregvirt = (u64 __iomem *) phys_to_virt(addr); - ipath_cdbg(VERBOSE, "mapped io addr %llx to kregbase %p " - "kregvirt %p\n", addr, dd->ipath_kregbase, - dd->ipath_kregvirt); + ipath_cdbg(VERBOSE, "mapped io addr %llx to kregbase %p\n", + addr, dd->ipath_kregbase); /* * clear ipath_flags here instead of in ipath_init_chip as it is set @@ -1809,7 +1807,6 @@ static void cleanup_device(struct ipath_ * re-init */ dd->ipath_kregbase = NULL; - dd->ipath_kregvirt = NULL; dd->ipath_uregbase = 0; dd->ipath_sregbase = 0; dd->ipath_cregbase = 0; diff -r a94e9f9c9c23 -r e43b4df874a9 drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:25 2006 -0700 @@ -158,11 +158,6 @@ struct ipath_devdata { unsigned long ipath_physaddr; /* base of memory alloced for ipath_kregbase, for free */ u64 *ipath_kregalloc; - /* - * version of kregbase that doesn't have high bits set (for 32 bit - * programs, so mmap64 44 bit works) - */ - u64 __iomem *ipath_kregvirt; /* * virtual address where port0 rcvhdrqtail updated for this unit. * only written to by the chip, not the driver. From bos at pathscale.com Thu Jun 29 14:40:58 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:40:58 -0700 Subject: [openib-general] [PATCH 7 of 39] IB/ipath - update some comments and fix typos In-Reply-To: Message-ID: <8f08597cacd2a9dcea28.1151617258@eng-12.pathscale.com> Signed-off-by: Robert Walsh Signed-off-by: Bryan O'Sullivan diff -r 600ceb6aeb8c -r 8f08597cacd2 drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:25 2006 -0700 @@ -723,13 +723,8 @@ u64 ipath_read_kreg64_port(const struct * @port: port number * * Return the contents of a register that is virtualized to be per port. - * Prints a debug message and returns -1 on errors (not distinguishable from - * valid contents at runtime; we may add a separate error variable at some - * point). - * - * This is normally not used by the kernel, but may be for debugging, and - * has a different implementation than user mode, which is why it's not in - * _common.h. + * Returns -1 on errors (not distinguishable from valid contents at + * runtime; we may add a separate error variable at some point). */ static inline u32 ipath_read_ureg32(const struct ipath_devdata *dd, ipath_ureg regno, int port) diff -r 600ceb6aeb8c -r 8f08597cacd2 drivers/infiniband/hw/ipath/ipath_layer.c --- a/drivers/infiniband/hw/ipath/ipath_layer.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.c Thu Jun 29 14:33:25 2006 -0700 @@ -885,7 +885,7 @@ static void copy_io(u32 __iomem *piobuf, /** * ipath_verbs_send - send a packet from the verbs layer * @dd: the infinipath device - * @hdrwords: the number of works in the header + * @hdrwords: the number of words in the header * @hdr: the packet header * @len: the length of the packet in bytes * @ss: the SGE to send From bos at pathscale.com Thu Jun 29 14:41:02 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:02 -0700 Subject: [openib-general] [PATCH 11 of 39] IB/ipath - return an error for unknown multicast GID In-Reply-To: Message-ID: <1e1f3da0e78d32f2a733.1151617262@eng-12.pathscale.com> Signed-off-by: Robert Walsh Signed-off-by: Bryan O'Sullivan diff -r 160e5cf91761 -r 1e1f3da0e78d drivers/infiniband/hw/ipath/ipath_verbs_mcast.c --- a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c Thu Jun 29 14:33:25 2006 -0700 @@ -273,7 +273,7 @@ int ipath_multicast_detach(struct ib_qp while (1) { if (n == NULL) { spin_unlock_irqrestore(&mcast_lock, flags); - ret = 0; + ret = -EINVAL; goto bail; } From bos at pathscale.com Thu Jun 29 14:41:03 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:03 -0700 Subject: [openib-general] [PATCH 12 of 39] IB/ipath - report correct device identification information in /sys In-Reply-To: Message-ID: <21d5d64750acfd45f537.1151617263@eng-12.pathscale.com> Signed-off-by: Robert Walsh Signed-off-by: Bryan O'Sullivan diff -r 1e1f3da0e78d -r 21d5d64750ac drivers/infiniband/hw/ipath/ipath_layer.c --- a/drivers/infiniband/hw/ipath/ipath_layer.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.c Thu Jun 29 14:33:25 2006 -0700 @@ -341,18 +341,26 @@ u32 ipath_layer_get_nguid(struct ipath_d EXPORT_SYMBOL_GPL(ipath_layer_get_nguid); -int ipath_layer_query_device(struct ipath_devdata *dd, u32 * vendor, - u32 * boardrev, u32 * majrev, u32 * minrev) -{ - *vendor = dd->ipath_vendorid; - *boardrev = dd->ipath_boardrev; - *majrev = dd->ipath_majrev; - *minrev = dd->ipath_minrev; - - return 0; -} - -EXPORT_SYMBOL_GPL(ipath_layer_query_device); +u32 ipath_layer_get_majrev(struct ipath_devdata *dd) +{ + return dd->ipath_majrev; +} + +EXPORT_SYMBOL_GPL(ipath_layer_get_majrev); + +u32 ipath_layer_get_minrev(struct ipath_devdata *dd) +{ + return dd->ipath_minrev; +} + +EXPORT_SYMBOL_GPL(ipath_layer_get_minrev); + +u32 ipath_layer_get_pcirev(struct ipath_devdata *dd) +{ + return dd->ipath_pcirev; +} + +EXPORT_SYMBOL_GPL(ipath_layer_get_pcirev); u32 ipath_layer_get_flags(struct ipath_devdata *dd) { @@ -374,6 +382,13 @@ u16 ipath_layer_get_deviceid(struct ipat } EXPORT_SYMBOL_GPL(ipath_layer_get_deviceid); + +u32 ipath_layer_get_vendorid(struct ipath_devdata *dd) +{ + return dd->ipath_vendorid; +} + +EXPORT_SYMBOL_GPL(ipath_layer_get_vendorid); u64 ipath_layer_get_lastibcstat(struct ipath_devdata *dd) { diff -r 1e1f3da0e78d -r 21d5d64750ac drivers/infiniband/hw/ipath/ipath_layer.h --- a/drivers/infiniband/hw/ipath/ipath_layer.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.h Thu Jun 29 14:33:25 2006 -0700 @@ -144,11 +144,13 @@ int ipath_layer_set_guid(struct ipath_de int ipath_layer_set_guid(struct ipath_devdata *, __be64 guid); __be64 ipath_layer_get_guid(struct ipath_devdata *); u32 ipath_layer_get_nguid(struct ipath_devdata *); -int ipath_layer_query_device(struct ipath_devdata *, u32 * vendor, - u32 * boardrev, u32 * majrev, u32 * minrev); +u32 ipath_layer_get_majrev(struct ipath_devdata *); +u32 ipath_layer_get_minrev(struct ipath_devdata *); +u32 ipath_layer_get_pcirev(struct ipath_devdata *); u32 ipath_layer_get_flags(struct ipath_devdata *dd); struct device *ipath_layer_get_device(struct ipath_devdata *dd); u16 ipath_layer_get_deviceid(struct ipath_devdata *dd); +u32 ipath_layer_get_vendorid(struct ipath_devdata *); u64 ipath_layer_get_lastibcstat(struct ipath_devdata *dd); u32 ipath_layer_get_ibmtu(struct ipath_devdata *dd); int ipath_layer_enable_timer(struct ipath_devdata *dd); diff -r 1e1f3da0e78d -r 21d5d64750ac drivers/infiniband/hw/ipath/ipath_mad.c --- a/drivers/infiniband/hw/ipath/ipath_mad.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_mad.c Thu Jun 29 14:33:25 2006 -0700 @@ -85,7 +85,7 @@ static int recv_subn_get_nodeinfo(struct { struct nodeinfo *nip = (struct nodeinfo *)&smp->data; struct ipath_devdata *dd = to_idev(ibdev)->dd; - u32 vendor, boardid, majrev, minrev; + u32 vendor, majrev, minrev; if (smp->attr_mod) smp->status |= IB_SMP_INVALID_FIELD; @@ -105,9 +105,11 @@ static int recv_subn_get_nodeinfo(struct nip->port_guid = nip->sys_guid; nip->partition_cap = cpu_to_be16(ipath_layer_get_npkeys(dd)); nip->device_id = cpu_to_be16(ipath_layer_get_deviceid(dd)); - ipath_layer_query_device(dd, &vendor, &boardid, &majrev, &minrev); + majrev = ipath_layer_get_majrev(dd); + minrev = ipath_layer_get_minrev(dd); nip->revision = cpu_to_be32((majrev << 16) | minrev); nip->local_port_num = port; + vendor = ipath_layer_get_vendorid(dd); nip->vendor_id[0] = 0; nip->vendor_id[1] = vendor >> 8; nip->vendor_id[2] = vendor; diff -r 1e1f3da0e78d -r 21d5d64750ac drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:25 2006 -0700 @@ -568,18 +568,15 @@ static int ipath_query_device(struct ib_ struct ib_device_attr *props) { struct ipath_ibdev *dev = to_idev(ibdev); - u32 vendor, boardrev, majrev, minrev; memset(props, 0, sizeof(*props)); props->device_cap_flags = IB_DEVICE_BAD_PKEY_CNTR | IB_DEVICE_BAD_QKEY_CNTR | IB_DEVICE_SHUTDOWN_PORT | IB_DEVICE_SYS_IMAGE_GUID; - ipath_layer_query_device(dev->dd, &vendor, &boardrev, - &majrev, &minrev); - props->vendor_id = vendor; - props->vendor_part_id = boardrev; - props->hw_ver = boardrev << 16 | majrev << 8 | minrev; + props->vendor_id = ipath_layer_get_vendorid(dev->dd); + props->vendor_part_id = ipath_layer_get_deviceid(dev->dd); + props->hw_ver = ipath_layer_get_pcirev(dev->dd); props->sys_image_guid = dev->sys_image_guid; @@ -1121,11 +1118,8 @@ static ssize_t show_rev(struct class_dev { struct ipath_ibdev *dev = container_of(cdev, struct ipath_ibdev, ibdev.class_dev); - int vendor, boardrev, majrev, minrev; - - ipath_layer_query_device(dev->dd, &vendor, &boardrev, - &majrev, &minrev); - return sprintf(buf, "%d.%d\n", majrev, minrev); + + return sprintf(buf, "%x\n", ipath_layer_get_pcirev(dev->dd)); } static ssize_t show_hca(struct class_device *cdev, char *buf) From bos at pathscale.com Thu Jun 29 14:41:04 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:04 -0700 Subject: [openib-general] [PATCH 13 of 39] IB/ipath - enforce device resource limits In-Reply-To: Message-ID: These limits are somewhat artificial in that we don't actually have any device limits. However, the verbs layer expects that such limits exist and are enforced, so we make up arbitrary (but sensible) limits. Signed-off-by: Robert Walsh Signed-off-by: Bryan O'Sullivan diff -r 21d5d64750ac -r a94e9f9c9c23 drivers/infiniband/hw/ipath/ipath_cq.c --- a/drivers/infiniband/hw/ipath/ipath_cq.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_cq.c Thu Jun 29 14:33:25 2006 -0700 @@ -158,9 +158,20 @@ struct ib_cq *ipath_create_cq(struct ib_ struct ib_ucontext *context, struct ib_udata *udata) { + struct ipath_ibdev *dev = to_idev(ibdev); struct ipath_cq *cq; struct ib_wc *wc; struct ib_cq *ret; + + if (entries > ib_ipath_max_cqes) { + ret = ERR_PTR(-EINVAL); + goto bail; + } + + if (dev->n_cqs_allocated == ib_ipath_max_cqs) { + ret = ERR_PTR(-ENOMEM); + goto bail; + } /* * Need to use vmalloc() if we want to support large #s of @@ -197,6 +208,8 @@ struct ib_cq *ipath_create_cq(struct ib_ ret = &cq->ibcq; + dev->n_cqs_allocated++; + bail: return ret; } @@ -211,9 +224,11 @@ bail: */ int ipath_destroy_cq(struct ib_cq *ibcq) { + struct ipath_ibdev *dev = to_idev(ibcq->device); struct ipath_cq *cq = to_icq(ibcq); tasklet_kill(&cq->comptask); + dev->n_cqs_allocated--; vfree(cq->queue); kfree(cq); diff -r 21d5d64750ac -r a94e9f9c9c23 drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:25 2006 -0700 @@ -661,8 +661,10 @@ struct ib_qp *ipath_create_qp(struct ib_ size_t sz; struct ib_qp *ret; - if (init_attr->cap.max_send_sge > 255 || - init_attr->cap.max_recv_sge > 255) { + if (init_attr->cap.max_send_sge > ib_ipath_max_sges || + init_attr->cap.max_recv_sge > ib_ipath_max_sges || + init_attr->cap.max_send_wr > ib_ipath_max_qp_wrs || + init_attr->cap.max_recv_wr > ib_ipath_max_qp_wrs) { ret = ERR_PTR(-ENOMEM); goto bail; } diff -r 21d5d64750ac -r a94e9f9c9c23 drivers/infiniband/hw/ipath/ipath_srq.c --- a/drivers/infiniband/hw/ipath/ipath_srq.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_srq.c Thu Jun 29 14:33:25 2006 -0700 @@ -126,11 +126,23 @@ struct ib_srq *ipath_create_srq(struct i struct ib_srq_init_attr *srq_init_attr, struct ib_udata *udata) { + struct ipath_ibdev *dev = to_idev(ibpd->device); struct ipath_srq *srq; u32 sz; struct ib_srq *ret; - if (srq_init_attr->attr.max_sge < 1) { + if (dev->n_srqs_allocated == ib_ipath_max_srqs) { + ret = ERR_PTR(-ENOMEM); + goto bail; + } + + if (srq_init_attr->attr.max_wr == 0) { + ret = ERR_PTR(-EINVAL); + goto bail; + } + + if ((srq_init_attr->attr.max_sge > ib_ipath_max_srq_sges) || + (srq_init_attr->attr.max_wr > ib_ipath_max_srq_wrs)) { ret = ERR_PTR(-EINVAL); goto bail; } @@ -165,6 +177,8 @@ struct ib_srq *ipath_create_srq(struct i ret = &srq->ibsrq; + dev->n_srqs_allocated++; + bail: return ret; } @@ -182,24 +196,26 @@ int ipath_modify_srq(struct ib_srq *ibsr unsigned long flags; int ret; - if (attr_mask & IB_SRQ_LIMIT) { - spin_lock_irqsave(&srq->rq.lock, flags); - srq->limit = attr->srq_limit; - spin_unlock_irqrestore(&srq->rq.lock, flags); - } + if (attr_mask & IB_SRQ_MAX_WR) + if ((attr->max_wr > ib_ipath_max_srq_wrs) || + (attr->max_sge > srq->rq.max_sge)) { + ret = -EINVAL; + goto bail; + } + + if (attr_mask & IB_SRQ_LIMIT) + if (attr->srq_limit >= srq->rq.size) { + ret = -EINVAL; + goto bail; + } + if (attr_mask & IB_SRQ_MAX_WR) { - u32 size = attr->max_wr + 1; struct ipath_rwqe *wq, *p; - u32 n; - u32 sz; - - if (attr->max_sge < srq->rq.max_sge) { - ret = -EINVAL; - goto bail; - } + u32 sz, size, n; sz = sizeof(struct ipath_rwqe) + attr->max_sge * sizeof(struct ipath_sge); + size = attr->max_wr + 1; wq = vmalloc(size * sz); if (!wq) { ret = -ENOMEM; @@ -243,6 +259,11 @@ int ipath_modify_srq(struct ib_srq *ibsr spin_unlock_irqrestore(&srq->rq.lock, flags); } + if (attr_mask & IB_SRQ_LIMIT) { + spin_lock_irqsave(&srq->rq.lock, flags); + srq->limit = attr->srq_limit; + spin_unlock_irqrestore(&srq->rq.lock, flags); + } ret = 0; bail: @@ -266,7 +287,9 @@ int ipath_destroy_srq(struct ib_srq *ibs int ipath_destroy_srq(struct ib_srq *ibsrq) { struct ipath_srq *srq = to_isrq(ibsrq); - + struct ipath_ibdev *dev = to_idev(ibsrq->device); + + dev->n_srqs_allocated--; vfree(srq->rq.wq); kfree(srq); diff -r 21d5d64750ac -r a94e9f9c9c23 drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:25 2006 -0700 @@ -55,6 +55,59 @@ unsigned int ib_ipath_debug; /* debug ma unsigned int ib_ipath_debug; /* debug mask */ module_param_named(debug, ib_ipath_debug, uint, S_IWUSR | S_IRUGO); MODULE_PARM_DESC(debug, "Verbs debug mask"); + +static unsigned int ib_ipath_max_pds = 0xFFFF; +module_param_named(max_pds, ib_ipath_max_pds, uint, S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(max_pds, + "Maximum number of protection domains to support"); + +static unsigned int ib_ipath_max_ahs = 0xFFFF; +module_param_named(max_ahs, ib_ipath_max_ahs, uint, S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(max_ahs, "Maximum number of address handles to support"); + +unsigned int ib_ipath_max_cqes = 0x2FFFF; +module_param_named(max_cqes, ib_ipath_max_cqes, uint, S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(max_cqes, + "Maximum number of completion queue entries to support"); + +unsigned int ib_ipath_max_cqs = 0x1FFFF; +module_param_named(max_cqs, ib_ipath_max_cqs, uint, S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(max_cqs, "Maximum number of completion queues to support"); + +unsigned int ib_ipath_max_qp_wrs = 0x3FFF; +module_param_named(max_qp_wrs, ib_ipath_max_qp_wrs, uint, + S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(max_qp_wrs, "Maximum number of QP WRs to support"); + +unsigned int ib_ipath_max_sges = 0x60; +module_param_named(max_sges, ib_ipath_max_sges, uint, S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(max_sges, "Maximum number of SGEs to support"); + +unsigned int ib_ipath_max_mcast_grps = 16384; +module_param_named(max_mcast_grps, ib_ipath_max_mcast_grps, uint, + S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(max_mcast_grps, + "Maximum number of multicast groups to support"); + +unsigned int ib_ipath_max_mcast_qp_attached = 16; +module_param_named(max_mcast_qp_attached, ib_ipath_max_mcast_qp_attached, + uint, S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(max_mcast_qp_attached, + "Maximum number of attached QPs to support"); + +unsigned int ib_ipath_max_srqs = 1024; +module_param_named(max_srqs, ib_ipath_max_srqs, uint, S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(max_srqs, "Maximum number of SRQs to support"); + +unsigned int ib_ipath_max_srq_sges = 128; +module_param_named(max_srq_sges, ib_ipath_max_srq_sges, + uint, S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(max_srq_sges, "Maximum number of SRQ SGEs to support"); + +unsigned int ib_ipath_max_srq_wrs = 0x1FFFF; +module_param_named(max_srq_wrs, ib_ipath_max_srq_wrs, + uint, S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(max_srq_wrs, "Maximum number of SRQ WRs support"); MODULE_LICENSE("GPL"); MODULE_AUTHOR("QLogic "); @@ -581,24 +634,25 @@ static int ipath_query_device(struct ib_ props->sys_image_guid = dev->sys_image_guid; props->max_mr_size = ~0ull; - props->max_qp = 0xffff; - props->max_qp_wr = 0xffff; - props->max_sge = 255; - props->max_cq = 0xffff; - props->max_cqe = 0xffff; - props->max_mr = 0xffff; - props->max_pd = 0xffff; + props->max_qp = dev->qp_table.max; + props->max_qp_wr = ib_ipath_max_qp_wrs; + props->max_sge = ib_ipath_max_sges; + props->max_cq = ib_ipath_max_cqs; + props->max_ah = ib_ipath_max_ahs; + props->max_cqe = ib_ipath_max_cqes; + props->max_mr = dev->lk_table.max; + props->max_pd = ib_ipath_max_pds; props->max_qp_rd_atom = 1; props->max_qp_init_rd_atom = 1; /* props->max_res_rd_atom */ - props->max_srq = 0xffff; - props->max_srq_wr = 0xffff; - props->max_srq_sge = 255; + props->max_srq = ib_ipath_max_srqs; + props->max_srq_wr = ib_ipath_max_srq_wrs; + props->max_srq_sge = ib_ipath_max_srq_sges; /* props->local_ca_ack_delay */ props->atomic_cap = IB_ATOMIC_HCA; props->max_pkeys = ipath_layer_get_npkeys(dev->dd); - props->max_mcast_grp = 0xffff; - props->max_mcast_qp_attach = 0xffff; + props->max_mcast_grp = ib_ipath_max_mcast_grps; + props->max_mcast_qp_attach = ib_ipath_max_mcast_qp_attached; props->max_total_mcast_qp_attach = props->max_mcast_qp_attach * props->max_mcast_grp; @@ -741,8 +795,21 @@ static struct ib_pd *ipath_alloc_pd(stru struct ib_ucontext *context, struct ib_udata *udata) { + struct ipath_ibdev *dev = to_idev(ibdev); struct ipath_pd *pd; struct ib_pd *ret; + + /* + * This is actually totally arbitrary. Some correctness tests + * assume there's a maximum number of PDs that can be allocated. + * We don't actually have this limit, but we fail the test if + * we allow allocations of more than we report for this value. + */ + + if (dev->n_pds_allocated == ib_ipath_max_pds) { + ret = ERR_PTR(-ENOMEM); + goto bail; + } pd = kmalloc(sizeof *pd, GFP_KERNEL); if (!pd) { @@ -750,6 +817,8 @@ static struct ib_pd *ipath_alloc_pd(stru goto bail; } + dev->n_pds_allocated++; + /* ib_alloc_pd() will initialize pd->ibpd. */ pd->user = udata != NULL; @@ -762,6 +831,9 @@ static int ipath_dealloc_pd(struct ib_pd static int ipath_dealloc_pd(struct ib_pd *ibpd) { struct ipath_pd *pd = to_ipd(ibpd); + struct ipath_ibdev *dev = to_idev(ibpd->device); + + dev->n_pds_allocated--; kfree(pd); @@ -780,6 +852,12 @@ static struct ib_ah *ipath_create_ah(str { struct ipath_ah *ah; struct ib_ah *ret; + struct ipath_ibdev *dev = to_idev(pd->device); + + if (dev->n_ahs_allocated == ib_ipath_max_ahs) { + ret = ERR_PTR(-ENOMEM); + goto bail; + } /* A multicast address requires a GRH (see ch. 8.4.1). */ if (ah_attr->dlid >= IPS_MULTICAST_LID_BASE && @@ -794,7 +872,7 @@ static struct ib_ah *ipath_create_ah(str goto bail; } - if (ah_attr->port_num != 1 || + if (ah_attr->port_num < 1 || ah_attr->port_num > pd->device->phys_port_cnt) { ret = ERR_PTR(-EINVAL); goto bail; @@ -806,6 +884,8 @@ static struct ib_ah *ipath_create_ah(str goto bail; } + dev->n_ahs_allocated++; + /* ib_create_ah() will initialize ah->ibah. */ ah->attr = *ah_attr; @@ -823,7 +903,10 @@ bail: */ static int ipath_destroy_ah(struct ib_ah *ibah) { + struct ipath_ibdev *dev = to_idev(ibah->device); struct ipath_ah *ah = to_iah(ibah); + + dev->n_ahs_allocated--; kfree(ah); diff -r 21d5d64750ac -r a94e9f9c9c23 drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Jun 29 14:33:25 2006 -0700 @@ -149,6 +149,7 @@ struct ipath_mcast { struct list_head qp_list; wait_queue_head_t wait; atomic_t refcount; + int n_attached; }; /* Memory region */ @@ -432,6 +433,11 @@ struct ipath_ibdev { __be64 sys_image_guid; /* in network order */ __be64 gid_prefix; /* in network order */ __be64 mkey; + u32 n_pds_allocated; /* number of PDs allocated for device */ + u32 n_ahs_allocated; /* number of AHs allocated for device */ + u32 n_cqs_allocated; /* number of CQs allocated for device */ + u32 n_srqs_allocated; /* number of SRQs allocated for device */ + u32 n_mcast_grps_allocated; /* number of mcast groups allocated */ u64 ipath_sword; /* total dwords sent (sample result) */ u64 ipath_rword; /* total dwords received (sample result) */ u64 ipath_spkts; /* total packets sent (sample result) */ @@ -697,6 +703,24 @@ extern const int ib_ipath_state_ops[]; extern unsigned int ib_ipath_lkey_table_size; +extern unsigned int ib_ipath_max_cqes; + +extern unsigned int ib_ipath_max_cqs; + +extern unsigned int ib_ipath_max_qp_wrs; + +extern unsigned int ib_ipath_max_sges; + +extern unsigned int ib_ipath_max_mcast_grps; + +extern unsigned int ib_ipath_max_mcast_qp_attached; + +extern unsigned int ib_ipath_max_srqs; + +extern unsigned int ib_ipath_max_srq_sges; + +extern unsigned int ib_ipath_max_srq_wrs; + extern const u32 ib_ipath_rnr_table[]; #endif /* IPATH_VERBS_H */ diff -r 21d5d64750ac -r a94e9f9c9c23 drivers/infiniband/hw/ipath/ipath_verbs_mcast.c --- a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c Thu Jun 29 14:33:25 2006 -0700 @@ -93,6 +93,7 @@ static struct ipath_mcast *ipath_mcast_a INIT_LIST_HEAD(&mcast->qp_list); init_waitqueue_head(&mcast->wait); atomic_set(&mcast->refcount, 0); + mcast->n_attached = 0; bail: return mcast; @@ -158,7 +159,8 @@ bail: * the table but the QP was added. Return ESRCH if the QP was already * attached and neither structure was added. */ -static int ipath_mcast_add(struct ipath_mcast *mcast, +static int ipath_mcast_add(struct ipath_ibdev *dev, + struct ipath_mcast *mcast, struct ipath_mcast_qp *mqp) { struct rb_node **n = &mcast_tree.rb_node; @@ -189,16 +191,28 @@ static int ipath_mcast_add(struct ipath_ /* Search the QP list to see if this is already there. */ list_for_each_entry_rcu(p, &tmcast->qp_list, list) { if (p->qp == mqp->qp) { - spin_unlock_irqrestore(&mcast_lock, flags); ret = ESRCH; goto bail; } } + if (tmcast->n_attached == ib_ipath_max_mcast_qp_attached) { + ret = ENOMEM; + goto bail; + } + + tmcast->n_attached++; + list_add_tail_rcu(&mqp->list, &tmcast->qp_list); - spin_unlock_irqrestore(&mcast_lock, flags); ret = EEXIST; goto bail; } + + if (dev->n_mcast_grps_allocated == ib_ipath_max_mcast_grps) { + ret = ENOMEM; + goto bail; + } + + dev->n_mcast_grps_allocated++; list_add_tail_rcu(&mqp->list, &mcast->qp_list); @@ -206,17 +220,18 @@ static int ipath_mcast_add(struct ipath_ rb_link_node(&mcast->rb_node, pn, n); rb_insert_color(&mcast->rb_node, &mcast_tree); + ret = 0; + +bail: spin_unlock_irqrestore(&mcast_lock, flags); - ret = 0; - -bail: return ret; } int ipath_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) { struct ipath_qp *qp = to_iqp(ibqp); + struct ipath_ibdev *dev = to_idev(ibqp->device); struct ipath_mcast *mcast; struct ipath_mcast_qp *mqp; int ret; @@ -236,7 +251,7 @@ int ipath_multicast_attach(struct ib_qp ret = -ENOMEM; goto bail; } - switch (ipath_mcast_add(mcast, mqp)) { + switch (ipath_mcast_add(dev, mcast, mqp)) { case ESRCH: /* Neither was used: can't attach the same QP twice. */ ipath_mcast_qp_free(mqp); @@ -246,6 +261,12 @@ int ipath_multicast_attach(struct ib_qp case EEXIST: /* The mcast wasn't used */ ipath_mcast_free(mcast); break; + case ENOMEM: + /* Exceeded the maximum number of mcast groups. */ + ipath_mcast_qp_free(mqp); + ipath_mcast_free(mcast); + ret = -ENOMEM; + goto bail; default: break; } @@ -259,6 +280,7 @@ int ipath_multicast_detach(struct ib_qp int ipath_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) { struct ipath_qp *qp = to_iqp(ibqp); + struct ipath_ibdev *dev = to_idev(ibqp->device); struct ipath_mcast *mcast = NULL; struct ipath_mcast_qp *p, *tmp; struct rb_node *n; @@ -297,6 +319,7 @@ int ipath_multicast_detach(struct ib_qp * link until we are sure there are no list walkers. */ list_del_rcu(&p->list); + mcast->n_attached--; /* If this was the last attached QP, remove the GID too. */ if (list_empty(&mcast->qp_list)) { @@ -320,6 +343,7 @@ int ipath_multicast_detach(struct ib_qp atomic_dec(&mcast->refcount); wait_event(mcast->wait, !atomic_read(&mcast->refcount)); ipath_mcast_free(mcast); + dev->n_mcast_grps_allocated--; } ret = 0; From bos at pathscale.com Thu Jun 29 14:40:56 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:40:56 -0700 Subject: [openib-general] [PATCH 5 of 39] IB/ipath - fix shared receive queues for RC In-Reply-To: Message-ID: Signed-off-by: Ralph Campbell Signed-off-by: Bryan O'Sullivan diff -r c93c2b42d279 -r e4f29a4e0c0f drivers/infiniband/hw/ipath/ipath_rc.c --- a/drivers/infiniband/hw/ipath/ipath_rc.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_rc.c Thu Jun 29 14:33:25 2006 -0700 @@ -257,7 +257,7 @@ int ipath_make_rc_req(struct ipath_qp *q break; case IB_WR_RDMA_WRITE: - if (newreq) + if (newreq && qp->s_lsn != (u32) -1) qp->s_lsn++; /* FALLTHROUGH */ case IB_WR_RDMA_WRITE_WITH_IMM: @@ -283,8 +283,7 @@ int ipath_make_rc_req(struct ipath_qp *q else { qp->s_state = OP(RDMA_WRITE_ONLY_WITH_IMMEDIATE); - /* Immediate data comes - * after RETH */ + /* Immediate data comes after RETH */ ohdr->u.rc.imm_data = wqe->wr.imm_data; hwords += 1; if (wqe->wr.send_flags & IB_SEND_SOLICITED) @@ -304,7 +303,8 @@ int ipath_make_rc_req(struct ipath_qp *q qp->s_state = OP(RDMA_READ_REQUEST); hwords += sizeof(ohdr->u.rc.reth) / 4; if (newreq) { - qp->s_lsn++; + if (qp->s_lsn != (u32) -1) + qp->s_lsn++; /* * Adjust s_next_psn to count the * expected number of responses. @@ -335,7 +335,8 @@ int ipath_make_rc_req(struct ipath_qp *q wqe->wr.wr.atomic.compare_add); hwords += sizeof(struct ib_atomic_eth) / 4; if (newreq) { - qp->s_lsn++; + if (qp->s_lsn != (u32) -1) + qp->s_lsn++; wqe->lpsn = wqe->psn; } if (++qp->s_cur == qp->s_size) @@ -553,6 +554,88 @@ static void send_rc_ack(struct ipath_qp } /** + * reset_psn - reset the QP state to send starting from PSN + * @qp: the QP + * @psn: the packet sequence number to restart at + * + * This is called from ipath_rc_rcv() to process an incoming RC ACK + * for the given QP. + * Called at interrupt level with the QP s_lock held. + */ +static void reset_psn(struct ipath_qp *qp, u32 psn) +{ + u32 n = qp->s_last; + struct ipath_swqe *wqe = get_swqe_ptr(qp, n); + u32 opcode; + + qp->s_cur = n; + + /* + * If we are starting the request from the beginning, + * let the normal send code handle initialization. + */ + if (ipath_cmp24(psn, wqe->psn) <= 0) { + qp->s_state = OP(SEND_LAST); + goto done; + } + + /* Find the work request opcode corresponding to the given PSN. */ + opcode = wqe->wr.opcode; + for (;;) { + int diff; + + if (++n == qp->s_size) + n = 0; + if (n == qp->s_tail) + break; + wqe = get_swqe_ptr(qp, n); + diff = ipath_cmp24(psn, wqe->psn); + if (diff < 0) + break; + qp->s_cur = n; + /* + * If we are starting the request from the beginning, + * let the normal send code handle initialization. + */ + if (diff == 0) { + qp->s_state = OP(SEND_LAST); + goto done; + } + opcode = wqe->wr.opcode; + } + + /* + * Set the state to restart in the middle of a request. + * Don't change the s_sge, s_cur_sge, or s_cur_size. + * See ipath_do_rc_send(). + */ + switch (opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + qp->s_state = OP(RDMA_READ_RESPONSE_FIRST); + break; + + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + qp->s_state = OP(RDMA_READ_RESPONSE_LAST); + break; + + case IB_WR_RDMA_READ: + qp->s_state = OP(RDMA_READ_RESPONSE_MIDDLE); + break; + + default: + /* + * This case shouldn't happen since its only + * one PSN per req. + */ + qp->s_state = OP(SEND_LAST); + } +done: + qp->s_psn = psn; +} + +/** * ipath_restart_rc - back up requester to resend the last un-ACKed request * @qp: the QP to restart * @psn: packet sequence number for the request @@ -564,7 +647,6 @@ void ipath_restart_rc(struct ipath_qp *q { struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); struct ipath_ibdev *dev; - u32 n; /* * If there are no requests pending, we are done. @@ -606,130 +688,13 @@ void ipath_restart_rc(struct ipath_qp *q else dev->n_rc_resends += (int)qp->s_psn - (int)psn; - /* - * If we are starting the request from the beginning, let the normal - * send code handle initialization. - */ - qp->s_cur = qp->s_last; - if (ipath_cmp24(psn, wqe->psn) <= 0) { - qp->s_state = OP(SEND_LAST); - qp->s_psn = wqe->psn; - } else { - n = qp->s_cur; - for (;;) { - if (++n == qp->s_size) - n = 0; - if (n == qp->s_tail) { - if (ipath_cmp24(psn, qp->s_next_psn) >= 0) { - qp->s_cur = n; - wqe = get_swqe_ptr(qp, n); - } - break; - } - wqe = get_swqe_ptr(qp, n); - if (ipath_cmp24(psn, wqe->psn) < 0) - break; - qp->s_cur = n; - } - qp->s_psn = psn; - - /* - * Reset the state to restart in the middle of a request. - * Don't change the s_sge, s_cur_sge, or s_cur_size. - * See ipath_do_rc_send(). - */ - switch (wqe->wr.opcode) { - case IB_WR_SEND: - case IB_WR_SEND_WITH_IMM: - qp->s_state = OP(RDMA_READ_RESPONSE_FIRST); - break; - - case IB_WR_RDMA_WRITE: - case IB_WR_RDMA_WRITE_WITH_IMM: - qp->s_state = OP(RDMA_READ_RESPONSE_LAST); - break; - - case IB_WR_RDMA_READ: - qp->s_state = - OP(RDMA_READ_RESPONSE_MIDDLE); - break; - - default: - /* - * This case shouldn't happen since its only - * one PSN per req. - */ - qp->s_state = OP(SEND_LAST); - } - } + reset_psn(qp, psn); done: tasklet_hi_schedule(&qp->s_task); bail: return; -} - -/** - * reset_psn - reset the QP state to send starting from PSN - * @qp: the QP - * @psn: the packet sequence number to restart at - * - * This is called from ipath_rc_rcv_resp() to process an incoming RC ACK - * for the given QP. - * Called at interrupt level with the QP s_lock held. - */ -static void reset_psn(struct ipath_qp *qp, u32 psn) -{ - struct ipath_swqe *wqe; - u32 n; - - n = qp->s_cur; - wqe = get_swqe_ptr(qp, n); - for (;;) { - if (++n == qp->s_size) - n = 0; - if (n == qp->s_tail) { - if (ipath_cmp24(psn, qp->s_next_psn) >= 0) { - qp->s_cur = n; - wqe = get_swqe_ptr(qp, n); - } - break; - } - wqe = get_swqe_ptr(qp, n); - if (ipath_cmp24(psn, wqe->psn) < 0) - break; - qp->s_cur = n; - } - qp->s_psn = psn; - - /* - * Set the state to restart in the middle of a - * request. Don't change the s_sge, s_cur_sge, or - * s_cur_size. See ipath_do_rc_send(). - */ - switch (wqe->wr.opcode) { - case IB_WR_SEND: - case IB_WR_SEND_WITH_IMM: - qp->s_state = OP(RDMA_READ_RESPONSE_FIRST); - break; - - case IB_WR_RDMA_WRITE: - case IB_WR_RDMA_WRITE_WITH_IMM: - qp->s_state = OP(RDMA_READ_RESPONSE_LAST); - break; - - case IB_WR_RDMA_READ: - qp->s_state = OP(RDMA_READ_RESPONSE_MIDDLE); - break; - - default: - /* - * This case shouldn't happen since its only - * one PSN per req. - */ - qp->s_state = OP(SEND_LAST); - } } /** @@ -738,7 +703,7 @@ static void reset_psn(struct ipath_qp *q * @psn: the packet sequence number of the ACK * @opcode: the opcode of the request that resulted in the ACK * - * This is called from ipath_rc_rcv() to process an incoming RC ACK + * This is called from ipath_rc_rcv_resp() to process an incoming RC ACK * for the given QP. * Called at interrupt level with the QP s_lock held. * Returns 1 if OK, 0 if current operation should be aborted (NAK). @@ -877,22 +842,12 @@ static int do_rc_ack(struct ipath_qp *qp if (qp->s_last == qp->s_tail) goto bail; - /* The last valid PSN seen is the previous request's. */ - qp->s_last_psn = wqe->psn - 1; + /* The last valid PSN is the previous PSN. */ + qp->s_last_psn = psn - 1; dev->n_rc_resends += (int)qp->s_psn - (int)psn; - /* - * If we are starting the request from the beginning, let - * the normal send code handle initialization. - */ - qp->s_cur = qp->s_last; - wqe = get_swqe_ptr(qp, qp->s_cur); - if (ipath_cmp24(psn, wqe->psn) <= 0) { - qp->s_state = OP(SEND_LAST); - qp->s_psn = wqe->psn; - } else - reset_psn(qp, psn); + reset_psn(qp, psn); qp->s_rnr_timeout = ib_ipath_rnr_table[(aeth >> IPS_AETH_CREDIT_SHIFT) & @@ -1070,9 +1025,10 @@ static inline void ipath_rc_rcv_resp(str &dev->pending[dev->pending_index]); spin_unlock(&dev->pending_lock); /* - * Update the RDMA receive state but do the copy w/o holding the - * locks and blocking interrupts. XXX Yet another place that - * affects relaxed RDMA order since we don't want s_sge modified. + * Update the RDMA receive state but do the copy w/o + * holding the locks and blocking interrupts. + * XXX Yet another place that affects relaxed RDMA order + * since we don't want s_sge modified. */ qp->s_len -= pmtu; qp->s_last_psn = psn; @@ -1119,9 +1075,12 @@ static inline void ipath_rc_rcv_resp(str if (do_rc_ack(qp, aeth, psn, OP(RDMA_READ_RESPONSE_LAST))) { /* * Change the state so we contimue - * processing new requests. + * processing new requests and wake up the + * tasklet if there are posted sends. */ qp->s_state = OP(SEND_LAST); + if (qp->s_tail != qp->s_head) + tasklet_hi_schedule(&qp->s_task); } goto ack_done; } From bos at pathscale.com Thu Jun 29 14:41:09 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:09 -0700 Subject: [openib-general] [PATCH 18 of 39] IB/ipath - use vmalloc to allocate struct ipath_devdata In-Reply-To: Message-ID: <9c072f8e7e68131f1c7e.1151617269@eng-12.pathscale.com> This is not a DMA target, so no need to use dma_alloc_coherent on it. Signed-off-by: Bryan O'Sullivan diff -r 9d943b828776 -r 9c072f8e7e68 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 @@ -171,14 +171,13 @@ static void ipath_free_devdata(struct pc list_del(&dd->ipath_list); spin_unlock_irqrestore(&ipath_devs_lock, flags); } - dma_free_coherent(&pdev->dev, sizeof(*dd), dd, dd->ipath_dma_addr); + vfree(dd); } static struct ipath_devdata *ipath_alloc_devdata(struct pci_dev *pdev) { unsigned long flags; struct ipath_devdata *dd; - dma_addr_t dma_addr; int ret; if (!idr_pre_get(&unit_table, GFP_KERNEL)) { @@ -186,15 +185,12 @@ static struct ipath_devdata *ipath_alloc goto bail; } - dd = dma_alloc_coherent(&pdev->dev, sizeof(*dd), &dma_addr, - GFP_KERNEL); - + dd = vmalloc(sizeof(*dd)); if (!dd) { dd = ERR_PTR(-ENOMEM); goto bail; } - - dd->ipath_dma_addr = dma_addr; + memset(dd, 0, sizeof(*dd)); dd->ipath_unit = -1; spin_lock_irqsave(&ipath_devs_lock, flags); diff -r 9d943b828776 -r 9c072f8e7e68 drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:25 2006 -0700 @@ -163,7 +163,6 @@ struct ipath_devdata { * only written to by the chip, not the driver. */ volatile __le64 *ipath_hdrqtailptr; - dma_addr_t ipath_dma_addr; /* ipath_cfgports pointers */ struct ipath_portdata **ipath_pd; /* sk_buffs used by port 0 eager receive queue */ From bos at pathscale.com Thu Jun 29 14:41:06 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:06 -0700 Subject: [openib-general] [PATCH 15 of 39] IB/ipath - print better debug info when handling 32/64-bit DMA mask problems In-Reply-To: Message-ID: <125471ee6c6863fbfa35.1151617266@eng-12.pathscale.com> Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r e43b4df874a9 -r 125471ee6c68 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 @@ -425,12 +425,29 @@ static int __devinit ipath_init_one(stru */ ret = pci_set_dma_mask(pdev, DMA_32BIT_MASK); if (ret) { - dev_info(&pdev->dev, "pci_set_dma_mask unit %u " - "fails: %d\n", dd->ipath_unit, ret); + dev_info(&pdev->dev, + "Unable to set DMA mask for unit %u: %d\n", + dd->ipath_unit, ret); goto bail_regions; } - else + else { ipath_dbg("No 64bit DMA mask, used 32 bit mask\n"); + ret = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); + if (ret) + dev_info(&pdev->dev, + "Unable to set DMA consistent mask " + "for unit %u: %d\n", + dd->ipath_unit, ret); + + } + } + else { + ret = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); + if (ret) + dev_info(&pdev->dev, + "Unable to set DMA consistent mask " + "for unit %u: %d\n", + dd->ipath_unit, ret); } pci_set_master(pdev); From bos at pathscale.com Thu Jun 29 14:41:14 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:14 -0700 Subject: [openib-general] [PATCH 23 of 39] IB/ipath - disallow send of invalid packet sizes over UD In-Reply-To: Message-ID: <8e39364c2402304872e6.1151617274@eng-12.pathscale.com> Signed-off-by: Ralph Campbell Signed-off-by: Bryan O'Sullivan diff -r 811021b6c112 -r 8e39364c2402 drivers/infiniband/hw/ipath/ipath_ud.c --- a/drivers/infiniband/hw/ipath/ipath_ud.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ud.c Thu Jun 29 14:33:26 2006 -0700 @@ -274,6 +274,11 @@ int ipath_post_ud_send(struct ipath_qp * } len += wr->sg_list[i].length; ss.num_sge++; + } + /* Check for invalid packet size. */ + if (len > ipath_layer_get_ibmtu(dev->dd)) { + ret = -EINVAL; + goto bail; } extra_bytes = (4 - len) & 3; nwords = (len + extra_bytes) >> 2; From bos at pathscale.com Thu Jun 29 14:41:15 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:15 -0700 Subject: [openib-general] [PATCH 24 of 39] IB/ipath - don't confuse the max message size with the MTU In-Reply-To: Message-ID: Signed-off-by: Ralph Campbell Signed-off-by: Bryan O'Sullivan diff -r 8e39364c2402 -r e952aedb0e94 drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:26 2006 -0700 @@ -695,7 +695,7 @@ static int ipath_query_port(struct ib_de ipath_layer_get_lastibcstat(dev->dd) & 0xf]; props->port_cap_flags = dev->port_cap_flags; props->gid_tbl_len = 1; - props->max_msg_sz = 4096; + props->max_msg_sz = 0x80000000; props->pkey_tbl_len = ipath_layer_get_npkeys(dev->dd); props->bad_pkey_cntr = ipath_layer_get_cr_errpkey(dev->dd) - dev->z_pkey_violations; From bos at pathscale.com Thu Jun 29 14:41:19 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:19 -0700 Subject: [openib-general] [PATCH 28 of 39] IB/ipath - Fixes a bug where our delay for EEPROM no longer works due to compiler reordering In-Reply-To: Message-ID: <5f3c0b2d446d78e3327f.1151617279@eng-12.pathscale.com> The mb() prevents the compiler from reordering on this function, with some versions of gcc and -Os optimization. The result is random failures in the EEPROM read without this change. Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r 7d22a8963bda -r 5f3c0b2d446d drivers/infiniband/hw/ipath/ipath_eeprom.c --- a/drivers/infiniband/hw/ipath/ipath_eeprom.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c Thu Jun 29 14:33:26 2006 -0700 @@ -186,6 +186,7 @@ bail: */ static void i2c_wait_for_writes(struct ipath_devdata *dd) { + mb(); (void)ipath_read_kreg32(dd, dd->ipath_kregs->kr_scratch); } From bos at pathscale.com Thu Jun 29 14:41:08 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:08 -0700 Subject: [openib-general] [PATCH 17 of 39] IB/ipath - use more appropriate gfp flags In-Reply-To: Message-ID: <9d943b828776136a2bb7.1151617268@eng-12.pathscale.com> This helps us to survive better when memory is fragmented. Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r fd5e733f02ac -r 9d943b828776 drivers/infiniband/hw/ipath/ipath_file_ops.c --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:25 2006 -0700 @@ -705,6 +705,15 @@ static int ipath_create_user_egr(struct unsigned e, egrcnt, alloced, egrperchunk, chunk, egrsize, egroff; size_t size; int ret; + gfp_t gfp_flags; + + /* + * GFP_USER, but without GFP_FS, so buffer cache can be + * coalesced (we hope); otherwise, even at order 4, + * heavy filesystem activity makes these fail, and we can + * use compound pages. + */ + gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP; egrcnt = dd->ipath_rcvegrcnt; /* TID number offset for this port */ @@ -721,10 +730,8 @@ static int ipath_create_user_egr(struct * memory pressure (creating large files and then copying them over * NFS while doing lots of MPI jobs), we hit some allocation * failures, even though we can sleep... (2.6.10) Still get - * failures at 64K. 32K is the lowest we can go without waiting - * more memory again. It seems likely that the coalescing in - * free_pages, etc. still has issues (as it has had previously - * during 2.6.x development). + * failures at 64K. 32K is the lowest we can go without wasting + * additional memory. */ size = 0x8000; alloced = ALIGN(egrsize * egrcnt, size); @@ -745,12 +752,6 @@ static int ipath_create_user_egr(struct goto bail_rcvegrbuf; } for (e = 0; e < pd->port_rcvegrbuf_chunks; e++) { - /* - * GFP_USER, but without GFP_FS, so buffer cache can be - * coalesced (we hope); otherwise, even at order 4, - * heavy filesystem activity makes these fail - */ - gfp_t gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP; pd->port_rcvegrbuf[e] = dma_alloc_coherent( &dd->pcidev->dev, size, &pd->port_rcvegrbuf_phys[e], @@ -1167,9 +1168,10 @@ static int ipath_mmap(struct file *fp, s ureg = dd->ipath_uregbase + dd->ipath_palign * pd->port_port; - ipath_cdbg(MM, "ushare: pgaddr %llx vm_start=%lx, vmlen %lx\n", + ipath_cdbg(MM, "pgaddr %llx vm_start=%lx len %lx port %u:%u\n", (unsigned long long) pgaddr, vma->vm_start, - vma->vm_end - vma->vm_start); + vma->vm_end - vma->vm_start, dd->ipath_unit, + pd->port_port); if (pgaddr == ureg) ret = mmap_ureg(vma, dd, ureg); From bos at pathscale.com Thu Jun 29 14:41:16 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:16 -0700 Subject: [openib-general] [PATCH 25 of 39] IB/ipath - removed redundant statements In-Reply-To: Message-ID: <4c581c37bb95ad3abb6d.1151617276@eng-12.pathscale.com> The tail register read became redundant as the result of earlier receive interrupt bug fixes. Drop another unneeded register read. And another line that got duplicated. Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r e952aedb0e94 -r 4c581c37bb95 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:26 2006 -0700 @@ -890,9 +890,6 @@ void ipath_kreceive(struct ipath_devdata goto done; reloop: - /* read only once at start for performance */ - hdrqtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr); - for (i = 0; l != hdrqtail; i++) { u32 qp; u8 *bthbytes; diff -r e952aedb0e94 -r 4c581c37bb95 drivers/infiniband/hw/ipath/ipath_ht400.c --- a/drivers/infiniband/hw/ipath/ipath_ht400.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ht400.c Thu Jun 29 14:33:26 2006 -0700 @@ -1573,7 +1573,6 @@ void ipath_init_ht400_funcs(struct ipath dd->ipath_f_reset = ipath_setup_ht_reset; dd->ipath_f_get_boardname = ipath_ht_boardname; dd->ipath_f_init_hwerrors = ipath_ht_init_hwerrors; - dd->ipath_f_init_hwerrors = ipath_ht_init_hwerrors; dd->ipath_f_early_init = ipath_ht_early_init; dd->ipath_f_handle_hwerrors = ipath_ht_handle_hwerrors; dd->ipath_f_quiet_serdes = ipath_ht_quiet_serdes; diff -r e952aedb0e94 -r 4c581c37bb95 drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:26 2006 -0700 @@ -824,7 +824,6 @@ irqreturn_t ipath_intr(int irq, void *da ipath_stats.sps_fastrcvint++; goto done; } - istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus); } istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus); From bos at pathscale.com Thu Jun 29 14:41:24 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:24 -0700 Subject: [openib-general] [PATCH 33 of 39] IB/ipath - read/write correct sizes through diag interface In-Reply-To: Message-ID: We must increment uaddr by size we are reading or writing, since it's passed as a char *, not a pointer to the appropriate size. Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r 8fbb5d71823a -r a7c1ad1e090b drivers/infiniband/hw/ipath/ipath_diag.c --- a/drivers/infiniband/hw/ipath/ipath_diag.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_diag.c Thu Jun 29 14:33:26 2006 -0700 @@ -115,7 +115,7 @@ static int ipath_read_umem64(struct ipat goto bail; } reg_addr++; - uaddr++; + uaddr += sizeof(u64); } ret = 0; bail: @@ -154,7 +154,7 @@ static int ipath_write_umem64(struct ipa writeq(data, reg_addr); reg_addr++; - uaddr++; + uaddr += sizeof(u64); } ret = 0; bail: @@ -192,7 +192,8 @@ static int ipath_read_umem32(struct ipat } reg_addr++; - uaddr++; + uaddr += sizeof(u32); + } ret = 0; bail: @@ -231,7 +232,7 @@ static int ipath_write_umem32(struct ipa writel(data, reg_addr); reg_addr++; - uaddr++; + uaddr += sizeof(u32); } ret = 0; bail: From bos at pathscale.com Thu Jun 29 14:41:17 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:17 -0700 Subject: [openib-general] [PATCH 26 of 39] IB/ipath - check for valid LID and multicast LIDs In-Reply-To: Message-ID: Signed-off-by: Ralph Campbell Signed-off-by: Bryan O'Sullivan diff -r 4c581c37bb95 -r eef7f8021500 drivers/infiniband/hw/ipath/ipath_sysfs.c --- a/drivers/infiniband/hw/ipath/ipath_sysfs.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c Thu Jun 29 14:33:26 2006 -0700 @@ -280,7 +280,7 @@ static ssize_t store_lid(struct device * if (ret < 0) goto invalid; - if (lid == 0 || lid >= 0xc000) { + if (lid == 0 || lid >= IPS_MULTICAST_LID_BASE) { ret = -EINVAL; goto invalid; } @@ -314,7 +314,7 @@ static ssize_t store_mlid(struct device int ret; ret = ipath_parse_ushort(buf, &mlid); - if (ret < 0) + if (ret < 0 || mlid < IPS_MULTICAST_LID_BASE) goto invalid; unit = dd->ipath_unit; From bos at pathscale.com Thu Jun 29 14:41:12 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:12 -0700 Subject: [openib-general] [PATCH 21 of 39] IB/ipath - fixed bug 9776 for real. The problem was that I was updating In-Reply-To: Message-ID: <1a4350d895c9a673c98e.1151617272@eng-12.pathscale.com> the head register multiple times in the rcvhdrq processing loop, and setting the counter on each update. Since that meant that the tail register was ahead of head for all but the last update, we would get extra interrupts. The fix was to not write the counter value except on the last update. I also changed to update rcvhdrhead and rcvegrindexhead at most every 16 packets, if there were lots of packets in the queue (and of course, on the last packet, regardless). I also made some small cleanups while debugging this. With these changes, xeon/monty typically sees two openib packets per interrupt on sdp and ipoib, opteron/monty is about 1.25 pkts/intr. I'm seeing about 3800 Mbit/s monty/xeon, and 5000-5100 opteron/monty with netperf sdp. Netpipe doesn't show as good as that, peaking at about 4400 on opteron/monty sdp. Plain ipoib xeon is about 2100+ netperf, opteron 2900+, at 128KB Signed-off-by: olson at eng-12.pathscale.com Signed-off-by: Bryan O'Sullivan diff -r 8bc865893a11 -r 1a4350d895c9 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:26 2006 -0700 @@ -870,7 +870,7 @@ void ipath_kreceive(struct ipath_devdata const u32 maxcnt = dd->ipath_rcvhdrcnt * rsize; /* words */ u32 etail = -1, l, hdrqtail; struct ips_message_header *hdr; - u32 eflags, i, etype, tlen, pkttot = 0; + u32 eflags, i, etype, tlen, pkttot = 0, updegr=0; static u64 totcalls; /* stats, may eventually remove */ char emsg[128]; @@ -884,14 +884,14 @@ void ipath_kreceive(struct ipath_devdata if (test_and_set_bit(0, &dd->ipath_rcv_pending)) goto bail; - if (dd->ipath_port0head == - (u32)le64_to_cpu(*dd->ipath_hdrqtailptr)) + l = dd->ipath_port0head; + if (l == (u32)le64_to_cpu(*dd->ipath_hdrqtailptr)) goto done; /* read only once at start for performance */ hdrqtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr); - for (i = 0, l = dd->ipath_port0head; l != hdrqtail; i++) { + for (i = 0; l != hdrqtail; i++) { u32 qp; u8 *bthbytes; @@ -1002,15 +1002,26 @@ void ipath_kreceive(struct ipath_devdata l += rsize; if (l >= maxcnt) l = 0; + if (etype != RCVHQ_RCV_TYPE_EXPECTED) + updegr = 1; /* - * update for each packet, to help prevent overflows if we - * have lots of packets. + * update head regs on last packet, and every 16 packets. + * Reduce bus traffic, while still trying to prevent + * rcvhdrq overflows, for when the queue is nearly full */ - (void)ipath_write_ureg(dd, ur_rcvhdrhead, - dd->ipath_rhdrhead_intr_off | l, 0); - if (etype != RCVHQ_RCV_TYPE_EXPECTED) - (void)ipath_write_ureg(dd, ur_rcvegrindexhead, - etail, 0); + if (l == hdrqtail || (i && !(i&0xf))) { + u64 lval; + if (l == hdrqtail) /* want interrupt only on last */ + lval = dd->ipath_rhdrhead_intr_off | l; + else + lval = l; + (void)ipath_write_ureg(dd, ur_rcvhdrhead, lval, 0); + if (updegr) { + (void)ipath_write_ureg(dd, ur_rcvegrindexhead, + etail, 0); + updegr = 0; + } + } } pkttot += i; diff -r 8bc865893a11 -r 1a4350d895c9 drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:26 2006 -0700 @@ -383,7 +383,7 @@ static unsigned handle_frequent_errors(s return supp_msgs; } -static void handle_errors(struct ipath_devdata *dd, ipath_err_t errs) +static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs) { char msg[512]; u64 ignore_this_time = 0; @@ -480,7 +480,7 @@ static void handle_errors(struct ipath_d INFINIPATH_E_IBSTATUSCHANGED); } if (!errs) - return; + return 0; if (!noprint) /* @@ -604,9 +604,7 @@ static void handle_errors(struct ipath_d wake_up_interruptible(&ipath_sma_state_wait); } - if (chkerrpkts) - /* process possible error packets in hdrq */ - ipath_kreceive(dd); + return chkerrpkts; } /* this is separate to allow for better optimization of ipath_intr() */ @@ -765,10 +763,10 @@ irqreturn_t ipath_intr(int irq, void *da irqreturn_t ipath_intr(int irq, void *data, struct pt_regs *regs) { struct ipath_devdata *dd = data; - u32 istat; + u32 istat, chk0rcv = 0; ipath_err_t estat = 0; irqreturn_t ret; - u32 p0bits; + u32 p0bits, oldhead; static unsigned unexpected = 0; static const u32 port0rbits = (1U<ipath_port0head != - (u32)le64_to_cpu(*dd->ipath_hdrqtailptr)) { - u32 oldhead = dd->ipath_port0head; + oldhead = dd->ipath_port0head; + if (oldhead != (u32) le64_to_cpu(*dd->ipath_hdrqtailptr)) { if (dd->ipath_flags & IPATH_GPIO_INTR) { ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear, (u64) (1 << 2)); @@ -830,6 +827,8 @@ irqreturn_t ipath_intr(int irq, void *da } istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus); + p0bits = port0rbits; + if (unlikely(!istat)) { ipath_stats.sps_nullintr++; ret = IRQ_NONE; /* not our interrupt, or already handled */ @@ -867,10 +866,11 @@ irqreturn_t ipath_intr(int irq, void *da ipath_dev_err(dd, "Read of error status failed " "(all bits set); ignoring\n"); else - handle_errors(dd, estat); - } - - p0bits = port0rbits; + if (handle_errors(dd, estat)) + /* force calling ipath_kreceive() */ + chk0rcv = 1; + } + if (istat & INFINIPATH_I_GPIO) { /* * Packets are available in the port 0 rcv queue. @@ -892,8 +892,10 @@ irqreturn_t ipath_intr(int irq, void *da ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear, (u64) (1 << 2)); p0bits |= INFINIPATH_I_GPIO; - } - } + chk0rcv = 1; + } + } + chk0rcv |= istat & p0bits; /* * clear the ones we will deal with on this round @@ -905,18 +907,16 @@ irqreturn_t ipath_intr(int irq, void *da ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, istat); /* - * we check for both transition from empty to non-empty, and urgent - * packets (those with the interrupt bit set in the header), and - * if enabled, the GPIO bit 2 interrupt used for port0 on some - * HT-400 boards. - * Do this before checking for pio buffers available, since - * receives can overflow; piobuf waiters can afford a few - * extra cycles, since they were waiting anyway. - */ - if (istat & p0bits) { + * handle port0 receive before checking for pio buffers available, + * since receives can overflow; piobuf waiters can afford a few + * extra cycles, since they were waiting anyway, and user's waiting + * for receive are at the bottom. + */ + if (chk0rcv) { ipath_kreceive(dd); istat &= ~port0rbits; } + if (istat & ((infinipath_i_rcvavail_mask << INFINIPATH_I_RCVAVAIL_SHIFT) | (infinipath_i_rcvurg_mask << From bos at pathscale.com Thu Jun 29 14:41:13 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:13 -0700 Subject: [openib-general] [PATCH 22 of 39] IB/ipath - fix lost interrupts on HT-400 In-Reply-To: Message-ID: <811021b6c112f8616d73.1151617273@eng-12.pathscale.com> Do an extra check to see if in-memory tail changed while processing packets, and if so, going back through the loop again (but only once per call to ipath_kreceive()). In practice, this seems to be enough to guarantee that if we crossed the clearing of an interrupt at start of ipath_intr with a scheduled tail register update, that we'll process the "extra" packet that lost the interrupt because we cleared it just as it was about to arrive. Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r 1a4350d895c9 -r 811021b6c112 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:26 2006 -0700 @@ -870,7 +870,7 @@ void ipath_kreceive(struct ipath_devdata const u32 maxcnt = dd->ipath_rcvhdrcnt * rsize; /* words */ u32 etail = -1, l, hdrqtail; struct ips_message_header *hdr; - u32 eflags, i, etype, tlen, pkttot = 0, updegr=0; + u32 eflags, i, etype, tlen, pkttot = 0, updegr=0, reloop=0; static u64 totcalls; /* stats, may eventually remove */ char emsg[128]; @@ -885,9 +885,11 @@ void ipath_kreceive(struct ipath_devdata goto bail; l = dd->ipath_port0head; - if (l == (u32)le64_to_cpu(*dd->ipath_hdrqtailptr)) + hdrqtail = (u32) le64_to_cpu(*dd->ipath_hdrqtailptr); + if (l == hdrqtail) goto done; +reloop: /* read only once at start for performance */ hdrqtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr); @@ -1011,7 +1013,7 @@ void ipath_kreceive(struct ipath_devdata */ if (l == hdrqtail || (i && !(i&0xf))) { u64 lval; - if (l == hdrqtail) /* want interrupt only on last */ + if (l == hdrqtail) /* PE-800 interrupt only on last */ lval = dd->ipath_rhdrhead_intr_off | l; else lval = l; @@ -1021,6 +1023,23 @@ void ipath_kreceive(struct ipath_devdata etail, 0); updegr = 0; } + } + } + + if (!dd->ipath_rhdrhead_intr_off && !reloop) { + /* HT-400 workaround; we can have a race clearing chip + * interrupt with another interrupt about to be delivered, + * and can clear it before it is delivered on the GPIO + * workaround. By doing the extra check here for the + * in-memory tail register updating while we were doing + * earlier packets, we "almost" guarantee we have covered + * that case. + */ + u32 hqtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr); + if (hqtail != hdrqtail) { + hdrqtail = hqtail; + reloop = 1; /* loop 1 extra time at most */ + goto reloop; } } diff -r 1a4350d895c9 -r 811021b6c112 drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:26 2006 -0700 @@ -766,7 +766,7 @@ irqreturn_t ipath_intr(int irq, void *da u32 istat, chk0rcv = 0; ipath_err_t estat = 0; irqreturn_t ret; - u32 p0bits, oldhead; + u32 oldhead, curtail; static unsigned unexpected = 0; static const u32 port0rbits = (1U<ipath_port0head; - if (oldhead != (u32) le64_to_cpu(*dd->ipath_hdrqtailptr)) { + curtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr); + if (oldhead != curtail) { if (dd->ipath_flags & IPATH_GPIO_INTR) { ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear, (u64) (1 << 2)); - p0bits = port0rbits | INFINIPATH_I_GPIO; + istat = port0rbits | INFINIPATH_I_GPIO; } else - p0bits = port0rbits; - ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, p0bits); + istat = port0rbits; + ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, istat); ipath_kreceive(dd); if (oldhead != dd->ipath_port0head) { ipath_stats.sps_fastrcvint++; @@ -827,7 +828,6 @@ irqreturn_t ipath_intr(int irq, void *da } istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus); - p0bits = port0rbits; if (unlikely(!istat)) { ipath_stats.sps_nullintr++; @@ -890,19 +890,19 @@ irqreturn_t ipath_intr(int irq, void *da else { /* Clear GPIO status bit 2 */ ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear, - (u64) (1 << 2)); - p0bits |= INFINIPATH_I_GPIO; + (u64) (1 << 2)); chk0rcv = 1; } } - chk0rcv |= istat & p0bits; - - /* - * clear the ones we will deal with on this round - * We clear it early, mostly for receive interrupts, so we - * know the chip will have seen this by the time we process - * the queue, and will re-interrupt if necessary. The processor - * itself won't take the interrupt again until we return. + chk0rcv |= istat & port0rbits; + + /* + * Clear the interrupt bits we found set, unless they are receive + * related, in which case we already cleared them above, and don't + * want to clear them again, because we might lose an interrupt. + * Clear it early, so we "know" know the chip will have seen this by + * the time we process the queue, and will re-interrupt if necessary. + * The processor itself won't take the interrupt again until we return. */ ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, istat); From bos at pathscale.com Thu Jun 29 14:41:22 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:22 -0700 Subject: [openib-general] [PATCH 31 of 39] IB/ipath - drop the "stats" sysfs attribute group In-Reply-To: Message-ID: <21378f21e091f6fc81fc.1151617282@eng-12.pathscale.com> This attribute group made it into the original driver, but should not have. Signed-off-by: Bryan O'Sullivan diff -r 3ceb73f8bde0 -r 21378f21e091 drivers/infiniband/hw/ipath/ipath_sysfs.c --- a/drivers/infiniband/hw/ipath/ipath_sysfs.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c Thu Jun 29 14:33:26 2006 -0700 @@ -84,81 +84,6 @@ static ssize_t show_num_units(struct dev return scnprintf(buf, PAGE_SIZE, "%d\n", ipath_count_units(NULL, NULL, NULL)); } - -#define DRIVER_STAT(name, attr) \ - static ssize_t show_stat_##name(struct device_driver *dev, \ - char *buf) \ - { \ - return scnprintf( \ - buf, PAGE_SIZE, "%llu\n", \ - (unsigned long long) ipath_stats.sps_ ##attr); \ - } \ - static DRIVER_ATTR(name, S_IRUGO, show_stat_##name, NULL) - -DRIVER_STAT(intrs, ints); -DRIVER_STAT(err_intrs, errints); -DRIVER_STAT(errs, errs); -DRIVER_STAT(pkt_errs, pkterrs); -DRIVER_STAT(crc_errs, crcerrs); -DRIVER_STAT(hw_errs, hwerrs); -DRIVER_STAT(ib_link, iblink); -DRIVER_STAT(port0_pkts, port0pkts); -DRIVER_STAT(ether_spkts, ether_spkts); -DRIVER_STAT(ether_rpkts, ether_rpkts); -DRIVER_STAT(sma_spkts, sma_spkts); -DRIVER_STAT(sma_rpkts, sma_rpkts); -DRIVER_STAT(hdrq_full, hdrqfull); -DRIVER_STAT(etid_full, etidfull); -DRIVER_STAT(no_piobufs, nopiobufs); -DRIVER_STAT(ports, ports); -DRIVER_STAT(pkey0, pkeys[0]); -DRIVER_STAT(pkey1, pkeys[1]); -DRIVER_STAT(pkey2, pkeys[2]); -DRIVER_STAT(pkey3, pkeys[3]); - -DRIVER_STAT(nports, nports); -DRIVER_STAT(null_intr, nullintr); -DRIVER_STAT(max_pkts_call, maxpkts_call); -DRIVER_STAT(avg_pkts_call, avgpkts_call); -DRIVER_STAT(page_locks, pagelocks); -DRIVER_STAT(page_unlocks, pageunlocks); -DRIVER_STAT(krdrops, krdrops); - -static struct attribute *driver_stat_attributes[] = { - &driver_attr_intrs.attr, - &driver_attr_err_intrs.attr, - &driver_attr_errs.attr, - &driver_attr_pkt_errs.attr, - &driver_attr_crc_errs.attr, - &driver_attr_hw_errs.attr, - &driver_attr_ib_link.attr, - &driver_attr_port0_pkts.attr, - &driver_attr_ether_spkts.attr, - &driver_attr_ether_rpkts.attr, - &driver_attr_sma_spkts.attr, - &driver_attr_sma_rpkts.attr, - &driver_attr_hdrq_full.attr, - &driver_attr_etid_full.attr, - &driver_attr_no_piobufs.attr, - &driver_attr_ports.attr, - &driver_attr_pkey0.attr, - &driver_attr_pkey1.attr, - &driver_attr_pkey2.attr, - &driver_attr_pkey3.attr, - &driver_attr_nports.attr, - &driver_attr_null_intr.attr, - &driver_attr_max_pkts_call.attr, - &driver_attr_avg_pkts_call.attr, - &driver_attr_page_locks.attr, - &driver_attr_page_unlocks.attr, - &driver_attr_krdrops.attr, - NULL -}; - -static struct attribute_group driver_stat_attr_group = { - .name = "stats", - .attrs = driver_stat_attributes -}; static ssize_t show_status(struct device *dev, struct device_attribute *attr, @@ -716,20 +641,12 @@ int ipath_driver_create_group(struct dev int ret; ret = sysfs_create_group(&drv->kobj, &driver_attr_group); - if (ret) - goto bail; - - ret = sysfs_create_group(&drv->kobj, &driver_stat_attr_group); - if (ret) - sysfs_remove_group(&drv->kobj, &driver_attr_group); - -bail: + return ret; } void ipath_driver_remove_group(struct device_driver *drv) { - sysfs_remove_group(&drv->kobj, &driver_stat_attr_group); sysfs_remove_group(&drv->kobj, &driver_attr_group); } From bos at pathscale.com Thu Jun 29 14:41:23 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:23 -0700 Subject: [openib-general] [PATCH 32 of 39] IB/ipath - support more models of InfiniPath hardware In-Reply-To: Message-ID: <8fbb5d71823abafe963a.1151617283@eng-12.pathscale.com> We do a few more explicit checks for specific models, and now also support the old PathScale serial number style, or new QLogic style. This is backwards compatible with previous versions of software and hardware. That is, older software will see a plausible serial number and correct GUID when used with a new board, while newer software will correctly handle an older board. Signed-off-by: Mike Albaugh Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r 21378f21e091 -r 8fbb5d71823a drivers/infiniband/hw/ipath/ipath_common.h --- a/drivers/infiniband/hw/ipath/ipath_common.h Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_common.h Thu Jun 29 14:33:26 2006 -0700 @@ -476,7 +476,7 @@ struct ipath_sma_pkt * Data layout in I2C flash (for GUID, etc.) * All fields are little-endian binary unless otherwise stated */ -#define IPATH_FLASH_VERSION 1 +#define IPATH_FLASH_VERSION 2 struct ipath_flash { /* flash layout version (IPATH_FLASH_VERSION) */ __u8 if_fversion; @@ -484,14 +484,14 @@ struct ipath_flash { __u8 if_csum; /* * valid length (in use, protected by if_csum), including - * if_fversion and if_sum themselves) + * if_fversion and if_csum themselves) */ __u8 if_length; /* the GUID, in network order */ __u8 if_guid[8]; /* number of GUIDs to use, starting from if_guid */ __u8 if_numguid; - /* the board serial number, in ASCII */ + /* the (last 10 characters of) board serial number, in ASCII */ char if_serial[12]; /* board mfg date (YYYYMMDD ASCII) */ char if_mfgdate[8]; @@ -503,8 +503,10 @@ struct ipath_flash { __u8 if_powerhour[2]; /* ASCII free-form comment field */ char if_comment[32]; - /* 78 bytes used, min flash size is 128 bytes */ - __u8 if_future[50]; + /* Backwards compatible prefix for longer QLogic Serial Numbers */ + char if_sprefix[4]; + /* 82 bytes used, min flash size is 128 bytes */ + __u8 if_future[46]; }; /* diff -r 21378f21e091 -r 8fbb5d71823a drivers/infiniband/hw/ipath/ipath_eeprom.c --- a/drivers/infiniband/hw/ipath/ipath_eeprom.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c Thu Jun 29 14:33:26 2006 -0700 @@ -602,8 +602,31 @@ void ipath_get_eeprom_info(struct ipath_ guid = *(__be64 *) ifp->if_guid; dd->ipath_guid = guid; dd->ipath_nguid = ifp->if_numguid; - memcpy(dd->ipath_serial, ifp->if_serial, - sizeof(ifp->if_serial)); + /* + * Things are slightly complicated by the desire to transparently + * support both the Pathscale 10-digit serial number and the QLogic + * 13-character version. + */ + if ((ifp->if_fversion > 1) && ifp->if_sprefix[0] + && ((u8 *)ifp->if_sprefix)[0] != 0xFF) { + /* This board has a Serial-prefix, which is stored + * elsewhere for backward-compatibility. + */ + char *snp = dd->ipath_serial; + int len; + memcpy(snp, ifp->if_sprefix, sizeof ifp->if_sprefix); + snp[sizeof ifp->if_sprefix] = '\0'; + len = strlen(snp); + snp += len; + len = (sizeof dd->ipath_serial) - len; + if (len > sizeof ifp->if_serial) { + len = sizeof ifp->if_serial; + } + memcpy(snp, ifp->if_serial, len); + } else + memcpy(dd->ipath_serial, ifp->if_serial, + sizeof ifp->if_serial); + ipath_cdbg(VERBOSE, "Initted GUID to %llx from eeprom\n", (unsigned long long) be64_to_cpu(dd->ipath_guid)); diff -r 21378f21e091 -r 8fbb5d71823a drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:26 2006 -0700 @@ -491,8 +491,11 @@ struct ipath_devdata { u16 ipath_lid; /* list of pkeys programmed; 0 if not set */ u16 ipath_pkeys[4]; - /* ASCII serial number, from flash */ - u8 ipath_serial[12]; + /* + * ASCII serial number, from flash, large enough for original + * all digit strings, and longer QLogic serial number format + */ + u8 ipath_serial[16]; /* human readable board version */ u8 ipath_boardversion[80]; /* chip major rev, from ipath_revision */ diff -r 21378f21e091 -r 8fbb5d71823a drivers/infiniband/hw/ipath/ipath_pe800.c --- a/drivers/infiniband/hw/ipath/ipath_pe800.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_pe800.c Thu Jun 29 14:33:26 2006 -0700 @@ -533,7 +533,7 @@ static int ipath_pe_boardname(struct ipa if (n) snprintf(name, namelen, "%s", n); - if (dd->ipath_majrev != 4 || dd->ipath_minrev != 1) { + if (dd->ipath_majrev != 4 || !dd->ipath_minrev || dd->ipath_minrev>2) { ipath_dev_err(dd, "Unsupported PE-800 revision %u.%u!\n", dd->ipath_majrev, dd->ipath_minrev); ret = 1; From bos at pathscale.com Thu Jun 29 14:41:25 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:25 -0700 Subject: [openib-general] [PATCH 34 of 39] IB/ipath - fix a bug that results in addresses near 0 being written via DMA In-Reply-To: Message-ID: We can't tell for sure if any packets are in the infinipath receive buffer when we shut down a chip port. Normally this is taken care of by orderly shutdown, but when processes are terminated, or sending process has a bug, we can continue to receive packets. So rather than writing zero to the address registers for the closing port, we point it at a dummy memory. Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r a7c1ad1e090b -r b6ebaf2dd2fd drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:26 2006 -0700 @@ -1824,6 +1824,12 @@ static void cleanup_device(struct ipath_ dd->ipath_pioavailregs_phys); dd->ipath_pioavailregs_dma = NULL; } + if (dd->ipath_dummy_hdrq) { + dma_free_coherent(&dd->pcidev->dev, + dd->ipath_pd[0]->port_rcvhdrq_size, + dd->ipath_dummy_hdrq, dd->ipath_dummy_hdrq_phys); + dd->ipath_dummy_hdrq = NULL; + } if (dd->ipath_pageshadow) { struct page **tmpp = dd->ipath_pageshadow; diff -r a7c1ad1e090b -r b6ebaf2dd2fd drivers/infiniband/hw/ipath/ipath_file_ops.c --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:26 2006 -0700 @@ -1486,41 +1486,50 @@ static int ipath_close(struct inode *in, } if (dd->ipath_kregbase) { - ipath_write_kreg_port( - dd, dd->ipath_kregs->kr_rcvhdrtailaddr, - port, 0ULL); - ipath_write_kreg_port( - dd, dd->ipath_kregs->kr_rcvhdraddr, - pd->port_port, 0); + int i; + /* atomically clear receive enable port. */ + clear_bit(INFINIPATH_R_PORTENABLE_SHIFT + port, + &dd->ipath_rcvctrl); + ipath_write_kreg( dd, dd->ipath_kregs->kr_rcvctrl, + dd->ipath_rcvctrl); + /* and read back from chip to be sure that nothing + * else is in flight when we do the rest */ + (void)ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); /* clean up the pkeys for this port user */ ipath_clean_part_key(pd, dd); - if (port < dd->ipath_cfgports) { - int i = dd->ipath_pbufsport * (port - 1); - ipath_disarm_piobufs(dd, i, dd->ipath_pbufsport); - - /* atomically clear receive enable port. */ - clear_bit(INFINIPATH_R_PORTENABLE_SHIFT + port, - &dd->ipath_rcvctrl); - ipath_write_kreg( - dd, - dd->ipath_kregs->kr_rcvctrl, - dd->ipath_rcvctrl); - - if (dd->ipath_pageshadow) - unlock_expected_tids(pd); - ipath_stats.sps_ports--; - ipath_cdbg(PROC, "%s[%u] closed port %u:%u\n", - pd->port_comm, pd->port_pid, - dd->ipath_unit, port); - } + + /* + * be paranoid, and never write 0's to these, just use an + * unused part of the port 0 tail page. Of course, + * rcvhdraddr points to a large chunk of memory, so this + * could still trash things, but at least it won't trash + * page 0, and by disabling the port, it should stop "soon", + * even if a packet or two is in already in flight after we + * disabled the port. + */ + ipath_write_kreg_port(dd, + dd->ipath_kregs->kr_rcvhdrtailaddr, port, + dd->ipath_dummy_hdrq_phys); + ipath_write_kreg_port(dd, dd->ipath_kregs->kr_rcvhdraddr, + pd->port_port, dd->ipath_dummy_hdrq_phys); + + i = dd->ipath_pbufsport * (port - 1); + ipath_disarm_piobufs(dd, i, dd->ipath_pbufsport); + + if (dd->ipath_pageshadow) + unlock_expected_tids(pd); + ipath_stats.sps_ports--; + ipath_cdbg(PROC, "%s[%u] closed port %u:%u\n", + pd->port_comm, pd->port_pid, + dd->ipath_unit, port); + + dd->ipath_f_clear_tids(dd, pd->port_port); } pd->port_cnt = 0; pd->port_pid = 0; - - dd->ipath_f_clear_tids(dd, pd->port_port); dd->ipath_pd[pd->port_port] = NULL; /* before releasing mutex */ mutex_unlock(&ipath_mutex); diff -r a7c1ad1e090b -r b6ebaf2dd2fd drivers/infiniband/hw/ipath/ipath_init_chip.c --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Jun 29 14:33:26 2006 -0700 @@ -647,6 +647,7 @@ int ipath_init_chip(struct ipath_devdata u32 val32, kpiobufs; u64 val; struct ipath_portdata *pd = NULL; /* keep gcc4 happy */ + gfp_t gfp_flags = GFP_USER | __GFP_COMP; ret = init_housekeeping(dd, &pd, reinit); if (ret) @@ -833,6 +834,22 @@ int ipath_init_chip(struct ipath_devdata "rcvhdrq and/or egr bufs\n"); else enable_chip(dd, pd, reinit); + + + if (!ret && !reinit) { + /* used when we close a port, for DMA already in flight at close */ + dd->ipath_dummy_hdrq = dma_alloc_coherent( + &dd->pcidev->dev, pd->port_rcvhdrq_size, + &dd->ipath_dummy_hdrq_phys, + gfp_flags); + if (!dd->ipath_dummy_hdrq ) { + dev_info(&dd->pcidev->dev, + "Couldn't allocate 0x%lx bytes for dummy hdrq\n", + pd->port_rcvhdrq_size); + /* fallback to just 0'ing */ + dd->ipath_dummy_hdrq_phys = 0UL; + } + } /* * cause retrigger of pending interrupts ignored during init, diff -r a7c1ad1e090b -r b6ebaf2dd2fd drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:26 2006 -0700 @@ -352,6 +352,8 @@ struct ipath_devdata { /* check for stale messages in rcv queue */ /* only allow one intr at a time. */ unsigned long ipath_rcv_pending; + void *ipath_dummy_hdrq; /* used after port close */ + dma_addr_t ipath_dummy_hdrq_phys; /* * Shadow copies of registers; size indicates read access size. From bos at pathscale.com Thu Jun 29 14:41:21 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:21 -0700 Subject: [openib-general] [PATCH 30 of 39] IB/ipath - purge sps_lid and sps_mlid arrays In-Reply-To: Message-ID: <3ceb73f8bde0e0335b54.1151617281@eng-12.pathscale.com> The two arrays only had space for 4 units. Also changed from ipath_set_sps_lid() to ipath_set_lid(); the sps was leftover. Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r 1bef8244297a -r 3ceb73f8bde0 drivers/infiniband/hw/ipath/ipath_common.h --- a/drivers/infiniband/hw/ipath/ipath_common.h Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_common.h Thu Jun 29 14:33:26 2006 -0700 @@ -122,8 +122,7 @@ struct infinipath_stats { __u64 sps_ports; /* list of pkeys (other than default) accepted (0 means not set) */ __u16 sps_pkeys[4]; - /* lids for up to 4 infinipaths, indexed by infinipath # */ - __u16 sps_lid[4]; + __u16 sps_unused16[4]; /* available; maintaining compatible layout */ /* number of user ports per chip (not IB ports) */ __u32 sps_nports; /* not our interrupt, or already handled */ @@ -141,10 +140,8 @@ struct infinipath_stats { * packets if ipath not configured, sma/mad, etc.) */ __u64 sps_krdrops; - /* mlids for up to 4 infinipaths, indexed by infinipath # */ - __u16 sps_mlid[4]; /* pad for future growth */ - __u64 __sps_pad[45]; + __u64 __sps_pad[46]; }; /* diff -r 1bef8244297a -r 3ceb73f8bde0 drivers/infiniband/hw/ipath/ipath_init_chip.c --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Jun 29 14:33:26 2006 -0700 @@ -811,8 +811,6 @@ int ipath_init_chip(struct ipath_devdata /* clear any interrups up to this point (ints still not enabled) */ ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, -1LL); - ipath_stats.sps_lid[dd->ipath_unit] = dd->ipath_lid; - /* * Set up the port 0 (kernel) rcvhdr q and egr TIDs. If doing * re-init, the simplest way to handle this is to free diff -r 1bef8244297a -r 3ceb73f8bde0 drivers/infiniband/hw/ipath/ipath_layer.c --- a/drivers/infiniband/hw/ipath/ipath_layer.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.c Thu Jun 29 14:33:26 2006 -0700 @@ -300,9 +300,8 @@ bail: EXPORT_SYMBOL_GPL(ipath_layer_set_mtu); -int ipath_set_sps_lid(struct ipath_devdata *dd, u32 arg, u8 lmc) -{ - ipath_stats.sps_lid[dd->ipath_unit] = arg; +int ipath_set_lid(struct ipath_devdata *dd, u32 arg, u8 lmc) +{ dd->ipath_lid = arg; dd->ipath_lmc = lmc; @@ -316,7 +315,7 @@ int ipath_set_sps_lid(struct ipath_devda return 0; } -EXPORT_SYMBOL_GPL(ipath_set_sps_lid); +EXPORT_SYMBOL_GPL(ipath_set_lid); int ipath_layer_set_guid(struct ipath_devdata *dd, __be64 guid) { @@ -632,9 +631,9 @@ int ipath_layer_open(struct ipath_devdat if (*dd->ipath_statusp & IPATH_STATUS_IB_READY) intval |= IPATH_LAYER_INT_IF_UP; - if (ipath_stats.sps_lid[dd->ipath_unit]) + if (dd->ipath_lid) intval |= IPATH_LAYER_INT_LID; - if (ipath_stats.sps_mlid[dd->ipath_unit]) + if (dd->ipath_mlid) intval |= IPATH_LAYER_INT_BCAST; /* * do this on open, in case low level is already up and diff -r 1bef8244297a -r 3ceb73f8bde0 drivers/infiniband/hw/ipath/ipath_layer.h --- a/drivers/infiniband/hw/ipath/ipath_layer.h Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.h Thu Jun 29 14:33:26 2006 -0700 @@ -129,7 +129,7 @@ u32 ipath_layer_get_cr_errpkey(struct ip u32 ipath_layer_get_cr_errpkey(struct ipath_devdata *dd); int ipath_layer_set_linkstate(struct ipath_devdata *dd, u8 state); int ipath_layer_set_mtu(struct ipath_devdata *, u16); -int ipath_set_sps_lid(struct ipath_devdata *, u32, u8); +int ipath_set_lid(struct ipath_devdata *, u32, u8); int ipath_layer_send_hdr(struct ipath_devdata *dd, struct ether_header *hdr); int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords, diff -r 1bef8244297a -r 3ceb73f8bde0 drivers/infiniband/hw/ipath/ipath_mad.c --- a/drivers/infiniband/hw/ipath/ipath_mad.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_mad.c Thu Jun 29 14:33:26 2006 -0700 @@ -308,7 +308,7 @@ static int recv_subn_set_portinfo(struct /* Must be a valid unicast LID address. */ if (lid == 0 || lid >= IPS_MULTICAST_LID_BASE) goto err; - ipath_set_sps_lid(dev->dd, lid, pip->mkeyprot_resv_lmc & 7); + ipath_set_lid(dev->dd, lid, pip->mkeyprot_resv_lmc & 7); event.event = IB_EVENT_LID_CHANGE; ib_dispatch_event(&event); } diff -r 1bef8244297a -r 3ceb73f8bde0 drivers/infiniband/hw/ipath/ipath_sysfs.c --- a/drivers/infiniband/hw/ipath/ipath_sysfs.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c Thu Jun 29 14:33:26 2006 -0700 @@ -115,11 +115,6 @@ DRIVER_STAT(pkey1, pkeys[1]); DRIVER_STAT(pkey1, pkeys[1]); DRIVER_STAT(pkey2, pkeys[2]); DRIVER_STAT(pkey3, pkeys[3]); -/* XXX fix the following when dynamic table of devices used */ -DRIVER_STAT(lid0, lid[0]); -DRIVER_STAT(lid1, lid[1]); -DRIVER_STAT(lid2, lid[2]); -DRIVER_STAT(lid3, lid[3]); DRIVER_STAT(nports, nports); DRIVER_STAT(null_intr, nullintr); @@ -128,11 +123,6 @@ DRIVER_STAT(page_locks, pagelocks); DRIVER_STAT(page_locks, pagelocks); DRIVER_STAT(page_unlocks, pageunlocks); DRIVER_STAT(krdrops, krdrops); -/* XXX fix the following when dynamic table of devices used */ -DRIVER_STAT(mlid0, mlid[0]); -DRIVER_STAT(mlid1, mlid[1]); -DRIVER_STAT(mlid2, mlid[2]); -DRIVER_STAT(mlid3, mlid[3]); static struct attribute *driver_stat_attributes[] = { &driver_attr_intrs.attr, @@ -155,10 +145,6 @@ static struct attribute *driver_stat_att &driver_attr_pkey1.attr, &driver_attr_pkey2.attr, &driver_attr_pkey3.attr, - &driver_attr_lid0.attr, - &driver_attr_lid1.attr, - &driver_attr_lid2.attr, - &driver_attr_lid3.attr, &driver_attr_nports.attr, &driver_attr_null_intr.attr, &driver_attr_max_pkts_call.attr, @@ -166,10 +152,6 @@ static struct attribute *driver_stat_att &driver_attr_page_locks.attr, &driver_attr_page_unlocks.attr, &driver_attr_krdrops.attr, - &driver_attr_mlid0.attr, - &driver_attr_mlid1.attr, - &driver_attr_mlid2.attr, - &driver_attr_mlid3.attr, NULL }; @@ -273,7 +255,7 @@ static ssize_t store_lid(struct device * size_t count) { struct ipath_devdata *dd = dev_get_drvdata(dev); - u16 lid; + u16 lid = 0; int ret; ret = ipath_parse_ushort(buf, &lid); @@ -285,11 +267,11 @@ static ssize_t store_lid(struct device * goto invalid; } - ipath_set_sps_lid(dd, lid, 0); + ipath_set_lid(dd, lid, 0); goto bail; invalid: - ipath_dev_err(dd, "attempt to set invalid LID\n"); + ipath_dev_err(dd, "attempt to set invalid LID 0x%x\n", lid); bail: return ret; } @@ -320,7 +302,6 @@ static ssize_t store_mlid(struct device unit = dd->ipath_unit; dd->ipath_mlid = mlid; - ipath_stats.sps_mlid[unit] = mlid; ipath_layer_intr(dd, IPATH_LAYER_INT_BCAST); goto bail; From bos at pathscale.com Thu Jun 29 14:41:11 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:11 -0700 Subject: [openib-general] [PATCH 20 of 39] IB/ipath - reduce overhead on receive interrupts In-Reply-To: Message-ID: <8bc865893a11e5c8772c.1151617271@eng-12.pathscale.com> Also count the number of interrupts where that works (fastrcvint). On any interrupt where the port0 head and tail registers are not equal, just call the ipath_kreceive code without reading the interrupt status, thus saving the approximately 0.25usec processor stall waiting for the read to return. If any other interrupt bits are set, or head==tail, take the normal path, but that has been reordered to handle read ahead of pioavail. Also no longer call ipath_kreceive() from ipath_qcheck(), because that just seems to make things worse, and isn't really buying us anything, these days. Also no longer loop in ipath_kreceive(); better to not hold things off too long (I saw many cases where we would loop 4-8 times, and handle thousands (up to 3500) in a single call). Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r 1e8837473193 -r 8bc865893a11 drivers/infiniband/hw/ipath/ipath_common.h --- a/drivers/infiniband/hw/ipath/ipath_common.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_common.h Thu Jun 29 14:33:25 2006 -0700 @@ -97,8 +97,8 @@ struct infinipath_stats { __u64 sps_hwerrs; /* number of times IB link changed state unexpectedly */ __u64 sps_iblink; - /* no longer used; left for compatibility */ - __u64 sps_unused3; + /* kernel receive interrupts that didn't read intstat */ + __u64 sps_fastrcvint; /* number of kernel (port0) packets received */ __u64 sps_port0pkts; /* number of "ethernet" packets sent by driver */ diff -r 1e8837473193 -r 8bc865893a11 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 @@ -888,12 +888,7 @@ void ipath_kreceive(struct ipath_devdata (u32)le64_to_cpu(*dd->ipath_hdrqtailptr)) goto done; -gotmore: - /* - * read only once at start. If in flood situation, this helps - * performance slightly. If more arrive while we are processing, - * we'll come back here and do them - */ + /* read only once at start for performance */ hdrqtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr); for (i = 0, l = dd->ipath_port0head; l != hdrqtail; i++) { @@ -1022,10 +1017,6 @@ gotmore: dd->ipath_port0head = l; - if (hdrqtail != (u32)le64_to_cpu(*dd->ipath_hdrqtailptr)) - /* more arrived while we handled first batch */ - goto gotmore; - if (pkttot > ipath_stats.sps_maxpkts_call) ipath_stats.sps_maxpkts_call = pkttot; ipath_stats.sps_port0pkts += pkttot; diff -r 1e8837473193 -r 8bc865893a11 drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:25 2006 -0700 @@ -539,10 +539,10 @@ static void handle_errors(struct ipath_d continue; if (hd == (tl + 1) || (!hd && tl == dd->ipath_hdrqlast)) { + if (i == 0) + chkerrpkts = 1; dd->ipath_lastrcvhdrqtails[i] = tl; pd->port_hdrqfull++; - if (i == 0) - chkerrpkts = 1; } } } @@ -724,7 +724,12 @@ set: dd->ipath_sendctrl); } -static void handle_rcv(struct ipath_devdata *dd, u32 istat) +/* + * Handle receive interrupts for user ports; this means a user + * process was waiting for a packet to arrive, and didn't want + * to poll + */ +static void handle_urcv(struct ipath_devdata *dd, u32 istat) { u64 portr; int i; @@ -734,22 +739,17 @@ static void handle_rcv(struct ipath_devd infinipath_i_rcvavail_mask) | ((istat >> INFINIPATH_I_RCVURG_SHIFT) & infinipath_i_rcvurg_mask); - for (i = 0; i < dd->ipath_cfgports; i++) { + for (i = 1; i < dd->ipath_cfgports; i++) { struct ipath_portdata *pd = dd->ipath_pd[i]; - if (portr & (1 << i) && pd && - pd->port_cnt) { - if (i == 0) - ipath_kreceive(dd); - else if (test_bit(IPATH_PORT_WAITING_RCV, - &pd->port_flag)) { - int rcbit; - clear_bit(IPATH_PORT_WAITING_RCV, - &pd->port_flag); - rcbit = i + INFINIPATH_R_INTRAVAIL_SHIFT; - clear_bit(1UL << rcbit, &dd->ipath_rcvctrl); - wake_up_interruptible(&pd->port_wait); - rcvdint = 1; - } + if (portr & (1 << i) && pd && pd->port_cnt && + test_bit(IPATH_PORT_WAITING_RCV, &pd->port_flag)) { + int rcbit; + clear_bit(IPATH_PORT_WAITING_RCV, + &pd->port_flag); + rcbit = i + INFINIPATH_R_INTRAVAIL_SHIFT; + clear_bit(1UL << rcbit, &dd->ipath_rcvctrl); + wake_up_interruptible(&pd->port_wait); + rcvdint = 1; } } if (rcvdint) { @@ -767,14 +767,17 @@ irqreturn_t ipath_intr(int irq, void *da struct ipath_devdata *dd = data; u32 istat; ipath_err_t estat = 0; + irqreturn_t ret; + u32 p0bits; static unsigned unexpected = 0; - irqreturn_t ret; - - if(!(dd->ipath_flags & IPATH_PRESENT)) { - /* this is mostly so we don't try to touch the chip while - * it is being reset */ - /* - * This return value is perhaps odd, but we do not want the + static const u32 port0rbits = (1U<ipath_flags & IPATH_PRESENT)) { + /* + * This return value is not great, but we do not want the * interrupt core code to remove our interrupt handler * because we don't appear to be handling an interrupt * during a chip reset. @@ -782,6 +785,50 @@ irqreturn_t ipath_intr(int irq, void *da return IRQ_HANDLED; } + /* + * this needs to be flags&initted, not statusp, so we keep + * taking interrupts even after link goes down, etc. + * Also, we *must* clear the interrupt at some point, or we won't + * take it again, which can be real bad for errors, etc... + */ + + if (!(dd->ipath_flags & IPATH_INITTED)) { + ipath_bad_intr(dd, &unexpected); + ret = IRQ_NONE; + goto bail; + } + + /* + * We try to avoid reading the interrupt status register, since + * that's a PIO read, and stalls the processor for up to about + * ~0.25 usec. The idea is that if we processed a port0 packet, + * we blindly clear the port 0 receive interrupt bits, and nothing + * else, then return. If other interrupts are pending, the chip + * will re-interrupt us as soon as we write the intclear register. + * We then won't process any more kernel packets (if not the 2nd + * time, then the 3rd or 4th) and we'll then handle the other + * interrupts. We clear the interrupts first so that we don't + * lose intr for later packets that arrive while we are processing. + */ + if (dd->ipath_port0head != + (u32)le64_to_cpu(*dd->ipath_hdrqtailptr)) { + u32 oldhead = dd->ipath_port0head; + if (dd->ipath_flags & IPATH_GPIO_INTR) { + ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear, + (u64) (1 << 2)); + p0bits = port0rbits | INFINIPATH_I_GPIO; + } + else + p0bits = port0rbits; + ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, p0bits); + ipath_kreceive(dd); + if (oldhead != dd->ipath_port0head) { + ipath_stats.sps_fastrcvint++; + goto done; + } + istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus); + } + istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus); if (unlikely(!istat)) { ipath_stats.sps_nullintr++; @@ -795,31 +842,17 @@ irqreturn_t ipath_intr(int irq, void *da goto bail; } - ipath_stats.sps_ints++; - - /* - * this needs to be flags&initted, not statusp, so we keep - * taking interrupts even after link goes down, etc. - * Also, we *must* clear the interrupt at some point, or we won't - * take it again, which can be real bad for errors, etc... - */ - - if (!(dd->ipath_flags & IPATH_INITTED)) { - ipath_bad_intr(dd, &unexpected); - ret = IRQ_NONE; - goto bail; - } if (unexpected) unexpected = 0; - ipath_cdbg(VERBOSE, "intr stat=0x%x\n", istat); - - if (istat & ~infinipath_i_bitsextant) + if (unlikely(istat & ~infinipath_i_bitsextant)) ipath_dev_err(dd, "interrupt with unknown interrupts %x set\n", istat & (u32) ~ infinipath_i_bitsextant); - - if (istat & INFINIPATH_I_ERROR) { + else + ipath_cdbg(VERBOSE, "intr stat=0x%x\n", istat); + + if (unlikely(istat & INFINIPATH_I_ERROR)) { ipath_stats.sps_errints++; estat = ipath_read_kreg64(dd, dd->ipath_kregs->kr_errorstatus); @@ -837,7 +870,14 @@ irqreturn_t ipath_intr(int irq, void *da handle_errors(dd, estat); } + p0bits = port0rbits; if (istat & INFINIPATH_I_GPIO) { + /* + * Packets are available in the port 0 rcv queue. + * Eventually this needs to be generalized to check + * IPATH_GPIO_INTR, and the specific GPIO bit, if + * GPIO interrupts are used for anything else. + */ if (unlikely(!(dd->ipath_flags & IPATH_GPIO_INTR))) { u32 gpiostatus; gpiostatus = ipath_read_kreg32( @@ -851,14 +891,7 @@ irqreturn_t ipath_intr(int irq, void *da /* Clear GPIO status bit 2 */ ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear, (u64) (1 << 2)); - - /* - * Packets are available in the port 0 rcv queue. - * Eventually this needs to be generalized to check - * IPATH_GPIO_INTR, and the specific GPIO bit, if - * GPIO interrupts are used for anything else. - */ - ipath_kreceive(dd); + p0bits |= INFINIPATH_I_GPIO; } } @@ -871,6 +904,25 @@ irqreturn_t ipath_intr(int irq, void *da */ ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, istat); + /* + * we check for both transition from empty to non-empty, and urgent + * packets (those with the interrupt bit set in the header), and + * if enabled, the GPIO bit 2 interrupt used for port0 on some + * HT-400 boards. + * Do this before checking for pio buffers available, since + * receives can overflow; piobuf waiters can afford a few + * extra cycles, since they were waiting anyway. + */ + if (istat & p0bits) { + ipath_kreceive(dd); + istat &= ~port0rbits; + } + if (istat & ((infinipath_i_rcvavail_mask << + INFINIPATH_I_RCVAVAIL_SHIFT) + | (infinipath_i_rcvurg_mask << + INFINIPATH_I_RCVURG_SHIFT))) + handle_urcv(dd, istat); + if (istat & INFINIPATH_I_SPIOBUFAVAIL) { clear_bit(IPATH_S_PIOINTBUFAVAIL, &dd->ipath_sendctrl); ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, @@ -882,17 +934,7 @@ irqreturn_t ipath_intr(int irq, void *da handle_layer_pioavail(dd); } - /* - * we check for both transition from empty to non-empty, and urgent - * packets (those with the interrupt bit set in the header) - */ - - if (istat & ((infinipath_i_rcvavail_mask << - INFINIPATH_I_RCVAVAIL_SHIFT) - | (infinipath_i_rcvurg_mask << - INFINIPATH_I_RCVURG_SHIFT))) - handle_rcv(dd, istat); - +done: ret = IRQ_HANDLED; bail: diff -r 1e8837473193 -r 8bc865893a11 drivers/infiniband/hw/ipath/ipath_stats.c --- a/drivers/infiniband/hw/ipath/ipath_stats.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_stats.c Thu Jun 29 14:33:25 2006 -0700 @@ -186,7 +186,6 @@ static void ipath_qcheck(struct ipath_de dd->ipath_port0head, (unsigned long long) ipath_stats.sps_port0pkts); - ipath_kreceive(dd); } dd->ipath_lastport0rcv_cnt = ipath_stats.sps_port0pkts; } From bos at pathscale.com Thu Jun 29 14:41:10 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:10 -0700 Subject: [openib-general] [PATCH 19 of 39] IB/ipath - memory management cleanups In-Reply-To: Message-ID: <1e88374731937c2d4379.1151617270@eng-12.pathscale.com> Made in-memory rcvhdrq tail update be in dma_alloc'ed memory, not random user or special kernel (needed for ppc, also "just the right thing to do"). Some cleanups to make unexpected link transitions less likely to produce complaints about packet errors, and also to not leave SMA packets stuck and unable to go out. A few other random debug and comment cleanups. Always init rcvhdrq head/tail registers to 0, to avoid race conditions (should have been that way some time ago). Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r 9c072f8e7e68 -r 1e8837473193 drivers/infiniband/hw/ipath/ipath_common.h --- a/drivers/infiniband/hw/ipath/ipath_common.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_common.h Thu Jun 29 14:33:25 2006 -0700 @@ -311,6 +311,9 @@ struct ipath_base_info { __u32 spi_rcv_egrchunksize; /* total size of mmap to cover full rcvegrbuffers */ __u32 spi_rcv_egrbuftotlen; + __u32 spi_filler_for_align; + /* address of readonly memory copy of the rcvhdrq tail register. */ + __u64 spi_rcvhdr_tailaddr; } __attribute__ ((aligned(8))); @@ -380,13 +383,7 @@ struct ipath_user_info { */ __u32 spu_rcvhdrsize; - /* - * cache line aligned (64 byte) user address to - * which the rcvhdrtail register will be written by infinipath - * whenever it changes, so that no chip registers are read in - * the performance path. - */ - __u64 spu_rcvhdraddr; + __u64 spu_unused; /* kept for compatible layout */ /* * address of struct base_info to write to diff -r 9c072f8e7e68 -r 1e8837473193 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 @@ -131,14 +131,6 @@ static struct pci_driver ipath_driver = .id_table = ipath_pci_tbl, }; -/* - * This is where port 0's rcvhdrtail register is written back; we also - * want nothing else sharing the cache line, so make it a cache line - * in size. Used for all units. - */ -volatile __le64 *ipath_port0_rcvhdrtail; -dma_addr_t ipath_port0_rcvhdrtail_dma; -static int port0_rcvhdrtail_refs; static inline void read_bars(struct ipath_devdata *dd, struct pci_dev *dev, u32 *bar0, u32 *bar1) @@ -268,47 +260,6 @@ int ipath_count_units(int *npresentp, in return nunits; } -static int init_port0_rcvhdrtail(struct pci_dev *pdev) -{ - int ret; - - mutex_lock(&ipath_mutex); - - if (!ipath_port0_rcvhdrtail) { - ipath_port0_rcvhdrtail = - dma_alloc_coherent(&pdev->dev, - IPATH_PORT0_RCVHDRTAIL_SIZE, - &ipath_port0_rcvhdrtail_dma, - GFP_KERNEL); - - if (!ipath_port0_rcvhdrtail) { - ret = -ENOMEM; - goto bail; - } - } - port0_rcvhdrtail_refs++; - ret = 0; - -bail: - mutex_unlock(&ipath_mutex); - - return ret; -} - -static void cleanup_port0_rcvhdrtail(struct pci_dev *pdev) -{ - mutex_lock(&ipath_mutex); - - if (!--port0_rcvhdrtail_refs) { - dma_free_coherent(&pdev->dev, IPATH_PORT0_RCVHDRTAIL_SIZE, - (void *) ipath_port0_rcvhdrtail, - ipath_port0_rcvhdrtail_dma); - ipath_port0_rcvhdrtail = NULL; - } - - mutex_unlock(&ipath_mutex); -} - /* * These next two routines are placeholders in case we don't have per-arch * code for controlling write combining. If explicit control of write @@ -333,20 +284,12 @@ static int __devinit ipath_init_one(stru u32 bar0 = 0, bar1 = 0; u8 rev; - ret = init_port0_rcvhdrtail(pdev); - if (ret < 0) { - printk(KERN_ERR IPATH_DRV_NAME - ": Could not allocate port0_rcvhdrtail: error %d\n", - -ret); - goto bail; - } - dd = ipath_alloc_devdata(pdev); if (IS_ERR(dd)) { ret = PTR_ERR(dd); printk(KERN_ERR IPATH_DRV_NAME ": Could not allocate devdata: error %d\n", -ret); - goto bail_rcvhdrtail; + goto bail; } ipath_cdbg(VERBOSE, "initializing unit #%u\n", dd->ipath_unit); @@ -574,9 +517,6 @@ bail_devdata: bail_devdata: ipath_free_devdata(pdev, dd); -bail_rcvhdrtail: - cleanup_port0_rcvhdrtail(pdev); - bail: return ret; } @@ -608,7 +548,6 @@ static void __devexit ipath_remove_one(s pci_disable_device(pdev); ipath_free_devdata(pdev, dd); - cleanup_port0_rcvhdrtail(pdev); } /* general driver use */ @@ -1383,26 +1322,20 @@ bail: * @dd: the infinipath device * @pd: the port data * - * this *must* be physically contiguous memory, and for now, - * that limits it to what kmalloc can do. + * this must be contiguous memory (from an i/o perspective), and must be + * DMA'able (which means for some systems, it will go through an IOMMU, + * or be forced into a low address range). */ int ipath_create_rcvhdrq(struct ipath_devdata *dd, struct ipath_portdata *pd) { - int ret = 0, amt; - - amt = ALIGN(dd->ipath_rcvhdrcnt * dd->ipath_rcvhdrentsize * - sizeof(u32), PAGE_SIZE); + int ret = 0; + if (!pd->port_rcvhdrq) { - /* - * not using REPEAT isn't viable; at 128KB, we can easily - * fail this. The problem with REPEAT is we can block here - * "forever". There isn't an inbetween, unfortunately. We - * could reduce the risk by never freeing the rcvhdrq except - * at unload, but even then, the first time a port is used, - * we could delay for some time... - */ + dma_addr_t phys_hdrqtail; gfp_t gfp_flags = GFP_USER | __GFP_COMP; + int amt = ALIGN(dd->ipath_rcvhdrcnt * dd->ipath_rcvhdrentsize * + sizeof(u32), PAGE_SIZE); pd->port_rcvhdrq = dma_alloc_coherent( &dd->pcidev->dev, amt, &pd->port_rcvhdrq_phys, @@ -1415,6 +1348,16 @@ int ipath_create_rcvhdrq(struct ipath_de ret = -ENOMEM; goto bail; } + pd->port_rcvhdrtail_kvaddr = dma_alloc_coherent( + &dd->pcidev->dev, PAGE_SIZE, &phys_hdrqtail, GFP_KERNEL); + if (!pd->port_rcvhdrtail_kvaddr) { + ipath_dev_err(dd, "attempt to allocate 1 page " + "for port %u rcvhdrqtailaddr failed\n", + pd->port_port); + ret = -ENOMEM; + goto bail; + } + pd->port_rcvhdrqtailaddr_phys = phys_hdrqtail; pd->port_rcvhdrq_size = amt; @@ -1424,20 +1367,28 @@ int ipath_create_rcvhdrq(struct ipath_de (unsigned long) pd->port_rcvhdrq_phys, (unsigned long) pd->port_rcvhdrq_size, pd->port_port); - } else { - /* - * clear for security, sanity, and/or debugging, each - * time we reuse - */ - memset(pd->port_rcvhdrq, 0, amt); - } + + ipath_cdbg(VERBOSE, "port %d hdrtailaddr, %llx physical\n", + pd->port_port, + (unsigned long long) phys_hdrqtail); + } + else + ipath_cdbg(VERBOSE, "reuse port %d rcvhdrq @%p %llx phys; " + "hdrtailaddr@%p %llx physical\n", + pd->port_port, pd->port_rcvhdrq, + pd->port_rcvhdrq_phys, pd->port_rcvhdrtail_kvaddr, + (unsigned long long)pd->port_rcvhdrqtailaddr_phys); + + /* clear for security and sanity on each use */ + memset(pd->port_rcvhdrq, 0, pd->port_rcvhdrq_size); + memset((void *)pd->port_rcvhdrtail_kvaddr, 0, PAGE_SIZE); /* * tell chip each time we init it, even if we are re-using previous - * memory (we zero it at process close) - */ - ipath_cdbg(VERBOSE, "writing port %d rcvhdraddr as %lx\n", - pd->port_port, (unsigned long) pd->port_rcvhdrq_phys); + * memory (we zero the register at process close) + */ + ipath_write_kreg_port(dd, dd->ipath_kregs->kr_rcvhdrtailaddr, + pd->port_port, pd->port_rcvhdrqtailaddr_phys); ipath_write_kreg_port(dd, dd->ipath_kregs->kr_rcvhdraddr, pd->port_port, pd->port_rcvhdrq_phys); @@ -1525,15 +1476,27 @@ void ipath_set_ib_lstate(struct ipath_de [INFINIPATH_IBCC_LINKCMD_ARMED] = "ARMED", [INFINIPATH_IBCC_LINKCMD_ACTIVE] = "ACTIVE" }; + int linkcmd = (which >> INFINIPATH_IBCC_LINKCMD_SHIFT) & + INFINIPATH_IBCC_LINKCMD_MASK; + ipath_cdbg(SMA, "Trying to move unit %u to %s, current ltstate " "is %s\n", dd->ipath_unit, - what[(which >> INFINIPATH_IBCC_LINKCMD_SHIFT) & - INFINIPATH_IBCC_LINKCMD_MASK], + what[linkcmd], ipath_ibcstatus_str[ (ipath_read_kreg64 (dd, dd->ipath_kregs->kr_ibcstatus) >> INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) & INFINIPATH_IBCS_LINKTRAININGSTATE_MASK]); + /* flush all queued sends when going to DOWN or INIT, to be sure that + * they don't block SMA and other MAD packets */ + if (!linkcmd || linkcmd == INFINIPATH_IBCC_LINKCMD_INIT) { + ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, + INFINIPATH_S_ABORT); + ipath_disarm_piobufs(dd, dd->ipath_lastport_piobuf, + (unsigned)(dd->ipath_piobcnt2k + + dd->ipath_piobcnt4k) - + dd->ipath_lastport_piobuf); + } ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl, dd->ipath_ibcctrl | which); @@ -1681,60 +1644,54 @@ void ipath_shutdown_device(struct ipath_ /** * ipath_free_pddata - free a port's allocated data * @dd: the infinipath device - * @port: the port - * @freehdrq: free the port data structure if true - * - * when closing, free up any allocated data for a port, if the - * reference count goes to zero - * Note: this also optionally frees the portdata itself! - * Any changes here have to be matched up with the reinit case - * of ipath_init_chip(), which calls this routine on reinit after reset. - */ -void ipath_free_pddata(struct ipath_devdata *dd, u32 port, int freehdrq) -{ - struct ipath_portdata *pd = dd->ipath_pd[port]; - + * @pd: the portdata structure + * + * free up any allocated data for a port + * This should not touch anything that would affect a simultaneous + * re-allocation of port data, because it is called after ipath_mutex + * is released (and can be called from reinit as well). + * It should never change any chip state, or global driver state. + * (The only exception to global state is freeing the port0 port0_skbs.) + */ +void ipath_free_pddata(struct ipath_devdata *dd, struct ipath_portdata *pd) +{ if (!pd) return; - if (freehdrq) - /* - * only clear and free portdata if we are going to also - * release the hdrq, otherwise we leak the hdrq on each - * open/close cycle - */ - dd->ipath_pd[port] = NULL; - if (freehdrq && pd->port_rcvhdrq) { + + if (pd->port_rcvhdrq) { ipath_cdbg(VERBOSE, "free closed port %d rcvhdrq @ %p " "(size=%lu)\n", pd->port_port, pd->port_rcvhdrq, (unsigned long) pd->port_rcvhdrq_size); dma_free_coherent(&dd->pcidev->dev, pd->port_rcvhdrq_size, pd->port_rcvhdrq, pd->port_rcvhdrq_phys); pd->port_rcvhdrq = NULL; - } - if (port && pd->port_rcvegrbuf) { - /* always free this */ - if (pd->port_rcvegrbuf) { - unsigned e; - - for (e = 0; e < pd->port_rcvegrbuf_chunks; e++) { - void *base = pd->port_rcvegrbuf[e]; - size_t size = pd->port_rcvegrbuf_size; - - ipath_cdbg(VERBOSE, "egrbuf free(%p, %lu), " - "chunk %u/%u\n", base, - (unsigned long) size, - e, pd->port_rcvegrbuf_chunks); - dma_free_coherent( - &dd->pcidev->dev, size, base, - pd->port_rcvegrbuf_phys[e]); - } - vfree(pd->port_rcvegrbuf); - pd->port_rcvegrbuf = NULL; - vfree(pd->port_rcvegrbuf_phys); - pd->port_rcvegrbuf_phys = NULL; - } + if (pd->port_rcvhdrtail_kvaddr) { + dma_free_coherent(&dd->pcidev->dev, PAGE_SIZE, + (void *)pd->port_rcvhdrtail_kvaddr, + pd->port_rcvhdrqtailaddr_phys); + pd->port_rcvhdrtail_kvaddr = NULL; + } + } + if (pd->port_port && pd->port_rcvegrbuf) { + unsigned e; + + for (e = 0; e < pd->port_rcvegrbuf_chunks; e++) { + void *base = pd->port_rcvegrbuf[e]; + size_t size = pd->port_rcvegrbuf_size; + + ipath_cdbg(VERBOSE, "egrbuf free(%p, %lu), " + "chunk %u/%u\n", base, + (unsigned long) size, + e, pd->port_rcvegrbuf_chunks); + dma_free_coherent(&dd->pcidev->dev, size, + base, pd->port_rcvegrbuf_phys[e]); + } + vfree(pd->port_rcvegrbuf); + pd->port_rcvegrbuf = NULL; + vfree(pd->port_rcvegrbuf_phys); + pd->port_rcvegrbuf_phys = NULL; pd->port_rcvegrbuf_chunks = 0; - } else if (port == 0 && dd->ipath_port0_skbs) { + } else if (pd->port_port == 0 && dd->ipath_port0_skbs) { unsigned e; struct sk_buff **skbs = dd->ipath_port0_skbs; @@ -1746,10 +1703,8 @@ void ipath_free_pddata(struct ipath_devd dev_kfree_skb(skbs[e]); vfree(skbs); } - if (freehdrq) { - kfree(pd->port_tid_pg_list); - kfree(pd); - } + kfree(pd->port_tid_pg_list); + kfree(pd); } static int __init infinipath_init(void) @@ -1874,10 +1829,14 @@ static void cleanup_device(struct ipath_ /* * free any resources still in use (usually just kernel ports) - * at unload - */ - for (port = 0; port < dd->ipath_cfgports; port++) - ipath_free_pddata(dd, port, 1); + * at unload; we do for portcnt, not cfgports, because cfgports + * could have changed while we were loaded. + */ + for (port = 0; port < dd->ipath_portcnt; port++) { + struct ipath_portdata *pd = dd->ipath_pd[port]; + dd->ipath_pd[port] = NULL; + ipath_free_pddata(dd, pd); + } kfree(dd->ipath_pd); /* * debuggability, in case some cleanup path tries to use it diff -r 9c072f8e7e68 -r 1e8837473193 drivers/infiniband/hw/ipath/ipath_file_ops.c --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:25 2006 -0700 @@ -123,6 +123,7 @@ static int ipath_get_base_info(struct ip * on to yet another method of dealing with this */ kinfo->spi_rcvhdr_base = (u64) pd->port_rcvhdrq_phys; + kinfo->spi_rcvhdr_tailaddr = (u64)pd->port_rcvhdrqtailaddr_phys; kinfo->spi_rcv_egrbufs = (u64) pd->port_rcvegr_phys; kinfo->spi_pioavailaddr = (u64) dd->ipath_pioavailregs_phys; kinfo->spi_status = (u64) kinfo->spi_pioavailaddr + @@ -785,11 +786,12 @@ static int ipath_create_user_egr(struct bail_rcvegrbuf_phys: for (e = 0; e < pd->port_rcvegrbuf_chunks && - pd->port_rcvegrbuf[e]; e++) + pd->port_rcvegrbuf[e]; e++) { dma_free_coherent(&dd->pcidev->dev, size, pd->port_rcvegrbuf[e], pd->port_rcvegrbuf_phys[e]); + } vfree(pd->port_rcvegrbuf_phys); pd->port_rcvegrbuf_phys = NULL; bail_rcvegrbuf: @@ -804,10 +806,7 @@ static int ipath_do_user_init(struct ipa { int ret = 0; struct ipath_devdata *dd = pd->port_dd; - u64 physaddr, uaddr, off, atmp; - struct page *pagep; u32 head32; - u64 head; /* for now, if major version is different, bail */ if ((uinfo->spu_userversion >> 16) != IPATH_USER_SWMAJOR) { @@ -831,54 +830,6 @@ static int ipath_do_user_init(struct ipa } /* for now we do nothing with rcvhdrcnt: uinfo->spu_rcvhdrcnt */ - - /* set up for the rcvhdr Q tail register writeback to user memory */ - if (!uinfo->spu_rcvhdraddr || - !access_ok(VERIFY_WRITE, (u64 __user *) (unsigned long) - uinfo->spu_rcvhdraddr, sizeof(u64))) { - ipath_dbg("Port %d rcvhdrtail addr %llx not valid\n", - pd->port_port, - (unsigned long long) uinfo->spu_rcvhdraddr); - ret = -EINVAL; - goto done; - } - - off = offset_in_page(uinfo->spu_rcvhdraddr); - uaddr = PAGE_MASK & (unsigned long) uinfo->spu_rcvhdraddr; - ret = ipath_get_user_pages_nocopy(uaddr, &pagep); - if (ret) { - dev_info(&dd->pcidev->dev, "Failed to lookup and lock " - "address %llx for rcvhdrtail: errno %d\n", - (unsigned long long) uinfo->spu_rcvhdraddr, -ret); - goto done; - } - ipath_stats.sps_pagelocks++; - pd->port_rcvhdrtail_uaddr = uaddr; - pd->port_rcvhdrtail_pagep = pagep; - pd->port_rcvhdrtail_kvaddr = - page_address(pagep); - pd->port_rcvhdrtail_kvaddr += off; - physaddr = page_to_phys(pagep) + off; - ipath_cdbg(VERBOSE, "port %d user addr %llx hdrtailaddr, %llx " - "physical (off=%llx)\n", - pd->port_port, - (unsigned long long) uinfo->spu_rcvhdraddr, - (unsigned long long) physaddr, (unsigned long long) off); - ipath_write_kreg_port(dd, dd->ipath_kregs->kr_rcvhdrtailaddr, - pd->port_port, physaddr); - atmp = ipath_read_kreg64_port(dd, - dd->ipath_kregs->kr_rcvhdrtailaddr, - pd->port_port); - if (physaddr != atmp) { - ipath_dev_err(dd, - "Catastrophic software error, " - "RcvHdrTailAddr%u written as %llx, " - "read back as %llx\n", pd->port_port, - (unsigned long long) physaddr, - (unsigned long long) atmp); - ret = -EINVAL; - goto done; - } /* for right now, kernel piobufs are at end, so port 1 is at 0 */ pd->port_piobufs = dd->ipath_piobufbase + @@ -898,26 +849,18 @@ static int ipath_do_user_init(struct ipa ret = ipath_create_user_egr(pd); if (ret) goto done; - /* enable receives now */ - /* atomically set enable bit for this port */ - set_bit(INFINIPATH_R_PORTENABLE_SHIFT + pd->port_port, - &dd->ipath_rcvctrl); /* - * set the head registers for this port to the current values + * set the eager head register for this port to the current values * of the tail pointers, since we don't know if they were * updated on last use of the port. */ - head32 = ipath_read_ureg32(dd, ur_rcvhdrtail, pd->port_port); - head = (u64) head32; - ipath_write_ureg(dd, ur_rcvhdrhead, head, pd->port_port); head32 = ipath_read_ureg32(dd, ur_rcvegrindextail, pd->port_port); ipath_write_ureg(dd, ur_rcvegrindexhead, head32, pd->port_port); dd->ipath_lastegrheads[pd->port_port] = -1; dd->ipath_lastrcvhdrqtails[pd->port_port] = -1; - ipath_cdbg(VERBOSE, "Wrote port%d head %llx, egrhead %x from " - "tail regs\n", pd->port_port, - (unsigned long long) head, head32); + ipath_cdbg(VERBOSE, "Wrote port%d egrhead %x from tail regs\n", + pd->port_port, head32); pd->port_tidcursor = 0; /* start at beginning after open */ /* * now enable the port; the tail registers will be written to memory @@ -926,13 +869,62 @@ static int ipath_do_user_init(struct ipa * transition from 0 to 1, so clear it first, then set it as part of * enabling the port. This will (very briefly) affect any other * open ports, but it shouldn't be long enough to be an issue. + * We explictly set the in-memory copy to 0 beforehand, so we don't + * have to wait to be sure the DMA update has happened. */ + *pd->port_rcvhdrtail_kvaddr = 0ULL; + set_bit(INFINIPATH_R_PORTENABLE_SHIFT + pd->port_port, + &dd->ipath_rcvctrl); ipath_write_kreg(dd, dd->ipath_kregs->kr_rcvctrl, dd->ipath_rcvctrl & ~INFINIPATH_R_TAILUPD); ipath_write_kreg(dd, dd->ipath_kregs->kr_rcvctrl, dd->ipath_rcvctrl); - done: + return ret; +} + + +/* common code for the mappings on dma_alloc_coherent mem */ +static int ipath_mmap_mem(struct vm_area_struct *vma, + struct ipath_portdata *pd, unsigned len, + int write_ok, dma_addr_t addr, char *what) +{ + struct ipath_devdata *dd = pd->port_dd; + unsigned pfn = (unsigned long)addr >> PAGE_SHIFT; + int ret; + + if ((vma->vm_end - vma->vm_start) > len) { + dev_info(&dd->pcidev->dev, + "FAIL on %s: len %lx > %x\n", what, + vma->vm_end - vma->vm_start, len); + ret = -EFAULT; + goto bail; + } + + if (!write_ok) { + if (vma->vm_flags & VM_WRITE) { + dev_info(&dd->pcidev->dev, + "%s must be mapped readonly\n", what); + ret = -EPERM; + goto bail; + } + + /* don't allow them to later change with mprotect */ + vma->vm_flags &= ~VM_MAYWRITE; + } + + ret = remap_pfn_range(vma, vma->vm_start, pfn, + len, vma->vm_page_prot); + if (ret) + dev_info(&dd->pcidev->dev, + "%s port%u mmap of %lx, %x bytes r%c failed: %d\n", + what, pd->port_port, (unsigned long)addr, len, + write_ok?'w':'o', ret); + else + ipath_cdbg(VERBOSE, "%s port%u mmaped %lx, %x bytes r%c\n", + what, pd->port_port, (unsigned long)addr, len, + write_ok?'w':'o'); +bail: return ret; } @@ -942,8 +934,11 @@ static int mmap_ureg(struct vm_area_stru unsigned long phys; int ret; - /* it's the real hardware, so io_remap works */ - + /* + * This is real hardware, so use io_remap. This is the mechanism + * for the user process to update the head registers for their port + * in the chip. + */ if ((vma->vm_end - vma->vm_start) > PAGE_SIZE) { dev_info(&dd->pcidev->dev, "FAIL mmap userreg: reqlen " "%lx > PAGE\n", vma->vm_end - vma->vm_start); @@ -969,10 +964,11 @@ static int mmap_piobufs(struct vm_area_s int ret; /* - * When we map the PIO buffers, we want to map them as writeonly, no - * read possible. + * When we map the PIO buffers in the chip, we want to map them as + * writeonly, no read possible. This prevents access to previous + * process data, and catches users who might try to read the i/o + * space due to a bug. */ - if ((vma->vm_end - vma->vm_start) > (dd->ipath_pbufsport * dd->ipath_palign)) { dev_info(&dd->pcidev->dev, "FAIL mmap piobufs: " @@ -983,11 +979,10 @@ static int mmap_piobufs(struct vm_area_s } phys = dd->ipath_physaddr + pd->port_piobufs; + /* - * Do *NOT* mark this as non-cached (PWT bit), or we don't get the + * Don't mark this as non-cached, or we don't get the * write combining behavior we want on the PIO buffers! - * vma->vm_page_prot = - * pgprot_noncached(vma->vm_page_prot); */ if (vma->vm_flags & VM_READ) { @@ -999,8 +994,7 @@ static int mmap_piobufs(struct vm_area_s } /* don't allow them to later change to readable with mprotect */ - - vma->vm_flags &= ~VM_MAYWRITE; + vma->vm_flags &= ~VM_MAYREAD; vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND; ret = io_remap_pfn_range(vma, vma->vm_start, phys >> PAGE_SHIFT, @@ -1018,11 +1012,6 @@ static int mmap_rcvegrbufs(struct vm_are size_t total_size, i; dma_addr_t *phys; int ret; - - if (!pd->port_rcvegrbuf) { - ret = -EFAULT; - goto bail; - } size = pd->port_rcvegrbuf_size; total_size = pd->port_rcvegrbuf_chunks * size; @@ -1041,12 +1030,11 @@ static int mmap_rcvegrbufs(struct vm_are ret = -EPERM; goto bail; } + /* don't allow them to later change to writeable with mprotect */ + vma->vm_flags &= ~VM_MAYWRITE; start = vma->vm_start; phys = pd->port_rcvegrbuf_phys; - - /* don't allow them to later change to writeable with mprotect */ - vma->vm_flags &= ~VM_MAYWRITE; for (i = 0; i < pd->port_rcvegrbuf_chunks; i++, start += size) { ret = remap_pfn_range(vma, start, phys[i] >> PAGE_SHIFT, @@ -1056,78 +1044,6 @@ static int mmap_rcvegrbufs(struct vm_are } ret = 0; -bail: - return ret; -} - -static int mmap_rcvhdrq(struct vm_area_struct *vma, - struct ipath_portdata *pd) -{ - struct ipath_devdata *dd = pd->port_dd; - size_t total_size; - int ret; - - /* - * kmalloc'ed memory, physically contiguous; this is from - * spi_rcvhdr_base; we allow user to map read-write so they can - * write hdrq entries to allow protocol code to directly poll - * whether a hdrq entry has been written. - */ - total_size = ALIGN(dd->ipath_rcvhdrcnt * dd->ipath_rcvhdrentsize * - sizeof(u32), PAGE_SIZE); - if ((vma->vm_end - vma->vm_start) > total_size) { - dev_info(&dd->pcidev->dev, - "FAIL on rcvhdrq: reqlen %lx > actual %lx\n", - vma->vm_end - vma->vm_start, - (unsigned long) total_size); - ret = -EFAULT; - goto bail; - } - - ret = remap_pfn_range(vma, vma->vm_start, - pd->port_rcvhdrq_phys >> PAGE_SHIFT, - vma->vm_end - vma->vm_start, - vma->vm_page_prot); -bail: - return ret; -} - -static int mmap_pioavailregs(struct vm_area_struct *vma, - struct ipath_portdata *pd) -{ - struct ipath_devdata *dd = pd->port_dd; - int ret; - - /* - * when we map the PIO bufferavail registers, we want to map them as - * readonly, no write possible. - * - * kmalloc'ed memory, physically contiguous, one page only, readonly - */ - - if ((vma->vm_end - vma->vm_start) > PAGE_SIZE) { - dev_info(&dd->pcidev->dev, "FAIL on pioavailregs_dma: " - "reqlen %lx > actual %lx\n", - vma->vm_end - vma->vm_start, - (unsigned long) PAGE_SIZE); - ret = -EFAULT; - goto bail; - } - - if (vma->vm_flags & VM_WRITE) { - dev_info(&dd->pcidev->dev, - "Can't map pioavailregs as writable (flags=%lx)\n", - vma->vm_flags); - ret = -EPERM; - goto bail; - } - - /* don't allow them to later change with mprotect */ - vma->vm_flags &= ~VM_MAYWRITE; - - ret = remap_pfn_range(vma, vma->vm_start, - dd->ipath_pioavailregs_phys >> PAGE_SHIFT, - PAGE_SIZE, vma->vm_page_prot); bail: return ret; } @@ -1151,6 +1067,7 @@ static int ipath_mmap(struct file *fp, s pd = port_fp(fp); dd = pd->port_dd; + /* * This is the ipath_do_user_init() code, mapping the shared buffers * into the user process. The address referred to by vm_pgoff is the @@ -1160,29 +1077,59 @@ static int ipath_mmap(struct file *fp, s pgaddr = vma->vm_pgoff << PAGE_SHIFT; /* - * note that ureg does *NOT* have the kregvirt as part of it, to be - * sure that for 32 bit programs, we don't end up trying to map a > - * 44 address. Has to match ipath_get_base_info() code that sets - * __spi_uregbase + * Must fit in 40 bits for our hardware; some checked elsewhere, + * but we'll be paranoid. Check for 0 is mostly in case one of the + * allocations failed, but user called mmap anyway. We want to catch + * that before it can match. */ - + if (!pgaddr || pgaddr >= (1ULL<<40)) { + ipath_dev_err(dd, "Bad phys addr %llx, start %lx, end %lx\n", + (unsigned long long)pgaddr, vma->vm_start, vma->vm_end); + return -EINVAL; + } + + /* just the offset of the port user registers, not physical addr */ ureg = dd->ipath_uregbase + dd->ipath_palign * pd->port_port; - ipath_cdbg(MM, "pgaddr %llx vm_start=%lx len %lx port %u:%u\n", + ipath_cdbg(MM, "ushare: pgaddr %llx vm_start=%lx, vmlen %lx\n", (unsigned long long) pgaddr, vma->vm_start, - vma->vm_end - vma->vm_start, dd->ipath_unit, - pd->port_port); - - if (pgaddr == ureg) + vma->vm_end - vma->vm_start); + + if (vma->vm_start & (PAGE_SIZE-1)) { + ipath_dev_err(dd, + "vm_start not aligned: %lx, end=%lx phys %lx\n", + vma->vm_start, vma->vm_end, (unsigned long)pgaddr); + ret = -EINVAL; + } + else if (pgaddr == ureg) ret = mmap_ureg(vma, dd, ureg); else if (pgaddr == pd->port_piobufs) ret = mmap_piobufs(vma, dd, pd); else if (pgaddr == (u64) pd->port_rcvegr_phys) ret = mmap_rcvegrbufs(vma, pd); - else if (pgaddr == (u64) pd->port_rcvhdrq_phys) - ret = mmap_rcvhdrq(vma, pd); + else if (pgaddr == (u64) pd->port_rcvhdrq_phys) { + /* + * The rcvhdrq itself; readonly except on HT-400 (so have + * to allow writable mapping), multiple pages, contiguous + * from an i/o perspective. + */ + unsigned total_size = + ALIGN(dd->ipath_rcvhdrcnt * dd->ipath_rcvhdrentsize + * sizeof(u32), PAGE_SIZE); + ret = ipath_mmap_mem(vma, pd, total_size, 1, + pd->port_rcvhdrq_phys, + "rcvhdrq"); + } + else if (pgaddr == (u64)pd->port_rcvhdrqtailaddr_phys) + /* in-memory copy of rcvhdrq tail register */ + ret = ipath_mmap_mem(vma, pd, PAGE_SIZE, 0, + pd->port_rcvhdrqtailaddr_phys, + "rcvhdrq tail"); else if (pgaddr == dd->ipath_pioavailregs_phys) - ret = mmap_pioavailregs(vma, pd); + /* in-memory copy of pioavail registers */ + ret = ipath_mmap_mem(vma, pd, PAGE_SIZE, 0, + dd->ipath_pioavailregs_phys, + "pioavail registers"); else ret = -EINVAL; @@ -1539,14 +1486,6 @@ static int ipath_close(struct inode *in, } if (dd->ipath_kregbase) { - if (pd->port_rcvhdrtail_uaddr) { - pd->port_rcvhdrtail_uaddr = 0; - pd->port_rcvhdrtail_kvaddr = NULL; - ipath_release_user_pages_on_close( - &pd->port_rcvhdrtail_pagep, 1); - pd->port_rcvhdrtail_pagep = NULL; - ipath_stats.sps_pageunlocks++; - } ipath_write_kreg_port( dd, dd->ipath_kregs->kr_rcvhdrtailaddr, port, 0ULL); @@ -1583,9 +1522,9 @@ static int ipath_close(struct inode *in, dd->ipath_f_clear_tids(dd, pd->port_port); - ipath_free_pddata(dd, pd->port_port, 0); - + dd->ipath_pd[pd->port_port] = NULL; /* before releasing mutex */ mutex_unlock(&ipath_mutex); + ipath_free_pddata(dd, pd); /* after releasing the mutex */ return ret; } @@ -1905,3 +1844,4 @@ bail: bail: return; } + diff -r 9c072f8e7e68 -r 1e8837473193 drivers/infiniband/hw/ipath/ipath_init_chip.c --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Jun 29 14:33:25 2006 -0700 @@ -411,17 +411,8 @@ static int init_pioavailregs(struct ipat /* and its length */ dd->ipath_freezelen = L1_CACHE_BYTES - sizeof(dd->ipath_statusp[0]); - if (dd->ipath_unit * 64 > (IPATH_PORT0_RCVHDRTAIL_SIZE - 64)) { - ipath_dev_err(dd, "unit %u too large for port 0 " - "rcvhdrtail buffer size\n", dd->ipath_unit); - ret = -ENODEV; - } - else - ret = 0; - - /* so we can get current tail in ipath_kreceive(), per chip */ - dd->ipath_hdrqtailptr = &ipath_port0_rcvhdrtail[ - dd->ipath_unit * (64 / sizeof(*ipath_port0_rcvhdrtail))]; + ret = 0; + done: return ret; } @@ -654,7 +645,7 @@ int ipath_init_chip(struct ipath_devdata { int ret = 0, i; u32 val32, kpiobufs; - u64 val, atmp; + u64 val; struct ipath_portdata *pd = NULL; /* keep gcc4 happy */ ret = init_housekeeping(dd, &pd, reinit); @@ -777,24 +768,6 @@ int ipath_init_chip(struct ipath_devdata goto done; } - val = ipath_port0_rcvhdrtail_dma + dd->ipath_unit * 64; - - /* verify that the alignment requirement was met */ - ipath_write_kreg_port(dd, dd->ipath_kregs->kr_rcvhdrtailaddr, - 0, val); - atmp = ipath_read_kreg64_port( - dd, dd->ipath_kregs->kr_rcvhdrtailaddr, 0); - if (val != atmp) { - ipath_dev_err(dd, "Catastrophic software error, " - "RcvHdrTailAddr0 written as %llx, " - "read back as %llx from %x\n", - (unsigned long long) val, - (unsigned long long) atmp, - dd->ipath_kregs->kr_rcvhdrtailaddr); - ret = -EINVAL; - goto done; - } - ipath_write_kreg(dd, dd->ipath_kregs->kr_rcvbthqp, IPATH_KD_QP); /* @@ -845,12 +818,18 @@ int ipath_init_chip(struct ipath_devdata * re-init, the simplest way to handle this is to free * existing, and re-allocate. */ - if (reinit) - ipath_free_pddata(dd, 0, 0); + if (reinit) { + struct ipath_portdata *pd = dd->ipath_pd[0]; + dd->ipath_pd[0] = NULL; + ipath_free_pddata(dd, pd); + } dd->ipath_f_tidtemplate(dd); ret = ipath_create_rcvhdrq(dd, pd); - if (!ret) + if (!ret) { + dd->ipath_hdrqtailptr = + (volatile __le64 *)pd->port_rcvhdrtail_kvaddr; ret = create_port0_egr(dd); + } if (ret) ipath_dev_err(dd, "failed to allocate port 0 (kernel) " "rcvhdrq and/or egr bufs\n"); diff -r 9c072f8e7e68 -r 1e8837473193 drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:25 2006 -0700 @@ -37,6 +37,7 @@ #include "ips_common.h" #include "ipath_layer.h" +/* These are all rcv-related errors which we want to count for stats */ #define E_SUM_PKTERRS \ (INFINIPATH_E_RHDRLEN | INFINIPATH_E_RBADTID | \ INFINIPATH_E_RBADVERSION | INFINIPATH_E_RHDR | \ @@ -45,12 +46,25 @@ INFINIPATH_E_RFORMATERR | INFINIPATH_E_RUNSUPVL | \ INFINIPATH_E_RUNEXPCHAR | INFINIPATH_E_REBP) +/* These are all send-related errors which we want to count for stats */ #define E_SUM_ERRS \ (INFINIPATH_E_SPIOARMLAUNCH | INFINIPATH_E_SUNEXPERRPKTNUM | \ INFINIPATH_E_SDROPPEDDATAPKT | INFINIPATH_E_SDROPPEDSMPPKT | \ INFINIPATH_E_SMAXPKTLEN | INFINIPATH_E_SUNSUPVL | \ INFINIPATH_E_SMINPKTLEN | INFINIPATH_E_SPKTLEN | \ INFINIPATH_E_INVALIDADDR) + +/* + * these are errors that can occur when the link changes state while + * a packet is being sent or received. This doesn't cover things + * like EBP or VCRC that can be the result of a sending having the + * link change state, so we receive a "known bad" packet. + */ +#define E_SUM_LINK_PKTERRS \ + (INFINIPATH_E_SDROPPEDDATAPKT | INFINIPATH_E_SDROPPEDSMPPKT | \ + INFINIPATH_E_SMINPKTLEN | INFINIPATH_E_SPKTLEN | \ + INFINIPATH_E_RSHORTPKTLEN | INFINIPATH_E_RMINPKTLEN | \ + INFINIPATH_E_RUNEXPCHAR) static u64 handle_e_sum_errs(struct ipath_devdata *dd, ipath_err_t errs) { @@ -101,9 +115,7 @@ static u64 handle_e_sum_errs(struct ipat if (ipath_debug & __IPATH_PKTDBG) printk("\n"); } - if ((errs & (INFINIPATH_E_SDROPPEDDATAPKT | - INFINIPATH_E_SDROPPEDSMPPKT | - INFINIPATH_E_SMINPKTLEN)) && + if ((errs & E_SUM_LINK_PKTERRS) && !(dd->ipath_flags & IPATH_LINKACTIVE)) { /* * This can happen when SMA is trying to bring the link @@ -112,11 +124,9 @@ static u64 handle_e_sum_errs(struct ipat * valid. We don't want to confuse people, so we just * don't print them, except at debug */ - ipath_dbg("Ignoring pktsend errors %llx, because not " - "yet active\n", (unsigned long long) errs); - ignore_this_time = INFINIPATH_E_SDROPPEDDATAPKT | - INFINIPATH_E_SDROPPEDSMPPKT | - INFINIPATH_E_SMINPKTLEN; + ipath_dbg("Ignoring packet errors %llx, because link not " + "ACTIVE\n", (unsigned long long) errs); + ignore_this_time = errs & E_SUM_LINK_PKTERRS; } return ignore_this_time; @@ -157,7 +167,29 @@ static void handle_e_ibstatuschanged(str */ val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_ibcstatus); lstate = val & IPATH_IBSTATE_MASK; - if (lstate == IPATH_IBSTATE_INIT || lstate == IPATH_IBSTATE_ARM || + + /* + * this is confusing enough when it happens that I want to always put it + * on the console and in the logs. If it was a requested state change, + * we'll have already cleared the flags, so we won't print this warning + */ + if ((lstate != IPATH_IBSTATE_ARM && lstate != IPATH_IBSTATE_ACTIVE) + && (dd->ipath_flags & (IPATH_LINKARMED | IPATH_LINKACTIVE))) { + dev_info(&dd->pcidev->dev, "Link state changed from %s to %s\n", + (dd->ipath_flags & IPATH_LINKARMED) ? "ARM" : "ACTIVE", + ib_linkstate(lstate)); + /* + * Flush all queued sends when link went to DOWN or INIT, + * to be sure that they don't block SMA and other MAD packets + */ + ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, + INFINIPATH_S_ABORT); + ipath_disarm_piobufs(dd, dd->ipath_lastport_piobuf, + (unsigned)(dd->ipath_piobcnt2k + + dd->ipath_piobcnt4k) - + dd->ipath_lastport_piobuf); + } + else if (lstate == IPATH_IBSTATE_INIT || lstate == IPATH_IBSTATE_ARM || lstate == IPATH_IBSTATE_ACTIVE) { /* * only print at SMA if there is a change, debug if not @@ -380,6 +412,19 @@ static void handle_errors(struct ipath_d if (errs & E_SUM_ERRS) ignore_this_time = handle_e_sum_errs(dd, errs); + else if ((errs & E_SUM_LINK_PKTERRS) && + !(dd->ipath_flags & IPATH_LINKACTIVE)) { + /* + * This can happen when SMA is trying to bring the link + * up, but the IB link changes state at the "wrong" time. + * The IB logic then complains that the packet isn't + * valid. We don't want to confuse people, so we just + * don't print them, except at debug + */ + ipath_dbg("Ignoring packet errors %llx, because link not " + "ACTIVE\n", (unsigned long long) errs); + ignore_this_time = errs & E_SUM_LINK_PKTERRS; + } if (supp_msgs == 250000) { /* diff -r 9c072f8e7e68 -r 1e8837473193 drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:25 2006 -0700 @@ -62,9 +62,7 @@ struct ipath_portdata { /* rcvhdrq base, needs mmap before useful */ void *port_rcvhdrq; /* kernel virtual address where hdrqtail is updated */ - u64 *port_rcvhdrtail_kvaddr; - /* page * used for uaddr */ - struct page *port_rcvhdrtail_pagep; + volatile __le64 *port_rcvhdrtail_kvaddr; /* * temp buffer for expected send setup, allocated at open, instead * of each setup call @@ -79,11 +77,7 @@ struct ipath_portdata { dma_addr_t port_rcvegr_phys; /* mmap of hdrq, must fit in 44 bits */ dma_addr_t port_rcvhdrq_phys; - /* - * the actual user address that we ipath_mlock'ed, so we can - * ipath_munlock it at close - */ - unsigned long port_rcvhdrtail_uaddr; + dma_addr_t port_rcvhdrqtailaddr_phys; /* * number of opens on this instance (0 or 1; ignoring forks, dup, * etc. for now) @@ -515,11 +509,6 @@ struct ipath_devdata { u8 ipath_lmc; }; -extern volatile __le64 *ipath_port0_rcvhdrtail; -extern dma_addr_t ipath_port0_rcvhdrtail_dma; - -#define IPATH_PORT0_RCVHDRTAIL_SIZE PAGE_SIZE - extern struct list_head ipath_dev_list; extern spinlock_t ipath_devs_lock; extern struct ipath_devdata *ipath_lookup(int unit); @@ -579,7 +568,7 @@ void ipath_disarm_piobufs(struct ipath_d unsigned cnt); int ipath_create_rcvhdrq(struct ipath_devdata *, struct ipath_portdata *); -void ipath_free_pddata(struct ipath_devdata *, u32, int); +void ipath_free_pddata(struct ipath_devdata *, struct ipath_portdata *); int ipath_parse_ushort(const char *str, unsigned short *valp); From bos at pathscale.com Thu Jun 29 14:41:18 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:18 -0700 Subject: [openib-general] [PATCH 27 of 39] IB/ipath - fixes to performance get counters for IB compliance In-Reply-To: Message-ID: <7d22a8963bdaca778b13.1151617278@eng-12.pathscale.com> This patch fixes some problems uncovered during IB compliance testing to return the right values for error counters returned by the Performance Get Counters packet. Signed-off-by: Ralph Campbell Signed-off-by: Bryan O'Sullivan diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:26 2006 -0700 @@ -460,6 +460,8 @@ static int __devinit ipath_init_one(stru * by ipath_setup_htconfig. */ dd->ipath_flags = 0; + dd->ipath_lli_counter = 0; + dd->ipath_lli_errors = 0; if (dd->ipath_f_bus(dd, pdev)) ipath_dev_err(dd, "Failed to setup config space; " @@ -942,6 +944,18 @@ reloop: "tlen=%x opcode=%x egridx=%x: %s\n", eflags, l, etype, tlen, bthbytes[0], ips_get_index((__le32 *) rc), emsg); + /* Count local link integrity errors. */ + if (eflags & (INFINIPATH_RHF_H_ICRCERR | + INFINIPATH_RHF_H_VCRCERR)) { + u8 n = (dd->ipath_ibcctrl >> + INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) & + INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK; + + if (++dd->ipath_lli_counter > n) { + dd->ipath_lli_counter = 0; + dd->ipath_lli_errors++; + } + } } else if (etype == RCVHQ_RCV_TYPE_NON_KD) { int ret = __ipath_verbs_rcv(dd, rc + 1, ebuf, tlen); @@ -949,6 +963,9 @@ reloop: ipath_cdbg(VERBOSE, "received IB packet, " "not SMA (QP=%x)\n", qp); + if (dd->ipath_lli_counter) + dd->ipath_lli_counter--; + } else if (etype == RCVHQ_RCV_TYPE_EAGER) { if (qp == IPATH_KD_QP && bthbytes[0] == ipath_layer_rcv_opcode && diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:26 2006 -0700 @@ -262,6 +262,7 @@ static void handle_e_ibstatuschanged(str | IPATH_LINKACTIVE | IPATH_LINKARMED); *dd->ipath_statusp &= ~IPATH_STATUS_IB_READY; + dd->ipath_lli_counter = 0; if (!noprint) { if (((dd->ipath_lastibcstat >> INFINIPATH_IBCS_LINKSTATE_SHIFT) & diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Jun 29 14:33:26 2006 -0700 @@ -507,6 +507,11 @@ struct ipath_devdata { u8 ipath_pci_cacheline; /* LID mask control */ u8 ipath_lmc; + + /* local link integrity counter */ + u32 ipath_lli_counter; + /* local link integrity errors */ + u32 ipath_lli_errors; }; extern struct list_head ipath_dev_list; diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_layer.c --- a/drivers/infiniband/hw/ipath/ipath_layer.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.c Thu Jun 29 14:33:26 2006 -0700 @@ -1032,19 +1032,22 @@ int ipath_layer_get_counters(struct ipat ipath_snap_cntr(dd, dd->ipath_cregs->cr_ibsymbolerrcnt); cntrs->link_error_recovery_counter = ipath_snap_cntr(dd, dd->ipath_cregs->cr_iblinkerrrecovcnt); + /* + * The link downed counter counts when the other side downs the + * connection. We add in the number of times we downed the link + * due to local link integrity errors to compensate. + */ cntrs->link_downed_counter = ipath_snap_cntr(dd, dd->ipath_cregs->cr_iblinkdowncnt); cntrs->port_rcv_errors = ipath_snap_cntr(dd, dd->ipath_cregs->cr_rxdroppktcnt) + ipath_snap_cntr(dd, dd->ipath_cregs->cr_rcvovflcnt) + ipath_snap_cntr(dd, dd->ipath_cregs->cr_portovflcnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_errrcvflowctrlcnt) + ipath_snap_cntr(dd, dd->ipath_cregs->cr_err_rlencnt) + ipath_snap_cntr(dd, dd->ipath_cregs->cr_invalidrlencnt) + ipath_snap_cntr(dd, dd->ipath_cregs->cr_erricrccnt) + ipath_snap_cntr(dd, dd->ipath_cregs->cr_errvcrccnt) + ipath_snap_cntr(dd, dd->ipath_cregs->cr_errlpcrccnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_errlinkcnt) + ipath_snap_cntr(dd, dd->ipath_cregs->cr_badformatcnt); cntrs->port_rcv_remphys_errors = ipath_snap_cntr(dd, dd->ipath_cregs->cr_rcvebpcnt); @@ -1058,6 +1061,8 @@ int ipath_layer_get_counters(struct ipat ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktsendcnt); cntrs->port_rcv_packets = ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktrcvcnt); + cntrs->local_link_integrity_errors = dd->ipath_lli_errors; + cntrs->excessive_buffer_overrun_errors = 0; /* XXX */ ret = 0; diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_layer.h --- a/drivers/infiniband/hw/ipath/ipath_layer.h Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.h Thu Jun 29 14:33:26 2006 -0700 @@ -55,6 +55,8 @@ struct ipath_layer_counters { u64 port_rcv_data; u64 port_xmit_packets; u64 port_rcv_packets; + u32 local_link_integrity_errors; + u32 excessive_buffer_overrun_errors; }; /* diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_mad.c --- a/drivers/infiniband/hw/ipath/ipath_mad.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_mad.c Thu Jun 29 14:33:26 2006 -0700 @@ -613,6 +613,9 @@ struct ib_pma_portcounters { #define IB_PMA_SEL_PORT_RCV_ERRORS __constant_htons(0x0008) #define IB_PMA_SEL_PORT_RCV_REMPHYS_ERRORS __constant_htons(0x0010) #define IB_PMA_SEL_PORT_XMIT_DISCARDS __constant_htons(0x0040) +#define IB_PMA_SEL_LOCAL_LINK_INTEGRITY_ERRORS __constant_htons(0x0200) +#define IB_PMA_SEL_EXCESSIVE_BUFFER_OVERRUNS __constant_htons(0x0400) +#define IB_PMA_SEL_PORT_VL15_DROPPED __constant_htons(0x0800) #define IB_PMA_SEL_PORT_XMIT_DATA __constant_htons(0x1000) #define IB_PMA_SEL_PORT_RCV_DATA __constant_htons(0x2000) #define IB_PMA_SEL_PORT_XMIT_PACKETS __constant_htons(0x4000) @@ -859,6 +862,10 @@ static int recv_pma_get_portcounters(str cntrs.port_rcv_data -= dev->z_port_rcv_data; cntrs.port_xmit_packets -= dev->z_port_xmit_packets; cntrs.port_rcv_packets -= dev->z_port_rcv_packets; + cntrs.local_link_integrity_errors -= + dev->z_local_link_integrity_errors; + cntrs.excessive_buffer_overrun_errors -= + dev->z_excessive_buffer_overrun_errors; memset(pmp->data, 0, sizeof(pmp->data)); @@ -896,6 +903,16 @@ static int recv_pma_get_portcounters(str else p->port_xmit_discards = cpu_to_be16((u16)cntrs.port_xmit_discards); + if (cntrs.local_link_integrity_errors > 0xFUL) + cntrs.local_link_integrity_errors = 0xFUL; + if (cntrs.excessive_buffer_overrun_errors > 0xFUL) + cntrs.excessive_buffer_overrun_errors = 0xFUL; + p->lli_ebor_errors = (cntrs.local_link_integrity_errors << 4) | + cntrs.excessive_buffer_overrun_errors; + if (dev->n_vl15_dropped > 0xFFFFUL) + p->vl15_dropped = __constant_cpu_to_be16(0xFFFF); + else + p->vl15_dropped = cpu_to_be16((u16)dev->n_vl15_dropped); if (cntrs.port_xmit_data > 0xFFFFFFFFUL) p->port_xmit_data = __constant_cpu_to_be32(0xFFFFFFFF); else @@ -989,6 +1006,17 @@ static int recv_pma_set_portcounters(str if (p->counter_select & IB_PMA_SEL_PORT_XMIT_DISCARDS) dev->z_port_xmit_discards = cntrs.port_xmit_discards; + + if (p->counter_select & IB_PMA_SEL_LOCAL_LINK_INTEGRITY_ERRORS) + dev->z_local_link_integrity_errors = + cntrs.local_link_integrity_errors; + + if (p->counter_select & IB_PMA_SEL_EXCESSIVE_BUFFER_OVERRUNS) + dev->z_excessive_buffer_overrun_errors = + cntrs.excessive_buffer_overrun_errors; + + if (p->counter_select & IB_PMA_SEL_PORT_VL15_DROPPED) + dev->n_vl15_dropped = 0; if (p->counter_select & IB_PMA_SEL_PORT_XMIT_DATA) dev->z_port_xmit_data = cntrs.port_xmit_data; @@ -1275,32 +1303,8 @@ int ipath_process_mad(struct ib_device * struct ib_wc *in_wc, struct ib_grh *in_grh, struct ib_mad *in_mad, struct ib_mad *out_mad) { - struct ipath_ibdev *dev = to_idev(ibdev); int ret; - /* - * Snapshot current HW counters to "clear" them. - * This should be done when the driver is loaded except that for - * some reason we get a zillion errors when brining up the link. - */ - if (dev->rcv_errors == 0) { - struct ipath_layer_counters cntrs; - - ipath_layer_get_counters(to_idev(ibdev)->dd, &cntrs); - dev->rcv_errors++; - dev->z_symbol_error_counter = cntrs.symbol_error_counter; - dev->z_link_error_recovery_counter = - cntrs.link_error_recovery_counter; - dev->z_link_downed_counter = cntrs.link_downed_counter; - dev->z_port_rcv_errors = cntrs.port_rcv_errors + 1; - dev->z_port_rcv_remphys_errors = - cntrs.port_rcv_remphys_errors; - dev->z_port_xmit_discards = cntrs.port_xmit_discards; - dev->z_port_xmit_data = cntrs.port_xmit_data; - dev->z_port_rcv_data = cntrs.port_rcv_data; - dev->z_port_xmit_packets = cntrs.port_xmit_packets; - dev->z_port_rcv_packets = cntrs.port_rcv_packets; - } switch (in_mad->mad_hdr.mgmt_class) { case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: case IB_MGMT_CLASS_SUBN_LID_ROUTED: diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_ud.c --- a/drivers/infiniband/hw/ipath/ipath_ud.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ud.c Thu Jun 29 14:33:26 2006 -0700 @@ -560,7 +560,16 @@ void ipath_ud_rcv(struct ipath_ibdev *de spin_lock_irqsave(&rq->lock, flags); if (rq->tail == rq->head) { spin_unlock_irqrestore(&rq->lock, flags); - dev->n_pkt_drops++; + /* + * Count VL15 packets dropped due to no receive buffer. + * Otherwise, count them as buffer overruns since usually, + * the HW will be able to receive packets even if there are + * no QPs with posted receive buffers. + */ + if (qp->ibqp.qp_num == 0) + dev->n_vl15_dropped++; + else + dev->rcv_errors++; goto bail; } /* Silently drop packets which are too big. */ diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:26 2006 -0700 @@ -981,6 +981,7 @@ static int ipath_verbs_register_sysfs(st */ static void *ipath_register_ib_device(int unit, struct ipath_devdata *dd) { + struct ipath_layer_counters cntrs; struct ipath_ibdev *idev; struct ib_device *dev; int ret; @@ -1030,6 +1031,25 @@ static void *ipath_register_ib_device(in idev->pma_counter_select[3] = IB_PMA_PORT_RCV_PKTS; idev->pma_counter_select[5] = IB_PMA_PORT_XMIT_WAIT; idev->link_width_enabled = 3; /* 1x or 4x */ + + /* Snapshot current HW counters to "clear" them. */ + ipath_layer_get_counters(dd, &cntrs); + idev->z_symbol_error_counter = cntrs.symbol_error_counter; + idev->z_link_error_recovery_counter = + cntrs.link_error_recovery_counter; + idev->z_link_downed_counter = cntrs.link_downed_counter; + idev->z_port_rcv_errors = cntrs.port_rcv_errors; + idev->z_port_rcv_remphys_errors = + cntrs.port_rcv_remphys_errors; + idev->z_port_xmit_discards = cntrs.port_xmit_discards; + idev->z_port_xmit_data = cntrs.port_xmit_data; + idev->z_port_rcv_data = cntrs.port_rcv_data; + idev->z_port_xmit_packets = cntrs.port_xmit_packets; + idev->z_port_rcv_packets = cntrs.port_rcv_packets; + idev->z_local_link_integrity_errors = + cntrs.local_link_integrity_errors; + idev->z_excessive_buffer_overrun_errors = + cntrs.excessive_buffer_overrun_errors; /* * The system image GUID is supposed to be the same for all diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Jun 29 14:33:26 2006 -0700 @@ -460,6 +460,8 @@ struct ipath_ibdev { u64 z_port_xmit_packets; /* starting count for PMA */ u64 z_port_rcv_packets; /* starting count for PMA */ u32 z_pkey_violations; /* starting count for PMA */ + u32 z_local_link_integrity_errors; /* starting count for PMA */ + u32 z_excessive_buffer_overrun_errors; /* starting count for PMA */ u32 n_rc_resends; u32 n_rc_acks; u32 n_rc_qacks; @@ -469,6 +471,7 @@ struct ipath_ibdev { u32 n_other_naks; u32 n_timeouts; u32 n_pkt_drops; + u32 n_vl15_dropped; u32 n_wqe_errs; u32 n_rdma_dup_busy; u32 n_piowait; From bos at pathscale.com Thu Jun 29 14:41:07 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:07 -0700 Subject: [openib-general] [PATCH 16 of 39] IB/ipath - enable freeze mode when shutting down device In-Reply-To: Message-ID: Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r 125471ee6c68 -r fd5e733f02ac drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:25 2006 -0700 @@ -1656,7 +1656,7 @@ void ipath_shutdown_device(struct ipath_ /* disable IBC */ dd->ipath_control &= ~INFINIPATH_C_LINKENABLE; ipath_write_kreg(dd, dd->ipath_kregs->kr_control, - dd->ipath_control); + dd->ipath_control | INFINIPATH_C_FREEZEMODE); /* * clear SerdesEnable and turn the leds off; do this here because From bos at pathscale.com Thu Jun 29 14:41:20 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:20 -0700 Subject: [openib-general] [PATCH 29 of 39] IB/ipath - RC receive interrupt performance changes In-Reply-To: Message-ID: <1bef8244297aef83d9a6.1151617280@eng-12.pathscale.com> This patch separates QP state used for sending and receiving RC packets so the processing in the receive interrupt handler can be done mostly without locks being held. ACK packets are now sent without requiring synchronization with the send tasklet. Signed-off-by: Ralph Campbell Signed-off-by: Bryan O'Sullivan diff -r 5f3c0b2d446d -r 1bef8244297a drivers/infiniband/hw/ipath/ipath_keys.c --- a/drivers/infiniband/hw/ipath/ipath_keys.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_keys.c Thu Jun 29 14:33:26 2006 -0700 @@ -121,6 +121,7 @@ int ipath_lkey_ok(struct ipath_lkey_tabl struct ib_sge *sge, int acc) { struct ipath_mregion *mr; + unsigned n, m; size_t off; int ret; @@ -152,20 +153,22 @@ int ipath_lkey_ok(struct ipath_lkey_tabl } off += mr->offset; + m = 0; + n = 0; + while (off >= mr->map[m]->segs[n].length) { + off -= mr->map[m]->segs[n].length; + n++; + if (n >= IPATH_SEGSZ) { + m++; + n = 0; + } + } isge->mr = mr; - isge->m = 0; - isge->n = 0; - while (off >= mr->map[isge->m]->segs[isge->n].length) { - off -= mr->map[isge->m]->segs[isge->n].length; - isge->n++; - if (isge->n >= IPATH_SEGSZ) { - isge->m++; - isge->n = 0; - } - } - isge->vaddr = mr->map[isge->m]->segs[isge->n].vaddr + off; - isge->length = mr->map[isge->m]->segs[isge->n].length - off; + isge->vaddr = mr->map[m]->segs[n].vaddr + off; + isge->length = mr->map[m]->segs[n].length - off; isge->sge_length = sge->length; + isge->m = m; + isge->n = n; ret = 1; @@ -190,6 +193,7 @@ int ipath_rkey_ok(struct ipath_ibdev *de struct ipath_lkey_table *rkt = &dev->lk_table; struct ipath_sge *sge = &ss->sge; struct ipath_mregion *mr; + unsigned n, m; size_t off; int ret; @@ -207,20 +211,22 @@ int ipath_rkey_ok(struct ipath_ibdev *de } off += mr->offset; + m = 0; + n = 0; + while (off >= mr->map[m]->segs[n].length) { + off -= mr->map[m]->segs[n].length; + n++; + if (n >= IPATH_SEGSZ) { + m++; + n = 0; + } + } sge->mr = mr; - sge->m = 0; - sge->n = 0; - while (off >= mr->map[sge->m]->segs[sge->n].length) { - off -= mr->map[sge->m]->segs[sge->n].length; - sge->n++; - if (sge->n >= IPATH_SEGSZ) { - sge->m++; - sge->n = 0; - } - } - sge->vaddr = mr->map[sge->m]->segs[sge->n].vaddr + off; - sge->length = mr->map[sge->m]->segs[sge->n].length - off; + sge->vaddr = mr->map[m]->segs[n].vaddr + off; + sge->length = mr->map[m]->segs[n].length - off; sge->sge_length = len; + sge->m = m; + sge->n = n; ss->sg_list = NULL; ss->num_sge = 1; diff -r 5f3c0b2d446d -r 1bef8244297a drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:26 2006 -0700 @@ -333,10 +333,11 @@ static void ipath_reset_qp(struct ipath_ qp->remote_qpn = 0; qp->qkey = 0; qp->qp_access_flags = 0; + clear_bit(IPATH_S_BUSY, &qp->s_flags); qp->s_hdrwords = 0; qp->s_psn = 0; qp->r_psn = 0; - atomic_set(&qp->msn, 0); + qp->r_msn = 0; if (qp->ibqp.qp_type == IB_QPT_RC) { qp->s_state = IB_OPCODE_RC_SEND_LAST; qp->r_state = IB_OPCODE_RC_SEND_LAST; @@ -345,7 +346,8 @@ static void ipath_reset_qp(struct ipath_ qp->r_state = IB_OPCODE_UC_SEND_LAST; } qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; - qp->s_nak_state = 0; + qp->r_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + qp->r_nak_state = 0; qp->s_rnr_timeout = 0; qp->s_head = 0; qp->s_tail = 0; @@ -363,10 +365,10 @@ static void ipath_reset_qp(struct ipath_ * @qp: the QP to put into an error state * * Flushes both send and receive work queues. - * QP r_rq.lock and s_lock should be held. - */ - -static void ipath_error_qp(struct ipath_qp *qp) + * QP s_lock should be held and interrupts disabled. + */ + +void ipath_error_qp(struct ipath_qp *qp) { struct ipath_ibdev *dev = to_idev(qp->ibqp.device); struct ib_wc wc; @@ -409,12 +411,14 @@ static void ipath_error_qp(struct ipath_ qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; wc.opcode = IB_WC_RECV; + spin_lock(&qp->r_rq.lock); while (qp->r_rq.tail != qp->r_rq.head) { wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id; if (++qp->r_rq.tail >= qp->r_rq.size) qp->r_rq.tail = 0; ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); } + spin_unlock(&qp->r_rq.lock); } /** @@ -434,8 +438,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, unsigned long flags; int ret; - spin_lock_irqsave(&qp->r_rq.lock, flags); - spin_lock(&qp->s_lock); + spin_lock_irqsave(&qp->s_lock, flags); cur_state = attr_mask & IB_QP_CUR_STATE ? attr->cur_qp_state : qp->state; @@ -506,31 +509,19 @@ int ipath_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_MIN_RNR_TIMER) - qp->s_min_rnr_timer = attr->min_rnr_timer; + qp->r_min_rnr_timer = attr->min_rnr_timer; if (attr_mask & IB_QP_QKEY) qp->qkey = attr->qkey; qp->state = new_state; - spin_unlock(&qp->s_lock); - spin_unlock_irqrestore(&qp->r_rq.lock, flags); - - /* - * If QP1 changed to the RTS state, try to move to the link to INIT - * even if it was ACTIVE so the SM will reinitialize the SMA's - * state. - */ - if (qp->ibqp.qp_num == 1 && new_state == IB_QPS_RTS) { - struct ipath_ibdev *dev = to_idev(ibqp->device); - - ipath_layer_set_linkstate(dev->dd, IPATH_IB_LINKDOWN); - } + spin_unlock_irqrestore(&qp->s_lock, flags); + ret = 0; goto bail; inval: - spin_unlock(&qp->s_lock); - spin_unlock_irqrestore(&qp->r_rq.lock, flags); + spin_unlock_irqrestore(&qp->s_lock, flags); ret = -EINVAL; bail: @@ -564,7 +555,7 @@ int ipath_query_qp(struct ib_qp *ibqp, s attr->sq_draining = 0; attr->max_rd_atomic = 1; attr->max_dest_rd_atomic = 1; - attr->min_rnr_timer = qp->s_min_rnr_timer; + attr->min_rnr_timer = qp->r_min_rnr_timer; attr->port_num = 1; attr->timeout = 0; attr->retry_cnt = qp->s_retry_cnt; @@ -591,16 +582,12 @@ int ipath_query_qp(struct ib_qp *ibqp, s * @qp: the queue pair to compute the AETH for * * Returns the AETH. - * - * The QP s_lock should be held. */ __be32 ipath_compute_aeth(struct ipath_qp *qp) { - u32 aeth = atomic_read(&qp->msn) & IPS_MSN_MASK; - - if (qp->s_nak_state) { - aeth |= qp->s_nak_state << IPS_AETH_CREDIT_SHIFT; - } else if (qp->ibqp.srq) { + u32 aeth = qp->r_msn & IPS_MSN_MASK; + + if (qp->ibqp.srq) { /* * Shared receive queues don't generate credits. * Set the credit field to the invalid value. diff -r 5f3c0b2d446d -r 1bef8244297a drivers/infiniband/hw/ipath/ipath_rc.c --- a/drivers/infiniband/hw/ipath/ipath_rc.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_rc.c Thu Jun 29 14:33:26 2006 -0700 @@ -42,7 +42,7 @@ * @qp: the QP who's SGE we're restarting * @wqe: the work queue to initialize the QP's SGE from * - * The QP s_lock should be held. + * The QP s_lock should be held and interrupts disabled. */ static void ipath_init_restart(struct ipath_qp *qp, struct ipath_swqe *wqe) { @@ -77,7 +77,6 @@ u32 ipath_make_rc_ack(struct ipath_qp *q struct ipath_other_headers *ohdr, u32 pmtu) { - struct ipath_sge_state *ss; u32 hwords; u32 len; u32 bth0; @@ -91,7 +90,7 @@ u32 ipath_make_rc_ack(struct ipath_qp *q */ switch (qp->s_ack_state) { case OP(RDMA_READ_REQUEST): - ss = &qp->s_rdma_sge; + qp->s_cur_sge = &qp->s_rdma_sge; len = qp->s_rdma_len; if (len > pmtu) { len = pmtu; @@ -108,7 +107,7 @@ u32 ipath_make_rc_ack(struct ipath_qp *q qp->s_ack_state = OP(RDMA_READ_RESPONSE_MIDDLE); /* FALLTHROUGH */ case OP(RDMA_READ_RESPONSE_MIDDLE): - ss = &qp->s_rdma_sge; + qp->s_cur_sge = &qp->s_rdma_sge; len = qp->s_rdma_len; if (len > pmtu) len = pmtu; @@ -127,41 +126,50 @@ u32 ipath_make_rc_ack(struct ipath_qp *q * We have to prevent new requests from changing * the r_sge state while a ipath_verbs_send() * is in progress. - * Changing r_state allows the receiver - * to continue processing new packets. - * We do it here now instead of above so - * that we are sure the packet was sent before - * changing the state. - */ - qp->r_state = OP(RDMA_READ_RESPONSE_LAST); + */ qp->s_ack_state = OP(ACKNOWLEDGE); - return 0; + bth0 = 0; + goto bail; case OP(COMPARE_SWAP): case OP(FETCH_ADD): - ss = NULL; + qp->s_cur_sge = NULL; len = 0; - qp->r_state = OP(SEND_LAST); - qp->s_ack_state = OP(ACKNOWLEDGE); - bth0 = IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24; + /* + * Set the s_ack_state so the receive interrupt handler + * won't try to send an ACK (out of order) until this one + * is actually sent. + */ + qp->s_ack_state = OP(RDMA_READ_RESPONSE_LAST); + bth0 = OP(ATOMIC_ACKNOWLEDGE) << 24; ohdr->u.at.aeth = ipath_compute_aeth(qp); - ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->s_ack_atomic); + ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->r_atomic_data); hwords += sizeof(ohdr->u.at) / 4; break; default: /* Send a regular ACK. */ - ss = NULL; + qp->s_cur_sge = NULL; len = 0; - qp->s_ack_state = OP(ACKNOWLEDGE); - bth0 = qp->s_ack_state << 24; - ohdr->u.aeth = ipath_compute_aeth(qp); + /* + * Set the s_ack_state so the receive interrupt handler + * won't try to send an ACK (out of order) until this one + * is actually sent. + */ + qp->s_ack_state = OP(RDMA_READ_RESPONSE_LAST); + bth0 = OP(ACKNOWLEDGE) << 24; + if (qp->s_nak_state) + ohdr->u.aeth = cpu_to_be32((qp->r_msn & IPS_MSN_MASK) | + (qp->s_nak_state << + IPS_AETH_CREDIT_SHIFT)); + else + ohdr->u.aeth = ipath_compute_aeth(qp); hwords++; } qp->s_hdrwords = hwords; - qp->s_cur_sge = ss; qp->s_cur_size = len; +bail: return bth0; } @@ -174,7 +182,7 @@ u32 ipath_make_rc_ack(struct ipath_qp *q * @bth2p: pointer to the BTH PSN word * * Return 1 if constructed; otherwise, return 0. - * Note the QP s_lock must be held. + * Note the QP s_lock must be held and interrupts disabled. */ int ipath_make_rc_req(struct ipath_qp *qp, struct ipath_other_headers *ohdr, @@ -356,6 +364,11 @@ int ipath_make_rc_req(struct ipath_qp *q bth2 |= qp->s_psn++ & IPS_PSN_MASK; if ((int)(qp->s_psn - qp->s_next_psn) > 0) qp->s_next_psn = qp->s_psn; + /* + * Put the QP on the pending list so lost ACKs will cause + * a retry. More than one request can be pending so the + * QP may already be on the dev->pending list. + */ spin_lock(&dev->pending_lock); if (list_empty(&qp->timerwait)) list_add_tail(&qp->timerwait, @@ -365,8 +378,8 @@ int ipath_make_rc_req(struct ipath_qp *q case OP(RDMA_READ_RESPONSE_FIRST): /* - * This case can only happen if a send is restarted. See - * ipath_restart_rc(). + * This case can only happen if a send is restarted. + * See ipath_restart_rc(). */ ipath_init_restart(qp, wqe); /* FALLTHROUGH */ @@ -526,11 +539,17 @@ static void send_rc_ack(struct ipath_qp ohdr = &hdr.u.l.oth; lrh0 = IPS_LRH_GRH; } + /* read pkey_index w/o lock (its atomic) */ bth0 = ipath_layer_get_pkey(dev->dd, qp->s_pkey_index); - ohdr->u.aeth = ipath_compute_aeth(qp); - if (qp->s_ack_state >= OP(COMPARE_SWAP)) { - bth0 |= IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24; - ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->s_ack_atomic); + if (qp->r_nak_state) + ohdr->u.aeth = cpu_to_be32((qp->r_msn & IPS_MSN_MASK) | + (qp->r_nak_state << + IPS_AETH_CREDIT_SHIFT)); + else + ohdr->u.aeth = ipath_compute_aeth(qp); + if (qp->r_ack_state >= OP(COMPARE_SWAP)) { + bth0 |= OP(ATOMIC_ACKNOWLEDGE) << 24; + ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->r_atomic_data); hwords += sizeof(ohdr->u.at.atomic_ack_eth) / 4; } else bth0 |= OP(ACKNOWLEDGE) << 24; @@ -541,15 +560,36 @@ static void send_rc_ack(struct ipath_qp hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd)); ohdr->bth[0] = cpu_to_be32(bth0); ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); - ohdr->bth[2] = cpu_to_be32(qp->s_ack_psn & IPS_PSN_MASK); + ohdr->bth[2] = cpu_to_be32(qp->r_ack_psn & IPS_PSN_MASK); /* * If we can send the ACK, clear the ACK state. */ if (ipath_verbs_send(dev->dd, hwords, (u32 *) &hdr, 0, NULL) == 0) { - qp->s_ack_state = OP(ACKNOWLEDGE); + qp->r_ack_state = OP(ACKNOWLEDGE); + dev->n_unicast_xmit++; + } else { + /* + * We are out of PIO buffers at the moment. + * Pass responsibility for sending the ACK to the + * send tasklet so that when a PIO buffer becomes + * available, the ACK is sent ahead of other outgoing + * packets. + */ dev->n_rc_qacks++; - dev->n_unicast_xmit++; + spin_lock_irq(&qp->s_lock); + /* Don't coalesce if a RDMA read or atomic is pending. */ + if (qp->s_ack_state == OP(ACKNOWLEDGE) || + qp->s_ack_state < OP(RDMA_READ_REQUEST)) { + qp->s_ack_state = qp->r_ack_state; + qp->s_nak_state = qp->r_nak_state; + qp->s_ack_psn = qp->r_ack_psn; + qp->r_ack_state = OP(ACKNOWLEDGE); + } + spin_unlock_irq(&qp->s_lock); + + /* Call ipath_do_rc_send() in another thread. */ + tasklet_hi_schedule(&qp->s_task); } } @@ -641,7 +681,7 @@ done: * @psn: packet sequence number for the request * @wc: the work completion request * - * The QP s_lock should be held. + * The QP s_lock should be held and interrupts disabled. */ void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc) { @@ -705,7 +745,7 @@ bail: * * This is called from ipath_rc_rcv_resp() to process an incoming RC ACK * for the given QP. - * Called at interrupt level with the QP s_lock held. + * Called at interrupt level with the QP s_lock held and interrupts disabled. * Returns 1 if OK, 0 if current operation should be aborted (NAK). */ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode) @@ -1126,18 +1166,16 @@ static inline int ipath_rc_rcv_error(str * Don't queue the NAK if a RDMA read, atomic, or * NAK is pending though. */ - spin_lock(&qp->s_lock); - if ((qp->s_ack_state >= OP(RDMA_READ_REQUEST) && - qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) || - qp->s_nak_state != 0) { - spin_unlock(&qp->s_lock); + if (qp->s_ack_state != OP(ACKNOWLEDGE) || + qp->r_nak_state != 0) goto done; - } - qp->s_ack_state = OP(SEND_ONLY); - qp->s_nak_state = IB_NAK_PSN_ERROR; - /* Use the expected PSN. */ - qp->s_ack_psn = qp->r_psn; - goto resched; + if (qp->r_ack_state < OP(COMPARE_SWAP)) { + qp->r_ack_state = OP(SEND_ONLY); + qp->r_nak_state = IB_NAK_PSN_ERROR; + /* Use the expected PSN. */ + qp->r_ack_psn = qp->r_psn; + } + goto send_ack; } /* @@ -1151,33 +1189,29 @@ static inline int ipath_rc_rcv_error(str * send the earliest so that RDMA reads can be restarted at * the requester's expected PSN. */ - spin_lock(&qp->s_lock); - if (qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE && - ipath_cmp24(psn, qp->s_ack_psn) >= 0) { - if (qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST) - qp->s_ack_psn = psn; - spin_unlock(&qp->s_lock); - goto done; - } - switch (opcode) { - case OP(RDMA_READ_REQUEST): - /* - * We have to be careful to not change s_rdma_sge - * while ipath_do_rc_send() is using it and not - * holding the s_lock. - */ - if (qp->s_ack_state != OP(ACKNOWLEDGE) && - qp->s_ack_state >= IB_OPCODE_RDMA_READ_REQUEST) { - spin_unlock(&qp->s_lock); - dev->n_rdma_dup_busy++; - goto done; - } + if (opcode == OP(RDMA_READ_REQUEST)) { /* RETH comes after BTH */ if (!header_in_data) reth = &ohdr->u.rc.reth; else { reth = (struct ib_reth *)data; data += sizeof(*reth); + } + /* + * If we receive a duplicate RDMA request, it means the + * requester saw a sequence error and needs to restart + * from an earlier point. We can abort the current + * RDMA read send in that case. + */ + spin_lock_irq(&qp->s_lock); + if (qp->s_ack_state != OP(ACKNOWLEDGE) && + (qp->s_hdrwords || ipath_cmp24(psn, qp->s_ack_psn) >= 0)) { + /* + * We are already sending earlier requested data. + * Don't abort it to send later out of sequence data. + */ + spin_unlock_irq(&qp->s_lock); + goto done; } qp->s_rdma_len = be32_to_cpu(reth->length); if (qp->s_rdma_len != 0) { @@ -1192,8 +1226,10 @@ static inline int ipath_rc_rcv_error(str ok = ipath_rkey_ok(dev, &qp->s_rdma_sge, qp->s_rdma_len, vaddr, rkey, IB_ACCESS_REMOTE_READ); - if (unlikely(!ok)) + if (unlikely(!ok)) { + spin_unlock_irq(&qp->s_lock); goto done; + } } else { qp->s_rdma_sge.sg_list = NULL; qp->s_rdma_sge.num_sge = 0; @@ -1202,25 +1238,44 @@ static inline int ipath_rc_rcv_error(str qp->s_rdma_sge.sge.length = 0; qp->s_rdma_sge.sge.sge_length = 0; } - break; - + qp->s_ack_state = opcode; + qp->s_ack_psn = psn; + spin_unlock_irq(&qp->s_lock); + tasklet_hi_schedule(&qp->s_task); + goto send_ack; + } + + /* + * A pending RDMA read will ACK anything before it so + * ignore earlier duplicate requests. + */ + if (qp->s_ack_state != OP(ACKNOWLEDGE)) + goto done; + + /* + * If an ACK is pending, don't replace the pending ACK + * with an earlier one since the later one will ACK the earlier. + * Also, if we already have a pending atomic, send it. + */ + if (qp->r_ack_state != OP(ACKNOWLEDGE) && + (ipath_cmp24(psn, qp->r_ack_psn) <= 0 || + qp->r_ack_state >= OP(COMPARE_SWAP))) + goto send_ack; + switch (opcode) { case OP(COMPARE_SWAP): case OP(FETCH_ADD): /* * Check for the PSN of the last atomic operation * performed and resend the result if found. */ - if ((psn & IPS_PSN_MASK) != qp->r_atomic_psn) { - spin_unlock(&qp->s_lock); + if ((psn & IPS_PSN_MASK) != qp->r_atomic_psn) goto done; - } - qp->s_ack_atomic = qp->r_atomic_data; break; } - qp->s_ack_state = opcode; - qp->s_nak_state = 0; - qp->s_ack_psn = psn; -resched: + qp->r_ack_state = opcode; + qp->r_nak_state = 0; + qp->r_ack_psn = psn; +send_ack: return 0; done: @@ -1248,7 +1303,6 @@ void ipath_rc_rcv(struct ipath_ibdev *de u32 hdrsize; u32 psn; u32 pad; - unsigned long flags; struct ib_wc wc; u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); int diff; @@ -1289,10 +1343,8 @@ void ipath_rc_rcv(struct ipath_ibdev *de opcode <= OP(ATOMIC_ACKNOWLEDGE)) { ipath_rc_rcv_resp(dev, ohdr, data, tlen, qp, opcode, psn, hdrsize, pmtu, header_in_data); - goto bail; - } - - spin_lock_irqsave(&qp->r_rq.lock, flags); + goto done; + } /* Compute 24 bits worth of difference. */ diff = ipath_cmp24(psn, qp->r_psn); @@ -1300,7 +1352,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de if (ipath_rc_rcv_error(dev, ohdr, data, qp, opcode, psn, diff, header_in_data)) goto done; - goto resched; + goto send_ack; } /* Check for opcode sequence errors. */ @@ -1312,22 +1364,19 @@ void ipath_rc_rcv(struct ipath_ibdev *de opcode == OP(SEND_LAST_WITH_IMMEDIATE)) break; nack_inv: - /* - * A NAK will ACK earlier sends and RDMA writes. Don't queue the - * NAK if a RDMA read, atomic, or NAK is pending though. - */ - spin_lock(&qp->s_lock); - if (qp->s_ack_state >= OP(RDMA_READ_REQUEST) && - qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) { - spin_unlock(&qp->s_lock); - goto done; - } - /* XXX Flush WQEs */ - qp->state = IB_QPS_ERR; - qp->s_ack_state = OP(SEND_ONLY); - qp->s_nak_state = IB_NAK_INVALID_REQUEST; - qp->s_ack_psn = qp->r_psn; - goto resched; + /* + * A NAK will ACK earlier sends and RDMA writes. + * Don't queue the NAK if a RDMA read, atomic, or NAK + * is pending though. + */ + if (qp->r_ack_state >= OP(COMPARE_SWAP)) + goto send_ack; + /* XXX Flush WQEs */ + qp->state = IB_QPS_ERR; + qp->r_ack_state = OP(SEND_ONLY); + qp->r_nak_state = IB_NAK_INVALID_REQUEST; + qp->r_ack_psn = qp->r_psn; + goto send_ack; case OP(RDMA_WRITE_FIRST): case OP(RDMA_WRITE_MIDDLE): @@ -1336,20 +1385,6 @@ void ipath_rc_rcv(struct ipath_ibdev *de opcode == OP(RDMA_WRITE_LAST_WITH_IMMEDIATE)) break; goto nack_inv; - - case OP(RDMA_READ_REQUEST): - case OP(COMPARE_SWAP): - case OP(FETCH_ADD): - /* - * Drop all new requests until a response has been sent. A - * new request then ACKs the RDMA response we sent. Relaxed - * ordering would allow new requests to be processed but we - * would need to keep a queue of rwqe's for all that are in - * progress. Note that we can't RNR NAK this request since - * the RDMA READ or atomic response is already queued to be - * sent (unless we implement a response send queue). - */ - goto done; default: if (opcode == OP(SEND_MIDDLE) || @@ -1359,6 +1394,11 @@ void ipath_rc_rcv(struct ipath_ibdev *de opcode == OP(RDMA_WRITE_LAST) || opcode == OP(RDMA_WRITE_LAST_WITH_IMMEDIATE)) goto nack_inv; + /* + * Note that it is up to the requester to not send a new + * RDMA read or atomic operation before receiving an ACK + * for the previous operation. + */ break; } @@ -1375,17 +1415,12 @@ void ipath_rc_rcv(struct ipath_ibdev *de * Don't queue the NAK if a RDMA read or atomic * is pending though. */ - spin_lock(&qp->s_lock); - if (qp->s_ack_state >= - OP(RDMA_READ_REQUEST) && - qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) { - spin_unlock(&qp->s_lock); - goto done; - } - qp->s_ack_state = OP(SEND_ONLY); - qp->s_nak_state = IB_RNR_NAK | qp->s_min_rnr_timer; - qp->s_ack_psn = qp->r_psn; - goto resched; + if (qp->r_ack_state >= OP(COMPARE_SWAP)) + goto send_ack; + qp->r_ack_state = OP(SEND_ONLY); + qp->r_nak_state = IB_RNR_NAK | qp->r_min_rnr_timer; + qp->r_ack_psn = qp->r_psn; + goto send_ack; } qp->r_rcv_len = 0; /* FALLTHROUGH */ @@ -1442,7 +1477,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de if (unlikely(wc.byte_len > qp->r_len)) goto nack_inv; ipath_copy_sge(&qp->r_sge, data, tlen); - atomic_inc(&qp->msn); + qp->r_msn++; if (opcode == OP(RDMA_WRITE_LAST) || opcode == OP(RDMA_WRITE_ONLY)) break; @@ -1486,29 +1521,8 @@ void ipath_rc_rcv(struct ipath_ibdev *de ok = ipath_rkey_ok(dev, &qp->r_sge, qp->r_len, vaddr, rkey, IB_ACCESS_REMOTE_WRITE); - if (unlikely(!ok)) { - nack_acc: - /* - * A NAK will ACK earlier sends and RDMA - * writes. Don't queue the NAK if a RDMA - * read, atomic, or NAK is pending though. - */ - spin_lock(&qp->s_lock); - if (qp->s_ack_state >= - OP(RDMA_READ_REQUEST) && - qp->s_ack_state != - IB_OPCODE_ACKNOWLEDGE) { - spin_unlock(&qp->s_lock); - goto done; - } - /* XXX Flush WQEs */ - qp->state = IB_QPS_ERR; - qp->s_ack_state = OP(RDMA_WRITE_ONLY); - qp->s_nak_state = - IB_NAK_REMOTE_ACCESS_ERROR; - qp->s_ack_psn = qp->r_psn; - goto resched; - } + if (unlikely(!ok)) + goto nack_acc; } else { qp->r_sge.sg_list = NULL; qp->r_sge.sge.mr = NULL; @@ -1535,12 +1549,10 @@ void ipath_rc_rcv(struct ipath_ibdev *de reth = (struct ib_reth *)data; data += sizeof(*reth); } - spin_lock(&qp->s_lock); - if (qp->s_ack_state != OP(ACKNOWLEDGE) && - qp->s_ack_state >= IB_OPCODE_RDMA_READ_REQUEST) { - spin_unlock(&qp->s_lock); - goto done; - } + if (unlikely(!(qp->qp_access_flags & + IB_ACCESS_REMOTE_READ))) + goto nack_acc; + spin_lock_irq(&qp->s_lock); qp->s_rdma_len = be32_to_cpu(reth->length); if (qp->s_rdma_len != 0) { u32 rkey = be32_to_cpu(reth->rkey); @@ -1552,7 +1564,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de qp->s_rdma_len, vaddr, rkey, IB_ACCESS_REMOTE_READ); if (unlikely(!ok)) { - spin_unlock(&qp->s_lock); + spin_unlock_irq(&qp->s_lock); goto nack_acc; } /* @@ -1569,21 +1581,25 @@ void ipath_rc_rcv(struct ipath_ibdev *de qp->s_rdma_sge.sge.length = 0; qp->s_rdma_sge.sge.sge_length = 0; } - if (unlikely(!(qp->qp_access_flags & - IB_ACCESS_REMOTE_READ))) - goto nack_acc; /* * We need to increment the MSN here instead of when we * finish sending the result since a duplicate request would * increment it more than once. */ - atomic_inc(&qp->msn); + qp->r_msn++; + qp->s_ack_state = opcode; - qp->s_nak_state = 0; qp->s_ack_psn = psn; + spin_unlock_irq(&qp->s_lock); + qp->r_psn++; qp->r_state = opcode; - goto rdmadone; + qp->r_nak_state = 0; + + /* Call ipath_do_rc_send() in another thread. */ + tasklet_hi_schedule(&qp->s_task); + + goto done; case OP(COMPARE_SWAP): case OP(FETCH_ADD): { @@ -1612,7 +1628,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de goto nack_acc; /* Perform atomic OP and save result. */ sdata = be64_to_cpu(ateth->swap_data); - spin_lock(&dev->pending_lock); + spin_lock_irq(&dev->pending_lock); qp->r_atomic_data = *(u64 *) qp->r_sge.sge.vaddr; if (opcode == OP(FETCH_ADD)) *(u64 *) qp->r_sge.sge.vaddr = @@ -1620,8 +1636,8 @@ void ipath_rc_rcv(struct ipath_ibdev *de else if (qp->r_atomic_data == be64_to_cpu(ateth->compare_data)) *(u64 *) qp->r_sge.sge.vaddr = sdata; - spin_unlock(&dev->pending_lock); - atomic_inc(&qp->msn); + spin_unlock_irq(&dev->pending_lock); + qp->r_msn++; qp->r_atomic_psn = psn & IPS_PSN_MASK; psn |= 1 << 31; break; @@ -1633,44 +1649,39 @@ void ipath_rc_rcv(struct ipath_ibdev *de } qp->r_psn++; qp->r_state = opcode; + qp->r_nak_state = 0; /* Send an ACK if requested or required. */ if (psn & (1 << 31)) { /* * Coalesce ACKs unless there is a RDMA READ or * ATOMIC pending. */ - spin_lock(&qp->s_lock); - if (qp->s_ack_state == OP(ACKNOWLEDGE) || - qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST) { - qp->s_ack_state = opcode; - qp->s_nak_state = 0; - qp->s_ack_psn = psn; - qp->s_ack_atomic = qp->r_atomic_data; - goto resched; - } - spin_unlock(&qp->s_lock); - } + if (qp->r_ack_state < OP(COMPARE_SWAP)) { + qp->r_ack_state = opcode; + qp->r_ack_psn = psn; + } + goto send_ack; + } + goto done; + +nack_acc: + /* + * A NAK will ACK earlier sends and RDMA writes. + * Don't queue the NAK if a RDMA read, atomic, or NAK + * is pending though. + */ + if (qp->r_ack_state < OP(COMPARE_SWAP)) { + /* XXX Flush WQEs */ + qp->state = IB_QPS_ERR; + qp->r_ack_state = OP(RDMA_WRITE_ONLY); + qp->r_nak_state = IB_NAK_REMOTE_ACCESS_ERROR; + qp->r_ack_psn = qp->r_psn; + } +send_ack: + /* Send ACK right away unless the send tasklet has a pending ACK. */ + if (qp->s_ack_state == OP(ACKNOWLEDGE)) + send_rc_ack(qp); + done: - spin_unlock_irqrestore(&qp->r_rq.lock, flags); - goto bail; - -resched: - /* - * Try to send ACK right away but not if ipath_do_rc_send() is - * active. - */ - if (qp->s_hdrwords == 0 && - (qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST || - qp->s_ack_state >= IB_OPCODE_COMPARE_SWAP)) - send_rc_ack(qp); - -rdmadone: - spin_unlock(&qp->s_lock); - spin_unlock_irqrestore(&qp->r_rq.lock, flags); - - /* Call ipath_do_rc_send() in another thread. */ - tasklet_hi_schedule(&qp->s_task); - -bail: return; } diff -r 5f3c0b2d446d -r 1bef8244297a drivers/infiniband/hw/ipath/ipath_ruc.c --- a/drivers/infiniband/hw/ipath/ipath_ruc.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c Thu Jun 29 14:33:26 2006 -0700 @@ -113,20 +113,23 @@ void ipath_insert_rnr_queue(struct ipath * * Return 0 if no RWQE is available, otherwise return 1. * - * Called at interrupt level with the QP r_rq.lock held. + * Can be called from interrupt level. */ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only) { + unsigned long flags; struct ipath_rq *rq; struct ipath_srq *srq; struct ipath_rwqe *wqe; - int ret; + int ret = 1; if (!qp->ibqp.srq) { rq = &qp->r_rq; + spin_lock_irqsave(&rq->lock, flags); + if (unlikely(rq->tail == rq->head)) { ret = 0; - goto bail; + goto done; } wqe = get_rwqe_ptr(rq, rq->tail); qp->r_wr_id = wqe->wr_id; @@ -138,17 +141,16 @@ int ipath_get_rwqe(struct ipath_qp *qp, } if (++rq->tail >= rq->size) rq->tail = 0; - ret = 1; - goto bail; + goto done; } srq = to_isrq(qp->ibqp.srq); rq = &srq->rq; - spin_lock(&rq->lock); + spin_lock_irqsave(&rq->lock, flags); + if (unlikely(rq->tail == rq->head)) { - spin_unlock(&rq->lock); ret = 0; - goto bail; + goto done; } wqe = get_rwqe_ptr(rq, rq->tail); qp->r_wr_id = wqe->wr_id; @@ -170,18 +172,18 @@ int ipath_get_rwqe(struct ipath_qp *qp, n = rq->head - rq->tail; if (n < srq->limit) { srq->limit = 0; - spin_unlock(&rq->lock); + spin_unlock_irqrestore(&rq->lock, flags); ev.device = qp->ibqp.device; ev.element.srq = qp->ibqp.srq; ev.event = IB_EVENT_SRQ_LIMIT_REACHED; srq->ibsrq.event_handler(&ev, srq->ibsrq.srq_context); - } else - spin_unlock(&rq->lock); - } else - spin_unlock(&rq->lock); - ret = 1; - + goto bail; + } + } + +done: + spin_unlock_irqrestore(&rq->lock, flags); bail: return ret; } @@ -248,10 +250,8 @@ again: wc.imm_data = wqe->wr.imm_data; /* FALLTHROUGH */ case IB_WR_SEND: - spin_lock_irqsave(&qp->r_rq.lock, flags); if (!ipath_get_rwqe(qp, 0)) { rnr_nak: - spin_unlock_irqrestore(&qp->r_rq.lock, flags); /* Handle RNR NAK */ if (qp->ibqp.qp_type == IB_QPT_UC) goto send_comp; @@ -263,20 +263,17 @@ again: sqp->s_rnr_retry--; dev->n_rnr_naks++; sqp->s_rnr_timeout = - ib_ipath_rnr_table[sqp->s_min_rnr_timer]; + ib_ipath_rnr_table[sqp->r_min_rnr_timer]; ipath_insert_rnr_queue(sqp); goto done; } - spin_unlock_irqrestore(&qp->r_rq.lock, flags); break; case IB_WR_RDMA_WRITE_WITH_IMM: wc.wc_flags = IB_WC_WITH_IMM; wc.imm_data = wqe->wr.imm_data; - spin_lock_irqsave(&qp->r_rq.lock, flags); if (!ipath_get_rwqe(qp, 1)) goto rnr_nak; - spin_unlock_irqrestore(&qp->r_rq.lock, flags); /* FALLTHROUGH */ case IB_WR_RDMA_WRITE: if (wqe->length == 0) diff -r 5f3c0b2d446d -r 1bef8244297a drivers/infiniband/hw/ipath/ipath_uc.c --- a/drivers/infiniband/hw/ipath/ipath_uc.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_uc.c Thu Jun 29 14:33:26 2006 -0700 @@ -241,7 +241,6 @@ void ipath_uc_rcv(struct ipath_ibdev *de u32 hdrsize; u32 psn; u32 pad; - unsigned long flags; struct ib_wc wc; u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); struct ib_reth *reth; @@ -279,8 +278,6 @@ void ipath_uc_rcv(struct ipath_ibdev *de wc.imm_data = 0; wc.wc_flags = 0; - spin_lock_irqsave(&qp->r_rq.lock, flags); - /* Compare the PSN verses the expected PSN. */ if (unlikely(ipath_cmp24(psn, qp->r_psn) != 0)) { /* @@ -537,15 +534,11 @@ void ipath_uc_rcv(struct ipath_ibdev *de default: /* Drop packet for unknown opcodes. */ - spin_unlock_irqrestore(&qp->r_rq.lock, flags); dev->n_pkt_drops++; - goto bail; + goto done; } qp->r_psn++; qp->r_state = opcode; done: - spin_unlock_irqrestore(&qp->r_rq.lock, flags); - -bail: return; } diff -r 5f3c0b2d446d -r 1bef8244297a drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Jun 29 14:33:26 2006 -0700 @@ -307,32 +307,34 @@ struct ipath_qp { u32 s_next_psn; /* PSN for next request */ u32 s_last_psn; /* last response PSN processed */ u32 s_psn; /* current packet sequence number */ + u32 s_ack_psn; /* PSN for RDMA_READ */ u32 s_rnr_timeout; /* number of milliseconds for RNR timeout */ - u32 s_ack_psn; /* PSN for next ACK or RDMA_READ */ - u64 s_ack_atomic; /* data for atomic ACK */ + u32 r_ack_psn; /* PSN for next ACK or atomic ACK */ u64 r_wr_id; /* ID for current receive WQE */ u64 r_atomic_data; /* data for last atomic op */ u32 r_atomic_psn; /* PSN of last atomic op */ u32 r_len; /* total length of r_sge */ u32 r_rcv_len; /* receive data len processed */ u32 r_psn; /* expected rcv packet sequence number */ + u32 r_msn; /* message sequence number */ u8 state; /* QP state */ u8 s_state; /* opcode of last packet sent */ u8 s_ack_state; /* opcode of packet to ACK */ u8 s_nak_state; /* non-zero if NAK is pending */ u8 r_state; /* opcode of last packet received */ + u8 r_ack_state; /* opcode of packet to ACK */ + u8 r_nak_state; /* non-zero if NAK is pending */ + u8 r_min_rnr_timer; /* retry timeout value for RNR NAKs */ u8 r_reuse_sge; /* for UC receive errors */ u8 r_sge_inx; /* current index into sg_list */ + u8 qp_access_flags; u8 s_max_sge; /* size of s_wq->sg_list */ - u8 qp_access_flags; u8 s_retry_cnt; /* number of times to retry */ u8 s_rnr_retry_cnt; - u8 s_min_rnr_timer; u8 s_retry; /* requester retry counter */ u8 s_rnr_retry; /* requester RNR retry counter */ u8 s_pkey_index; /* PKEY index to use */ enum ib_mtu path_mtu; - atomic_t msn; /* message sequence number */ u32 remote_qpn; u32 qkey; /* QKEY for this QP (for UD or RD) */ u32 s_size; /* send work queue size */ From bos at pathscale.com Thu Jun 29 14:41:27 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:27 -0700 Subject: [openib-general] [PATCH 36 of 39] IB/ipath - Ignore receive queue size if SRQ is specified In-Reply-To: Message-ID: <31c382d8210a80c37278.1151617287@eng-12.pathscale.com> The receive work queue size should be ignored if the QP is created to use a shared receive queue according to the IB spec. Signed-off-by: Ralph Campbell Signed-off-by: Bryan O'Sullivan diff -r 9b423c45af8b -r 31c382d8210a drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:26 2006 -0700 @@ -685,16 +685,22 @@ struct ib_qp *ipath_create_qp(struct ib_ ret = ERR_PTR(-ENOMEM); goto bail; } - qp->r_rq.size = init_attr->cap.max_recv_wr + 1; - sz = sizeof(struct ipath_sge) * - init_attr->cap.max_recv_sge + - sizeof(struct ipath_rwqe); - qp->r_rq.wq = vmalloc(qp->r_rq.size * sz); - if (!qp->r_rq.wq) { - kfree(qp); - vfree(swq); - ret = ERR_PTR(-ENOMEM); - goto bail; + if (init_attr->srq) { + qp->r_rq.size = 0; + qp->r_rq.max_sge = 0; + qp->r_rq.wq = NULL; + } else { + qp->r_rq.size = init_attr->cap.max_recv_wr + 1; + qp->r_rq.max_sge = init_attr->cap.max_recv_sge; + sz = (sizeof(struct ipath_sge) * qp->r_rq.max_sge) + + sizeof(struct ipath_rwqe); + qp->r_rq.wq = vmalloc(qp->r_rq.size * sz); + if (!qp->r_rq.wq) { + kfree(qp); + vfree(swq); + ret = ERR_PTR(-ENOMEM); + goto bail; + } } /* @@ -713,7 +719,6 @@ struct ib_qp *ipath_create_qp(struct ib_ qp->s_wq = swq; qp->s_size = init_attr->cap.max_send_wr + 1; qp->s_max_sge = init_attr->cap.max_send_sge; - qp->r_rq.max_sge = init_attr->cap.max_recv_sge; qp->s_flags = init_attr->sq_sig_type == IB_SIGNAL_REQ_WR ? 1 << IPATH_S_SIGNAL_REQ_WR : 0; dev = to_idev(ibpd->device); From bos at pathscale.com Thu Jun 29 14:41:26 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:26 -0700 Subject: [openib-general] [PATCH 35 of 39] IB/ipath - remove some #if 0 code related to lockable memory In-Reply-To: Message-ID: <9b423c45af8b2eb98562.1151617286@eng-12.pathscale.com> Signed-off-by: Dave Olson Signed-off-by: Bryan O'Sullivan diff -r b6ebaf2dd2fd -r 9b423c45af8b drivers/infiniband/hw/ipath/ipath_user_pages.c --- a/drivers/infiniband/hw/ipath/ipath_user_pages.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c Thu Jun 29 14:33:26 2006 -0700 @@ -57,17 +57,6 @@ static int __get_user_pages(unsigned lon unsigned long lock_limit; size_t got; int ret; - -#if 0 - /* - * XXX - causes MPI programs to fail, haven't had time to check - * yet - */ - if (!capable(CAP_IPC_LOCK)) { - ret = -EPERM; - goto bail; - } -#endif lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; From bos at pathscale.com Thu Jun 29 14:41:29 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:29 -0700 Subject: [openib-general] [PATCH 38 of 39] IB/ipath - More changes to support InfiniPath on PowerPC 970 systems In-Reply-To: Message-ID: Ordering of writethrough store buffers needs to be forced, and we need to use ifdef to get writethrough behavior to InfiniPath buffers, because there is no generic way to specify that at this time (similar to code in char/drm/drm_vm.c and block/z2ram.c). Signed-off-by: John Gregor Signed-off-by: Bryan O'Sullivan diff -r 2a721e1f490b -r c22b6c244d5d drivers/infiniband/hw/ipath/Makefile --- a/drivers/infiniband/hw/ipath/Makefile Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/Makefile Thu Jun 29 14:33:26 2006 -0700 @@ -20,6 +20,7 @@ ipath_core-y := \ ipath_user_pages.o ipath_core-$(CONFIG_X86_64) += ipath_wc_x86_64.o +ipath_core-$(CONFIG_PPC64) += ipath_wc_ppc64.o ib_ipath-y := \ ipath_cq.o \ diff -r 2a721e1f490b -r c22b6c244d5d drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:26 2006 -0700 @@ -440,7 +440,13 @@ static int __devinit ipath_init_one(stru } dd->ipath_pcirev = rev; +#if defined(__powerpc__) + /* There isn't a generic way to specify writethrough mappings */ + dd->ipath_kregbase = __ioremap(addr, len, + (_PAGE_NO_CACHE|_PAGE_WRITETHRU)); +#else dd->ipath_kregbase = ioremap_nocache(addr, len); +#endif if (!dd->ipath_kregbase) { ipath_dbg("Unable to map io addr %llx to kvirt, failing\n", diff -r 2a721e1f490b -r c22b6c244d5d drivers/infiniband/hw/ipath/ipath_file_ops.c --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:26 2006 -0700 @@ -985,6 +985,13 @@ static int mmap_piobufs(struct vm_area_s * write combining behavior we want on the PIO buffers! */ +#if defined(__powerpc__) + /* There isn't a generic way to specify writethrough mappings */ + pgprot_val(vma->vm_page_prot) |= _PAGE_NO_CACHE; + pgprot_val(vma->vm_page_prot) |= _PAGE_WRITETHRU; + pgprot_val(vma->vm_page_prot) &= ~_PAGE_GUARDED; +#endif + if (vma->vm_flags & VM_READ) { dev_info(&dd->pcidev->dev, "Can't map piobufs as readable (flags=%lx)\n", diff -r 2a721e1f490b -r c22b6c244d5d drivers/infiniband/hw/ipath/ipath_wc_ppc64.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_wc_ppc64.c Thu Jun 29 14:33:26 2006 -0700 @@ -0,0 +1,52 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +/* + * This file is conditionally built on PowerPC only. Otherwise weak symbol + * versions of the functions exported from here are used. + */ + +#include "ipath_kernel.h" + +/** + * ipath_unordered_wc - indicate whether write combining is ordered + * + * PowerPC systems (at least those in the 970 processor family) + * write partially filled store buffers in address order, but will write + * completely filled store buffers in "random" order, and therefore must + * have serialization for correctness with current InfiniPath chips. + * + */ +int ipath_unordered_wc(void) +{ + return 1; +} From bos at pathscale.com Thu Jun 29 14:41:30 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:30 -0700 Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy in RDMA interrupt handler to reduce packet loss In-Reply-To: Message-ID: <1b00209ef20a0e7893d8.1151617290@eng-12.pathscale.com> In cases where a large incoming RDMA is being received, we have to copy data inside the interrupt handler before we can ACK each packet. The source is DMAed to by the hardware, which means that the CPU won't have it cached. We only read the source this one time; using normal load instructions pollutes the dcache with useless data, reducing performance to the point where we can lose a significant number of packets. Using a (memcpy-compatible) copy routine that loads with streaming instructions, we try to not fill the dcache with useless data. Avoiding the cache refill penalty lets us keep up better with the sender, resulting in many fewer dropped packets. We use normal stores to the destination, because the copied-to data will be used soon after the interrupt handler completes. Signed-off-by: Ralph Campbell Signed-off-by: Bryan O'Sullivan diff -r c22b6c244d5d -r 1b00209ef20a drivers/infiniband/hw/ipath/Makefile --- a/drivers/infiniband/hw/ipath/Makefile Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/Makefile Thu Jun 29 14:33:26 2006 -0700 @@ -35,3 +35,5 @@ ib_ipath-y := \ ipath_ud.o \ ipath_verbs.o \ ipath_verbs_mcast.o + +ib_ipath-$(CONFIG_X86_64) += ipath_memcpy_x86_64.o diff -r c22b6c244d5d -r 1b00209ef20a drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:26 2006 -0700 @@ -159,7 +159,7 @@ void ipath_copy_sge(struct ipath_sge_sta BUG_ON(len == 0); if (len > length) len = length; - memcpy(sge->vaddr, data, len); + ipath_memcpy_nc(sge->vaddr, data, len); sge->vaddr += len; sge->length -= len; sge->sge_length -= len; diff -r c22b6c244d5d -r 1b00209ef20a drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Jun 29 14:33:26 2006 -0700 @@ -728,4 +728,14 @@ extern unsigned int ib_ipath_max_srq_wrs extern const u32 ib_ipath_rnr_table[]; +/* + * Copy data. Try not to pollute the dcache with the source data, + * because we won't be reading it again. + */ +#if defined(CONFIG_X86_64) +void *ipath_memcpy_nc(void *dest, const void *src, size_t n); +#else +#define ipath_memcpy_nc(dest, src, n) memcpy(dest, src, n) +#endif + #endif /* IPATH_VERBS_H */ diff -r c22b6c244d5d -r 1b00209ef20a drivers/infiniband/hw/ipath/ipath_memcpy_x86_64.S --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_memcpy_x86_64.S Thu Jun 29 14:33:26 2006 -0700 @@ -0,0 +1,157 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +/* + * ipath_memcpy_nc - memcpy-compatible copy routine, using streaming loads + * @dest: destination address + * @src: source address + * @count: number of bytes to copy + * + * Use streaming loads and normal stores for a special-case copy where + * we know we won't be reading the source again, but will be reading the + * destination again soon. + */ + .text + .p2align 4,,15 + /* rdi destination, rsi source, rdx count */ + .globl ipath_memcpy_nc + .type ipath_memcpy_nc, @function +ipath_memcpy_nc: + movq %rdi, %rax +.L5: + cmpq $15, %rdx + ja .L34 +.L3: + cmpl $8, %edx /* rdx is 0..15 */ + jbe .L9 +.L6: + testb $8, %dxl /* rdx is 3,5,6,7,9..15 */ + je .L13 + movq (%rsi), %rcx + addq $8, %rsi + movq %rcx, (%rdi) + addq $8, %rdi +.L13: + testb $4, %dxl + je .L15 + movl (%rsi), %ecx + addq $4, %rsi + movl %ecx, (%rdi) + addq $4, %rdi +.L15: + testb $2, %dxl + je .L17 + movzwl (%rsi), %ecx + addq $2, %rsi + movw %cx, (%rdi) + addq $2, %rdi +.L17: + testb $1, %dxl + je .L33 +.L1: + movzbl (%rsi), %ecx + movb %cl, (%rdi) +.L33: + ret +.L34: + cmpq $63, %rdx /* rdx is > 15 */ + ja .L64 + movl $16, %ecx /* rdx is 16..63 */ +.L25: + movq 8(%rsi), %r8 + movq (%rsi), %r9 + addq %rcx, %rsi + movq %r8, 8(%rdi) + movq %r9, (%rdi) + addq %rcx, %rdi + subq %rcx, %rdx + cmpl %edx, %ecx /* is rdx >= 16? */ + jbe .L25 + jmp .L3 /* rdx is 0..15 */ + .p2align 4,,7 +.L64: + movl $64, %ecx +.L42: + prefetchnta 128(%rsi) + movq (%rsi), %r8 + movq 8(%rsi), %r9 + movq 16(%rsi), %r10 + movq 24(%rsi), %r11 + subq %rcx, %rdx + movq %r8, (%rdi) + movq 32(%rsi), %r8 + movq %r9, 8(%rdi) + movq 40(%rsi), %r9 + movq %r10, 16(%rdi) + movq 48(%rsi), %r10 + movq %r11, 24(%rdi) + movq 56(%rsi), %r11 + addq %rcx, %rsi + movq %r8, 32(%rdi) + movq %r9, 40(%rdi) + movq %r10, 48(%rdi) + movq %r11, 56(%rdi) + addq %rcx, %rdi + cmpq %rdx, %rcx /* is rdx >= 64? */ + jbe .L42 + sfence + orl %edx, %edx + je .L33 + jmp .L5 +.L9: + jmp *.L12(,%rdx,8) /* rdx is 0..8 */ + .section .rodata + .align 8 + .align 4 +.L12: + .quad .L33 + .quad .L1 + .quad .L2 + .quad .L6 + .quad .L4 + .quad .L6 + .quad .L6 + .quad .L6 + .quad .L8 + .text +.L2: + movzwl (%rsi), %ecx + movw %cx, (%rdi) + ret +.L4: + movl (%rsi), %ecx + movl %ecx, (%rdi) + ret +.L8: + movq (%rsi), %rcx + movq %rcx, (%rdi) + ret From bos at pathscale.com Thu Jun 29 14:41:28 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:41:28 -0700 Subject: [openib-general] [PATCH 37 of 39] IB/ipath - namespace cleanup: replace ips with ipath In-Reply-To: Message-ID: <2a721e1f490b74df3737.1151617288@eng-12.pathscale.com> Remove ips namespace from infinipath drivers. This renames ips_common.h to ipath_common.h. Definitions, data structures, etc. that were not used by kernel modules have moved to user-only headers. All names including ips have been renamed to ipath. Some names have had an ipath prefix added. Signed-off-by: Christian Bell Signed-off-by: Bryan O'Sullivan diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_common.h --- a/drivers/infiniband/hw/ipath/ipath_common.h Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_common.h Thu Jun 29 14:33:26 2006 -0700 @@ -39,7 +39,8 @@ * to communicate between kernel and user code. */ -/* This is the IEEE-assigned OUI for QLogic, Inc. InfiniPath */ + +/* This is the IEEE-assigned OUI for QLogic Inc. InfiniPath */ #define IPATH_SRC_OUI_1 0x00 #define IPATH_SRC_OUI_2 0x11 #define IPATH_SRC_OUI_3 0x75 @@ -343,9 +344,9 @@ struct ipath_base_info { /* * Similarly, this is the kernel version going back to the user. It's * slightly different, in that we want to tell if the driver was built as - * part of a QLogic release, or from the driver from OpenIB, kernel.org, - * or a standard distribution, for support reasons. The high bit is 0 for - * non-QLogic, and 1 for QLogic-built/supplied. + * part of a QLogic release, or from the driver from openfabrics.org, + * kernel.org, or a standard distribution, for support reasons. + * The high bit is 0 for non-QLogic and 1 for QLogic-built/supplied. * * It's returned by the driver to the user code during initialization in the * spi_sw_version field of ipath_base_info, so the user code can in turn @@ -600,14 +601,118 @@ struct infinipath_counters { #define INFINIPATH_KPF_INTR 0x1 /* SendPIO per-buffer control */ -#define INFINIPATH_SP_LENGTHP1_MASK 0x3FF -#define INFINIPATH_SP_LENGTHP1_SHIFT 0 -#define INFINIPATH_SP_INTR 0x80000000 -#define INFINIPATH_SP_TEST 0x40000000 -#define INFINIPATH_SP_TESTEBP 0x20000000 +#define INFINIPATH_SP_TEST 0x40 +#define INFINIPATH_SP_TESTEBP 0x20 /* SendPIOAvail bits */ #define INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT 1 #define INFINIPATH_SENDPIOAVAIL_CHECK_SHIFT 0 +/* infinipath header format */ +struct ipath_header { + /* + * Version - 4 bits, Port - 4 bits, TID - 10 bits and Offset - + * 14 bits before ECO change ~28 Dec 03. After that, Vers 4, + * Port 3, TID 11, offset 14. + */ + __le32 ver_port_tid_offset; + __le16 chksum; + __le16 pkt_flags; +}; + +/* infinipath user message header format. + * This structure contains the first 4 fields common to all protocols + * that employ infinipath. + */ +struct ipath_message_header { + __be16 lrh[4]; + __be32 bth[3]; + /* fields below this point are in host byte order */ + struct ipath_header iph; + __u8 sub_opcode; +}; + +/* infinipath ethernet header format */ +struct ether_header { + __be16 lrh[4]; + __be32 bth[3]; + struct ipath_header iph; + __u8 sub_opcode; + __u8 cmd; + __be16 lid; + __u16 mac[3]; + __u8 frag_num; + __u8 seq_num; + __le32 len; + /* MUST be of word size due to PIO write requirements */ + __le32 csum; + __le16 csum_offset; + __le16 flags; + __u16 first_2_bytes; + __u8 unused[2]; /* currently unused */ +}; + + +/* IB - LRH header consts */ +#define IPATH_LRH_GRH 0x0003 /* 1. word of IB LRH - next header: GRH */ +#define IPATH_LRH_BTH 0x0002 /* 1. word of IB LRH - next header: BTH */ + +/* misc. */ +#define SIZE_OF_CRC 1 + +#define IPATH_DEFAULT_P_KEY 0xFFFF +#define IPATH_PERMISSIVE_LID 0xFFFF +#define IPATH_AETH_CREDIT_SHIFT 24 +#define IPATH_AETH_CREDIT_MASK 0x1F +#define IPATH_AETH_CREDIT_INVAL 0x1F +#define IPATH_PSN_MASK 0xFFFFFF +#define IPATH_MSN_MASK 0xFFFFFF +#define IPATH_QPN_MASK 0xFFFFFF +#define IPATH_MULTICAST_LID_BASE 0xC000 +#define IPATH_MULTICAST_QPN 0xFFFFFF + +/* Receive Header Queue: receive type (from infinipath) */ +#define RCVHQ_RCV_TYPE_EXPECTED 0 +#define RCVHQ_RCV_TYPE_EAGER 1 +#define RCVHQ_RCV_TYPE_NON_KD 2 +#define RCVHQ_RCV_TYPE_ERROR 3 + + +/* sub OpCodes - ith4x */ +#define IPATH_ITH4X_OPCODE_ENCAP 0x81 +#define IPATH_ITH4X_OPCODE_LID_ARP 0x82 + +#define IPATH_HEADER_QUEUE_WORDS 9 + +/* functions for extracting fields from rcvhdrq entries for the driver. + */ +static inline __u32 ipath_hdrget_err_flags(const __le32 * rbuf) +{ + return __le32_to_cpu(rbuf[1]); +} + +static inline __u32 ipath_hdrget_rcv_type(const __le32 * rbuf) +{ + return (__le32_to_cpu(rbuf[0]) >> INFINIPATH_RHF_RCVTYPE_SHIFT) + & INFINIPATH_RHF_RCVTYPE_MASK; +} + +static inline __u32 ipath_hdrget_length_in_bytes(const __le32 * rbuf) +{ + return ((__le32_to_cpu(rbuf[0]) >> INFINIPATH_RHF_LENGTH_SHIFT) + & INFINIPATH_RHF_LENGTH_MASK) << 2; +} + +static inline __u32 ipath_hdrget_index(const __le32 * rbuf) +{ + return (__le32_to_cpu(rbuf[0]) >> INFINIPATH_RHF_EGRINDEX_SHIFT) + & INFINIPATH_RHF_EGRINDEX_MASK; +} + +static inline __u32 ipath_hdrget_ipath_ver(__le32 hdrword) +{ + return (__le32_to_cpu(hdrword) >> INFINIPATH_I_VERS_SHIFT) + & INFINIPATH_I_VERS_MASK; +} + #endif /* _IPATH_COMMON_H */ diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_diag.c --- a/drivers/infiniband/hw/ipath/ipath_diag.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_diag.c Thu Jun 29 14:33:26 2006 -0700 @@ -44,10 +44,9 @@ #include #include +#include "ipath_kernel.h" +#include "ipath_layer.h" #include "ipath_common.h" -#include "ipath_kernel.h" -#include "ips_common.h" -#include "ipath_layer.h" int ipath_diag_inuse; static int diag_set_link; diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Jun 29 14:33:26 2006 -0700 @@ -39,8 +39,8 @@ #include #include "ipath_kernel.h" -#include "ips_common.h" #include "ipath_layer.h" +#include "ipath_common.h" static void ipath_update_pio_bufs(struct ipath_devdata *); @@ -823,7 +823,8 @@ static void ipath_rcv_layer(struct ipath u8 pad, *bthbytes; struct sk_buff *skb, *nskb; - if (dd->ipath_port0_skbs && hdr->sub_opcode == OPCODE_ENCAP) { + if (dd->ipath_port0_skbs && + hdr->sub_opcode == IPATH_ITH4X_OPCODE_ENCAP) { /* * Allocate a new sk_buff to replace the one we give * to the network stack. @@ -854,7 +855,7 @@ static void ipath_rcv_layer(struct ipath /* another ether packet received */ ipath_stats.sps_ether_rpkts++; } - else if (hdr->sub_opcode == OPCODE_LID_ARP) + else if (hdr->sub_opcode == IPATH_ITH4X_OPCODE_LID_ARP) __ipath_layer_rcv_lid(dd, hdr); } @@ -871,7 +872,7 @@ void ipath_kreceive(struct ipath_devdata const u32 rsize = dd->ipath_rcvhdrentsize; /* words */ const u32 maxcnt = dd->ipath_rcvhdrcnt * rsize; /* words */ u32 etail = -1, l, hdrqtail; - struct ips_message_header *hdr; + struct ipath_message_header *hdr; u32 eflags, i, etype, tlen, pkttot = 0, updegr=0, reloop=0; static u64 totcalls; /* stats, may eventually remove */ char emsg[128]; @@ -897,7 +898,7 @@ reloop: u8 *bthbytes; rc = (u64 *) (dd->ipath_pd[0]->port_rcvhdrq + (l << 2)); - hdr = (struct ips_message_header *)&rc[1]; + hdr = (struct ipath_message_header *)&rc[1]; /* * could make a network order version of IPATH_KD_QP, and * do the obvious shift before masking to speed this up. @@ -905,10 +906,10 @@ reloop: qp = ntohl(hdr->bth[1]) & 0xffffff; bthbytes = (u8 *) hdr->bth; - eflags = ips_get_hdr_err_flags((__le32 *) rc); - etype = ips_get_rcv_type((__le32 *) rc); + eflags = ipath_hdrget_err_flags((__le32 *) rc); + etype = ipath_hdrget_rcv_type((__le32 *) rc); /* total length */ - tlen = ips_get_length_in_bytes((__le32 *) rc); + tlen = ipath_hdrget_length_in_bytes((__le32 *) rc); ebuf = NULL; if (etype != RCVHQ_RCV_TYPE_EXPECTED) { /* @@ -918,7 +919,7 @@ reloop: * set ebuf (so we try to copy data) unless the * length requires it. */ - etail = ips_get_index((__le32 *) rc); + etail = ipath_hdrget_index((__le32 *) rc); if (tlen > sizeof(*hdr) || etype == RCVHQ_RCV_TYPE_NON_KD) ebuf = ipath_get_egrbuf(dd, etail, 0); @@ -930,7 +931,7 @@ reloop: */ if (etype != RCVHQ_RCV_TYPE_NON_KD && etype != - RCVHQ_RCV_TYPE_ERROR && ips_get_ipath_ver( + RCVHQ_RCV_TYPE_ERROR && ipath_hdrget_ipath_ver( hdr->iph.ver_port_tid_offset) != IPS_PROTO_VERSION) { ipath_cdbg(PKT, "Bad InfiniPath protocol version " @@ -943,7 +944,7 @@ reloop: ipath_cdbg(PKT, "RHFerrs %x hdrqtail=%x typ=%u " "tlen=%x opcode=%x egridx=%x: %s\n", eflags, l, etype, tlen, bthbytes[0], - ips_get_index((__le32 *) rc), emsg); + ipath_hdrget_index((__le32 *) rc), emsg); /* Count local link integrity errors. */ if (eflags & (INFINIPATH_RHF_H_ICRCERR | INFINIPATH_RHF_H_VCRCERR)) { diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_file_ops.c --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:26 2006 -0700 @@ -39,8 +39,8 @@ #include #include "ipath_kernel.h" -#include "ips_common.h" #include "ipath_layer.h" +#include "ipath_common.h" static int ipath_open(struct inode *, struct file *); static int ipath_close(struct inode *, struct file *); @@ -458,7 +458,7 @@ static int ipath_set_part_key(struct ipa u16 lkey = key & 0x7FFF; int ret; - if (lkey == (IPS_DEFAULT_P_KEY & 0x7FFF)) { + if (lkey == (IPATH_DEFAULT_P_KEY & 0x7FFF)) { /* nothing to do; this key always valid */ ret = 0; goto bail; diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_init_chip.c --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Jun 29 14:33:26 2006 -0700 @@ -36,7 +36,7 @@ #include #include "ipath_kernel.h" -#include "ips_common.h" +#include "ipath_common.h" /* * min buffers we want to have per port, after driver @@ -277,7 +277,7 @@ static int init_chip_first(struct ipath_ pd->port_port = 0; pd->port_cnt = 1; /* The port 0 pkey table is used by the layer interface. */ - pd->port_pkeys[0] = IPS_DEFAULT_P_KEY; + pd->port_pkeys[0] = IPATH_DEFAULT_P_KEY; dd->ipath_rcvtidcnt = ipath_read_kreg32(dd, dd->ipath_kregs->kr_rcvtidcnt); dd->ipath_rcvtidbase = diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Thu Jun 29 14:33:26 2006 -0700 @@ -34,8 +34,8 @@ #include #include "ipath_kernel.h" -#include "ips_common.h" #include "ipath_layer.h" +#include "ipath_common.h" /* These are all rcv-related errors which we want to count for stats */ #define E_SUM_PKTERRS \ diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_layer.c --- a/drivers/infiniband/hw/ipath/ipath_layer.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.c Thu Jun 29 14:33:26 2006 -0700 @@ -41,8 +41,8 @@ #include #include "ipath_kernel.h" -#include "ips_common.h" #include "ipath_layer.h" +#include "ipath_common.h" /* Acquire before ipath_devs_lock. */ static DEFINE_MUTEX(ipath_layer_mutex); @@ -622,7 +622,7 @@ int ipath_layer_open(struct ipath_devdat goto bail; } - ret = ipath_setrcvhdrsize(dd, NUM_OF_EXTRA_WORDS_IN_HEADER_QUEUE); + ret = ipath_setrcvhdrsize(dd, IPATH_HEADER_QUEUE_WORDS); if (ret < 0) goto bail; @@ -1106,10 +1106,10 @@ int ipath_layer_send_hdr(struct ipath_de } vlsllnh = *((__be16 *) hdr); - if (vlsllnh != htons(IPS_LRH_BTH)) { + if (vlsllnh != htons(IPATH_LRH_BTH)) { ipath_dbg("Warning: lrh[0] wrong (%x, not %x); " "not sending\n", be16_to_cpu(vlsllnh), - IPS_LRH_BTH); + IPATH_LRH_BTH); ret = -EINVAL; } if (ret) diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_mad.c --- a/drivers/infiniband/hw/ipath/ipath_mad.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_mad.c Thu Jun 29 14:33:26 2006 -0700 @@ -35,7 +35,7 @@ #include "ipath_kernel.h" #include "ipath_verbs.h" -#include "ips_common.h" +#include "ipath_common.h" #define IB_SMP_UNSUP_VERSION __constant_htons(0x0004) #define IB_SMP_UNSUP_METHOD __constant_htons(0x0008) @@ -306,7 +306,7 @@ static int recv_subn_set_portinfo(struct lid = be16_to_cpu(pip->lid); if (lid != ipath_layer_get_lid(dev->dd)) { /* Must be a valid unicast LID address. */ - if (lid == 0 || lid >= IPS_MULTICAST_LID_BASE) + if (lid == 0 || lid >= IPATH_MULTICAST_LID_BASE) goto err; ipath_set_lid(dev->dd, lid, pip->mkeyprot_resv_lmc & 7); event.event = IB_EVENT_LID_CHANGE; @@ -316,7 +316,7 @@ static int recv_subn_set_portinfo(struct smlid = be16_to_cpu(pip->sm_lid); if (smlid != dev->sm_lid) { /* Must be a valid unicast LID address. */ - if (smlid == 0 || smlid >= IPS_MULTICAST_LID_BASE) + if (smlid == 0 || smlid >= IPATH_MULTICAST_LID_BASE) goto err; dev->sm_lid = smlid; event.event = IB_EVENT_SM_CHANGE; diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_qp.c --- a/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_qp.c Thu Jun 29 14:33:26 2006 -0700 @@ -35,7 +35,7 @@ #include #include "ipath_verbs.h" -#include "ips_common.h" +#include "ipath_common.h" #define BITS_PER_PAGE (PAGE_SIZE*BITS_PER_BYTE) #define BITS_PER_PAGE_MASK (BITS_PER_PAGE-1) @@ -450,7 +450,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, if (attr_mask & IB_QP_AV) if (attr->ah_attr.dlid == 0 || - attr->ah_attr.dlid >= IPS_MULTICAST_LID_BASE) + attr->ah_attr.dlid >= IPATH_MULTICAST_LID_BASE) goto inval; if (attr_mask & IB_QP_PKEY_INDEX) @@ -585,14 +585,14 @@ int ipath_query_qp(struct ib_qp *ibqp, s */ __be32 ipath_compute_aeth(struct ipath_qp *qp) { - u32 aeth = qp->r_msn & IPS_MSN_MASK; + u32 aeth = qp->r_msn & IPATH_MSN_MASK; if (qp->ibqp.srq) { /* * Shared receive queues don't generate credits. * Set the credit field to the invalid value. */ - aeth |= IPS_AETH_CREDIT_INVAL << IPS_AETH_CREDIT_SHIFT; + aeth |= IPATH_AETH_CREDIT_INVAL << IPATH_AETH_CREDIT_SHIFT; } else { u32 min, max, x; u32 credits; @@ -622,7 +622,7 @@ __be32 ipath_compute_aeth(struct ipath_q else min = x; } - aeth |= x << IPS_AETH_CREDIT_SHIFT; + aeth |= x << IPATH_AETH_CREDIT_SHIFT; } return cpu_to_be32(aeth); } @@ -888,18 +888,18 @@ void ipath_sqerror_qp(struct ipath_qp *q */ void ipath_get_credit(struct ipath_qp *qp, u32 aeth) { - u32 credit = (aeth >> IPS_AETH_CREDIT_SHIFT) & IPS_AETH_CREDIT_MASK; + u32 credit = (aeth >> IPATH_AETH_CREDIT_SHIFT) & IPATH_AETH_CREDIT_MASK; /* * If the credit is invalid, we can send * as many packets as we like. Otherwise, we have to * honor the credit field. */ - if (credit == IPS_AETH_CREDIT_INVAL) + if (credit == IPATH_AETH_CREDIT_INVAL) qp->s_lsn = (u32) -1; else if (qp->s_lsn != (u32) -1) { /* Compute new LSN (i.e., MSN + credit) */ - credit = (aeth + credit_table[credit]) & IPS_MSN_MASK; + credit = (aeth + credit_table[credit]) & IPATH_MSN_MASK; if (ipath_cmp24(credit, qp->s_lsn) > 0) qp->s_lsn = credit; } diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_rc.c --- a/drivers/infiniband/hw/ipath/ipath_rc.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_rc.c Thu Jun 29 14:33:26 2006 -0700 @@ -32,7 +32,7 @@ */ #include "ipath_verbs.h" -#include "ips_common.h" +#include "ipath_common.h" /* cut down ridiculously long IB macro names */ #define OP(x) IB_OPCODE_RC_##x @@ -49,7 +49,7 @@ static void ipath_init_restart(struct ip struct ipath_ibdev *dev; u32 len; - len = ((qp->s_psn - wqe->psn) & IPS_PSN_MASK) * + len = ((qp->s_psn - wqe->psn) & IPATH_PSN_MASK) * ib_mtu_enum_to_int(qp->path_mtu); qp->s_sge.sge = wqe->sg_list[0]; qp->s_sge.sg_list = wqe->sg_list + 1; @@ -159,9 +159,9 @@ u32 ipath_make_rc_ack(struct ipath_qp *q qp->s_ack_state = OP(RDMA_READ_RESPONSE_LAST); bth0 = OP(ACKNOWLEDGE) << 24; if (qp->s_nak_state) - ohdr->u.aeth = cpu_to_be32((qp->r_msn & IPS_MSN_MASK) | + ohdr->u.aeth = cpu_to_be32((qp->r_msn & IPATH_MSN_MASK) | (qp->s_nak_state << - IPS_AETH_CREDIT_SHIFT)); + IPATH_AETH_CREDIT_SHIFT)); else ohdr->u.aeth = ipath_compute_aeth(qp); hwords++; @@ -361,7 +361,7 @@ int ipath_make_rc_req(struct ipath_qp *q if (qp->s_tail >= qp->s_size) qp->s_tail = 0; } - bth2 |= qp->s_psn++ & IPS_PSN_MASK; + bth2 |= qp->s_psn++ & IPATH_PSN_MASK; if ((int)(qp->s_psn - qp->s_next_psn) > 0) qp->s_next_psn = qp->s_psn; /* @@ -387,7 +387,7 @@ int ipath_make_rc_req(struct ipath_qp *q qp->s_state = OP(SEND_MIDDLE); /* FALLTHROUGH */ case OP(SEND_MIDDLE): - bth2 = qp->s_psn++ & IPS_PSN_MASK; + bth2 = qp->s_psn++ & IPATH_PSN_MASK; if ((int)(qp->s_psn - qp->s_next_psn) > 0) qp->s_next_psn = qp->s_psn; ss = &qp->s_sge; @@ -429,7 +429,7 @@ int ipath_make_rc_req(struct ipath_qp *q qp->s_state = OP(RDMA_WRITE_MIDDLE); /* FALLTHROUGH */ case OP(RDMA_WRITE_MIDDLE): - bth2 = qp->s_psn++ & IPS_PSN_MASK; + bth2 = qp->s_psn++ & IPATH_PSN_MASK; if ((int)(qp->s_psn - qp->s_next_psn) > 0) qp->s_next_psn = qp->s_psn; ss = &qp->s_sge; @@ -466,7 +466,7 @@ int ipath_make_rc_req(struct ipath_qp *q * See ipath_restart_rc(). */ ipath_init_restart(qp, wqe); - len = ((qp->s_psn - wqe->psn) & IPS_PSN_MASK) * pmtu; + len = ((qp->s_psn - wqe->psn) & IPATH_PSN_MASK) * pmtu; ohdr->u.rc.reth.vaddr = cpu_to_be64(wqe->wr.wr.rdma.remote_addr + len); ohdr->u.rc.reth.rkey = @@ -474,7 +474,7 @@ int ipath_make_rc_req(struct ipath_qp *q ohdr->u.rc.reth.length = cpu_to_be32(qp->s_len); qp->s_state = OP(RDMA_READ_REQUEST); hwords += sizeof(ohdr->u.rc.reth) / 4; - bth2 = qp->s_psn++ & IPS_PSN_MASK; + bth2 = qp->s_psn++ & IPATH_PSN_MASK; if ((int)(qp->s_psn - qp->s_next_psn) > 0) qp->s_next_psn = qp->s_psn; ss = NULL; @@ -529,7 +529,7 @@ static void send_rc_ack(struct ipath_qp /* Construct the header. */ ohdr = &hdr.u.oth; - lrh0 = IPS_LRH_BTH; + lrh0 = IPATH_LRH_BTH; /* header size in 32-bit words LRH+BTH+AETH = (8+12+4)/4. */ hwords = 6; if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { @@ -537,14 +537,14 @@ static void send_rc_ack(struct ipath_qp &qp->remote_ah_attr.grh, hwords, 0); ohdr = &hdr.u.l.oth; - lrh0 = IPS_LRH_GRH; + lrh0 = IPATH_LRH_GRH; } /* read pkey_index w/o lock (its atomic) */ bth0 = ipath_layer_get_pkey(dev->dd, qp->s_pkey_index); if (qp->r_nak_state) - ohdr->u.aeth = cpu_to_be32((qp->r_msn & IPS_MSN_MASK) | + ohdr->u.aeth = cpu_to_be32((qp->r_msn & IPATH_MSN_MASK) | (qp->r_nak_state << - IPS_AETH_CREDIT_SHIFT)); + IPATH_AETH_CREDIT_SHIFT)); else ohdr->u.aeth = ipath_compute_aeth(qp); if (qp->r_ack_state >= OP(COMPARE_SWAP)) { @@ -560,7 +560,7 @@ static void send_rc_ack(struct ipath_qp hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd)); ohdr->bth[0] = cpu_to_be32(bth0); ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); - ohdr->bth[2] = cpu_to_be32(qp->r_ack_psn & IPS_PSN_MASK); + ohdr->bth[2] = cpu_to_be32(qp->r_ack_psn & IPATH_PSN_MASK); /* * If we can send the ACK, clear the ACK state. @@ -890,8 +890,8 @@ static int do_rc_ack(struct ipath_qp *qp reset_psn(qp, psn); qp->s_rnr_timeout = - ib_ipath_rnr_table[(aeth >> IPS_AETH_CREDIT_SHIFT) & - IPS_AETH_CREDIT_MASK]; + ib_ipath_rnr_table[(aeth >> IPATH_AETH_CREDIT_SHIFT) & + IPATH_AETH_CREDIT_MASK]; ipath_insert_rnr_queue(qp); goto bail; @@ -899,8 +899,8 @@ static int do_rc_ack(struct ipath_qp *qp /* The last valid PSN seen is the previous request's. */ if (qp->s_last != qp->s_tail) qp->s_last_psn = wqe->psn - 1; - switch ((aeth >> IPS_AETH_CREDIT_SHIFT) & - IPS_AETH_CREDIT_MASK) { + switch ((aeth >> IPATH_AETH_CREDIT_SHIFT) & + IPATH_AETH_CREDIT_MASK) { case 0: /* PSN sequence error */ dev->n_seq_naks++; /* @@ -1268,7 +1268,7 @@ static inline int ipath_rc_rcv_error(str * Check for the PSN of the last atomic operation * performed and resend the result if found. */ - if ((psn & IPS_PSN_MASK) != qp->r_atomic_psn) + if ((psn & IPATH_PSN_MASK) != qp->r_atomic_psn) goto done; break; } @@ -1638,7 +1638,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de *(u64 *) qp->r_sge.sge.vaddr = sdata; spin_unlock_irq(&dev->pending_lock); qp->r_msn++; - qp->r_atomic_psn = psn & IPS_PSN_MASK; + qp->r_atomic_psn = psn & IPATH_PSN_MASK; psn |= 1 << 31; break; } diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_ruc.c --- a/drivers/infiniband/hw/ipath/ipath_ruc.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c Thu Jun 29 14:33:26 2006 -0700 @@ -32,7 +32,7 @@ */ #include "ipath_verbs.h" -#include "ips_common.h" +#include "ipath_common.h" /* * Convert the AETH RNR timeout code into the number of milliseconds. @@ -632,7 +632,7 @@ again: /* Sending responses has higher priority over sending requests. */ if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE && (bth0 = ipath_make_rc_ack(qp, ohdr, pmtu)) != 0) - bth2 = qp->s_ack_psn++ & IPS_PSN_MASK; + bth2 = qp->s_ack_psn++ & IPATH_PSN_MASK; else if (!((qp->ibqp.qp_type == IB_QPT_RC) ? ipath_make_rc_req(qp, ohdr, pmtu, &bth0, &bth2) : ipath_make_uc_req(qp, ohdr, pmtu, &bth0, &bth2))) { @@ -651,12 +651,12 @@ again: /* Construct the header. */ extra_bytes = (4 - qp->s_cur_size) & 3; nwords = (qp->s_cur_size + extra_bytes) >> 2; - lrh0 = IPS_LRH_BTH; + lrh0 = IPATH_LRH_BTH; if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { qp->s_hdrwords += ipath_make_grh(dev, &qp->s_hdr.u.l.grh, &qp->remote_ah_attr.grh, qp->s_hdrwords, nwords); - lrh0 = IPS_LRH_GRH; + lrh0 = IPATH_LRH_GRH; } lrh0 |= qp->remote_ah_attr.sl << 4; qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_sysfs.c --- a/drivers/infiniband/hw/ipath/ipath_sysfs.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c Thu Jun 29 14:33:26 2006 -0700 @@ -35,8 +35,8 @@ #include #include "ipath_kernel.h" -#include "ips_common.h" #include "ipath_layer.h" +#include "ipath_common.h" /** * ipath_parse_ushort - parse an unsigned short value in an arbitrary base @@ -187,7 +187,7 @@ static ssize_t store_lid(struct device * if (ret < 0) goto invalid; - if (lid == 0 || lid >= IPS_MULTICAST_LID_BASE) { + if (lid == 0 || lid >= IPATH_MULTICAST_LID_BASE) { ret = -EINVAL; goto invalid; } @@ -221,7 +221,7 @@ static ssize_t store_mlid(struct device int ret; ret = ipath_parse_ushort(buf, &mlid); - if (ret < 0 || mlid < IPS_MULTICAST_LID_BASE) + if (ret < 0 || mlid < IPATH_MULTICAST_LID_BASE) goto invalid; unit = dd->ipath_unit; diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_uc.c --- a/drivers/infiniband/hw/ipath/ipath_uc.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_uc.c Thu Jun 29 14:33:26 2006 -0700 @@ -32,7 +32,7 @@ */ #include "ipath_verbs.h" -#include "ips_common.h" +#include "ipath_common.h" /* cut down ridiculously long IB macro names */ #define OP(x) IB_OPCODE_UC_##x @@ -213,7 +213,7 @@ int ipath_make_uc_req(struct ipath_qp *q qp->s_cur_sge = &qp->s_sge; qp->s_cur_size = len; *bth0p = bth0 | (qp->s_state << 24); - *bth2p = qp->s_next_psn++ & IPS_PSN_MASK; + *bth2p = qp->s_next_psn++ & IPATH_PSN_MASK; return 1; done: diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_ud.c --- a/drivers/infiniband/hw/ipath/ipath_ud.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ud.c Thu Jun 29 14:33:26 2006 -0700 @@ -34,7 +34,7 @@ #include #include "ipath_verbs.h" -#include "ips_common.h" +#include "ipath_common.h" /** * ipath_ud_loopback - handle send on loopback QPs @@ -289,8 +289,8 @@ int ipath_post_ud_send(struct ipath_qp * ret = -EINVAL; goto bail; } - if (ah_attr->dlid >= IPS_MULTICAST_LID_BASE) { - if (ah_attr->dlid != IPS_PERMISSIVE_LID) + if (ah_attr->dlid >= IPATH_MULTICAST_LID_BASE) { + if (ah_attr->dlid != IPATH_PERMISSIVE_LID) dev->n_multicast_xmit++; else dev->n_unicast_xmit++; @@ -310,7 +310,7 @@ int ipath_post_ud_send(struct ipath_qp * if (ah_attr->ah_flags & IB_AH_GRH) { /* Header size in 32-bit words. */ hwords = 17; - lrh0 = IPS_LRH_GRH; + lrh0 = IPATH_LRH_GRH; ohdr = &qp->s_hdr.u.l.oth; qp->s_hdr.u.l.grh.version_tclass_flow = cpu_to_be32((6 << 28) | @@ -336,7 +336,7 @@ int ipath_post_ud_send(struct ipath_qp * } else { /* Header size in 32-bit words. */ hwords = 7; - lrh0 = IPS_LRH_BTH; + lrh0 = IPATH_LRH_BTH; ohdr = &qp->s_hdr.u.oth; } if (wr->opcode == IB_WR_SEND_WITH_IMM) { @@ -367,18 +367,18 @@ int ipath_post_ud_send(struct ipath_qp * if (wr->send_flags & IB_SEND_SOLICITED) bth0 |= 1 << 23; bth0 |= extra_bytes << 20; - bth0 |= qp->ibqp.qp_type == IB_QPT_SMI ? IPS_DEFAULT_P_KEY : + bth0 |= qp->ibqp.qp_type == IB_QPT_SMI ? IPATH_DEFAULT_P_KEY : ipath_layer_get_pkey(dev->dd, qp->s_pkey_index); ohdr->bth[0] = cpu_to_be32(bth0); /* * Use the multicast QP if the destination LID is a multicast LID. */ - ohdr->bth[1] = ah_attr->dlid >= IPS_MULTICAST_LID_BASE && - ah_attr->dlid != IPS_PERMISSIVE_LID ? - __constant_cpu_to_be32(IPS_MULTICAST_QPN) : + ohdr->bth[1] = ah_attr->dlid >= IPATH_MULTICAST_LID_BASE && + ah_attr->dlid != IPATH_PERMISSIVE_LID ? + __constant_cpu_to_be32(IPATH_MULTICAST_QPN) : cpu_to_be32(wr->wr.ud.remote_qpn); /* XXX Could lose a PSN count but not worth locking */ - ohdr->bth[2] = cpu_to_be32(qp->s_next_psn++ & IPS_PSN_MASK); + ohdr->bth[2] = cpu_to_be32(qp->s_next_psn++ & IPATH_PSN_MASK); /* * Qkeys with the high order bit set mean use the * qkey from the QP context instead of the WR (see 10.2.5). @@ -469,7 +469,7 @@ void ipath_ud_rcv(struct ipath_ibdev *de src_qp = be32_to_cpu(ohdr->u.ud.deth[1]); } } - src_qp &= IPS_QPN_MASK; + src_qp &= IPATH_QPN_MASK; /* * Check that the permissive LID is only used on QP0 @@ -627,7 +627,7 @@ void ipath_ud_rcv(struct ipath_ibdev *de /* * Save the LMC lower bits if the destination LID is a unicast LID. */ - wc.dlid_path_bits = dlid >= IPS_MULTICAST_LID_BASE ? 0 : + wc.dlid_path_bits = dlid >= IPATH_MULTICAST_LID_BASE ? 0 : dlid & ((1 << (dev->mkeyprot_resv_lmc & 7)) - 1); /* Signal completion event if the solicited bit is set. */ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:26 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Jun 29 14:33:26 2006 -0700 @@ -37,7 +37,7 @@ #include "ipath_kernel.h" #include "ipath_verbs.h" -#include "ips_common.h" +#include "ipath_common.h" /* Not static, because we don't want the compiler removing it */ const char ipath_verbs_version[] = "ipath_verbs " IPATH_IDSTR; @@ -429,7 +429,7 @@ static void ipath_ib_rcv(void *arg, void /* Check for a valid destination LID (see ch. 7.11.1). */ lid = be16_to_cpu(hdr->lrh[1]); - if (lid < IPS_MULTICAST_LID_BASE) { + if (lid < IPATH_MULTICAST_LID_BASE) { lid &= ~((1 << (dev->mkeyprot_resv_lmc & 7)) - 1); if (unlikely(lid != ipath_layer_get_lid(dev->dd))) { dev->rcv_errors++; @@ -439,9 +439,9 @@ static void ipath_ib_rcv(void *arg, void /* Check for GRH */ lnh = be16_to_cpu(hdr->lrh[0]) & 3; - if (lnh == IPS_LRH_BTH) + if (lnh == IPATH_LRH_BTH) ohdr = &hdr->u.oth; - else if (lnh == IPS_LRH_GRH) + else if (lnh == IPATH_LRH_GRH) ohdr = &hdr->u.l.oth; else { dev->rcv_errors++; @@ -453,8 +453,8 @@ static void ipath_ib_rcv(void *arg, void dev->opstats[opcode].n_packets++; /* Get the destination QP number. */ - qp_num = be32_to_cpu(ohdr->bth[1]) & IPS_QPN_MASK; - if (qp_num == IPS_MULTICAST_QPN) { + qp_num = be32_to_cpu(ohdr->bth[1]) & IPATH_QPN_MASK; + if (qp_num == IPATH_MULTICAST_QPN) { struct ipath_mcast *mcast; struct ipath_mcast_qp *p; @@ -465,7 +465,7 @@ static void ipath_ib_rcv(void *arg, void } dev->n_multicast_rcv++; list_for_each_entry_rcu(p, &mcast->qp_list, list) - ipath_qp_rcv(dev, hdr, lnh == IPS_LRH_GRH, data, + ipath_qp_rcv(dev, hdr, lnh == IPATH_LRH_GRH, data, tlen, p->qp); /* * Notify ipath_multicast_detach() if it is waiting for us @@ -477,7 +477,7 @@ static void ipath_ib_rcv(void *arg, void qp = ipath_lookup_qpn(&dev->qp_table, qp_num); if (qp) { dev->n_unicast_rcv++; - ipath_qp_rcv(dev, hdr, lnh == IPS_LRH_GRH, data, + ipath_qp_rcv(dev, hdr, lnh == IPATH_LRH_GRH, data, tlen, qp); /* * Notify ipath_destroy_qp() if it is waiting @@ -860,8 +860,8 @@ static struct ib_ah *ipath_create_ah(str } /* A multicast address requires a GRH (see ch. 8.4.1). */ - if (ah_attr->dlid >= IPS_MULTICAST_LID_BASE && - ah_attr->dlid != IPS_PERMISSIVE_LID && + if (ah_attr->dlid >= IPATH_MULTICAST_LID_BASE && + ah_attr->dlid != IPATH_PERMISSIVE_LID && !(ah_attr->ah_flags & IB_AH_GRH)) { ret = ERR_PTR(-EINVAL); goto bail; diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ips_common.h --- a/drivers/infiniband/hw/ipath/ips_common.h Thu Jun 29 14:33:26 2006 -0700 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,264 +0,0 @@ -#ifndef IPS_COMMON_H -#define IPS_COMMON_H -/* - * Copyright (c) 2006 QLogic, Inc. All rights reserved. - * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - */ - -#include "ipath_common.h" - -struct ipath_header { - /* - * Version - 4 bits, Port - 4 bits, TID - 10 bits and Offset - - * 14 bits before ECO change ~28 Dec 03. After that, Vers 4, - * Port 3, TID 11, offset 14. - */ - __le32 ver_port_tid_offset; - __le16 chksum; - __le16 pkt_flags; -}; - -struct ips_message_header { - __be16 lrh[4]; - __be32 bth[3]; - /* fields below this point are in host byte order */ - struct ipath_header iph; - __u8 sub_opcode; - __u8 flags; - __u16 src_rank; - /* 24 bits. The upper 8 bit is available for other use */ - union { - struct { - unsigned ack_seq_num:24; - unsigned port:4; - unsigned unused:4; - }; - __u32 ack_seq_num_org; - }; - __u8 expected_tid_session_id; - __u8 tinylen; /* to aid MPI */ - union { - __u16 tag; /* to aid MPI */ - __u16 mqhdr; /* for PSM MQ */ - }; - union { - __u32 mpi[4]; /* to aid MPI */ - __u32 data[4]; - __u64 mq[2]; /* for PSM MQ */ - struct { - __u16 mtu; - __u8 major_ver; - __u8 minor_ver; - __u32 not_used; //free - __u32 run_id; - __u32 client_ver; - }; - }; -}; - -struct ether_header { - __be16 lrh[4]; - __be32 bth[3]; - struct ipath_header iph; - __u8 sub_opcode; - __u8 cmd; - __be16 lid; - __u16 mac[3]; - __u8 frag_num; - __u8 seq_num; - __le32 len; - /* MUST be of word size due to PIO write requirements */ - __le32 csum; - __le16 csum_offset; - __le16 flags; - __u16 first_2_bytes; - __u8 unused[2]; /* currently unused */ -}; - -/* - * The PIO buffer used for sending infinipath messages must only be written - * in 32-bit words, all the data must be written, and no writes can occur - * after the last word is written (which transfers "ownership" of the buffer - * to the chip and triggers the message to be sent). - * Since the Linux sk_buff structure can be recursive, non-aligned, and - * any number of bytes in each segment, we use the following structure - * to keep information about the overall state of the copy operation. - * This is used to save the information needed to store the checksum - * in the right place before sending the last word to the hardware and - * to buffer the last 0-3 bytes of non-word sized segments. - */ -struct copy_data_s { - struct ether_header *hdr; - /* addr of PIO buf to write csum to */ - __u32 __iomem *csum_pio; - __u32 __iomem *to; /* addr of PIO buf to write data to */ - __u32 device; /* which device to allocate PIO bufs from */ - __s32 error; /* set if there is an error. */ - __s32 extra; /* amount of data saved in u.buf below */ - __u32 len; /* total length to send in bytes */ - __u32 flen; /* frament length in words */ - __u32 csum; /* partial IP checksum */ - __u32 pos; /* position for partial checksum */ - __u32 offset; /* offset to where data currently starts */ - __s32 checksum_calc; /* set to 1 when csum has been calculated */ - struct sk_buff *skb; - union { - __u32 w; - __u8 buf[4]; - } u; -}; - -/* IB - LRH header consts */ -#define IPS_LRH_GRH 0x0003 /* 1. word of IB LRH - next header: GRH */ -#define IPS_LRH_BTH 0x0002 /* 1. word of IB LRH - next header: BTH */ - -#define IPS_OFFSET 0 - -/* - * defines the cut-off point between the header queue and eager/expected - * TID queue - */ -#define NUM_OF_EXTRA_WORDS_IN_HEADER_QUEUE \ - ((sizeof(struct ips_message_header) - \ - offsetof(struct ips_message_header, iph)) >> 2) - -/* OpCodes */ -#define OPCODE_IPS 0xC0 -#define OPCODE_ITH4X 0xC1 - -/* OpCode 30 is use by stand-alone test programs */ -#define OPCODE_RAW_DATA 0xDE -/* last OpCode (31) is reserved for test */ -#define OPCODE_TEST 0xDF - -/* sub OpCodes - ips */ -#define OPCODE_SEQ_DATA 0x01 -#define OPCODE_SEQ_CTRL 0x02 - -#define OPCODE_SEQ_MQ_DATA 0x03 -#define OPCODE_SEQ_MQ_CTRL 0x04 - -#define OPCODE_ACK 0x10 -#define OPCODE_NAK 0x11 - -#define OPCODE_ERR_CHK 0x20 -#define OPCODE_ERR_CHK_PLS 0x21 - -#define OPCODE_STARTUP 0x30 -#define OPCODE_STARTUP_ACK 0x31 -#define OPCODE_STARTUP_NAK 0x32 - -#define OPCODE_STARTUP_EXT 0x34 -#define OPCODE_STARTUP_ACK_EXT 0x35 -#define OPCODE_STARTUP_NAK_EXT 0x36 - -#define OPCODE_TIDS_RELEASE 0x40 -#define OPCODE_TIDS_RELEASE_CONFIRM 0x41 - -#define OPCODE_CLOSE 0x50 -#define OPCODE_CLOSE_ACK 0x51 -/* - * like OPCODE_CLOSE, but no complaint if other side has already closed. - * Used when doing abort(), MPI_Abort(), etc. - */ -#define OPCODE_ABORT 0x52 - -/* sub OpCodes - ith4x */ -#define OPCODE_ENCAP 0x81 -#define OPCODE_LID_ARP 0x82 - -/* Receive Header Queue: receive type (from infinipath) */ -#define RCVHQ_RCV_TYPE_EXPECTED 0 -#define RCVHQ_RCV_TYPE_EAGER 1 -#define RCVHQ_RCV_TYPE_NON_KD 2 -#define RCVHQ_RCV_TYPE_ERROR 3 - -/* misc. */ -#define SIZE_OF_CRC 1 - -#define EAGER_TID_ID INFINIPATH_I_TID_MASK - -#define IPS_DEFAULT_P_KEY 0xFFFF - -#define IPS_PERMISSIVE_LID 0xFFFF -#define IPS_MULTICAST_LID_BASE 0xC000 - -#define IPS_AETH_CREDIT_SHIFT 24 -#define IPS_AETH_CREDIT_MASK 0x1F -#define IPS_AETH_CREDIT_INVAL 0x1F - -#define IPS_PSN_MASK 0xFFFFFF -#define IPS_MSN_MASK 0xFFFFFF -#define IPS_QPN_MASK 0xFFFFFF -#define IPS_MULTICAST_QPN 0xFFFFFF - -/* functions for extracting fields from rcvhdrq entries */ -static inline __u32 ips_get_hdr_err_flags(const __le32 * rbuf) -{ - return __le32_to_cpu(rbuf[1]); -} - -static inline __u32 ips_get_index(const __le32 * rbuf) -{ - return (__le32_to_cpu(rbuf[0]) >> INFINIPATH_RHF_EGRINDEX_SHIFT) - & INFINIPATH_RHF_EGRINDEX_MASK; -} - -static inline __u32 ips_get_rcv_type(const __le32 * rbuf) -{ - return (__le32_to_cpu(rbuf[0]) >> INFINIPATH_RHF_RCVTYPE_SHIFT) - & INFINIPATH_RHF_RCVTYPE_MASK; -} - -static inline __u32 ips_get_length_in_bytes(const __le32 * rbuf) -{ - return ((__le32_to_cpu(rbuf[0]) >> INFINIPATH_RHF_LENGTH_SHIFT) - & INFINIPATH_RHF_LENGTH_MASK) << 2; -} - -static inline void *ips_get_first_protocol_header(const __u32 * rbuf) -{ - return (void *)&rbuf[2]; -} - -static inline struct ips_message_header *ips_get_ips_header(const __u32 * - rbuf) -{ - return (struct ips_message_header *)&rbuf[2]; -} - -static inline __u32 ips_get_ipath_ver(__le32 hdrword) -{ - return (__le32_to_cpu(hdrword) >> INFINIPATH_I_VERS_SHIFT) - & INFINIPATH_I_VERS_MASK; -} - -#endif /* IPS_COMMON_H */ From davem at davemloft.net Thu Jun 29 14:50:27 2006 From: davem at davemloft.net (David Miller) Date: Thu, 29 Jun 2006 14:50:27 -0700 (PDT) Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy in RDMA interrupt handler to reduce packet loss In-Reply-To: <1b00209ef20a0e7893d8.1151617290@eng-12.pathscale.com> References: <1b00209ef20a0e7893d8.1151617290@eng-12.pathscale.com> Message-ID: <20060629.145027.41636491.davem@davemloft.net> From: Bryan O'Sullivan Date: Thu, 29 Jun 2006 14:41:30 -0700 > +/* > + * Copy data. Try not to pollute the dcache with the source data, > + * because we won't be reading it again. > + */ > +#if defined(CONFIG_X86_64) > +void *ipath_memcpy_nc(void *dest, const void *src, size_t n); > +#else > +#define ipath_memcpy_nc(dest, src, n) memcpy(dest, src, n) > +#endif A facility like this doesn't belong in some arbitrary driver layer. It belongs as a generic facility the whole kernel could make use of. Please stop polluting the infiniband drivers with Opteron crap. From davem at davemloft.net Thu Jun 29 14:53:19 2006 From: davem at davemloft.net (David Miller) Date: Thu, 29 Jun 2006 14:53:19 -0700 (PDT) Subject: [openib-general] [PATCH 38 of 39] IB/ipath - More changes to support InfiniPath on PowerPC 970 systems In-Reply-To: References: Message-ID: <20060629.145319.71091846.davem@davemloft.net> From: Bryan O'Sullivan Date: Thu, 29 Jun 2006 14:41:29 -0700 > ipath_core-$(CONFIG_X86_64) += ipath_wc_x86_64.o > +ipath_core-$(CONFIG_PPC64) += ipath_wc_ppc64.o Again, don't put these kinds of cpu specific functions into the infiniband driver. They are potentially globally useful, not something only Infiniband might want to do. From bos at pathscale.com Thu Jun 29 14:59:37 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 14:59:37 -0700 Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy in RDMA interrupt handler to reduce packet loss In-Reply-To: <20060629.145027.41636491.davem@davemloft.net> References: <1b00209ef20a0e7893d8.1151617290@eng-12.pathscale.com> <20060629.145027.41636491.davem@davemloft.net> Message-ID: <1151618377.10886.23.camel@chalcedony.pathscale.com> On Thu, 2006-06-29 at 14:50 -0700, David Miller wrote: > A facility like this doesn't belong in some arbitrary driver layer. > It belongs as a generic facility the whole kernel could make use > of. It could, indeed. In fact, we had that discussion here before I sent this patch in. It presumably wants to live in lib/, and acquire a more generic name. What name will capture the uncached-read-but-cached-write semantics in a useful fashion? memcpy_nc? References: <20060629.145319.71091846.davem@davemloft.net> Message-ID: <1151618499.10886.26.camel@chalcedony.pathscale.com> On Thu, 2006-06-29 at 14:53 -0700, David Miller wrote: > From: Bryan O'Sullivan > Date: Thu, 29 Jun 2006 14:41:29 -0700 > > > ipath_core-$(CONFIG_X86_64) += ipath_wc_x86_64.o > > +ipath_core-$(CONFIG_PPC64) += ipath_wc_ppc64.o > > Again, don't put these kinds of cpu specific functions > into the infiniband driver. They are potentially globally > useful, not something only Infiniband might want to do. The support for write combining in the kernel is not in a state where that makes any sense at the moment. Also, this is a single-statement function. References: <20060629.145319.71091846.davem@davemloft.net> <1151618499.10886.26.camel@chalcedony.pathscale.com> Message-ID: <20060629.150417.78710870.davem@davemloft.net> From: Bryan O'Sullivan Date: Thu, 29 Jun 2006 15:01:39 -0700 > The support for write combining in the kernel is not in a state where > that makes any sense at the moment. Please fix the generic code if it doesn't provide the facility you need at the moment. Don't shoe horn it into your driver just to make up for that. From davem at davemloft.net Thu Jun 29 15:03:19 2006 From: davem at davemloft.net (David Miller) Date: Thu, 29 Jun 2006 15:03:19 -0700 (PDT) Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy in RDMA interrupt handler to reduce packet loss In-Reply-To: <1151618377.10886.23.camel@chalcedony.pathscale.com> References: <1b00209ef20a0e7893d8.1151617290@eng-12.pathscale.com> <20060629.145027.41636491.davem@davemloft.net> <1151618377.10886.23.camel@chalcedony.pathscale.com> Message-ID: <20060629.150319.104035601.davem@davemloft.net> From: Bryan O'Sullivan Date: Thu, 29 Jun 2006 14:59:37 -0700 > It could, indeed. In fact, we had that discussion here before I sent > this patch in. It presumably wants to live in lib/, and acquire a more > generic name. What name will capture the uncached-read-but-cached-write > semantics in a useful fashion? memcpy_nc? I'm not good with names :-) Note that there also might be cases where using such a memcpy variant might be the wrong thing to do. For example, for a very tightly coupled CMT cpu implementation which has the memory controller, L2 cache, PCI controller, etc. all on the same die and the PCI controller makes use of the L2 cache just like the cpu threads do, using this kind of memcpy would always be the wrong thing to do. From afriedle at indiana.edu Thu Jun 29 15:43:47 2006 From: afriedle at indiana.edu (Andrew Friedley) Date: Thu, 29 Jun 2006 15:43:47 -0700 Subject: [openib-general] thread safety Message-ID: <44A457A3.60001@indiana.edu> Hello, I'm working with Matt Leininger this summer on developing support for UD in Open MPI, and eventually multicast collectives - he suggested I ask my question here. Is there any documentation available on thread safety (i.e., what is (non-)reentrant) with the openib verbs? I've had trouble finding anything more than vague hints with google. Thanks, Andrew From venkatesh.babu at 3leafnetworks.com Thu Jun 29 16:51:17 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Thu, 29 Jun 2006 16:51:17 -0700 Subject: [openib-general] Reloading of partition policy Message-ID: <44A46775.1000507@3leafnetworks.com> I was reviewing partition-config.txt and OpenSM_PKey_Mgr.txt and had the following comment - If we need to add/delete a node to/from a partition we need to update the file /etc/osm-partitions.txt and restart the OpenSM. According to the docs there no way we can do this without restarting the OpenSM. It would be useful to add new feature to reload the partition table after making the changes. VBabu From bos at pathscale.com Thu Jun 29 16:34:23 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Jun 2006 16:34:23 -0700 Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy in RDMA interrupt handler to reduce packet loss In-Reply-To: <20060629.150319.104035601.davem@davemloft.net> References: <1b00209ef20a0e7893d8.1151617290@eng-12.pathscale.com> <20060629.145027.41636491.davem@davemloft.net> <1151618377.10886.23.camel@chalcedony.pathscale.com> <20060629.150319.104035601.davem@davemloft.net> Message-ID: <1151624063.10886.34.camel@chalcedony.pathscale.com> On Thu, 2006-06-29 at 15:03 -0700, David Miller wrote: > I'm not good with names :-) Heh. I'll call it memcpy_nc for now, then, and people can retch all over the name as they please when I submit a more suitably generic patch. > Note that there also might be cases where using such a memcpy > variant might be the wrong thing to do. For example, for a very > tightly coupled CMT cpu implementation which has the memory controller, > L2 cache, PCI controller, etc. all on the same die and the PCI controller > makes use of the L2 cache just like the cpu threads do, using this > kind of memcpy would always be the wrong thing to do. I'm not quite following you, though I assume you're referring to Niagara or Rock :-) Are you saying a memcpy_nc would do worse than plain memcpy, or worse than some other memcpy-like routine? References: <44A457A3.60001@indiana.edu> Message-ID: <44A465FC.4070803@ichips.intel.com> Andrew Friedley wrote: > I'm working with Matt Leininger this summer on developing support for UD > in Open MPI, and eventually multicast collectives - he suggested I ask > my question here. > > Is there any documentation available on thread safety (i.e., what is > (non-)reentrant) with the openib verbs? I've had trouble finding > anything more than vague hints with google. Some kernel information is available in gen2/trunk/src/linux-kernel/docs. See core_locking.txt. Some of the information applies to userspace as well, such as all verbs being fully reentrant. - Sean From davem at davemloft.net Thu Jun 29 16:46:23 2006 From: davem at davemloft.net (David Miller) Date: Thu, 29 Jun 2006 16:46:23 -0700 (PDT) Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy in RDMA interrupt handler to reduce packet loss In-Reply-To: <1151624063.10886.34.camel@chalcedony.pathscale.com> References: <1151618377.10886.23.camel@chalcedony.pathscale.com> <20060629.150319.104035601.davem@davemloft.net> <1151624063.10886.34.camel@chalcedony.pathscale.com> Message-ID: <20060629.164623.59469884.davem@davemloft.net> From: Bryan O'Sullivan Date: Thu, 29 Jun 2006 16:34:23 -0700 > I'm not quite following you, though I assume you're referring to Niagara > or Rock :-) Are you saying a memcpy_nc would do worse than plain > memcpy, or worse than some other memcpy-like routine? It would do worse than memcpy. If you bypass the L2 cache, it's pointless because the next agent (PCI controller, CPU thread, etc.) is going to need the data in the L2 cache. It's better in that kind of setup to eat the L2 cache miss overhead in memcpy since memcpy can usually prefetch and store buffer in order to absorb some of the L2 miss costs. From ralphc at pathscale.com Thu Jun 29 16:55:41 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Thu, 29 Jun 2006 16:55:41 -0700 Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy in RDMA interrupt handler to reduce packet loss In-Reply-To: <20060629.164623.59469884.davem@davemloft.net> References: <1151618377.10886.23.camel@chalcedony.pathscale.com> <20060629.150319.104035601.davem@davemloft.net> <1151624063.10886.34.camel@chalcedony.pathscale.com> <20060629.164623.59469884.davem@davemloft.net> Message-ID: <1151625341.4572.133.camel@brick.pathscale.com> This is intended to be an architecture specific function so if the CPU does support HW dma to the CPU's L2 cache, the architecture specific version of memcpy_nc() would not replace the default definition which maps memcpy_nc() to memcpy(). For CPUs like the vast majority currently available, there is a performance benefit by not reading data into the cache that won't be read a second time. On Thu, 2006-06-29 at 16:46 -0700, David Miller wrote: > From: Bryan O'Sullivan > Date: Thu, 29 Jun 2006 16:34:23 -0700 > > > I'm not quite following you, though I assume you're referring to Niagara > > or Rock :-) Are you saying a memcpy_nc would do worse than plain > > memcpy, or worse than some other memcpy-like routine? > > It would do worse than memcpy. > > If you bypass the L2 cache, it's pointless because the next > agent (PCI controller, CPU thread, etc.) is going to need the > data in the L2 cache. > > It's better in that kind of setup to eat the L2 cache miss overhead in > memcpy since memcpy can usually prefetch and store buffer in order to > absorb some of the L2 miss costs. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From akpm at osdl.org Thu Jun 29 17:02:55 2006 From: akpm at osdl.org (Andrew Morton) Date: Thu, 29 Jun 2006 17:02:55 -0700 Subject: [openib-general] [PATCH 17 of 39] IB/ipath - use more appropriate gfp flags In-Reply-To: <9d943b828776136a2bb7.1151617268@eng-12.pathscale.com> References: <9d943b828776136a2bb7.1151617268@eng-12.pathscale.com> Message-ID: <20060629170255.028d7a90.akpm@osdl.org> "Bryan O'Sullivan" wrote: > > diff -r fd5e733f02ac -r 9d943b828776 drivers/infiniband/hw/ipath/ipath_file_ops.c > --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:25 2006 -0700 > +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c Thu Jun 29 14:33:25 2006 -0700 > @@ -705,6 +705,15 @@ static int ipath_create_user_egr(struct > unsigned e, egrcnt, alloced, egrperchunk, chunk, egrsize, egroff; > size_t size; > int ret; > + gfp_t gfp_flags; > + > + /* > + * GFP_USER, but without GFP_FS, so buffer cache can be > + * coalesced (we hope); otherwise, even at order 4, > + * heavy filesystem activity makes these fail, and we can > + * use compound pages. > + */ > + gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP; Yes, GFP_NOFS|_GFP_COMP is reasonably strong - we can do swapout but not file pageout. I expect you'll find that a full GFP_KERNEL is OK here. The ~__GFP_FS is used to prevent the vm scanner from calling into ->writepage() and getting stuck on locks which the __alloc_pages() caller already holds. But ipathfs doesn't even implement ->writepage(), so I don't see any problem with setting __GFP_FS. If you're getting into trouble there then I'd recommend giving it a try - it will make memory reclaim more successful, especially with ext3, where a ->writepage often cleans the page synchronously without doing any IO. That being said, order-4 allocations will be fairly reliably unreliable. From akpm at osdl.org Thu Jun 29 17:07:11 2006 From: akpm at osdl.org (Andrew Morton) Date: Thu, 29 Jun 2006 17:07:11 -0700 Subject: [openib-general] [PATCH 28 of 39] IB/ipath - Fixes a bug where our delay for EEPROM no longer works due to compiler reordering In-Reply-To: <5f3c0b2d446d78e3327f.1151617279@eng-12.pathscale.com> References: <5f3c0b2d446d78e3327f.1151617279@eng-12.pathscale.com> Message-ID: <20060629170711.757a97d2.akpm@osdl.org> "Bryan O'Sullivan" wrote: > > The mb() prevents the compiler from reordering on this function, with some versions > of gcc and -Os optimization. The result is random failures in the EEPROM read > without this change. > > > Signed-off-by: Dave Olson > Signed-off-by: Bryan O'Sullivan > > diff -r 7d22a8963bda -r 5f3c0b2d446d drivers/infiniband/hw/ipath/ipath_eeprom.c > --- a/drivers/infiniband/hw/ipath/ipath_eeprom.c Thu Jun 29 14:33:26 2006 -0700 > +++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c Thu Jun 29 14:33:26 2006 -0700 > @@ -186,6 +186,7 @@ bail: > */ > static void i2c_wait_for_writes(struct ipath_devdata *dd) > { > + mb(); > (void)ipath_read_kreg32(dd, dd->ipath_kregs->kr_scratch); > } > That's a bit weird. I wouldn't have expected the compiler to muck around with a readl(). From rick.jones2 at hp.com Thu Jun 29 17:28:50 2006 From: rick.jones2 at hp.com (Rick Jones) Date: Thu, 29 Jun 2006 17:28:50 -0700 Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy in RDMA interrupt handler to reduce packet loss In-Reply-To: <20060629.164623.59469884.davem@davemloft.net> References: <1151618377.10886.23.camel@chalcedony.pathscale.com> <20060629.150319.104035601.davem@davemloft.net> <1151624063.10886.34.camel@chalcedony.pathscale.com> <20060629.164623.59469884.davem@davemloft.net> Message-ID: <44A47042.8060203@hp.com> > If you bypass the L2 cache, it's pointless because the next > agent (PCI controller, CPU thread, etc.) is going to need the > data in the L2 cache. > > It's better in that kind of setup to eat the L2 cache miss overhead in > memcpy since memcpy can usually prefetch and store buffer in order to > absorb some of the L2 miss costs. I thought that most PCI controllers (that is to say the things bridging PCI to the rest of the system) could do prefetching and/or that PCI-X (if not PCI, no idea about PCI-e) cards could issue multiple transactions anyway? rick jones From olson at unixfolk.com Thu Jun 29 17:28:51 2006 From: olson at unixfolk.com (Dave Olson) Date: Thu, 29 Jun 2006 17:28:51 -0700 (PDT) Subject: [openib-general] [PATCH 38 of 39] IB/ipath - More changes to support InfiniPath on PowerPC 970 systems In-Reply-To: References: Message-ID: On Thu, 29 Jun 2006, David Miller wrote: | From: Bryan O'Sullivan | Date: Thu, 29 Jun 2006 14:41:29 -0700 | | > ipath_core-$(CONFIG_X86_64) += ipath_wc_x86_64.o | > +ipath_core-$(CONFIG_PPC64) += ipath_wc_ppc64.o | | Again, don't put these kinds of cpu specific functions | into the infiniband driver. They are potentially globally | useful, not something only Infiniband might want to do. The new code simply sets a flag as to whether instruction level write barriers need to be used or not, it doesn't contain actual code. The older file (already accepted) does have some setup code, as well as code setting flags, due to the fact that Bryan mentioned in his reply, that this stuff simply doesn't yet exist in a generic form. It's not clear to me that it can ever be made to exist in a generic form that will actually work on multiple architectures (or that there are enough users to be worth trying). We can make the attempt, but so far it's pretty non-generic, in it's very nature. Dave Olson olson at unixfolk.com http://www.unixfolk.com/dave From davem at davemloft.net Thu Jun 29 17:32:06 2006 From: davem at davemloft.net (David Miller) Date: Thu, 29 Jun 2006 17:32:06 -0700 (PDT) Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy in RDMA interrupt handler to reduce packet loss In-Reply-To: <44A47042.8060203@hp.com> References: <1151624063.10886.34.camel@chalcedony.pathscale.com> <20060629.164623.59469884.davem@davemloft.net> <44A47042.8060203@hp.com> Message-ID: <20060629.173206.48800902.davem@davemloft.net> From: Rick Jones Date: Thu, 29 Jun 2006 17:28:50 -0700 > I thought that most PCI controllers (that is to say the things bridging > PCI to the rest of the system) could do prefetching and/or that PCI-X > (if not PCI, no idea about PCI-e) cards could issue multiple > transactions anyway? People doing deep CMT chips have found out that all of that prefetching and store buffering is unnecessary when everything is so tightly integrated. All of the previous UltraSPARC boxes before Niagara had a streaming cache sitting on the PCI controller. It basically prefetched for reads and collected writes from PCI devices into cacheline sized chunks. The PCI controller in the current Niagara systems has none of that stuff. From rick.jones2 at hp.com Thu Jun 29 17:44:05 2006 From: rick.jones2 at hp.com (Rick Jones) Date: Thu, 29 Jun 2006 17:44:05 -0700 Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy in RDMA interrupt handler to reduce packet loss In-Reply-To: <20060629.173206.48800902.davem@davemloft.net> References: <1151624063.10886.34.camel@chalcedony.pathscale.com> <20060629.164623.59469884.davem@davemloft.net> <44A47042.8060203@hp.com> <20060629.173206.48800902.davem@davemloft.net> Message-ID: <44A473D5.70809@hp.com> David Miller wrote: > From: Rick Jones > Date: Thu, 29 Jun 2006 17:28:50 -0700 > > >>I thought that most PCI controllers (that is to say the things bridging >>PCI to the rest of the system) could do prefetching and/or that PCI-X >>(if not PCI, no idea about PCI-e) cards could issue multiple >>transactions anyway? > > > People doing deep CMT chips have found out that all of that > prefetching and store buffering is unnecessary when everything is so > tightly integrated. Then is prefetching in memcpy really that important to them (BTW besides Sun/Niagra who are doing "deep CMT"?) > All of the previous UltraSPARC boxes before Niagara had a > streaming cache sitting on the PCI controller. It basically > prefetched for reads and collected writes from PCI devices > into cacheline sized chunks. > > The PCI controller in the current Niagara systems has none of that > stuff. Relying on PCI-X devices to issue multiple requests then? rick jones From davem at davemloft.net Thu Jun 29 17:47:38 2006 From: davem at davemloft.net (David Miller) Date: Thu, 29 Jun 2006 17:47:38 -0700 (PDT) Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy in RDMA interrupt handler to reduce packet loss In-Reply-To: <44A473D5.70809@hp.com> References: <44A47042.8060203@hp.com> <20060629.173206.48800902.davem@davemloft.net> <44A473D5.70809@hp.com> Message-ID: <20060629.174738.85688575.davem@davemloft.net> From: Rick Jones Date: Thu, 29 Jun 2006 17:44:05 -0700 > Then is prefetching in memcpy really that important to them. Not really, the thread just blocks while waiting for memory. On stores they do a cacheline fill optimization similar to the powerpc. > Relying on PCI-X devices to issue multiple requests then? Perhaps :) From chrisw at sous-sol.org Thu Jun 29 18:39:05 2006 From: chrisw at sous-sol.org (Chris Wright) Date: Thu, 29 Jun 2006 18:39:05 -0700 Subject: [openib-general] [stable] [PATCH -stable] IB/mthca: restore missing PCI registers after reset In-Reply-To: <20060628171428.GF19300@mellanox.co.il> References: <20060628171428.GF19300@mellanox.co.il> Message-ID: <20060630013905.GF11588@sequoia.sous-sol.org> * Michael S. Tsirkin (mst at mellanox.co.il) wrote: > Hello, stable team! > The pull of the following fix was requested by Roland Dreier just a couple of > days before 2.6.17 came out, and so it seems it missed 2.6.17 by a narrow > margin: > > http://lkml.org/lkml/2006/6/13/164 Thanks, queued for the next -stable. -chris From halr at voltaire.com Thu Jun 29 20:50:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Jun 2006 23:50:21 -0400 Subject: [openib-general] Reloading of partition policy In-Reply-To: <44A46775.1000507@3leafnetworks.com> References: <44A46775.1000507@3leafnetworks.com> Message-ID: <1151639421.4478.746.camel@hal.voltaire.com> On Thu, 2006-06-29 at 19:51, Venkatesh Babu wrote: > I was reviewing partition-config.txt and OpenSM_PKey_Mgr.txt and had the > following comment - > > If we need to add/delete a node to/from a partition we need to update > the file > > /etc/osm-partitions.txt > > and restart the OpenSM. According to the docs there no way we can do > this without restarting the OpenSM. > > It would be useful to add new feature to reload the partition table > after making the changes. Partitions can be deleted and the new partitions enforced via issuing kill -HUP to the OpenSM without restarting now. The document is (already) out of date :-( I will update it shortly. -- Hal > > VBabu > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Fri Jun 30 02:38:29 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Jun 2006 05:38:29 -0400 Subject: [openib-general] [PATCHv2] OpenSM/osm_lid_mgr.c: Support enhanced switch port 0 forLMC > 0 In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30236891D@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30236891D@mtlexch01.mtl.com> Message-ID: <1151660308.4478.14933.camel@hal.voltaire.com> Hi Eitan, On Thu, 2006-06-29 at 15:54, Eitan Zahavi wrote: > Hi Hal, > > I think the check for num lids is so similar it deserves an inline > function. > What do you say? Does the function need to be inlined ? -- Hal > I refer to: > > > + if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) && > > + ib_switch_info_is_enhanced_port0(p_si)) > > + { > > + num_lids = lmc_num_lids; > > + } > > + else > > + { > > + num_lids = 1; > > + } > > + } > > > > From halr at voltaire.com Fri Jun 30 03:11:52 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Jun 2006 06:11:52 -0400 Subject: [openib-general] [PATCHv3] OpenSM/osm_lid_mgr.c: Support enhanced switch port 0 for LMC > 0 Message-ID: <1151662311.4478.16276.camel@hal.voltaire.com> OpenSM/osm_lid_mgr.c: Support enhanced switch port 0 for LMC > 0 Base port 0 is constrained to have an of LMC of 0 whereas enhanced switch port 0 is not. Support for enhanced switch port 0 is more like CA and router ports in terms of LMC. Signed-off-by: Hal Rosenstock Index: include/opensm/osm_switch.h =================================================================== --- include/opensm/osm_switch.h (revision 8296) +++ include/opensm/osm_switch.h (working copy) @@ -702,6 +702,42 @@ osm_switch_get_si_ptr( * Switch object *********/ +/****f* OpenSM: Switch/osm_switch_is_sp0_enhanced +* NAME +* osm_switch_is_sp0_enhanced +* +* DESCRIPTION +* Returns whether switch port 0 (SP0) is enhanced or base +* +*/ +static inline uint16_t +osm_switch_is_sp0_enhanced( + IN const osm_switch_t* const p_sw ) +{ + ib_switch_info_t *p_si; + + if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) && + ib_switch_info_is_enhanced_port0(p_si)) + { + return 1; /* enhanced SP0 */ + } + + return 0; /* base SP 0 */ +} +/* +* PARAMETERS +* p_sw +* [in] Pointer to an osm_switch_t object. +* +* RETURN VALUES +* TRUE if SP0 is enhanced. FALSE otherwise. +* +* NOTES +* +* SEE ALSO +* Switch object +*********/ + /****f* OpenSM: Switch/osm_switch_get_max_block_id * NAME * osm_switch_get_max_block_id Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 8296) +++ opensm/osm_lid_mgr.c (working copy) @@ -94,6 +94,7 @@ #include #include #include +#include #include #include #include @@ -351,6 +352,7 @@ __osm_lid_mgr_init_sweep( osm_lid_mgr_range_t *p_range = NULL; osm_port_t *p_port; cl_qmap_t *p_port_guid_tbl; + osm_switch_t *p_sw; uint8_t lmc_num_lids = (uint8_t)(1 << p_mgr->p_subn->opt.lmc); uint16_t lmc_mask; uint16_t req_lid, num_lids; @@ -436,7 +438,19 @@ __osm_lid_mgr_init_sweep( IB_NODE_TYPE_SWITCH ) num_lids = lmc_num_lids; else - num_lids = 1; + { + /* Determine if enhanced switch port 0 */ + p_sw = osm_get_switch_by_guid(p_mgr->p_subn, + osm_node_get_node_guid(osm_port_get_parent_node(p_port))); + if (osm_switch_is_sp0_enhanced(p_sw)) + { + num_lids = lmc_num_lids; + } + else + { + num_lids = 1; + } + } if ((num_lids != 1) && (((db_min_lid & lmc_mask) != db_min_lid) || @@ -539,7 +553,17 @@ __osm_lid_mgr_init_sweep( } else { - num_lids = 1; + /* Determine if enhanced switch port 0 */ + p_sw = osm_get_switch_by_guid(p_mgr->p_subn, + osm_node_get_node_guid(osm_port_get_parent_node(p_port))); + if (osm_switch_is_sp0_enhanced(p_sw)) + { + num_lids = lmc_num_lids; + } + else + { + num_lids = 1; + } } /* Make sure the lid is aligned */ @@ -798,6 +822,7 @@ __osm_lid_mgr_get_port_lid( uint8_t num_lids = (1 << p_mgr->p_subn->opt.lmc); int lid_changed = 0; uint16_t lmc_mask; + osm_switch_t *p_sw; OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_get_port_lid ); @@ -809,10 +834,18 @@ __osm_lid_mgr_get_port_lid( /* get the lid from the guid2lid */ guid = cl_ntoh64( osm_port_get_guid( p_port ) ); - /* if the port is a switch then we only need one lid */ + /* if the port is a switch with base switch port 0 then we only need one lid */ if( osm_node_get_type( osm_port_get_parent_node( p_port ) ) == IB_NODE_TYPE_SWITCH ) - num_lids = 1; + { + /* Determine if base switch port 0 */ + p_sw = osm_get_switch_by_guid(p_mgr->p_subn, + osm_node_get_node_guid(osm_port_get_parent_node(p_port))); + if (!osm_switch_is_sp0_enhanced(p_sw)) + { + num_lids = 1; + } + } /* if the port matches the guid2lid */ if (!osm_db_guid2lid_get( p_mgr->p_g2l, guid, &min_lid, &max_lid)) From eitan at mellanox.co.il Fri Jun 30 03:55:20 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 30 Jun 2006 13:55:20 +0300 Subject: [openib-general] [PATCHv2] OpenSM/osm_lid_mgr.c: Support enhanced switch port 0forLMC > 0 Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302368920@mtlexch01.mtl.com> No not really Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, June 30, 2006 12:38 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [PATCHv2] OpenSM/osm_lid_mgr.c: Support enhanced switch port > 0forLMC > 0 > > Hi Eitan, > > On Thu, 2006-06-29 at 15:54, Eitan Zahavi wrote: > > Hi Hal, > > > > I think the check for num lids is so similar it deserves an inline > > function. > > What do you say? > > Does the function need to be inlined ? > > -- Hal > > > I refer to: > > > > > + if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) && > > > + ib_switch_info_is_enhanced_port0(p_si)) > > > + { > > > + num_lids = lmc_num_lids; > > > + } > > > + else > > > + { > > > + num_lids = 1; > > > + } > > > + } > > > > > > > From svenar at simula.no Fri Jun 30 06:30:22 2006 From: svenar at simula.no (Sven-Arne Reinemo) Date: Fri, 30 Jun 2006 15:30:22 +0200 Subject: [openib-general] A few questions about IBMgtSim In-Reply-To: <4496F9F9.90101@mellanox.co.il> References: <44968BEF.9030401@simula.no> <4496F9F9.90101@mellanox.co.il> Message-ID: <44A5276E.9010001@simula.no> Anno Domini 19-06-2006 21:24, Eitan Zahavi wrote: > Hi Sven, > > Please see my response below: Thanks for your help. I have another question regarding time scales in simulations. When the SM is used with the simulator how do I find the simulated time for events? I.e. if I run a simulation where it takes the SM 1 hour to get to subnet up (wallclock time) how do I find/calculate the time it took according to the simulator clock? Best regards, -- Sven-Arne Reinemo [simula.research laboratory] http://www.simula.no/ ++++ GnuPG public key - http://home.simula.no/~svenar/gpg.asc ++++ From halr at voltaire.com Fri Jun 30 08:10:19 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Jun 2006 11:10:19 -0400 Subject: [openib-general] Reloading of partition policy In-Reply-To: <1151639421.4478.746.camel@hal.voltaire.com> References: <44A46775.1000507@3leafnetworks.com> <1151639421.4478.746.camel@hal.voltaire.com> Message-ID: <1151680218.4478.28538.camel@hal.voltaire.com> On Thu, 2006-06-29 at 23:50, Hal Rosenstock wrote: > On Thu, 2006-06-29 at 19:51, Venkatesh Babu wrote: > > I was reviewing partition-config.txt and OpenSM_PKey_Mgr.txt and had the > > following comment - > > > > If we need to add/delete a node to/from a partition we need to update > > the file > > > > /etc/osm-partitions.txt > > > > and restart the OpenSM. According to the docs there no way we can do > > this without restarting the OpenSM. I just looked at those documents and couldn't find what you were referring to. Can you be more specific ? -- Hal > > > > It would be useful to add new feature to reload the partition table > > after making the changes. > > Partitions can be deleted and the new partitions enforced via issuing > kill -HUP to the OpenSM without restarting now. > > The document is (already) out of date :-( I will update it shortly. > > -- Hal > > > > > VBabu > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Fri Jun 30 08:04:35 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 30 Jun 2006 18:04:35 +0300 Subject: [openib-general] A few questions about IBMgtSim Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302368923@mtlexch01.mtl.com> Hi Sven, Currently there is no way to scale simulation time to real time. The main reason is that the time scale is mixed: * OpenSM calculation time is about the same (if you run the simulator on remote node) * SMA time and packet traversal time is not scaling at all and the larger the fabric the larger the scaling factor. In real life the hardware handles the packets in simulation it is a single CPU EZ Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sven-Arne Reinemo [mailto:svenar at simula.no] > Sent: Friday, June 30, 2006 4:30 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [openib-general] A few questions about IBMgtSim > > Anno Domini 19-06-2006 21:24, Eitan Zahavi wrote: > > Hi Sven, > > > > Please see my response below: > > Thanks for your help. I have another question regarding time scales in > simulations. When the SM is used with the simulator how do I find the > simulated time for events? I.e. if I run a simulation where it takes the > SM 1 hour to get to subnet up (wallclock time) how do I find/calculate > the time it took according to the simulator clock? > > Best regards, > > -- > Sven-Arne Reinemo > [simula.research laboratory] http://www.simula.no/ > ++++ GnuPG public key - http://home.simula.no/~svenar/gpg.asc ++++ From bos at pathscale.com Fri Jun 30 10:00:31 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 30 Jun 2006 10:00:31 -0700 Subject: [openib-general] [PATCH 0 of 39] ipath - bug fixes, performance enhancements, and portability improvements In-Reply-To: <20060630163108.GA24882@mellanox.co.il> References: <20060630163108.GA24882@mellanox.co.il> Message-ID: <1151686831.2194.7.camel@localhost.localdomain> On Fri, 2006-06-30 at 19:31 +0300, Michael S. Tsirkin wrote: > OK, next week I'll put these into my tree, too. Thanks. The first 37 are in -mm; the last two you can drop until I sort them out. References: <000001c69b9e$86268fd0$8698070a@amr.corp.intel.com> Message-ID: <20060630163345.GB24882@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: ipath patch series a-comin', but no IB maintainer to shepherd them > > >This currently includes a single patch from Venkatesh Babu: > > IB/core: Set alternate port number when initializing QP attributes. > > > >that has been checked into openib svn by Sean. > > Thanks Michael. I will assume that you will push this change in through Roland > when he's back. Sure. -- MST From mst at mellanox.co.il Fri Jun 30 09:31:08 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 30 Jun 2006 19:31:08 +0300 Subject: [openib-general] [PATCH 0 of 39] ipath - bug fixes, performance enhancements, and portability improvements In-Reply-To: References: Message-ID: <20060630163108.GA24882@mellanox.co.il> Quoting r. Bryan O'Sullivan : > Subject: [PATCH 0 of 39] ipath - bug fixes, performance enhancements,and portability improvements > > Hi, Andrew - > > These patches bring the ipath driver up to date with a number of bug fixes, > performance improvements, and better PowerPC support. There are a few > whitespace and formatting patches in the series, but they're all self- > contained. The patches have been tested internally, and shouldn't contain > anything controversial. > > My hope is that they'll sit in -mm for a little bit, and make it into > an early 2.6.18 -rc kernel. OK, next week I'll put these into my tree, too. Bryan, as far as I can see there were some comments with regard to patches 38 and 39 in the series. Will you be sending updated revisions of these? -- MST From viswa.krish at gmail.com Fri Jun 30 10:26:11 2006 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Fri, 30 Jun 2006 10:26:11 -0700 Subject: [openib-general] CM and REP handling Message-ID: <4df28be40606301026g715df953v3676ed292662c694@mail.gmail.com> In the current communication manager (CM) implementation how is the REP MAD getting lost handled. When the REP gets lost, the cm_dup_req_handler gets called which currently enters the default condition and does nothing. The client retries the number of timers it is configured to and fails. If the first REP gets lost, the connection never gets established. So what should be the behavior ? -Viswa -------------- next part -------------- An HTML attachment was scrubbed... URL: From venkatesh.babu at 3leafnetworks.com Fri Jun 30 11:08:40 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Fri, 30 Jun 2006 11:08:40 -0700 Subject: [openib-general] Reloading of partition policy In-Reply-To: <1151680218.4478.28538.camel@hal.voltaire.com> References: <44A46775.1000507@3leafnetworks.com> <1151639421.4478.746.camel@hal.voltaire.com> <1151680218.4478.28538.camel@hal.voltaire.com> Message-ID: <44A568A8.1020908@3leafnetworks.com> The document doesn't describe the scenario where nodes are added/deleted from the partition table. I raised this issue because it could be an important use case. If this can be achieved without restarting the OpenSM, it is good. Just one more clarification - sending HUP signal doesn't cause OpenSM failover to other standby one right ? VBabu Hal Rosenstock wrote: > On Thu, 2006-06-29 at 23:50, Hal Rosenstock wrote: > >> On Thu, 2006-06-29 at 19:51, Venkatesh Babu wrote: >> >>> I was reviewing partition-config.txt and OpenSM_PKey_Mgr.txt and had the >>> following comment - >>> >>> If we need to add/delete a node to/from a partition we need to update >>> the file >>> >>> /etc/osm-partitions.txt >>> >>> and restart the OpenSM. According to the docs there no way we can do >>> this without restarting the OpenSM. >>> > > I just looked at those documents and couldn't find what you were > referring to. Can you be more specific ? > > -- Hal > > >>> It would be useful to add new feature to reload the partition table >>> after making the changes. >>> >> Partitions can be deleted and the new partitions enforced via issuing >> kill -HUP to the OpenSM without restarting now. >> >> The document is (already) out of date :-( I will update it shortly. >> >> -- Hal >> >> >>> VBabu >>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > > From trimmer at silverstorm.com Fri Jun 30 11:00:44 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Fri, 30 Jun 2006 14:00:44 -0400 Subject: [openib-general] CM and REP handling In-Reply-To: <4df28be40606301026g715df953v3676ed292662c694@mail.gmail.com> Message-ID: ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Viswanath Krishnamurthy Sent: Friday, June 30, 2006 1:26 PM To: openib-general at openib.org Subject: [openib-general] CM and REP handling In the current communication manager (CM) implementation how is the REP MAD getting lost handled. When the REP gets lost, the cm_dup_req_handler gets called which currently enters the default condition and does nothing. The client retries the number of timers it is configured to and fails. If the first REP gets lost, the connection never gets established. So what should be the behavior ? -Viswa -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Fri Jun 30 11:09:06 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 30 Jun 2006 11:09:06 -0700 Subject: [openib-general] CM and REP handling In-Reply-To: <4df28be40606301026g715df953v3676ed292662c694@mail.gmail.com> References: <4df28be40606301026g715df953v3676ed292662c694@mail.gmail.com> Message-ID: <44A568C2.7000502@ichips.intel.com> Viswanath Krishnamurthy wrote: > In the current communication manager (CM) implementation how is the REP MAD > getting lost handled. When the REP gets lost, the cm_dup_req_handler > gets called > which currently enters the default condition and does nothing. The > client retries > the number of timers it is configured to and fails. If the first REP > gets lost, the connection > never gets established. So what should be the behavior ? The REP will be resent until an RTU is received. Repeated REQs can be dropped in cm_dup_req_handler() because the initial REQ has been received and a REP generated. That is, cm_dup_req_handler() is called on the side sending the REP. Are you seeing an issue with the code when the first REP is lost? - Sean From trimmer at silverstorm.com Fri Jun 30 11:12:02 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Fri, 30 Jun 2006 14:12:02 -0400 Subject: [openib-general] CM and REP handling In-Reply-To: <4df28be40606301026g715df953v3676ed292662c694@mail.gmail.com> Message-ID: > From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Viswanath Krishnamurthy > Sent: Friday, June 30, 2006 1:26 PM > In the current communication manager (CM) implementation how is the REP MAD > getting lost handled. When the REP gets lost, the cm_dup_req_handler gets called > which currently enters the default condition and does nothing. The client retries > the number of timers it is configured to and fails. If the first REP gets lost, the connection > never gets established. So what should be the behavior ? The IBTA standard in section 12.9.7 defines this situation in the state machine. In this case the Active side will have sent a REQ. It will be in REQ Sent state (or REP Wait in passive side sent an MRA). In these states the Active side will have a timer running. If the REP is lost, the Active side will timeout and move to the "Timeout" state. In this state, the active side has the option of resending the REQ or sending a REJ and giving up on the connection attempt. In general it is best for the active side to perform a few retries before it gives up. During this sequence the passive side will think it has sent its REP (eg. the one which was lost) so it will be in the REP Sent state (see 12.9.7.2). In this state if it receives another matching REQ, it is to resend its REP. There is also a timer on the passive side in this state (waiting for the RTU). If the passive side times out it will move to RTU Timeout and has the option to resent its REP or send a REJ and give up the connection attempt. Here too it is best for the passive side to perform a few retries before giving up. Todd Rimmer -------------- next part -------------- An HTML attachment was scrubbed... URL: From trimmer at silverstorm.com Fri Jun 30 11:18:12 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Fri, 30 Jun 2006 14:18:12 -0400 Subject: [openib-general] CM and REP handling In-Reply-To: <44A568C2.7000502@ichips.intel.com> Message-ID: > From: Sean Hefty > Sent: Friday, June 30, 2006 2:09 PM > > The REP will be resent until an RTU is received. Repeated REQs can be > dropped > in cm_dup_req_handler() because the initial REQ has been received and a > REP > generated. That is, cm_dup_req_handler() is called on the side sending > the REP. > Shouldn't the cm_dup_req_handler in this case also resend the REP per the IBTA passive side state machine "REP Sent" state? Todd Rimmer From mshefty at ichips.intel.com Fri Jun 30 11:28:28 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 30 Jun 2006 11:28:28 -0700 Subject: [openib-general] CM and REP handling In-Reply-To: References: Message-ID: <44A56D4C.70704@ichips.intel.com> Rimmer, Todd wrote: > Shouldn't the cm_dup_req_handler in this case also resend the REP per > the IBTA passive side state machine "REP Sent" state? The REP will already being retried based on a timeout. It could be resent immediately in response to a duplicate REQ as well, but that shouldn't be necessary, and actually makes things more complex, since coordination must be done between sending based on a timeout, versus receiving a duplicate REQ. - Sean From trimmer at silverstorm.com Fri Jun 30 12:46:40 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Fri, 30 Jun 2006 15:46:40 -0400 Subject: [openib-general] CM and REP handling In-Reply-To: <44A56D4C.70704@ichips.intel.com> Message-ID: > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Friday, June 30, 2006 2:28 PM > > Rimmer, Todd wrote: > > Shouldn't the cm_dup_req_handler in this case also resend the REP per > > the IBTA passive side state machine "REP Sent" state? > > The REP will already being retried based on a timeout. It could be resent > immediately in response to a duplicate REQ as well, but that shouldn't be > necessary, and actually makes things more complex, since coordination must > be > done between sending based on a timeout, versus receiving a duplicate REQ. I would recommend implementing the state machine as defined in the spec for the following reasons: 1. it will be necessary to pass any future IBTA CIWG compliance tests for the CM 2. I would need to think about it, but the lost REP case may not be the only situation where a duplicate REQ can be received. 3. depending on RTU timeout on the passive side as the only means for resending the REP reduces the retries attempted in a "lossy" fabric for REP and RTU loss (eg. if you have 8 RTU timeout retries on passive side, and many REPs are lost followed by many RTUs, you get a total of 8 lost REPs+RTUs before you give up, managing the counters separately will tend allow for more retries). In our proprietary stack we implemented the defined state machine and have stressed it for 1000s of concurrent connections (including various Chariot SDP connect/disconnect stress tests and Oracle uDAPL stress tests plus our use of the CM to establish connections when running MPI on 1000s of nodes) in various real world and contrived situations of packet loss and slow responsiveness and the defined state machine has worked very well for all these situations. Todd Rimmer From rdreier at cisco.com Fri Jun 30 13:56:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 30 Jun 2006 13:56:18 -0700 Subject: [openib-general] ipath patch series a-comin', but no IB maintainer to shepherd them In-Reply-To: <20060629163857.GT19300@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 29 Jun 2006 19:38:57 +0300") References: <20060628171318.7d97d617.akpm@osdl.org> <20060629163857.GT19300@mellanox.co.il> Message-ID: > Further, in the hope that this will help keep things reasonably stable till > Roland comes back, and help everyone see what's being merged, I have > created a git branch for all things infiniband going into 2.6.18. > > You can get at it here: > git://www.mellanox.co.il/~git/infiniband mst-for-2.6.18 Thanks for doing this ... however www.mellanox.co.il doesn't seem to have the git port open: fatal: unable to connect a socket (Connection refused) fetch-pack from 'git://www.mellanox.co.il/~git/infiniband' failed. - R. From rdreier at cisco.com Fri Jun 30 14:08:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 30 Jun 2006 14:08:19 -0700 Subject: [openib-general] [PATCH 28 of 39] IB/ipath - Fixes a bug where our delay for EEPROM no longer works due to compiler reordering In-Reply-To: <20060629170711.757a97d2.akpm@osdl.org> (Andrew Morton's message of "Thu, 29 Jun 2006 17:07:11 -0700") References: <5f3c0b2d446d78e3327f.1151617279@eng-12.pathscale.com> <20060629170711.757a97d2.akpm@osdl.org> Message-ID: > > static void i2c_wait_for_writes(struct ipath_devdata *dd) > > { > > + mb(); > > (void)ipath_read_kreg32(dd, dd->ipath_kregs->kr_scratch); > > } > That's a bit weird. I wouldn't have expected the compiler to muck around > with a readl(). I never liked this patch. The last time it came up there were conflicting answers about whether it was a code generation bug or a real issue talking to hardware or what. At the least I think this merits a big comment explain what's going on -- and even better would be really understanding the bug that's being fixed so that we're confident it is indeed a real fix. - R. From mshefty at ichips.intel.com Fri Jun 30 14:52:07 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 30 Jun 2006 14:52:07 -0700 Subject: [openib-general] CM and REP handling In-Reply-To: References: Message-ID: <44A59D07.3060303@ichips.intel.com> Rimmer, Todd wrote: > I would recommend implementing the state machine as defined in the spec > for the following reasons: Technically, I believe that this follows the state machine. After receiving a duplicate REQ, a REP will be resent. The only difference is that there is a delay in resending the REP. > 1. it will be necessary to pass any future IBTA CIWG compliance tests > for the CM I don't believe that a compliance test would detect any issue. > 2. I would need to think about it, but the lost REP case may not be the > only situation where a duplicate REQ can be received. Note that the IB CM handles duplicate REQs differently based on the current state. > 3. depending on RTU timeout on the passive side as the only means for > resending the REP reduces the retries attempted in a "lossy" fabric for > REP and RTU loss (eg. if you have 8 RTU timeout retries on passive side, > and many REPs are lost followed by many RTUs, you get a total of 8 lost > REPs+RTUs before you give up, managing the counters separately will tend > allow for more retries). The number of retries cannot exceed the maximum CM retries that was specified in the REQ. Resending a REP immediately after receiving a duplicate REQ needs to check against this and increment the number of REPs that have been sent. The result is that the connection timeout actually decreases for every duplicate REQ that is received. - Sean From sean.hefty at intel.com Fri Jun 30 20:58:33 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 30 Jun 2006 20:58:33 -0700 Subject: [openib-general] [PATCH] RMPP: add Dual-sided RMPP support Message-ID: <000001c69cc2$9f7ac710$4e268686@amr.corp.intel.com> Add support for dual-sided RMPP transfers. The implementation assumes that any RMPP request that requires a response uses DS RMPP. Based on the RMPP start-up scenarios defined by the spec, this should be a valid assumption. That is, there is no start-up scenario defined where an RMPP request is followed by a non-RMPP response. By having this assumption, we avoid any API changes. In order for a node that supports DS RMPP to communicate with one that does not, RMPP responses assume a new window size of 1 if a DS ACK has not been received. (By DS ACK, I'm referring to the ACK of the final ACK to the request.) This is a slight spec deviation, but is necessary to allow communication with nodes that do not generate the DS ACK. It also handles the case when a response is sent after the request state has been discarded. Signed-off-by: Sean Hefty --- This was tested by running grmpp between OpenFabric nodes running with and without DS RMPP support. Additional testing is desirable before committing, since it affects all MADs using RMPP. Index: mad_rmpp.c =================================================================== --- mad_rmpp.c (revision 8224) +++ mad_rmpp.c (working copy) @@ -60,6 +60,7 @@ struct mad_rmpp_recv { int last_ack; int seg_num; int newwin; + int repwin; __be64 tid; u32 src_qp; @@ -170,6 +171,32 @@ static struct ib_mad_send_buf *alloc_res return msg; } +static void ack_ds_ack(struct ib_mad_agent_private *agent, + struct ib_mad_recv_wc *recv_wc) +{ + struct ib_mad_send_buf *msg; + struct ib_rmpp_mad *rmpp_mad; + int ret; + + msg = alloc_response_msg(&agent->agent, recv_wc); + if (IS_ERR(msg)) + return; + + rmpp_mad = msg->mad; + memcpy(rmpp_mad, recv_wc->recv_buf.mad, msg->hdr_len); + + rmpp_mad->mad_hdr.method ^= IB_MGMT_METHOD_RESP; + ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, IB_MGMT_RMPP_FLAG_ACTIVE); + rmpp_mad->rmpp_hdr.seg_num = 0; + rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(1); + + ret = ib_post_send_mad(msg, NULL); + if (ret) { + ib_destroy_ah(msg->ah); + ib_free_send_mad(msg); + } +} + void ib_rmpp_send_handler(struct ib_mad_send_wc *mad_send_wc) { struct ib_rmpp_mad *rmpp_mad = mad_send_wc->send_buf->mad; @@ -271,6 +298,7 @@ create_rmpp_recv(struct ib_mad_agent_pri rmpp_recv->newwin = 1; rmpp_recv->seg_num = 1; rmpp_recv->last_ack = 0; + rmpp_recv->repwin = 1; mad_hdr = &mad_recv_wc->recv_buf.mad->mad_hdr; rmpp_recv->tid = mad_hdr->tid; @@ -591,6 +619,16 @@ static inline void adjust_last_ack(struc break; } +static void process_ds_ack(struct ib_mad_agent_private *agent, + struct ib_mad_recv_wc *mad_recv_wc, int newwin) +{ + struct mad_rmpp_recv *rmpp_recv; + + rmpp_recv = find_rmpp_recv(agent, mad_recv_wc); + if (rmpp_recv && rmpp_recv->state == RMPP_STATE_COMPLETE) + rmpp_recv->repwin = newwin; +} + static void process_rmpp_ack(struct ib_mad_agent_private *agent, struct ib_mad_recv_wc *mad_recv_wc) { @@ -616,8 +654,18 @@ static void process_rmpp_ack(struct ib_m spin_lock_irqsave(&agent->lock, flags); mad_send_wr = ib_find_send_mad(agent, mad_recv_wc); - if (!mad_send_wr) - goto out; /* Unmatched ACK */ + if (!mad_send_wr) { + if (!seg_num) + process_ds_ack(agent, mad_recv_wc, newwin); + goto out; /* Unmatched or DS RMPP ACK */ + } + + if ((mad_send_wr->last_ack == mad_send_wr->send_buf.seg_count) && + (mad_send_wr->timeout)) { + spin_unlock_irqrestore(&agent->lock, flags); + ack_ds_ack(agent, mad_recv_wc); + return; /* Repeated ACK for DS RMPP transaction */ + } if ((mad_send_wr->last_ack == mad_send_wr->send_buf.seg_count) || (!mad_send_wr->timeout) || (mad_send_wr->status != IB_WC_SUCCESS)) @@ -656,6 +704,9 @@ static void process_rmpp_ack(struct ib_m if (mad_send_wr->refcount == 1) ib_reset_mad_timeout(mad_send_wr, mad_send_wr->send_buf.timeout_ms); + spin_unlock_irqrestore(&agent->lock, flags); + ack_ds_ack(agent, mad_recv_wc); + return; } else if (mad_send_wr->refcount == 1 && mad_send_wr->seg_num < mad_send_wr->newwin && mad_send_wr->seg_num < mad_send_wr->send_buf.seg_count) { @@ -772,6 +823,39 @@ out: return NULL; } +static int init_newwin(struct ib_mad_send_wr_private *mad_send_wr) +{ + struct ib_mad_agent_private *agent = mad_send_wr->mad_agent_priv; + struct ib_mad_hdr *mad_hdr = mad_send_wr->send_buf.mad; + struct mad_rmpp_recv *rmpp_recv; + struct ib_ah_attr ah_attr; + unsigned long flags; + int newwin = 1; + + if (!(mad_hdr->method & IB_MGMT_METHOD_RESP)) + goto out; + + spin_lock_irqsave(&agent->lock, flags); + list_for_each_entry(rmpp_recv, &agent->rmpp_list, list) { + if (rmpp_recv->tid != mad_hdr->tid || + rmpp_recv->mgmt_class != mad_hdr->mgmt_class || + rmpp_recv->class_version != mad_hdr->class_version || + (rmpp_recv->method & IB_MGMT_METHOD_RESP)) + continue; + + if (ib_query_ah(mad_send_wr->send_buf.ah, &ah_attr)) + continue; + + if (rmpp_recv->slid == ah_attr.dlid) { + newwin = rmpp_recv->repwin; + break; + } + } + spin_unlock_irqrestore(&agent->lock, flags); +out: + return newwin; +} + int ib_send_rmpp_mad(struct ib_mad_send_wr_private *mad_send_wr) { struct ib_rmpp_mad *rmpp_mad; @@ -787,7 +871,7 @@ int ib_send_rmpp_mad(struct ib_mad_send_ return IB_RMPP_RESULT_INTERNAL; } - mad_send_wr->newwin = 1; + mad_send_wr->newwin = init_newwin(mad_send_wr); /* We need to wait for the final ACK even if there isn't a response */ mad_send_wr->refcount += (mad_send_wr->timeout == 0);