From jsquyres at cisco.com Thu May 1 10:43:57 2014 From: jsquyres at cisco.com (Jeff Squyres (jsquyres)) Date: Thu, 1 May 2014 17:43:57 +0000 Subject: [Ofiwg-mpi] First! Message-ID: <5C139014-E03F-4C17-9595-60610AE1D569@cisco.com> Just to get something on the web archives... This list is a sub-group of the OpenFabrics Interconnect Working Group (OFIWG). I'm posting a summary of what has happened so far for those who are just joining the discussion. The OFIWG is designing a new low-level Linux networking API. Think of it as "Verbs 2.0", but without all the verbs / IB-specific baggage (*** see links at the end for more info on the overall effort). In January, we collected a lot of feedback from the MPI community and presented it to the OFIWG. Attached are the collated requirements / feedback that were presented to the OFIWG (*** see links at the end for more info). These requirements were presented to the OFIWG (they asked lots of good questions), and much work has progressed beyond then. This list was created to host the MPI-centric sub-discussions of the larger OFIWG libfabric discussions. ------ Links for more information ========================== Here's a brief writeup of the effort, and a good video of the technical lead of the OFIWG, Sean Hefty, describing the overall libfabric effort: Writeup/InsideHPC: http://insidehpc.com/2014/04/28/ofa-streaming-verbs-interface/ OFA Workshop video/Sean Hefty: http://insidehpc.com/2014/04/18/sean-hefty-presents-scalable-fabric-interfaces/ Slides from video: https://www.openfabrics.org/images/Workshops_2014/DevWorkshop/presos/Monday/pdf/09.30_2014%20_OFA_Workshop_ofa-sfi.pdf Explanations of the MPI requirements slides: Blog post overview: http://blogs.cisco.com/performance/a-fun-thing-happened-on-the-way-to-the-openframeworks-discussion-today/ InsideHPC slidecast: http://insidehpc.com/2014/01/30/slidecast-mpi-requirements-network-layer/ OFA Workshop video/Nathan Hjelmn: http://insidehpc.com/2014/04/20/video-mpi-requirements-network-layer/ Slides from video: https://www.openfabrics.org/images/Workshops_2014/DevWorkshop/presos/Monday/pdf/11.15_2014%20_OFA_Workshop_mpi-community-feedback.pdf -- Jeff Squyres jsquyres at cisco.com For corporate legal information go to:http://www.cisco.com/web/about/doing_business/legal/cri/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2014-01-28-mpi-community-feedback.pptx Type: application/vnd.openxmlformats-officedocument.presentationml.presentation Size: 702188 bytes Desc: 2014-01-28-mpi-community-feedback.pptx URL: From jsquyres at cisco.com Thu May 1 11:00:23 2014 From: jsquyres at cisco.com (Jeff Squyres (jsquyres)) Date: Thu, 1 May 2014 18:00:23 +0000 Subject: [Ofiwg-mpi] Fwd: OpenFabrics lib fabric effort: next round of feedback References: <7DE1BCDD-335E-4072-AE32-B11EE4399BC0@cisco.com> Message-ID: Below is an email I BCC'ed to a bunch of MPI community people before we got this listserv, including Sean Hefty's original email and his slides. I've already received some feedback on Sean's slides; I'll summarize what I've received so far in subsequent emails. Please feel free to comment / reply with more thoughts to the 3 specific questions asked below. If it helps, we can certainly have a webex for wider discussion. Begin forwarded message: From: "Jeff Squyres (jsquyres)" > Subject: OpenFabrics lib fabric effort: next round of feedback Date: April 23, 2014 6:45:08 AM EDT To: "Jeff Squyres (jsquyres)" > Cc: Sean Hefty > MPI Community (BCC'ed -- see below): Per our prior discussions back in January, the libfabric group is getting farther along in developing fabric-agnostic APIs. Sean Hefty, CC'ed, is the main technical leader of this effort. Sean sent an email last week that contains a PPT and a link to man pages for a bunch of proposed APIs. See his original email, including the PPT and link, below. This is very early days; these should probably be considered super-early-first-draft-here's-what-we're-thinking-about kinds of APIs. Many of them are similar to / influenced by the existing Linux verbs API. Many others are totally new / not anywhere close to anything in Linux verbs. There's (at least) 3 questions on the table for the MPI community: 0. How do we want to have a discussion about this stuff? Do we want a listserv to jointly discuss the MPI-specific aspects of these APIs? (vs. me BCC'ing blastograms to you) Do we want to have another webex soon? ...? ** I ask this question because there's a larger / general libfabric API discussion going on that the greater MPI community may or may not care about. These occur (somewhat) on the OpenFabrics IWG listserv, and mostly on the Tuesday webexes (12pm US Eastern). Do we want a place/time for MPI-specific aspects of this conversation? 1. As you can see in the PPT, there's some discussion occurring about how to integrate this new effort into the old Linux verbs API. a) Should all the new APIs just be "extensions" to Linux verbs? b) Should it be a whole new API? c) Should the new API include (by value) the old Linux verbs APIs? d) ...? ** These questions are aimed at a) applications (e.g., MPI implementations) forward-porting to libfabric and b) vendors providing both verbs and libfabric support that can be compatible with each other. 2. What do you think of the *general direction* of these APIs? Don't yet harp on specific parameters and functions (expressed as function pointers in the PPT, BTW) -- just comment on whether you think the overall direction / high-level ideas are headed in the right direction. And if not, indicate why not. ----- For reference, here's who's BCC'ed in no particular order -- anyone missing? (my mail client decided to drop the BCC lists from the mails I sent back in January/February, so this list may not be complete) Dave Goodell > Cesare Cant� > Reese Faucette > Amin Hassani > Brad Benton > Charles J Archer > Daniel Holmes > Devendar Bureddy > Jeff Hammond > Michael Blocksome > Michael Raymond > Nathan Hjelm > Pavel Shamis > Ron Brightwell > Sayantan Sur > Ryan Grant > Torsten Hoefler > Chulho Kim > Carl Obert > Perry Schmidt > Fab Tillier > Shane/Matthew Farmer > Howard Pritchard > Tony Skjellum > Begin forwarded message: > From: "Hefty, Sean" > > Subject: [ofiwg] presentation from today's WG meeting > Date: April 15, 2014 1:22:07 PM EDT > To: "ofiwg at lists.openfabrics.org" > > > Attached is the presentation that I went over today. Next week's meeting will solicit responses from anyone regarding the *direction* of the proposal. > > As mentioned in today's meeting, the man page formats for the current APIs can be found in the source code or online here: > > https://www.openfabrics.org/downloads/OFIWG/API/ > > This is slightly older documentation and describes the static inline wrapper functions around the operations discussed in the attached slides. The *details* of the specific function calls is what would need discussion within the working group and the application developers. > > For a merged effort, I would anticipate that in some cases the same set of function pointers could be usable between gen1 and gen2 APIs (e.g. msg_ops, rma_ops), but with differently named wrapper functions (e.g. fi_write versus ibv_write). An example of this was in last week's presentation. In other cases, functions may not easily apply (e.g. tagged_ops) or only the concepts may be transferable (e.g. optimized poll CQ call). The CM functionality and their full integration would be an example of calls that evolve from gen 1 to a gen 2. > > - Sean > _______________________________________________ > ofiwg mailing list > ofiwg at lists.openfabrics.org > http://lists.openfabrics.org/mailman/listinfo/ofiwg -- Jeff Squyres jsquyres at cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- Jeff Squyres jsquyres at cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2014-04-14-ofiwg-proposal.pptx Type: application/vnd.openxmlformats-officedocument.presentationml.presentation Size: 288039 bytes Desc: 2014-04-14-ofiwg-proposal.pptx URL: From sean.hefty at intel.com Mon May 19 20:11:49 2014 From: sean.hefty at intel.com (Hefty, Sean) Date: Tue, 20 May 2014 03:11:49 +0000 Subject: [OFIWG-MPI] FI Architecture slides for OFI WG meeting tomorrow Message-ID: <1828884A29C6694DAF28B7E6B8A82373992FCC50@ORSMSX109.amr.corp.intel.com> The attached slides are a high-level architectural overview of the fabric interfaces that have been discussed. These slides are not intended to define specific API definitions, though general functionality is covered. - Sean -------------- next part -------------- A non-text attachment was scrubbed... Name: 2014-05-20-fi-arch.pptx Type: application/vnd.openxmlformats-officedocument.presentationml.presentation Size: 493075 bytes Desc: 2014-05-20-fi-arch.pptx URL: From atchleyes at ornl.gov Tue May 20 05:17:03 2014 From: atchleyes at ornl.gov (Atchley, Scott) Date: Tue, 20 May 2014 12:17:03 +0000 Subject: [OFIWG-MPI] [ofiwg] FI Architecture slides for OFI WG meeting tomorrow In-Reply-To: <1828884A29C6694DAF28B7E6B8A82373992FCC50@ORSMSX109.amr.corp.intel.com> References: <1828884A29C6694DAF28B7E6B8A82373992FCC50@ORSMSX109.amr.corp.intel.com> Message-ID: On May 19, 2014, at 11:11 PM, "Hefty, Sean" wrote: > The attached slides are a high-level architectural overview of the fabric interfaces that have been discussed. These slides are not intended to define specific API definitions, though general functionality is covered. > > - Sean > <2014-05-20-fi-arch.pptx>_______________________________________________ What are the call details? Can I join? Scott ------------- Scott Atchley HPC Systems Engineer Center for Computational Sciences Oak Ridge National Laboratory atchleyes at ornl.gov From jsquyres at cisco.com Tue May 20 05:18:34 2014 From: jsquyres at cisco.com (Jeff Squyres (jsquyres)) Date: Tue, 20 May 2014 12:18:34 +0000 Subject: [OFIWG-MPI] FW: OFI WG Webex meeting Message-ID: <1828884A29C6694DAF28B7E6B8A82373992FCD5D@ORSMSX109.amr.corp.intel.com> The OFI WG invites you to attend this online webex meeting. This replaces the lync meetings that have previously been scheduled. The meeting time remains unchanged – every Tuesday from 9 AM – 10 AM Pacific, noon – 1 PM Eastern, and somewhat inconvenient for anyone overseas. Thanks! When: Occurs every Tuesday from 12:00 PM to 1:00 PM effective 4/15/2014. (UTC-05:00) Eastern Time (US & Canada) Where: Webex +~+~+~+~+~+~+~+~+~+ Topic: Webex for libfabric meeting Date: Every Tuesday, from Tuesday, April 15, 2014 to no end date Time: 12:00 pm, Eastern Daylight Time (New York, GMT-04:00) Meeting Number: 206 021 941 Meeting Password: libfabric ------------------------------------------------------- To join the online meeting (Now from mobile devices!) ------------------------------------------------------- 1. Go to https://cisco.webex.com/ciscosales/j.php?MTID=m4be80dd52b5aafabb061680546ee195c 2. Enter your name and email address. 3. Enter the meeting password: libfabric 4. Click “Join Now”. To view in other time zones or languages, please click the link: https://cisco.webex.com/ciscosales/j.php?MTID=m27790b4b4ac2f2c2885f1a5297a36d33 ---------------------------------------------------------------- ALERT – PLEASE READ: DO NOT DIAL THE TOLL FREE NUMBERS FROM WITHIN THE (408) OR (919) AREA CODES ---------------------------------------------------------------- Please dial the local access number for your area from the list below: - San Jose/Milpitas (408) area: 525-6800 - RTP (919) area: 392-3330 Dialing the WebEx toll free numbers from within 408 or 919 area codes is not enabled (non-Cisco phones). “ If you dial the toll-free numbers within the 408 or 919 area codes you will be instructed to hang up and dial the local access number.” Please use the call-back option whenever possible and otherwise dial local numbers only. The affected toll free numbers are: (866) 432-9903 for the San Jose/Milpitas area and (866) 349-3520 for the RTP area. ------------------------------------------------------- To join the teleconference only ------------------------------------------------------- 1. Dial into Cisco WebEx (view all Global Access Numbers at http://cisco.com/en/US/about/doing_business/conferencing/index.html 2. Follow the prompts to enter the Meeting Number (listed above) or Access Code followed by the # sign. San Jose, CA: +1.408.525.6800 RTP: +1.919.392.3330 US/Canada: +1.866.432.9903 United Kingdom: +44.20.8824.0117 India: +91.80.4350.1111 Germany: +49.619.6773.9002 Japan: +81.3.5763.9394 China: +86.10.8515.5666 ------------------------------------------------------- For assistance ------------------------------------------------------- 1. Go to https://cisco.webex.com/ciscosales/mc 2. On the left navigation bar, click “Support”. You can contact the meeting organizer at: jsquyres at cisco.com 1-408-525 0971 To add this meeting to your calendar program (for example Microsoft Outlook), click this link: https://cisco.webex.com/ciscosales/j.php?MTID=ma1a5018513e7fd5f0061a8a142429a5b The playback of UCF (Universal Communications Format) rich media files requires appropriate players. To view this type of rich media files in the meeting, please check whether you have the players installed on your computer by going to https://cisco.webex.com/ciscosales/systemdiagnosis.php. http://www.webex.com CCP:+14085256800x206021941# IMPORTANT NOTICE: This WebEx service includes a feature that allows audio and any documents and other materials exchanged or viewed during the session to be recorded. By joining this session, you automatically consent to such recordings. If you do not consent to the recording, discuss your concerns with the meeting host prior to the start of the recording or do not join the session. Please note that any such recordings may be subject to discovery in the event of litigation. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/calendar Size: 5768 bytes Desc: not available URL: From sean.hefty at intel.com Tue May 20 05:19:27 2014 From: sean.hefty at intel.com (Hefty, Sean) Date: Tue, 20 May 2014 12:19:27 +0000 Subject: [OFIWG-MPI] [ofiwg] FI Architecture slides for OFI WG meeting tomorrow In-Reply-To: References: <1828884A29C6694DAF28B7E6B8A82373992FCC50@ORSMSX109.amr.corp.intel.com> Message-ID: <1828884A29C6694DAF28B7E6B8A82373992FCD7A@ORSMSX109.amr.corp.intel.com> I forwarded the meeting information separately. Thanks, Sean > > The attached slides are a high-level architectural overview of the fabric > interfaces that have been discussed. These slides are not intended to > define specific API definitions, though general functionality is covered. > > > > - Sean > > <2014-05-20-fi-arch.pptx>_______________________________________________ > > What are the call details? Can I join? > > Scott From jsquyres at cisco.com Tue May 20 09:16:06 2014 From: jsquyres at cisco.com (Jeff Squyres (jsquyres)) Date: Tue, 20 May 2014 16:16:06 +0000 Subject: [OFIWG-MPI] Fwd: OFI WG Webex meeting References: <1828884A29C6694DAF28B7E6B8A82373992FCD5D@ORSMSX109.amr.corp.intel.com> Message-ID: <53AEB4EB-A415-4FAD-B41C-8C3F24151BF9@cisco.com> FYI -- I keep getting off-list emails (and even phone calls) from people asking about the overall libfabric effort, providing feedback, etc. If you'd like to join the weekly Tuesday at noon US Eastern webex, the info is listed below (including the webex URL you click to join). Begin forwarded message: > From: "Jeff Squyres (jsquyres)" > Subject: [OFIWG-MPI] FW: OFI WG Webex meeting > Date: May 20, 2014 8:18:34 AM EDT > To: "ofiwg-mpi at lists.openfabrics.org" , "Atchley, Scott" > Reply-To: MPI-specific OFIWG info > > > > The OFI WG invites you to attend this online webex meeting. This replaces the lync meetings that have previously been scheduled. The meeting time remains unchanged – every Tuesday from 9 AM – 10 AM Pacific, noon – 1 PM Eastern, and somewhat inconvenient for anyone overseas. > > Thanks! > > When: Occurs every Tuesday from 12:00 PM to 1:00 PM effective 4/15/2014. (UTC-05:00) Eastern Time (US & Canada) > Where: Webex > > +~+~+~+~+~+~+~+~+~+ > > Topic: Webex for libfabric meeting > Date: Every Tuesday, from Tuesday, April 15, 2014 to no end date > Time: 12:00 pm, Eastern Daylight Time (New York, GMT-04:00) > Meeting Number: 206 021 941 > Meeting Password: libfabric > > > ------------------------------------------------------- > To join the online meeting (Now from mobile devices!) > ------------------------------------------------------- > 1. Go to https://cisco.webex.com/ciscosales/j.php?MTID=m4be80dd52b5aafabb061680546ee195c > 2. Enter your name and email address. > 3. Enter the meeting password: libfabric > 4. Click “Join Now”. > > To view in other time zones or languages, please click the link: > https://cisco.webex.com/ciscosales/j.php?MTID=m27790b4b4ac2f2c2885f1a5297a36d33 > > ---------------------------------------------------------------- > ALERT – PLEASE READ: DO NOT DIAL THE TOLL FREE NUMBERS FROM WITHIN THE (408) OR (919) AREA CODES > ---------------------------------------------------------------- > Please dial the local access number for your area from the list below: > - San Jose/Milpitas (408) area: 525-6800 > - RTP (919) area: 392-3330 > > Dialing the WebEx toll free numbers from within 408 or 919 area codes is not enabled (non-Cisco phones). “ If you dial the toll-free numbers within the 408 or 919 area codes you will be instructed to hang up and dial the local access number.” Please use the call-back option whenever possible and otherwise dial local numbers only. The affected toll free numbers are: (866) 432-9903 for the San Jose/Milpitas area and (866) 349-3520 for the RTP area. > > ------------------------------------------------------- > To join the teleconference only > ------------------------------------------------------- > 1. Dial into Cisco WebEx (view all Global Access Numbers at > http://cisco.com/en/US/about/doing_business/conferencing/index.html > 2. Follow the prompts to enter the Meeting Number (listed above) or Access Code followed by the # sign. > > San Jose, CA: +1.408.525.6800 RTP: +1.919.392.3330 > > US/Canada: +1.866.432.9903 United Kingdom: +44.20.8824.0117 > > India: +91.80.4350.1111 Germany: +49.619.6773.9002 > > Japan: +81.3.5763.9394 China: +86.10.8515.5666 > > ------------------------------------------------------- > For assistance > ------------------------------------------------------- > 1. Go to https://cisco.webex.com/ciscosales/mc > 2. On the left navigation bar, click “Support”. > > You can contact the meeting organizer at: > jsquyres at cisco.com > 1-408-525 0971 > > To add this meeting to your calendar program (for example Microsoft Outlook), click this link: > https://cisco.webex.com/ciscosales/j.php?MTID=ma1a5018513e7fd5f0061a8a142429a5b > > The playback of UCF (Universal Communications Format) rich media files requires appropriate players. To view this type of rich media files in the meeting, please check whether you have the players installed on your computer by going to https://cisco.webex.com/ciscosales/systemdiagnosis.php. > > > > > http://www.webex.com > > CCP:+14085256800x206021941# > > IMPORTANT NOTICE: This WebEx service includes a feature that allows audio and any documents and other materials exchanged or viewed during the session to be recorded. By joining this session, you automatically consent to such recordings. If you do not consent to the recording, discuss your concerns with the meeting host prior to the start of the recording or do not join the session. Please note that any such recordings may be subject to discovery in the event of litigation. > > > _______________________________________________ > ofiwg-mpi mailing list > ofiwg-mpi at lists.openfabrics.org > http://lists.openfabrics.org/mailman/listinfo/ofiwg-mpi -- Jeff Squyres jsquyres at cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ From sean.hefty at intel.com Tue May 20 18:56:59 2014 From: sean.hefty at intel.com (Hefty, Sean) Date: Wed, 21 May 2014 01:56:59 +0000 Subject: [OFIWG-MPI] FI Architecture slides for OFI WG meeting tomorrow In-Reply-To: <1828884A29C6694DAF28B7E6B8A82373992FCC50@ORSMSX109.amr.corp.intel.com> References: <1828884A29C6694DAF28B7E6B8A82373992FCC50@ORSMSX109.amr.corp.intel.com> Message-ID: <1828884A29C6694DAF28B7E6B8A82373992FD49A@ORSMSX109.amr.corp.intel.com> I've attached a version 2 of the architecture slides that I presented today. If anyone has questions or comments, please feel free to chime in on the mail lists. The changes are: I added a note clarifying that this is conceptual, and that the objects mentioned may not actually be directly mapped to a specific software structure or class as defined. That level of detail would still be worked out. As just a possible example, it may not make sense for provider interfaces to be derived from a base object. This version removes the 'interface' object and instead adds that capability as an operation of the base class. This more easily allows any object to be extended. I also added an SRQ object for sharing buffers among multiple endpoints. Finally, I defined a new object, and EQ group, which is a collection of EQs. The definition of an EQ group is related to the progress model identified at the end of the slides. - Sean -------------- next part -------------- A non-text attachment was scrubbed... Name: 2014-05-27-fi-arch.pptx Type: application/vnd.openxmlformats-officedocument.presentationml.presentation Size: 525997 bytes Desc: 2014-05-27-fi-arch.pptx URL: From tjea at us.ibm.com Wed May 21 07:30:10 2014 From: tjea at us.ibm.com (Tsai-yang Jea) Date: Wed, 21 May 2014 10:30:10 -0400 Subject: [OFIWG-MPI] FI Architecture slides for OFI WG meeting tomorrow In-Reply-To: <1828884A29C6694DAF28B7E6B8A82373992FD49A@ORSMSX109.amr.corp.intel.com> References: <1828884A29C6694DAF28B7E6B8A82373992FCC50@ORSMSX109.amr.corp.intel.com> <1828884A29C6694DAF28B7E6B8A82373992FD49A@ORSMSX109.amr.corp.intel.com> Message-ID: Hi Sean, Thank you for the presentation. I am using Linux machine and somehow the OpenOffice does not like those powerpoint files too much. It can open the file but some of the slides are not displayed correctly. Is it possible that to generate a "pdf" file along with the .pptx file in the future? Thanks. Alan From: "Hefty, Sean" To: "ofiwg-mpi at lists.openfabrics.org" , "ofiwg at lists.openfabrics.org" Date: 05/20/2014 09:57 PM Subject: Re: [OFIWG-MPI] FI Architecture slides for OFI WG meeting tomorrow Sent by: ofiwg-mpi-bounces at lists.openfabrics.org I've attached a version 2 of the architecture slides that I presented today. If anyone has questions or comments, please feel free to chime in on the mail lists. The changes are: I added a note clarifying that this is conceptual, and that the objects mentioned may not actually be directly mapped to a specific software structure or class as defined. That level of detail would still be worked out. As just a possible example, it may not make sense for provider interfaces to be derived from a base object. This version removes the 'interface' object and instead adds that capability as an operation of the base class. This more easily allows any object to be extended. I also added an SRQ object for sharing buffers among multiple endpoints. Finally, I defined a new object, and EQ group, which is a collection of EQs. The definition of an EQ group is related to the progress model identified at the end of the slides. - Sean [attachment "2014-05-27-fi-arch.pptx" deleted by Tsai-yang Jea/Watson/IBM] _______________________________________________ ofiwg-mpi mailing list ofiwg-mpi at lists.openfabrics.org http://lists.openfabrics.org/mailman/listinfo/ofiwg-mpi -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From sean.hefty at intel.com Wed May 21 09:41:07 2014 From: sean.hefty at intel.com (Hefty, Sean) Date: Wed, 21 May 2014 16:41:07 +0000 Subject: [OFIWG-MPI] FI Architecture slides for OFI WG meeting tomorrow In-Reply-To: References: <1828884A29C6694DAF28B7E6B8A82373992FCC50@ORSMSX109.amr.corp.intel.com> <1828884A29C6694DAF28B7E6B8A82373992FD49A@ORSMSX109.amr.corp.intel.com> Message-ID: <1828884A29C6694DAF28B7E6B8A82373992FD5EF@ORSMSX109.amr.corp.intel.com> > Is it possible that to generate a "pdf" file along with the .pptx file in > the future? I will try to remember. :) Feel free to remind me if I forget. - Sean From sean.hefty at intel.com Wed May 21 10:09:02 2014 From: sean.hefty at intel.com (Hefty, Sean) Date: Wed, 21 May 2014 17:09:02 +0000 Subject: [OFIWG-MPI] FI Architecture slides for OFI WG meeting tomorrow In-Reply-To: References: <1828884A29C6694DAF28B7E6B8A82373992FCC50@ORSMSX109.amr.corp.intel.com> <1828884A29C6694DAF28B7E6B8A82373992FD49A@ORSMSX109.amr.corp.intel.com> Message-ID: <1828884A29C6694DAF28B7E6B8A82373992FD63E@ORSMSX109.amr.corp.intel.com> > Is it possible that to generate a "pdf" file along with the .pptx file in > the future? PDF version is attached. - Sean -------------- next part -------------- A non-text attachment was scrubbed... Name: 2014-05-27-fi-arch.pdf Type: application/pdf Size: 1011904 bytes Desc: 2014-05-27-fi-arch.pdf URL: From sean.hefty at intel.com Thu May 22 09:43:06 2014 From: sean.hefty at intel.com (Hefty, Sean) Date: Thu, 22 May 2014 16:43:06 +0000 Subject: [OFIWG-MPI] Call today In-Reply-To: <3D8F945A4E59E644AE9205E5CD3708E557D41FD6@MTIDAG01.mtl.com> References: <3D8F945A4E59E644AE9205E5CD3708E557D3F286@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FD02A@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D3FB6F@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FD64A@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D41FD6@MTIDAG01.mtl.com> Message-ID: <1828884A29C6694DAF28B7E6B8A82373992FDBFE@ORSMSX109.amr.corp.intel.com> With permission, copying mailing list on side thread that popped up. I understand MPI has wild card receives. But tagged semantics are useful even when associated with a generic endpoint concept, or a specific address. Note the proposed endpoint concept is not necessarily bound to a specific piece of hardware, though it may be based on the provider implementation. The tagged operations themselves may be implemented by hardware and are not restricted to being purely a software construct. Tagged interfaces, as well as other interfaces such as message queues, may still exist above the endpoint. But that layering of interfaces seems better suited above the fabric interfaces (e.g. MPI), rather than included with it. This seems more debatable to me though, and we could examine whether a domain or fabric object should have send/receive capabilities. - Sean > -----Original Message----- > From: Richard Graham [mailto:richardg at mellanox.com] > Sent: Wednesday, May 21, 2014 11:09 AM > To: Hefty, Sean > Cc: Paul Grun (grun at cray.com); Liran Liss > Subject: RE: Call today > > Tag matching as it comes to MPI semantics is not local to a given pair of > processes, e.g. MPI has a wild card receive that can take data from any > source, and therefore the matching context is broader than just a single > pair of source and destination. > > Rich > > -----Original Message----- > From: Hefty, Sean [mailto:sean.hefty at intel.com] > Sent: Wednesday, May 21, 2014 1:13 PM > To: Richard Graham > Cc: Paul Grun (grun at cray.com); Liran Liss > Subject: RE: Call today > > Tag matching, RMA, atomics, and message operations are currently associated > with an endpoint, but the functions are independent of the communication > protocol in use. Conceptually, it seems reasonable to think of tag > matching as a merging of message and RMA write operations. > > I agree that an endpoint is associated with the data source/sink. There is > no implied mapping between a process and an endpoint. > > > > -----Original Message----- > > From: Richard Graham [mailto:richardg at mellanox.com] > > Sent: Tuesday, May 20, 2014 9:22 PM > > To: Hefty, Sean > > Cc: Paul Grun (grun at cray.com); Liran Liss > > Subject: RE: Call today > > > > I suppose that you could consider tag-matching as part of transport. > > However, I would argue that such protocols should be independent of > > whether or not a reliable or unreliable communication protocol is used > (at least > > when it comes to the tag support needed for MPI). Also, I associate an > > end-point with either the source and/or the sync of data. In MPI tag > > matching is associated with mpi-level (process,communicator) pair, and > > therefore the tag-matching context may be associated with many end- > points. > > I would therefore keep tag-matching as a separate concept. > > > > Rich > > > > -----Original Message----- > > From: Hefty, Sean [mailto:sean.hefty at intel.com] > > Sent: Tuesday, May 20, 2014 1:26 PM > > To: Richard Graham > > Cc: Paul Grun (grun at cray.com); Liran Liss > > Subject: RE: Call today > > > > Tag-matching is a transport object (protocol), so I do think it makes > > sense being associated with a transport level object (i.e. endpoint). > > > > I thought you were referring to the SRQ, which may or may not be a > > transport level object. If the sharing of data buffer(s) among > > multiple connections is not considered a transport object, then I > > agree, it may make sense to have it be a separate object with its own > interfaces. > > Alternatively, it could also be a property of endpoints to share > > receive buffers. > > > > When the SRQ appears in the transport object (protocol), it may get > > more complex. > > > > For initial thoughts, sharing receive buffers could be handled by: > > > > 1. Creating an explicit SRQ object as a 'peer' to an endpoint. SRQ > > would have the ability to associate receive buffers with it. > > Endpoints would need to be associated with an SRQ to make use of it. > > 2. Create an SRQ 'endpoint' object. A send-receive endpoint could be > > created from and inherent the SRQ interfaces. > > 3. Add an endpoint property to allow sharing data buffers. Shared > > buffers could be posted to a domain object, or, alternatively, any > endpoint. > > > > Ultimately, the question becomes a matter of where the 'post receive > > buffer' operation resides, and the behavior of any 'post receive buffer' > > call which may reside elsewhere. E.g. SRQ::PostRecv() versus > > EP::PostRecv(), what is the behavior of EP::PostRecv() if buffer > > sharing is enabled? > > > > These assume SRQ as a non-transport object, or at least one that is > > not visible to the application. > > > > > > > > > Liran mentioned that you wanted me to repeat what I said - my only > > > comment was that we not couple transport (connection based > > > transport) with tag- matching (or any other object supported by the > library). > > > These are two different concepts, and should be kept separate. > > > > > > > > > > > > Rich From sean.hefty at intel.com Thu May 22 12:23:22 2014 From: sean.hefty at intel.com (Hefty, Sean) Date: Thu, 22 May 2014 19:23:22 +0000 Subject: [OFIWG-MPI] Call today In-Reply-To: <3D8F945A4E59E644AE9205E5CD3708E557D43EE8@MTIDAG01.mtl.com> References: <3D8F945A4E59E644AE9205E5CD3708E557D3F286@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FD02A@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D3FB6F@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FD64A@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D41FD6@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FDBFE@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D43EE8@MTIDAG01.mtl.com> Message-ID: <1828884A29C6694DAF28B7E6B8A82373992FDCEC@ORSMSX109.amr.corp.intel.com> > [rich] If the attempt here is to provide a building block that will map to > different use-case scenarios, then need to have an architecture that will > map well onto the areas of interest. MPI is just one such upper level > service, one that has been called out specifically in the context of the > proposal you have been presenting. So, following on this (the precise > definition of end point is still rather fuzzy at this stage) in general, > there is no such one-to-one mapping of and endpoint to an MPI matching > context, but there can be an association of a matching context with one or > more endpoints. What I am suggesting here is that we keep data notions > around data transfer orthogonal to what is done with the data (tag > matching, in this case). How the functionality is implemented (hardware > or not) is separate from how the stack in architected Tag matching is an association between data sent over the wire and the receive buffer into which the data should be placed. This is different than transports that simply match sent data with the buffers using a FIFO ordering scheme. Conceptually, this seems very similar to RMA writes, where control data carried in the transfer indicates which target side buffer the data should be placed into. Are you suggesting that RMA operations should be moved outside of the endpoint object as well? If you have a more specific proposal (and I'll look at what Howard sent out), it may help. But associating data transfer operations with non-endpoint (or whatever you want to call the base data transfer class) seems unnatural to me. From sean.hefty at intel.com Thu May 22 12:29:22 2014 From: sean.hefty at intel.com (Hefty, Sean) Date: Thu, 22 May 2014 19:29:22 +0000 Subject: [OFIWG-MPI] [ofiwg] Call today In-Reply-To: References: <3D8F945A4E59E644AE9205E5CD3708E557D3F286@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FD02A@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D3FB6F@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FD64A@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D41FD6@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FDBFE@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D43EE8@MTIDAG01.mtl.com> Message-ID: <1828884A29C6694DAF28B7E6B8A82373992FDD08@ORSMSX109.amr.corp.intel.com> Thanks, Howard, this is helpful. Regarding the 'tag match class' that you mention, would you create an 'rma class' as a peer, with the RMA operations defined in a similar fashion? If not, why not? Would this also extend to all other data transfer operations? I.e. message queue (send/receive) and atomics, plus any others defined in the future? > -----Original Message----- > From: Howard Pritchard [mailto:hppritcha at gmail.com] > Sent: Thursday, May 22, 2014 11:53 AM > To: Richard Graham > Cc: Hefty, Sean; ofiwg at lists.openfabrics.org; ofiwg- > mpi at lists.openfabrics.org > Subject: Re: [ofiwg] Call today > > Hi Folks, > > here is a diagram of a concept that was discussed in a side conversation at > the last OFA workshop. I'd thought that a msgq (aka tag matcher class) > object > should be instantiated via a method of the fabric class. > > red lines in the diagram indicated the pointee can be associated with the > class > being pointed to by the arrow, using the bind method of the class being > pointed > to. > > the search_by_addr method of the msgq is for use with FID_RDM endpoints, > while search_by_ep method is when the msgq is associated with multipled > FID_MSG type endpoints. > > Note the slide is a little old since the EC class has been divided now into > a EQ and counter type completion notification mechanisms. > > Hoping this will maybe help a little here. > > Howard > > > > On Thu, May 22, 2014 at 11:59 AM, Richard Graham > wrote: > > > Please see inline > > -----Original Message----- > From: Hefty, Sean [mailto:sean.hefty at intel.com] > Sent: Thursday, May 22, 2014 12:43 PM > To: Richard Graham; ofiwg at lists.openfabrics.org; ofiwg- > mpi at lists.openfabrics.org > Cc: Paul Grun (grun at cray.com); Liran Liss > Subject: RE: Call today > > With permission, copying mailing list on side thread that popped up. > > I understand MPI has wild card receives. But tagged semantics are > useful even when associated with a generic endpoint concept, or a specific > address. Note the proposed endpoint concept is not necessarily bound to a > specific piece of hardware, though it may be based on the provider > implementation. The tagged operations themselves may be implemented by > hardware and are not restricted to being purely a software construct. > [rich] If the attempt here is to provide a building block that will > map to different use-case scenarios, then need to have an architecture that > will map well onto the areas of interest. MPI is just one such upper level > service, one that has been called out specifically in the context of the > proposal you have been presenting. So, following on this (the precise > definition of end point is still rather fuzzy at this stage) in general, > there is no such one-to-one mapping of and endpoint to an MPI matching > context, but there can be an association of a matching context with one or > more endpoints. What I am suggesting here is that we keep data notions > around data transfer orthogonal to what is done with the data (tag > matching, in this case). How the functionality is implemented (hardware > or not) is separate from how the stack in architected > > Tagged interfaces, as well as other interfaces such as message > queues, may still exist above the endpoint. But that layering of > interfaces seems better suited above the fabric interfaces (e.g. MPI), > rather than included with it. This seems more debatable to me though, and > we could examine whether a domain or fabric object should have send/receive > capabilities. > [rich] Need to keep separate how data is transferred (perhaps with > functions that we may call send/recv) from the ULP's use of this data > (perhaps also using the a similar naming scheme of send/recv). > > - Sean > > > -----Original Message----- > > From: Richard Graham [mailto:richardg at mellanox.com] > > Sent: Wednesday, May 21, 2014 11:09 AM > > To: Hefty, Sean > > Cc: Paul Grun (grun at cray.com); Liran Liss > > Subject: RE: Call today > > > > Tag matching as it comes to MPI semantics is not local to a given > pair > > of processes, e.g. MPI has a wild card receive that can take data > from > > any source, and therefore the matching context is broader than just > a > > single pair of source and destination. > > > > Rich > > > > -----Original Message----- > > From: Hefty, Sean [mailto:sean.hefty at intel.com] > > Sent: Wednesday, May 21, 2014 1:13 PM > > To: Richard Graham > > Cc: Paul Grun (grun at cray.com); Liran Liss > > Subject: RE: Call today > > > > Tag matching, RMA, atomics, and message operations are currently > > associated with an endpoint, but the functions are independent of > the > > communication protocol in use. Conceptually, it seems reasonable > to > > think of tag matching as a merging of message and RMA write > operations. > > > > I agree that an endpoint is associated with the data source/sink. > > There is no implied mapping between a process and an endpoint. > > > > > > > -----Original Message----- > > > From: Richard Graham [mailto:richardg at mellanox.com] > > > Sent: Tuesday, May 20, 2014 9:22 PM > > > To: Hefty, Sean > > > Cc: Paul Grun (grun at cray.com); Liran Liss > > > Subject: RE: Call today > > > > > > I suppose that you could consider tag-matching as part of > transport. > > > However, I would argue that such protocols should be independent > of > > > whether or not a reliable or unreliable communication protocol is > > > used > > (at least > > > when it comes to the tag support needed for MPI). Also, I > associate an > > > end-point with either the source and/or the sync of data. In MPI > > > tag matching is associated with mpi-level (process,communicator) > > > pair, and therefore the tag-matching context may be associated > with > > > many end- > > points. > > > I would therefore keep tag-matching as a separate concept. > > > > > > Rich > > > > > > -----Original Message----- > > > From: Hefty, Sean [mailto:sean.hefty at intel.com] > > > Sent: Tuesday, May 20, 2014 1:26 PM > > > To: Richard Graham > > > Cc: Paul Grun (grun at cray.com); Liran Liss > > > Subject: RE: Call today > > > > > > Tag-matching is a transport object (protocol), so I do think it > > > makes sense being associated with a transport level object (i.e. > endpoint). > > > > > > I thought you were referring to the SRQ, which may or may not be > a > > > transport level object. If the sharing of data buffer(s) among > > > multiple connections is not considered a transport object, then I > > > agree, it may make sense to have it be a separate object with its > > > own > > interfaces. > > > Alternatively, it could also be a property of endpoints to share > > > receive buffers. > > > > > > When the SRQ appears in the transport object (protocol), it may > get > > > more complex. > > > > > > For initial thoughts, sharing receive buffers could be handled > by: > > > > > > 1. Creating an explicit SRQ object as a 'peer' to an endpoint. > SRQ > > > would have the ability to associate receive buffers with it. > > > Endpoints would need to be associated with an SRQ to make use of > it. > > > 2. Create an SRQ 'endpoint' object. A send-receive endpoint > could > > > be created from and inherent the SRQ interfaces. > > > 3. Add an endpoint property to allow sharing data buffers. > Shared > > > buffers could be posted to a domain object, or, alternatively, > any > > endpoint. > > > > > > Ultimately, the question becomes a matter of where the 'post > receive > > > buffer' operation resides, and the behavior of any 'post receive > buffer' > > > call which may reside elsewhere. E.g. SRQ::PostRecv() versus > > > EP::PostRecv(), what is the behavior of EP::PostRecv() if buffer > > > sharing is enabled? > > > > > > These assume SRQ as a non-transport object, or at least one that > is > > > not visible to the application. > > > > > > > > > > > > > Liran mentioned that you wanted me to repeat what I said - my > only > > > > comment was that we not couple transport (connection based > > > > transport) with tag- matching (or any other object supported by > > > > the > > library). > > > > These are two different concepts, and should be kept separate. > > > > > > > > > > > > > > > > Rich > > _______________________________________________ > ofiwg mailing list > ofiwg at lists.openfabrics.org > http://lists.openfabrics.org/mailman/listinfo/ofiwg > > From sean.hefty at intel.com Thu May 22 15:01:02 2014 From: sean.hefty at intel.com (Hefty, Sean) Date: Thu, 22 May 2014 22:01:02 +0000 Subject: [OFIWG-MPI] [ofiwg] Call today In-Reply-To: References: <3D8F945A4E59E644AE9205E5CD3708E557D3F286@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FD02A@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D3FB6F@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FD64A@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D41FD6@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FDBFE@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D43EE8@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FDD08@ORSMSX109.amr.corp.intel.com> Message-ID: <1828884A29C6694DAF28B7E6B8A82373992FE034@ORSMSX109.amr.corp.intel.com> If I understand your issue, it's not with the send calls being associated with an endpoint, only the relationship of a buffer at the target side. I would think this applies to all receive buffers and any target buffers of an RMA or atomic operation. I didn't follow your asymmetric comment. RMA target buffers are currently associated with a domain (using libfabric terms). Moving that to the fabric level has an implication that a buffer may be associated with different access keys. Trying to move receive buffers above the domain would have a similar issue. Attempting to share receive buffers above the domain would likely result in synchronization issues between devices and providers in such a way that support for zero copy would be compromised. Even SRQ is restricted to a single domain. From an architectural viewpoint, I guess the question is, does it make sense that receive buffers never be directly associated with endpoints? This seems to be the general decision point that this discussion is leading to. - Sean > No I would not create an RMA class. It seems to me like RMA is a transport > method, and as > such should be a method of the ep class. What I didn't like about the > receive buffer method > (with or without tags actually since the IB SRQ capability also falls under > this umbrella) was > that it seemed asymmetric. For FID_MSG type EP, with the current proposal, > there is > no ability to allow receive buffers to be posted that could "match" > incoming sends from > a set of EP's. Here I mean match in the general sense. Only if a vendor > supported > FID_RDM/FID_DGRAM could this be done as the proposal currently stands. > > I think I'm restating what Rich said. > > Howard > > > On Thu, May 22, 2014 at 1:29 PM, Hefty, Sean wrote: > > > Thanks, Howard, this is helpful. > > Regarding the 'tag match class' that you mention, would you create an > 'rma class' as a peer, with the RMA operations defined in a similar > fashion? If not, why not? Would this also extend to all other data > transfer operations? I.e. message queue (send/receive) and atomics, plus > any others defined in the future? > > > > -----Original Message----- > > From: Howard Pritchard [mailto:hppritcha at gmail.com] > > Sent: Thursday, May 22, 2014 11:53 AM > > To: Richard Graham > > Cc: Hefty, Sean; ofiwg at lists.openfabrics.org; ofiwg- > > mpi at lists.openfabrics.org > > Subject: Re: [ofiwg] Call today > > > > Hi Folks, > > > > here is a diagram of a concept that was discussed in a side > conversation at > > the last OFA workshop. I'd thought that a msgq (aka tag matcher > class) > > object > > should be instantiated via a method of the fabric class. > > > > red lines in the diagram indicated the pointee can be associated > with the > > class > > being pointed to by the arrow, using the bind method of the class > being > > pointed > > to. > > > > the search_by_addr method of the msgq is for use with FID_RDM > endpoints, > > while search_by_ep method is when the msgq is associated with > multipled > > FID_MSG type endpoints. > > > > Note the slide is a little old since the EC class has been divided > now into > > a EQ and counter type completion notification mechanisms. > > > > Hoping this will maybe help a little here. > > > > Howard > > > > > > > > On Thu, May 22, 2014 at 11:59 AM, Richard Graham > > > wrote: > > > > > > Please see inline > > > > -----Original Message----- > > From: Hefty, Sean [mailto:sean.hefty at intel.com] > > Sent: Thursday, May 22, 2014 12:43 PM > > To: Richard Graham; ofiwg at lists.openfabrics.org; ofiwg- > > mpi at lists.openfabrics.org > > Cc: Paul Grun (grun at cray.com); Liran Liss > > Subject: RE: Call today > > > > With permission, copying mailing list on side thread that > popped up. > > > > I understand MPI has wild card receives. But tagged > semantics are > > useful even when associated with a generic endpoint concept, or a > specific > > address. Note the proposed endpoint concept is not necessarily > bound to a > > specific piece of hardware, though it may be based on the provider > > implementation. The tagged operations themselves may be > implemented by > > hardware and are not restricted to being purely a software > construct. > > [rich] If the attempt here is to provide a building block > that will > > map to different use-case scenarios, then need to have an > architecture that > > will map well onto the areas of interest. MPI is just one such > upper level > > service, one that has been called out specifically in the context > of the > > proposal you have been presenting. So, following on this (the > precise > > definition of end point is still rather fuzzy at this stage) in > general, > > there is no such one-to-one mapping of and endpoint to an MPI > matching > > context, but there can be an association of a matching context with > one or > > more endpoints. What I am suggesting here is that we keep data > notions > > around data transfer orthogonal to what is done with the data (tag > > matching, in this case). How the functionality is implemented > (hardware > > or not) is separate from how the stack in architected > > > > Tagged interfaces, as well as other interfaces such as > message > > queues, may still exist above the endpoint. But that layering of > > interfaces seems better suited above the fabric interfaces (e.g. > MPI), > > rather than included with it. This seems more debatable to me > though, and > > we could examine whether a domain or fabric object should have > send/receive > > capabilities. > > [rich] Need to keep separate how data is transferred (perhaps > with > > functions that we may call send/recv) from the ULP's use of this > data > > (perhaps also using the a similar naming scheme of send/recv). > > > > - Sean > > > > > -----Original Message----- > > > From: Richard Graham [mailto:richardg at mellanox.com] > > > Sent: Wednesday, May 21, 2014 11:09 AM > > > To: Hefty, Sean > > > Cc: Paul Grun (grun at cray.com); Liran Liss > > > Subject: RE: Call today > > > > > > Tag matching as it comes to MPI semantics is not local to a > given > > pair > > > of processes, e.g. MPI has a wild card receive that can > take data > > from > > > any source, and therefore the matching context is broader > than just > > a > > > single pair of source and destination. > > > > > > Rich > > > > > > -----Original Message----- > > > From: Hefty, Sean [mailto:sean.hefty at intel.com] > > > Sent: Wednesday, May 21, 2014 1:13 PM > > > To: Richard Graham > > > Cc: Paul Grun (grun at cray.com); Liran Liss > > > Subject: RE: Call today > > > > > > Tag matching, RMA, atomics, and message operations are > currently > > > associated with an endpoint, but the functions are > independent of > > the > > > communication protocol in use. Conceptually, it seems > reasonable > > to > > > think of tag matching as a merging of message and RMA write > > operations. > > > > > > I agree that an endpoint is associated with the data > source/sink. > > > There is no implied mapping between a process and an > endpoint. > > > > > > > > > > -----Original Message----- > > > > From: Richard Graham [mailto:richardg at mellanox.com] > > > > Sent: Tuesday, May 20, 2014 9:22 PM > > > > To: Hefty, Sean > > > > Cc: Paul Grun (grun at cray.com); Liran Liss > > > > Subject: RE: Call today > > > > > > > > I suppose that you could consider tag-matching as part of > > transport. > > > > However, I would argue that such protocols should be > independent > > of > > > > whether or not a reliable or unreliable communication > protocol is > > > > used > > > (at least > > > > when it comes to the tag support needed for MPI). > Also, I > > associate an > > > > end-point with either the source and/or the sync of data. > In MPI > > > > tag matching is associated with mpi-level > (process,communicator) > > > > pair, and therefore the tag-matching context may be > associated > > with > > > > many end- > > > points. > > > > I would therefore keep tag-matching as a separate > concept. > > > > > > > > Rich > > > > > > > > -----Original Message----- > > > > From: Hefty, Sean [mailto:sean.hefty at intel.com] > > > > Sent: Tuesday, May 20, 2014 1:26 PM > > > > To: Richard Graham > > > > Cc: Paul Grun (grun at cray.com); Liran Liss > > > > Subject: RE: Call today > > > > > > > > Tag-matching is a transport object (protocol), so I do > think it > > > > makes sense being associated with a transport level > object (i.e. > > endpoint). > > > > > > > > I thought you were referring to the SRQ, which may or may > not be > > a > > > > transport level object. If the sharing of data buffer(s) > among > > > > multiple connections is not considered a transport > object, then I > > > > agree, it may make sense to have it be a separate object > with its > > > > own > > > interfaces. > > > > Alternatively, it could also be a property of endpoints > to share > > > > receive buffers. > > > > > > > > When the SRQ appears in the transport object (protocol), > it may > > get > > > > more complex. > > > > > > > > For initial thoughts, sharing receive buffers could be > handled > > by: > > > > > > > > 1. Creating an explicit SRQ object as a 'peer' to an > endpoint. > > SRQ > > > > would have the ability to associate receive buffers with > it. > > > > Endpoints would need to be associated with an SRQ to make > use of > > it. > > > > 2. Create an SRQ 'endpoint' object. A send-receive > endpoint > > could > > > > be created from and inherent the SRQ interfaces. > > > > 3. Add an endpoint property to allow sharing data > buffers. > > Shared > > > > buffers could be posted to a domain object, or, > alternatively, > > any > > > endpoint. > > > > > > > > Ultimately, the question becomes a matter of where the > 'post > > receive > > > > buffer' operation resides, and the behavior of any 'post > receive > > buffer' > > > > call which may reside elsewhere. E.g. SRQ::PostRecv() > versus > > > > EP::PostRecv(), what is the behavior of EP::PostRecv() if > buffer > > > > sharing is enabled? > > > > > > > > These assume SRQ as a non-transport object, or at least > one that > > is > > > > not visible to the application. > > > > > > > > > > > > > > > > > Liran mentioned that you wanted me to repeat what I > said - my > > only > > > > > comment was that we not couple transport (connection > based > > > > > transport) with tag- matching (or any other object > supported by > > > > > the > > > library). > > > > > These are two different concepts, and should be kept > separate. > > > > > > > > > > > > > > > > > > > > Rich > > > > _______________________________________________ > > ofiwg mailing list > > ofiwg at lists.openfabrics.org > > http://lists.openfabrics.org/mailman/listinfo/ofiwg > > > > > > > From hppritcha at gmail.com Tue May 27 07:43:04 2014 From: hppritcha at gmail.com (Howard Pritchard) Date: Tue, 27 May 2014 08:43:04 -0600 Subject: [OFIWG-MPI] [ofiwg] Call today In-Reply-To: <1828884A29C6694DAF28B7E6B8A82373992FE034@ORSMSX109.amr.corp.intel.com> References: <3D8F945A4E59E644AE9205E5CD3708E557D3F286@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FD02A@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D3FB6F@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FD64A@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D41FD6@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FDBFE@ORSMSX109.amr.corp.intel.com> <3D8F945A4E59E644AE9205E5CD3708E557D43EE8@MTIDAG01.mtl.com> <1828884A29C6694DAF28B7E6B8A82373992FDD08@ORSMSX109.amr.corp.intel.com> <1828884A29C6694DAF28B7E6B8A82373992FE034@ORSMSX109.amr.corp.intel.com> Message-ID: Hi Sean, Some comments inlined, but first about the idea of the msgq being instantiated via a method in the fabric class. I was thinking of the most general case, but I understand the case for having it be instantiated via a domain class method. If a vendor supported a protocol that internal to the domain could handle multiple rails, that would solve the problem of tag/rx buffer matching across multiple rails. Perhaps that's how mxm and psm work already? If that's the case, then I'd agree having a msgq object being instantiated via a domain method would be more appropriate. 2014-05-22 16:01 GMT-06:00 Hefty, Sean : > If I understand your issue, it's not with the send calls being associated > with an endpoint, only the relationship of a buffer at the target side. I > would think this applies to all receive buffers and any target buffers of > an RMA or atomic operation. I didn't follow your asymmetric comment. > > RMA target buffers are currently associated with a domain (using libfabric > terms). Moving that to the fabric level has an implication that a buffer > may be associated with different access keys. Trying to move receive > buffers above the domain would have a similar issue. > > Attempting to share receive buffers above the domain would likely result > in synchronization issues between devices and providers in such a way that > support for zero copy would be compromised. Even SRQ is restricted to a > single domain. > Okay, SRQ example is good reason to have the msgq being "child" of domain. > > From an architectural viewpoint, I guess the question is, does it make > sense that receive buffers never be directly associated with endpoints? > This seems to be the general decision point that this discussion is > leading to. > I think this needs more discussion. I'd think users of SRQ feature in ibverbs may have some ideas about pluses and minuses of associating buffers with endpoints vs the way RX buffers are posted to SRQs now with ibverbs. It might also be interesting to have some PSM and/or MXM gurus provide insight here. Both of these API's have a MSGQ like concept. Although it seems just from looking at available header files, and, for example, OpenMPI usage of these APIs, that the MSGQ also provides some kind of communication envelope as well. Actually for the PSM provider in libfabric, its pretty clear that the MSGQ is an important concept for that API. Howard > - Sean > > > > No I would not create an RMA class. It seems to me like RMA is a > transport > > method, and as > > such should be a method of the ep class. What I didn't like about the > > receive buffer method > > (with or without tags actually since the IB SRQ capability also falls > under > > this umbrella) was > > that it seemed asymmetric. For FID_MSG type EP, with the current > proposal, > > there is > > no ability to allow receive buffers to be posted that could "match" > > incoming sends from > > a set of EP's. Here I mean match in the general sense. Only if a vendor > > supported > > FID_RDM/FID_DGRAM could this be done as the proposal currently stands. > > > > I think I'm restating what Rich said. > > > > Howard > > > > > > On Thu, May 22, 2014 at 1:29 PM, Hefty, Sean > wrote: > > > > > > Thanks, Howard, this is helpful. > > > > Regarding the 'tag match class' that you mention, would you create > an > > 'rma class' as a peer, with the RMA operations defined in a similar > > fashion? If not, why not? Would this also extend to all other data > > transfer operations? I.e. message queue (send/receive) and atomics, plus > > any others defined in the future? > > > > > > > -----Original Message----- > > > From: Howard Pritchard [mailto:hppritcha at gmail.com] > > > Sent: Thursday, May 22, 2014 11:53 AM > > > To: Richard Graham > > > Cc: Hefty, Sean; ofiwg at lists.openfabrics.org; ofiwg- > > > mpi at lists.openfabrics.org > > > Subject: Re: [ofiwg] Call today > > > > > > Hi Folks, > > > > > > here is a diagram of a concept that was discussed in a side > > conversation at > > > the last OFA workshop. I'd thought that a msgq (aka tag matcher > > class) > > > object > > > should be instantiated via a method of the fabric class. > > > > > > red lines in the diagram indicated the pointee can be associated > > with the > > > class > > > being pointed to by the arrow, using the bind method of the class > > being > > > pointed > > > to. > > > > > > the search_by_addr method of the msgq is for use with FID_RDM > > endpoints, > > > while search_by_ep method is when the msgq is associated with > > multipled > > > FID_MSG type endpoints. > > > > > > Note the slide is a little old since the EC class has been > divided > > now into > > > a EQ and counter type completion notification mechanisms. > > > > > > Hoping this will maybe help a little here. > > > > > > Howard > > > > > > > > > > > > On Thu, May 22, 2014 at 11:59 AM, Richard Graham > > > > > wrote: > > > > > > > > > Please see inline > > > > > > -----Original Message----- > > > From: Hefty, Sean [mailto:sean.hefty at intel.com] > > > Sent: Thursday, May 22, 2014 12:43 PM > > > To: Richard Graham; ofiwg at lists.openfabrics.org; ofiwg- > > > mpi at lists.openfabrics.org > > > Cc: Paul Grun (grun at cray.com); Liran Liss > > > Subject: RE: Call today > > > > > > With permission, copying mailing list on side thread that > > popped up. > > > > > > I understand MPI has wild card receives. But tagged > > semantics are > > > useful even when associated with a generic endpoint concept, or a > > specific > > > address. Note the proposed endpoint concept is not necessarily > > bound to a > > > specific piece of hardware, though it may be based on the > provider > > > implementation. The tagged operations themselves may be > > implemented by > > > hardware and are not restricted to being purely a software > > construct. > > > [rich] If the attempt here is to provide a building block > > that will > > > map to different use-case scenarios, then need to have an > > architecture that > > > will map well onto the areas of interest. MPI is just one such > > upper level > > > service, one that has been called out specifically in the context > > of the > > > proposal you have been presenting. So, following on this (the > > precise > > > definition of end point is still rather fuzzy at this stage) in > > general, > > > there is no such one-to-one mapping of and endpoint to an MPI > > matching > > > context, but there can be an association of a matching context > with > > one or > > > more endpoints. What I am suggesting here is that we keep data > > notions > > > around data transfer orthogonal to what is done with the data > (tag > > > matching, in this case). How the functionality is implemented > > (hardware > > > or not) is separate from how the stack in architected > > > > > > Tagged interfaces, as well as other interfaces such as > > message > > > queues, may still exist above the endpoint. But that layering of > > > interfaces seems better suited above the fabric interfaces (e.g. > > MPI), > > > rather than included with it. This seems more debatable to me > > though, and > > > we could examine whether a domain or fabric object should have > > send/receive > > > capabilities. > > > [rich] Need to keep separate how data is transferred > (perhaps > > with > > > functions that we may call send/recv) from the ULP's use of this > > data > > > (perhaps also using the a similar naming scheme of send/recv). > > > > > > - Sean > > > > > > > -----Original Message----- > > > > From: Richard Graham [mailto:richardg at mellanox.com] > > > > Sent: Wednesday, May 21, 2014 11:09 AM > > > > To: Hefty, Sean > > > > Cc: Paul Grun (grun at cray.com); Liran Liss > > > > Subject: RE: Call today > > > > > > > > Tag matching as it comes to MPI semantics is not local > to a > > given > > > pair > > > > of processes, e.g. MPI has a wild card receive that can > > take data > > > from > > > > any source, and therefore the matching context is broader > > than just > > > a > > > > single pair of source and destination. > > > > > > > > Rich > > > > > > > > -----Original Message----- > > > > From: Hefty, Sean [mailto:sean.hefty at intel.com] > > > > Sent: Wednesday, May 21, 2014 1:13 PM > > > > To: Richard Graham > > > > Cc: Paul Grun (grun at cray.com); Liran Liss > > > > Subject: RE: Call today > > > > > > > > Tag matching, RMA, atomics, and message operations are > > currently > > > > associated with an endpoint, but the functions are > > independent of > > > the > > > > communication protocol in use. Conceptually, it seems > > reasonable > > > to > > > > think of tag matching as a merging of message and RMA > write > > > operations. > > > > > > > > I agree that an endpoint is associated with the data > > source/sink. > > > > There is no implied mapping between a process and an > > endpoint. > > > > > > > > > > > > > -----Original Message----- > > > > > From: Richard Graham [mailto:richardg at mellanox.com] > > > > > Sent: Tuesday, May 20, 2014 9:22 PM > > > > > To: Hefty, Sean > > > > > Cc: Paul Grun (grun at cray.com); Liran Liss > > > > > Subject: RE: Call today > > > > > > > > > > I suppose that you could consider tag-matching as part > of > > > transport. > > > > > However, I would argue that such protocols should be > > independent > > > of > > > > > whether or not a reliable or unreliable communication > > protocol is > > > > > used > > > > (at least > > > > > when it comes to the tag support needed for MPI). > > Also, I > > > associate an > > > > > end-point with either the source and/or the sync of > data. > > In MPI > > > > > tag matching is associated with mpi-level > > (process,communicator) > > > > > pair, and therefore the tag-matching context may be > > associated > > > with > > > > > many end- > > > > points. > > > > > I would therefore keep tag-matching as a separate > > concept. > > > > > > > > > > Rich > > > > > > > > > > -----Original Message----- > > > > > From: Hefty, Sean [mailto:sean.hefty at intel.com] > > > > > Sent: Tuesday, May 20, 2014 1:26 PM > > > > > To: Richard Graham > > > > > Cc: Paul Grun (grun at cray.com); Liran Liss > > > > > Subject: RE: Call today > > > > > > > > > > Tag-matching is a transport object (protocol), so I do > > think it > > > > > makes sense being associated with a transport level > > object (i.e. > > > endpoint). > > > > > > > > > > I thought you were referring to the SRQ, which may or > may > > not be > > > a > > > > > transport level object. If the sharing of data > buffer(s) > > among > > > > > multiple connections is not considered a transport > > object, then I > > > > > agree, it may make sense to have it be a separate > object > > with its > > > > > own > > > > interfaces. > > > > > Alternatively, it could also be a property of endpoints > > to share > > > > > receive buffers. > > > > > > > > > > When the SRQ appears in the transport object > (protocol), > > it may > > > get > > > > > more complex. > > > > > > > > > > For initial thoughts, sharing receive buffers could be > > handled > > > by: > > > > > > > > > > 1. Creating an explicit SRQ object as a 'peer' to an > > endpoint. > > > SRQ > > > > > would have the ability to associate receive buffers > with > > it. > > > > > Endpoints would need to be associated with an SRQ to > make > > use of > > > it. > > > > > 2. Create an SRQ 'endpoint' object. A send-receive > > endpoint > > > could > > > > > be created from and inherent the SRQ interfaces. > > > > > 3. Add an endpoint property to allow sharing data > > buffers. > > > Shared > > > > > buffers could be posted to a domain object, or, > > alternatively, > > > any > > > > endpoint. > > > > > > > > > > Ultimately, the question becomes a matter of where the > > 'post > > > receive > > > > > buffer' operation resides, and the behavior of any > 'post > > receive > > > buffer' > > > > > call which may reside elsewhere. E.g. SRQ::PostRecv() > > versus > > > > > EP::PostRecv(), what is the behavior of EP::PostRecv() > if > > buffer > > > > > sharing is enabled? > > > > > > > > > > These assume SRQ as a non-transport object, or at least > > one that > > > is > > > > > not visible to the application. > > > > > > > > > > > > > > > > > > > > > Liran mentioned that you wanted me to repeat what I > > said - my > > > only > > > > > > comment was that we not couple transport (connection > > based > > > > > > transport) with tag- matching (or any other object > > supported by > > > > > > the > > > > library). > > > > > > These are two different concepts, and should be kept > > separate. > > > > > > > > > > > > > > > > > > > > > > > > Rich > > > > > > _______________________________________________ > > > ofiwg mailing list > > > ofiwg at lists.openfabrics.org > > > http://lists.openfabrics.org/mailman/listinfo/ofiwg > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Tue May 27 10:25:37 2014 From: sean.hefty at intel.com (Hefty, Sean) Date: Tue, 27 May 2014 17:25:37 +0000 Subject: [OFIWG-MPI] discussion points from 2014-05-27 OFIWG meeting Message-ID: <1828884A29C6694DAF28B7E6B8A823739930A2B1@ORSMSX109.amr.corp.intel.com> Please respond if there are other areas that need to be addressed. But these seemed to be the most significant comments from today's meeting. An ask was made to show how current features/objects/items _could_ map to the proposed architecture. I will target trying to have an initial cut of this for IB and iWarp by June 10th. This will be useful in identifying any gaps in the architecture. A specific feature request was made to allow receiving all traffic, including multicast traffic - 'sniffer' mode. Architecturally, I think this requires supporting unconnected receives from *any* source, plus interfaces to configure an endpoint for this mode of operation. A point was raised about abstracting multiple EQs behind a single EQ. For example, supporting both CM and completion events on the same EQ. Architecturally, this _may_ be possible, though the interface details need to be worked through. I would like to revisit this after discussing progress, EQ groups, and the proposed EQ interfaces in more detail, which I think will more clearly highlight the trade-offs. Thanks! - Sean From sean.hefty at intel.com Thu May 29 11:23:39 2014 From: sean.hefty at intel.com (Hefty, Sean) Date: Thu, 29 May 2014 18:23:39 +0000 Subject: [OFIWG-MPI] OFIWG MPI requirements document Message-ID: <1828884A29C6694DAF28B7E6B8A823739930D0C2@ORSMSX109.amr.corp.intel.com> I converted the MPI requirements that were collected and presented by Jeff to the OFIWG into a word document table, which is attached. Both a word and pdf formats are available from the OFIWG download site: https://www.openfabrics.org/downloads/OFIWG/OFIWG-requirements.pdf The document captures each requirement, along with an initial indication as to whether the libfabric proposal meets the requirement or not (on a scale of 0.0 to 1.0). A note column documents how a requirement is being met, or what is needed to meet the requirement. There is an 'acked' column next to each requirement. This column is intended as an approval by the requestor (i.e. the MPI community in this case). Once a proposed solution meets a requirement and is acceptable to the requestor, it can be acked. This seemed a reasonable way to track each requirement and ensure that feedback was provided into a final solution. The OFIWG has not formalized any requirement tracking process, so please consider even this process as being open for discussion. But as the OFIWG continues to discuss the higher-level architecture, I think it's worthwhile to keep these lower-level requirements in mind. In the meantime, I will add other collected application requirements to this document. - Sean -------------- next part -------------- A non-text attachment was scrubbed... Name: OFIWG-requirements.docx Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document Size: 30993 bytes Desc: OFIWG-requirements.docx URL: