[ofiwg] OFIWG 8/20/2024 Minutes

Tue Aug 20 11:07:35 PDT 2024

Thank Alexia for taking the notes.

8/20/2024

* Participants:

	Aleixa Ingerson (Intel)
	Adam Goldman (Intel)
	Alex McKinley (Intel)
	Ben Lynam (Cornelis)
	Charles Shereda (Cornelis)
	Howard Pritchard (LANL)
	Ian Ziemba (HPE)
	Jerome Soumagne
	Jianxin Xiong (Intel)
	John Byrne (HPE)
	Juee Desai (Intel)
	Ken Raffenetti (ANL)
	Nikhil Nanal (Intel)
	Peinan Zhang (Intel)
	Rajalaxmi (Intel)
	Shi Jin (AWS)
	Stephen Oost (Intel)
	Steve Welch (HPE)
	Zach Dworkin (Intel)

* Notes:

	2.0 timeline:
		1.22 released recently
		2.0 a little delayed behind original schedule (original was July/August). 
		New schedule is alpha in late August - no RC. 
		GA also pushed back - RC end of November. 
		Final release mid-December

* 2.0 pending issues:

	* Add option for not supporting any source receive
	- Some providers might want to optimize for only supporting FI_DIRECTED_RECV. 
	   Some applications only use directed recv which might allow providers to optimize.
	- Could add mode bit FI_NO_ANY_SOURCE to allow disabling FI_DIRECTED_RECV
	   Support FI_MULTI_RECV for tagged messages
	- Current FI_MULTI_RECV only defined for FI_MSG, not FI_TAGGED
	- Could be useful for tagged messages. Could expand support to include tagged or 
	  could add extra capability for FI_TMULTI_RECV to be more specific.
	- Leaning towards adding capability bit to not break providers that currently support
	  FI_MULTI_RECV for only FI_MSG
	- HPE agrees capability bit would be preferred

	* Add hints input and caps output to collective join
	- Currently, before doing any collective operation, collective join allows you to join a collective group
	- Could be useful to have information about the type of collective to help the switch optimize
	  the configuration - could be helpful to add hints as input to join to help collective optimization. 
	  Also add caps as output to let the application know what collectives are supported
	- Add fi_collective_join2 or could alter fi_collective_join call directly. In the man page collective 
	  implementation is defined as "experimental" which allows us to modify the API without having 
	  to be backwards compatible
	- Q: What would be an example?
	- A: hints would include which collectives you want to call (ie allreduce, gather, etc). Capability 
	  returned returns what the provider can support (can be more than what was requested), but 
	  can disable collectives if they weren't requested.
	- Hints needed is really just the type of collective, not the size. This topic needs more discussion 
	  because is targeting hardware specific needs

	* Separate FI_DIRECTED_RECV capability for message and tagged messages
	- Sometimes application may only need FI_DIRECTED_RECV for only FI_MSG or FI_TAGGED
	- Proposal is to add FI_DIRECTED_TRECV capability bit.
	- Original FI_DIRECTED_RECV covers both. Only new capability is restricted

	* Only allow binding Eps to one CQ
	- Got a lot of feedback that separating CQs is helpful. This proposed change will be dropped 
	  from 2.0. Objections?
	- Concern: makes it more difficult to map application CQ to hardware CQ. Currently have a request
	  from customers to create a 1:1 relationship between OFI CQ and IBV CQ. Having a single application
	  CQ for sends and receives makes the code messy in regard to hardware mapping of resources
		- Allowing separating of CQ won't affect that case. More has to do with difficulty supporting 
		  one CQ for multiple uses

	* Allow different inject sizes for FI_MSG and FI_TAGGED
	- Change already added - resolved

	* Redefine FI_HMEM interface 
	- FI_HMEM is only an on/off capability bit but there are a lot of more specific capabilities 
	  (How to copy, async/sync,  dmabuf reg
	- Issue with CUDA calls conflicting with NCCL
		- Psm3 uses driver API, not runtime API and wasn't able to reproduce issue
	- 2.19 CUDA switched APIs from the driver API to the virtual API - broke AWS customers
	- A lot of these issues seem to be CUDA specific, maybe don't want to expose some issues 
	  targeted more at CUDA (for example which API to use), but could be good to define attributes 
	  to query
	- FI_HMEM interface uses same interface for dev/host/managed memory and allows applications 
	  to pass in any type of memory. However, some uses are restricted by what type of memory is 
	  used (ie RDMA or IPC cannot support host or managed memory).
		- FI_HMEM_DEVICE_ONLY flag exists to communicate to provider that the memory is 
		  not managed or host memory and can be used through RDMA or IPC protocols

	* Logging API
	- No more details, need to look into

	* Next meeting - AWS will present on HMEM capabilities

* Summary:

Discussion centered around 2.0 release schedule and pending issues/discussions. 2.0 is a little 
delayed (originally was July/August). New schedule is alpha in late August - no RC. GA is also 
pushed back - RC at end of November. Final release is targeted for mid-December.

Went over the following issues:
* Add option for not supporting any source receive - add mode bit FI_NO_ANY_SOURCE to 
   disable receiving from any source to allow providers to optimize for directed recv.
* Support FI_MULTI_RECV for tagged messages - add capability FI_MULTI_TRECV to add 
    tagged multi receive capability to not break providers that advertise FI_MULTI_RECV 
    and only support regular messaging with multi recv
* Add hints input and caps output to collective join - add input to join to allow applications 
   to specify which collectives they need and add output for provider to indicate which 
   collectives are enabled. This allows a a provider to optimize the configuration.
* Separate FI_DIRECTED_RECV capability for message and tagged messages - add 
   FI_DIRECTED_TRECV capability to specific directed recv is only needed for tagged interface. 
   Existing FI_DIRECTED_RECV remains untouched and indicates support for both message 
   and tagged interfaces
* Only allow binding Eps to one CQ - got a lot of feedback that separating CQs is helpful. 
   This proposed change will be dropped from 2.0.
* Allow different inject sizes for FI_MSG and FI_TAGGED - this was added and is upstream
* Redefine FI_HMEM interface - FI_HMEM is only an on/off capability bit but there is a 
   lot of more specific capabilities. AWS will give a presentation at the next OFIWG meeting 
   to discuss HMEM capabilities
* Logging API - no more details, need to look into