[ofiwg] 04/01/2025 OFIWG notes
Ingerson, Alexia
alexia.ingerson at intel.com
Tue Apr 1 12:12:10 PDT 2025
04/01/2025
Participants:
Alexia Ingerson (Intel)
Jianxin Xiong (Intel)
Alex McKinley (Intel)
Ben Lynam (Cornelis)
Charles Shereda (Cornelis)
Call-in User_1
Howard Pritchard (LANL)
Ian Ziemba (HPE)
Jerome Soumagne (HPE)
John Byrne (HPE)
Juee Desai (Intel)
Ken Raffenetti (ANL)
Peinan Zhang (Intel)
Rajalaxmi Angadi (Intel)
Stephen Oost (Intel)
Summary:
2.2.0 release targeted for 6/15 - have your big patches in by the end of May. Big features changes for this release are a new shm architecture and a refactoring of the lnx provider
Bug scrub: went over newer issues to make sure they are getting addressed.
Discussion regarding setting of environment variables for different client instances within the same process (for example MPI and DAOS). Suggested using domain ops to specify options for different instances.
Notes:
Planning next release (2.2.0 )
* RC1 6/1
* GA 6/15 - big patches by end of May
* Big features: new shm, new lnx
Going over open issues:
#10911: Verbs CSWAP fetch result error
* Able to reproduce with OFI. Looks like endianess issue or byte placement issue. Will look into it
#10887: LINKx MPI_Probe segfault
* Waiting for new link architecture to address since code will be so different. Will be fixed in new linkx
#10881: Reload verbs devices on each getinfo call
* Being addressed in PR
#10880: Possible to specify source port on libfabric RDMA client side
* Anyone have any experience?
#10879: deadlock with mimalloc
* Rbmap insert gets called with mm_lock held. Madvise calls insert callback again leading to deadlock
#10865: OSU segfault on linkx with cxi cuda
* HPE taking a look
#10860: Build error on Perlmutter
* System specific - HPE will take a look
#10852: verbs async events
* PR under review
#10847: missing rxm CQ entry flags
* Original issue fixed but maybe other issue exists, waiting on reply from reporter
#10822: cxi low performance
* HPE will take a look
#10821: shm prov key support
* Not really a bug/workaround in util mr map may fix, refactor may fix
#10823: efa control plan AV operation locking
* Efa issue
#10804: unsafe reads of av_entry_pool because no lock
#10798: improve rdm ep for storage
* Continued discussion
#10785: tcp debug build generates lock in use assert
* Client is using two threads - but threading model is set to FI_THREAD_DOMAIN so locks are getting set to noops
* Will look into threading implementation
#10762: data race in ibv_req_notify_cq
#10692: device max_cqe not used to set CQ size
#10589: psm3 illegal instruction
* Fixed
#10566: calling fi_connect and fi_eq_sread costs a lot of time
* System specific, no response, closing
Discussion: Setting libfabric environment variables per client in a process
* Separate libraries in user process want to both use OFI with different environment variables
* Could have domain specific op
* Related to issue #10526 (add runtime setting to select MR cache monitor)
* Domain option should be reasonable since client instances will be using different domains
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofiwg/attachments/20250401/755ac106/attachment.htm>
More information about the ofiwg
mailing list