[ofiwg] OFIWG notes 03/05/2025

Ingerson, Alexia alexia.ingerson at intel.com
Wed Mar 5 09:11:17 PST 2025


Date: 03/04/2025
Participants:
Alexia Ingerson (Intel)
Jianxin Xiong (Intel)
Ben Lynam (Cornelis)
Charles Shereda (Cornelis)
Ian Ziemba (HPE)
Jerome Soumagne (HPE)
Juee Desai (Intel)
Ken Raffenetti (ANL)
Sai Sunku (AWS)
Stephen Oost (Intel)

Summary:
2.1.0 RC1 out, RC2 scheduled for 3/8, GA scheduled for 3/15. Mark any cherry-picks with the new "for-2.1.x" label.

RPC issues dealing with persistent server and transient clients - new client sees stale replies intended for old client. Two PRs target this issue. #10837 updates documentation to try to decouple RDM from error cases (failure shouldn't close down RDM endpoint). #10792 adds a new tag format to essentially allow for ignoring unmatched messages for this case. There were mixed opinions about this solution as it seems it has a very limited (and maybe temporary) use case - don't want to include something too targeted in the API. Plan to look into tcp provider for provider specific implementation.

Notes:
Release 2.1.0 update:

  *   RC1 out 3/1/2025
  *   New branch v2.1.x
  *   RC2 scheduled for 3/8/2025
     *   Psm3 update
     *   Bug fixes for other providers
     *   New label "for-2.1.x"
  *   GA scheduled for 3/15/2025
RPC issues

  *   Persistent server, transient clients
     *   Client should not bring down server
     *   New client sees stale replies intended for old client
        *   Tagged messages for replies
        *   Tag is specific for the reply
        *   State reply won't find match (stuck in unexpected queue)
  *   Q: there were some concerns about return EAGAIN? Is that still a concern?
     *   That's more of a provider-specific detail. This PR (10837) is just to update documentation
     *   EAGAIN isn't appropriate because we shouldn't retry (client side is already down)
  *   Trying to decouple RDM endpoint from regular error cases - failure shouldn't close down RDM endpoint
  *   Also added new error type - unreachable EP for if client died
  *   Other PR (10792) proposes new tag format to essentially allow ignoring unmatched messages
     *   Reason for specifying as tag format is more to defined provider behavior
        *   Should we focus more an application behavior?
        *   IZ: agree we should focus on application behavior and up to provider to handle that
     *   IZ: What's missing from PR is that is uses mem tag format but never explains what that means
        *   Exact match vs TAG_BITS to use ignore bits
        *   Mercury uses 32 bits but don't need to impose that for this definition. Don't use mask, match on entire tag
     *   Original ask was to drop unexpected messages but has turned into tag matching definitions. Are these related any more?
        *   We don't need it if we have a better way to handle it but not sure what that would look like
        *   Not really any other usage outside of this use case for tag format
     *   In efa, use timestamp to generate unique connection id so messages for old peers can be easily identified and dropped
        *   The issue seems to be within the provider - not being able to distinguish new connections
     *   Going to revist so we don't introduce a new tag format for limited use case and limited time just for one provider - will look into tcp provider for provider specific implementation

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofiwg/attachments/20250305/304647fb/attachment.htm>


More information about the ofiwg mailing list