[Ofmfwg] Questions on the Sunfish response to Agent registration Event

Herrell, Russ W (Senior System Architect) russ.herrell at hpe.com
Wed Aug 16 17:35:40 PDT 2023


A long message, because it is an entire proposal that we should discuss this Friday:

I think we finally have the Sunfish reference repo updated to recognize a new Agent registering itself with the Sunfish Service by sending an event with the appropriate eventID
"MessageId": "Manager.1.0.AggregationSourceDiscovered"

Next we need the Sunfish Service to extract the inventory of resources being managed / represented by the new Agent.


The old write up "OFMF_server_and Agent_interactions_based_on_events" lists two options by which the Sunfish Service might accomplish this:

  1.  Sunfish grabs a Snapshot synchronization from the agent
     *   Sunfish does an acknowledgement of some kind to the Agent, and the Agent sends a subsequent Event with a payload of Redfish Objects that the Agent wants the Sunfish Service to install into its resource tree.
     *   Sunfish unpacks the possibly lengthy Event payload and does the install of each object
     *   I have several questions about the details, but I'm not going to bring them up here.
  2.  Sunfish does Recursive Inspection to crawl-out the Redfish model of the Agent's resources.
     *   Sunfish starts querying the Agent's Redfish resources, and chasing down all linked resources until it has a copy of the Agent's Redfish resource tree.

Both options have similar burdens on the Agent to assign Redfish IDs and craft navigation links as required.  I'm not sure that either tactic saves the Agent or Sunfish much work.
The two big differences:

  1.  Snapshot synchronization requires a deep dive into failure cases:  If Sunfish cannot parse one object in a 100 object payload, what do we do to recover?  Recursive Inspection can follow all the rules for single objects; we can apply different policies in Sunfish and not have to alter the Agent - Sunfish interface (it's still just Redfish).
  2.  Recursive Inspection lacks the atomic read feature:  if the Agent's resources change state while Sunfish is issuing multiple GETs, Sunfish and the Agent may not end up in sync without additional policies to guarantee a 'snapshot' result.

I'd like to propose that #2, Recursive Inspection is the easier technique to implement and is a more robust method when working with different agents and different fabrics.  The need to resolve non-atomic queries is present at runtime anyway, so we may need to solve that problem just once, for all fabrics and all agents and clients.  #1 will require additional 'Events' in the Sunfish - Agent interface. #2 may require we start tracking change events before Sunfish starts making queries.  The latter seems easier to do at the moment.

Regardless of which way we go on the above option for Agent inventory discovery by Sunfish, we need to get that block of code written and installed into the Sunfish reference.

My proposal:

  1.  We use Recursive Inspection
  2.  We create a self-invoked task inside Sunfish upon receipt of the agent registration event
  3.  We write a Recursive Inspection routine that starts at the Fabrics Collection of the Agent, and finds all the endpoints of the fabric and all the related objects
     *   Agents tag each retrieved object as 'clean' and arm the notification event to Sunfish upon sending the object in the response
     *   Recursive Inspection creates Systems and Chassis and other collections as required to place all linked fabric endpoints
     *   Recursive Inspection needs some mechanism to make new objects 'not ready for use' until all recursion is finished
  4.  We test this Recursive Inspection process by launching it against a Swordfish Emulator that is sourcing a suitable default storage fabric mockup
     *   We fire up little testbed which brings up a Sunfish emulator, a Swordfish emulator, and sends the Sunfish emulator an appropriately configured Event registering the Swordfish emulator as the new AggregationSource and triggering the Recursive Inspection
     *   All the GETS from Sunfish to Swordfish should work
     *   The storage mockups should end up in the Sunfish resource tree
  5.  We extend this testbed into an actual Swordfish Appliance Agent by creating the appropriate enhancements to a copy of the Swordfish Emulator
     *   This will define the 'architected rules' that an existing Swordfish API hosted directly on a Swordfish storage appliance would have to follow to allow Sunfish to proxy all API and Events to the end clients.
     *   This will define the role that a shim Swordfish Agent must serve when coupling Sunfish to a Swordfish Appliance API that isn't strictly compliant to Sunfish requirements.

                                                    i.     IE, when the Swordfish API of an existing appliance cannot be modified to register with Sunfish and let Sunfish proxy the API to clients, we need a simple shim layer that fills in the gaps

Once we have this Recursive Inspection block working, we can mockup multiple fabric Agents by using different fabric mockups on the Agent emulator.
The code for recursive crawl through of a fabric's resources can be leveraged for use as a client-side resource discovery routine, with appropriate additions or subtractions to properly interpret the sought after resources (like FAM or NVMeoF or GPUs) or ignore irrelevant resources.

If we can agree on this path forward, then we can negotiate who will write the Recursive Inspection block and build the Sunfish-Agent testbed pieces....

Thoughts?

Russ



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofmfwg/attachments/20230817/c086d1ab/attachment-0001.htm>


More information about the Ofmfwg mailing list