Parent-child hierarchy, allocations starting allocations and failure tolerance

Parent-child hierarchy, allocations starting allocations and failure tolerance

June 5, 2024

Parent child relations between actors #

The out-of-the box actor model does not assume formalism / structure for any hierarchical relations between actors (or, actually, any relations), and therefore does not have a notion of control plane – which makes it not ready for complex computing tasks which needs fault tolerance, error propagation, control flow management, etc.

Borrowing from Offer Networks / Computational model description:

[..] Actor model (as well as the concept of Open-ended intelligence) is fairly abstract and does not define actual mechanisms of communications between actors and formation of their collectives able to solve complex tasks. The issue here is of course introducing the right set of constraints that will make the model more concrete without losing the essential characteristics which we would like to understand (see also the community discussion on the topic). […]

One solution that I am familiar with and got to know via Lightbend’s Akka platform is supervision and monitoring system. We want to use the same principles to build primitives for managing control plane(s) in the decentralized computing framework. These principles are (at least main ones):

  1. Akka uses the notion of Actor System where, which builds up on the actor model’s premise that an actor can be created by another actor and makes it an explicit constraint, that an actor can only be created only by another actor.

  2. Therefore, when an actor in the Actor System is created, it will always know at least one another actor – the one which created it. This means that every actor has an explicit parent, conceptually. If we allow an explicit representation of such parent-child relationships between all actors (in NuNet case, using Graph interface / formalism to represent different types of topological links), then we have this principle in implementation.

  3. Using above we can now define hierarchical structures between actors in the framework. In case of Akka / centralized computing frameworks, the premise that any Actor System has single root parent, which creates the whole system / framework. For the decentralized computing framework, we can relax the requirement of a single root parent and allow multiple hierarchies to coexist within the same computational medium (can otherwise be called namespace / address space, etc).

  4. We can now define hierarchy a bit more clearly as a DAG – Directed Acyclic Graph. Each fully instantiated computing deployment is a DAG of NuNet Actors, which are Nodes and Allocations as per our ontology. Note here that this DAG is defined only by parent-child relations between actors and does not prevent to have other relations between actors defined (like neighbors, gpu-owners, etc.) and used as needed. Property graph model with the semantically biased graph traversal notions make it rather straightforward to implement.

  5. This DAG of actors is constructed in the process of orchestrating a NuNet job. NuNet job definition / semantics should be defined in a way that a job description allows to construct such a DAG – and currently proposed and discussed Job type allows just that. Actually, a Job definition itself most probably should be a DAG too.

  6. So, conceptually, NuNet job orchestration is nothing more than unwrapping one DAG (Job struct) into another DAG (parent-child relations hierarchy of Nodes and Allocations) and overlaying the control structures of each computing workflow / job on top of the otherwise more complex and messy of NuNet platform Graph.

  7. Actually, we can do even more than that – we can define how control structures emerge and change during computation itself, which is directly related to the open-ended decentralized computing model. And that would build a basis to reach for the load balancing, upwards and downwards scalability on demand, etc (besides the basic fault tolerance functionality). But first, for both pre-defined DAG deployment as well as dynamic one, we need to figure how this will be implemented.

Ability of Allocations to create other Allocations #

Conceptually this can be achieved by the principle first proposed in nunet ontology and nomenclature that an allocation should be able to spawn another allocation. Recall that every job is invoked and executed as an allocation, which extends actor (so has an address, mailbox, behavior, etc) therefore the principle that an Allocation can start another Allocation is perfectly aligned with the Actor model. We may, however, need to distinguish the difference of Node creating Allocations and Allocations creating Allocations as well as how Nodes themselves are created.

  1. Nodes will be created by compute providers / machine owners, therefore each node potentially could become a ‘root actor’ within a hierarchy;
  2. A Node should be able to create an Allocation, which is the main primitive and first step of any ob orchestration;
  3. Further, every Allocation can create ‘child’ Allocations on potentially any fitting Node registered in the platform and this is an essential functionality / property that will allow us to approach the ‘open-ended computing model’ in NuNet platform.
  4. In order to implement the first principle of the previous section, we will make it a strict requirement, that an Allocation can be created only via message issued from a Node or another Allocation and that message will create another Allocation and will establish parent-child relation between them.

The remaining question then is where in the platform this logic will be executed.

Native executor #

In principle there are three options (which, most probably, will need to be combined/balanced to achieve the proper implementation of the above principles):

  1. DMS package / plugin (or just a few functions in orchestrator package) that is able to execute the logic described above and defined in the implementation proposal (i.e. receive ‘createActor()` message, instantiate an Allocation, connect to other related Jobs, etc.);
  2. An SDK / additional functionality of each Allocation besides a Job that is running on it (note, that as per current orchestration proposal an Allocation is defined primarily as an execution environment that runs e.g. docker container that defined in Job struct).
  3. We may want to consider a sort of ‘jobless allocations’ which would only run orchestration logic. For that purpose it seems reasonable to define and implement a ’native executor’ on NuNet (using the same Actor, Allocation, Executor etc interfaces) which would be an Actor in the framework, but will take care of decentralized orchestration as defined by job requirements. An example where this may be already needed is, e.g. orchestration of kubernetes job on nunet, where a group of job should be covered to a job, but there is no container defined which is at the head of the hierarchy.

finished here 2024-06-05 11:56 cet for the purpose of today’s Mission Control. To be developed further.

Maintainer: Kabir (please tag all edit merge requests accordingly)