Supervision

Supervision

June 26, 2024

Supervision model #

This post is a stab at defining NuNet Actors supervision model. It is (of course) inspired by Akka’s supervision model and more concretely by Proto.Actor model and protoactor-go implementation of the model. Our goal is to adapt those models and implementations to handle supervision between NuNet Actors (i.e. Allocations and Nodes) considering parent-child relationships between NuNet actors.

A note on Proto.Actor model #

Proto.Actor supervision model has relations to Akka model and builds on top of notions of Cluster, Node, ActorSystem, Actor (very superficially). It seems that in this model, each actor has to belong to an actor system. Nodes (machines) could share actor system via belonging to a single cluster in which case actors can send messages to each other (because they share the same address / PID space). Clusters in Proto.Actor model system are managed by … third party cluster implementations, like Consul.io, etcd, etc. See Chat with Copilot about protoactor-go and specifically Q5 for the context.

NuNet actor / supervision hierarchy and requirements #

In nunet we have slightly different picture, since Nodes (or dms’s) can join the network themselves and we do not have any process or server that runs the network as a separate object – a network is a collection of nodes only. However, we can have sub-networks (via private-swarm and private-ip functionality). These cannot be considered Actor systems thou.

NuNet Actors relations #

Here are the relations between Network, Node(DMS), Allocation and Executor in the platorm. We need to design Supervision model on top of these;

flowchart TD
    Network --> |CanHaveMultiple| Node
    Node --> |CanSpawnMultiple| Allocation
    Node --> |CanSuperviseMultiple| Allocation
    Allocation -- CanSuperviseMultiple --> Allocation
    Node -- can join only one --> Network
    Allocation -->  |CanHaveMultiple| Executors
    Allocation -->  |CanSuperviseMultiple| Executors
  

Example deployment #

Suppose all nodes belong to a single network;

flowchart TD
    Node1 --> |deploysJob1| Allocation11
    Node1 --> |runs| Allocation11
    Allocation11 --> |runs| Executor111
    Allocation11 --> |runs| Executor112
    Node1 --> |deploysJob1| Allocation12
    Node1 --> |deploysJob1| Allocation13
    Node2 --> |runs| Allocation12
    Allocation12 --> |runs| Executor121
    Allocation12 --> |runs| Executor122
    Node3 --> |runs| Allocation13
    Allocation13 --> |runs| Executor131
    Allocation13 --> |runs| Executor132
  

All above belong to a single job, initiated by Node1 and therefore can be seen to form a single parent-child relationship hierarchy:

flowchart TD
    Node1 --> |parentOf| Allocation11
    Allocation11 --> |parentOf| Executor111
    Allocation11 --> |parentOf| Executor112
    Node1 --> |parentOf| Allocation12
    Node1 --> |parentOf| Allocation13
    Allocation12 --> |parentOf| Executor121
    Allocation12 --> |parentOf| Executor122
    Allocation13 --> |parentOf| Executor131
    Allocation13 --> |parentOf| Executor132
  

Which basically just makes parent-child relations independently if an allocation running on the same node or on another node (btw, we can also have different types of parent / child relations, as long as we define heartbeat / healtcheck concepts). In any case, in order to orchestrate the deployment of Job1, Node1 has to observe the whole hierarcy via either heartbeats of healtchecks.

Heartbeat #

Heartbeat is just a periodic message that a child actor sends to a parent actor saying ‘i am alive’. The logic / decision of what to do with that message rests within Parent (in Proto.Actor model they have a supervisor strategy concept for that see code).

Important note only actors can send heartbeat messages to other actors – so in our case Nodes to Nodes and Allocations to Allocations, but not Executors to Allocations.

Healtcheck #

Option 1 - healtheck between actors #

From the actor model perspective a healtcheck is two message process:

  1. A parent sends a message to a child ‘hey, what are you busy with?’ and closes the connection;
  2. A child gets a message and responds with a new message ‘yes, in response to your question, please be informed that I am alive and doing what I am supposed to do’ and closes the connection;

The logic then how long parent waits for a child to answer (and considers child dead in case no answer is received during established TTL of healtcheck request) or what to do when message is arrived is entirely within parents reponsiblity;

Option 2 – healtcheck between Allocation and Executor #

In case of parent-child relations between Allocation and Executor, healtcheck cannot happen in terms of Option 1, because Executors are not Actors in our model and cannot actively send messages. Therefore, each Allocation will need to have a special code to monitor executors, do healthcheck on them and implement the supervisor stategy that is outside Actor model.

Extension for Nodes to supervise all Allocations #

It could (and will be quite frequent case) happen that Nodes will run Allocations which belong to a job that was initiated by another node and therefore in our hierarchy will not be supervised by a node that is running it. It may be suboptimal. In this case we can have two supervisor strategies – one strategy that each node will use to handle all allocations running locally (independently of their belonging to different jobs) and another supervisor strategy that a node will use to supervise each job that is initiated by that node (and for this i suggest to have separate allocations and allow allocations to supervise other allocations – possibly in the future).

Proposed issues based on above discussion: #

  1. SupervisorStrategy implementation for Actors;
  2. Heartbeat implementation for Actors (node and allocation);
  3. Healthceck implementation for actors (as explained above);
  4. Healthckeck implementation for allocation->executor relations (as explained above);
  5. Relations:
    • job definitions should include heartbeat / healtcheck requirements;
  6. Possible issues:
    • what happens if a job description requires an executor to be healthchecked by another allocation that runs on another node? how are we going to implement this?