Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible scheduling #204

Closed
35 of 36 tasks
Tracked by #3065
jpbruinsslot opened this issue Jan 2, 2023 · 2 comments · Fixed by #2786 · May be fixed by minvws/nl-kat-mula#32
Closed
35 of 36 tasks
Tracked by #3065

Flexible scheduling #204

jpbruinsslot opened this issue Jan 2, 2023 · 2 comments · Fixed by #2786 · May be fixed by minvws/nl-kat-mula#32
Assignees
Labels
mula Issues related to the scheduler

Comments

@jpbruinsslot
Copy link
Contributor

jpbruinsslot commented Jan 2, 2023

Context

The scheduler now references the /random endpoint from octopoes to get ooi's for rescheduling of tasks (ooi * available boefjes = tasks). To keep track of recurring tasks within the scheduler we allow for:

  • a more deterministic approach to scheduling and rescheduling of tasks, so we don't rely on randomness from objects returned by octopoes.
  • more fine-grained control of the scheduling of specific tasks, potential for scheduling job independent of signals from octopoes, katalogus, and octopoes random endpoint.
  • allows for scheduling other tasks such as reports
  • scheduling of tasks without an associated ooi, i.e. independent run boefjes
  • scheduling of normalizer tasks, e.g. run a specific normalizer task at an interval regardless of a signal from bytes.
  • give better insight into what tasks are scheduled and at what time, and how many times the specific task has been executed
  • leads into new ranking possibilities

Proposed solution

Changes

Change 1: Schedule model

Create a new model Schedule which contains the necessary information to create a new Task. A Schedule has a 1-to-many relationship with an instanced Task. The updated entity relationship diagram for the scheduler:

erDiagram
task {
    id uuid PK
    scheduler_id str
    schedule_id uuid FK
    hash str
    priority int
    status taskstatus
    data jsonb
    created_at timestamp
    modified_at timestamp
}

schedule {
    id uuid PK
    scheduler_id str
    hash str
    data jsonb
    enabled bool
    schedule str
    tasks list[task]
    deadline_at timestamp
    created_at timestamp
    modified_at timestamp
}

task }o--|| schedule: ""
Loading

With both models looking like follows:

class TaskSchema(BaseModel):
model_config = ConfigDict(from_attributes=True, validate_assignment=True)
id: uuid.UUID = Field(default_factory=uuid.uuid4)
scheduler_id: str
hash: str | None = Field(None, max_length=32)
data: dict | None = {}
enabled: bool = True
schedule: str | None = None
tasks: list[Task] = []
deadline_at: datetime | None = None
created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
modified_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))

[!NOTE] This model is in the process to be rename to Schedule.

class Task(BaseModel):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID = Field(default_factory=uuid.uuid4)
scheduler_id: str
schema_id: uuid.UUID | None = None
# schema: TaskSchema ## FIXME: naming conflict with pydantic .schema()
priority: int | None = 0
status: TaskStatus = TaskStatus.PENDING
hash: str | None = Field(None, max_length=32)
data: dict | None = {}
created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
modified_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))

The data field contains the actual data that is needed for a task runner to execute its task, in case of a BoefjeScheduler this is a BoefjeTask:

{
  "id": "",
  "boefje": {},
  "input_ooi":  "",
  "organization": ""
}

Change 2: Removal of the PrioritizedItem table and model

Removal of the PrioritizedItem model and table. By removing this, a substantial decrease in data duplication within the scheduler has been achieved. The queue is now 'materialized' by filtering on the task table based on the status of a Task, i.e. all task with the status QUEUED. In practice this means the following, when a task is created (either by manual entry, enabling of boefjes, rescheduling, etc) this task is added to the task table, and for every task that gets created a Schedule is created as well.

Example of both Task and Schedule table, here a Schedule already scheduled a Task in the past, and QUEUED a Task in the future.

Schedule table:

id scheduler_id hash data manual enabled schedule deadline_at created_at modified_at
6f904 boefje-org0 1234 {} true true 0 12 * * 1 2020-01-01 2019-01-01 2020-01-01

[!NOTE] a Schedule can be manually defined by a cron expression

Task table:

id scheduler_id schedule_id hash priority status data created_at updated_at
59cb0 boefje-org0 6f904 1234 1 COMPLETED {} 2019-01-01 2019-01-01
122e2 boefje-org0 6f904 1234 1 QUEUED {} 2020-01-01 2020-01-01

When a task with the highest priority needs to be removed from the queue the following code is then being executed:

def pop(self, scheduler_id: str, filters: FilterRequest | None = None) -> models.Task | None:
with self.dbconn.session.begin() as session:
query = (
session.query(models.TaskDB)
.filter(models.TaskDB.status == models.TaskStatus.QUEUED)
.order_by(models.TaskDB.priority.asc())
.order_by(models.TaskDB.created_at.asc())
.filter(models.TaskDB.scheduler_id == scheduler_id)
)
if filters is not None:
query = apply_filter(models.TaskDB, query, filters)
item_orm = query.first()
if item_orm is None:
return None
return models.Task.model_validate(item_orm)

The rescheduling of tasks can be imagined as follows: the BoefjeScheduler queries the Schedule table where the deadline has passed (meaning a task needs to be scheduled now, and pushed onto the queue). A Task will be created and inserted into the Task table. A Task Runner can then retrieve a task through the server API, run the task and update the status when it finished.

321617829-80dd5bf1-b454-489b-b9ee-c66140e22ff4 drawio (1)

Tasklist

Pull Request

Associated issues

@jpbruinsslot jpbruinsslot linked a pull request Jan 2, 2023 that will close this issue
@jpbruinsslot jpbruinsslot linked a pull request Jan 2, 2023 that will close this issue
@jpbruinsslot jpbruinsslot self-assigned this Jan 5, 2023
@TwistMeister TwistMeister added enhancement New feature or request and removed feature request labels Feb 6, 2023
@dekkers dekkers added the mula Issues related to the scheduler label Feb 15, 2023
@dekkers dekkers transferred this issue from minvws/nl-kat-mula Feb 15, 2023
@jpbruinsslot jpbruinsslot changed the title [Feature] Recurring / rescheduling tasks [Mula] Recurring / rescheduling tasks Feb 16, 2023
@jpbruinsslot jpbruinsslot changed the title [Mula] Recurring / rescheduling tasks [Mula] Recurring / rescheduling jobs Feb 16, 2023
@jpbruinsslot jpbruinsslot changed the title [Mula] Recurring / rescheduling jobs [Mula] Scheduled jobs Feb 16, 2023
@jpbruinsslot jpbruinsslot linked a pull request Feb 16, 2023 that will close this issue
@jpbruinsslot jpbruinsslot changed the title [Mula] Scheduled jobs Recurring tasks May 31, 2023
@jpbruinsslot jpbruinsslot changed the title Recurring tasks Rescheduling recurring tasks May 31, 2023
@jpbruinsslot jpbruinsslot moved this from Backlog / Refined tasks to In Progress in KAT Jun 5, 2023
@jpbruinsslot jpbruinsslot moved this from In Progress to Backlog / Refined tasks in KAT Jun 5, 2023
@jpbruinsslot jpbruinsslot changed the title Rescheduling recurring tasks Rescheduling tasks Jun 5, 2023
@jpbruinsslot jpbruinsslot moved this from Backlog / Refined tasks to Todo (In this sprint) in KAT Jun 5, 2023
@jpbruinsslot jpbruinsslot moved this from Todo (In this sprint) to Backlog / Refined tasks in KAT Jun 5, 2023
@jpbruinsslot jpbruinsslot moved this from Backlog / Refined tasks to In Progress in KAT Jun 13, 2023
@jpbruinsslot jpbruinsslot moved this from In Progress to Todo (In this sprint) in KAT Jun 14, 2023
@jpbruinsslot jpbruinsslot moved this from Todo (In this sprint) to In Progress in KAT Jun 19, 2023
@jpbruinsslot jpbruinsslot removed a link to a pull request Jun 21, 2023
@jpbruinsslot jpbruinsslot linked a pull request Jun 21, 2023 that will close this issue
@jpbruinsslot jpbruinsslot moved this from In Progress to Todo (In this sprint) in KAT Jun 29, 2023
@jpbruinsslot jpbruinsslot moved this from In Progress to Todo (In this sprint) in KAT Jan 9, 2024
@jpbruinsslot jpbruinsslot moved this from Todo (In this sprint) to Backlog / Refined tasks in KAT Jan 11, 2024
@jpbruinsslot jpbruinsslot removed the enhancement New feature or request label Jan 18, 2024
@jpbruinsslot jpbruinsslot moved this from Backlog / Refined tasks to Todo (In this sprint) in KAT Jan 18, 2024
@jpbruinsslot jpbruinsslot moved this from Todo (In this sprint) to Backlog / Refined tasks in KAT Jan 31, 2024
@jpbruinsslot jpbruinsslot moved this from Backlog / Refined tasks to In Progress in KAT Mar 13, 2024
@jpbruinsslot jpbruinsslot moved this from In Progress to Backlog / Refined tasks in KAT Mar 13, 2024
@jpbruinsslot jpbruinsslot removed a link to a pull request Mar 14, 2024
@jpbruinsslot jpbruinsslot linked a pull request Mar 14, 2024 that will close this issue
4 tasks
@jpbruinsslot jpbruinsslot removed a link to a pull request Apr 9, 2024
4 tasks
@jpbruinsslot jpbruinsslot linked a pull request Apr 9, 2024 that will close this issue
9 tasks
@jpbruinsslot jpbruinsslot moved this from Backlog / Refined tasks to To be discussed in KAT Apr 11, 2024
@jpbruinsslot jpbruinsslot moved this from To be discussed to Todo (In this Sprint) in KAT Apr 18, 2024
@jpbruinsslot jpbruinsslot moved this from Todo (In this Sprint) to Backlog / Refined tasks in KAT Apr 18, 2024
@jpbruinsslot jpbruinsslot mentioned this issue Apr 25, 2024
9 tasks
@jpbruinsslot jpbruinsslot moved this to To be discussed in KAT May 20, 2024
@jpbruinsslot jpbruinsslot moved this from To be discussed to Backlog / To do in KAT May 21, 2024
@jpbruinsslot jpbruinsslot moved this from Backlog / To do to To be discussed in KAT May 22, 2024
@jpbruinsslot
Copy link
Contributor Author

The design has been agreed upon, with the advice to make sure that database normalization would be possible in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mula Issues related to the scheduler
Projects
Archived in project
5 participants