Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible scheduling #2786

Merged
merged 201 commits into from
Aug 13, 2024
Merged

Flexible scheduling #2786

merged 201 commits into from
Aug 13, 2024

Conversation

jpbruinsslot
Copy link
Contributor

@jpbruinsslot jpbruinsslot commented Apr 4, 2024

Please refer to #204 for more background on this PR

Changes

Since this is a big PR with substantial changes, I've added the following for more guidance on review and qa. It is helpful to read the above-linked issue to get more context. And whenever you need me I can give you this as well in a call.

The main changes are:

  • Addition of a new Schedule model to support a more deterministic approach to (re)scheduling of tasks
  • Removal of the PrioritizedItem model and its database table. The queue is not a separate table anymore, a 'view' of the database filtered on status queued is the current state of the queued items of a scheduler.

Specific code changes of interest:

  • Model changes

    • Schedule model added to models/schedule.py
      • migrations added ( take specific note of the p_item change to data in the task table)
      • store added to storage/schedule_store.py
    • Removed PrioritizedItem
      • migration added for removal of items table
  • BoefjeScheduler changes

    • added a new process that adds new tasks to be picked up by checking the schedule of prior tasks: push_tasks_for_rescheduling()
  • NormalizerScheduler changes

    • minor changes made to support the new setup
  • Abstract base class Scheduler changes - some methods have been moved around so below are the main points of interest:

    • minor changes made to support the new setup
    • The post_push() method creates a Schedule for when a task has been added to the queue.
  • API server changes

    • Added endpoints for Schedule
    • Updated Task endpoints to support the new setup
  • Boefje Runner changes to support the new setup

  • Rocky changes to support the new setup

QA notes

  • Since this pr is now a replacement for the rescheduling of tasks within the scheduler the functionality of KAT should remain the same. So the main issues that could arise is from starting, (automatic) rescheduling of scans. All other actions of scanning that would touch the scheduler.
    • To check if schedule objects have been created for the boefje tasks, you can run through the onboarding, do some scans and check the schedules curl http://localhost:8004/schedules
  • Migrations, since there are substantial data migrations with this pr, care should be given on checking whether migrations from data prior to this pr will applied correctly with this pr.
    • Check out main
    • make reset
    • Walk through the onboarding, do some scans
    • Check out feature/mula/refactor-queue
    • docker compose kill shutdown all the containers
    • make kat (we don't want to reset we want the migration to be applied on the initial run)

Code Checklist

  • All the commits in this PR are properly PGP-signed and verified.
  • This PR only contains functionality relevant to the issue.
  • I have written unit tests for the changes or fixes I made.
  • I have checked the documentation and made changes where necessary.
  • I have performed a self-review of my code and refactored it to the best of my abilities.
  • Tickets have been created for newly discovered issues.
  • For any non-trivial functionality, I have added integration and/or end-to-end tests.
  • I have informed others of any required .env changes files if required and changed the .env-dist accordingly.
  • I have included comments in the code to elaborate on what is not self-evident from the code itself, including references to issues and discussions online, or implicit behavior of an interface.

Checklist for code reviewers:

Copy-paste the checklist from the docs/source/templates folder into your comment.


Checklist for QA:

Copy-paste the checklist from the docs/source/templates folder into your comment.

* main: (21 commits)
  Add IPv6 config to docker-compose.yml (#2256)
  Remove IPs with zero vulnerabilities (#2319)
  Translations update from Hosted Weblate (#2311)
  Chore/update pr template with comments (#2305)
  Sector report summary - Best and worst scoring security checks (#2312)
  Remove icons from compliance issue table (#2340)
  More ulimits for buggy celery (#2338)
  Remove smartphone from bug report template (#2334)
  add meta / cache hash for rpki boefje to raw output (#2255)
  Add max fds ulimit to octopoes api worker (#2327)
  Fix multiple Debian issues (#2283)
  Fix/upgrade jinja2 (#2326)
  Remove preselection from multireport flow (#2318)
  Updated template file to respect environment prefixes in docs (#2317)
  fix zero division (#2298)
  List item behaviour (#2281)
  Translations update from Hosted Weblate (#2279)
  Fixed invalid type usage in `get_rabbit_channel` and `close_rabbit_channel` (#2280)
  Translations update from Hosted Weblate (#2261)
  Fix export buttons report (#2259)
  ...
* main:
  Fix/394 Introduce clearance level control for objects imported by CSV (#2390)
  Update RabbitMQ to the latest version (#2392)
  Show created at and data from in reports (#2370)
  Rename invalid rpki finding to expired (#2377)
  Convert `docker-compose` to `docker compose` (#2341)
  Remove uWSGI (#2366)
  Prevent double github actions (#2374)
* main:
  Fix/394 Introduce clearance level control for objects imported by CSV (#2390)
  Update RabbitMQ to the latest version (#2392)
  Show created at and data from in reports (#2370)
  Rename invalid rpki finding to expired (#2377)
  Convert `docker-compose` to `docker compose` (#2341)
  Remove uWSGI (#2366)
  Prevent double github actions (#2374)
  Fix WEASYPRINT_BASEURL default value and change ports in docker-compose.yml (#2373)
  Remove debian11 packages (#2358)
  Error handling for Generate Report (#2274)
  Update dependencies (#2348)
  Add token authentication (#2349)
  Adds CAA records to the model, boefje, normalizer, adds a check bit and a finding (#2315)
  Check for sudo in install and update script (#2360)
  Feat/normalizer mimetype upload deeplink (#2220)
  Add hrefs to Basic Security overview (#2330)
  Render dicts and list ooi attrs as jsonfield (#2355)
…lexible-scheduling

* chore/mula/update-architecture-doc:
  Fix
  Fix diagram010.svg
  Update docs and restructure some code
  Update documentation
  Add diagrams
  Update architecture documentation
* main:
  Fix new boefjes issue for scheduler (#3297)
  Add audit logging to CRUD actions using Django signals (#3314)
  fix: Button height (#3316)
  Add user id to OOI (#3305)
  Translations update from Hosted Weblate (#3179)
  Restructure scheduler storage module (#3294)
  Fix CSRF error in API with token auth (#3313)
* main:
  Restructure scheduler server module (#3295)
* main:
  Restructure scheduler development scripts (#3293)
  Change report flow to POST requests (#3174)
Copy link
Contributor

@Donnype Donnype left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff @jpbruinsslot! Do you have a todo list or timeline on when we can merge this and make the switch?

rocky/tests/conftest.py Outdated Show resolved Hide resolved
boefjes/boefjes/clients/bytes_client.py Outdated Show resolved Hide resolved
mula/scheduler/server/handlers/queues.py Show resolved Hide resolved
mula/scheduler/server/handlers/queues.py Outdated Show resolved Hide resolved
Comment on lines 407 to 422
# We at least delay a job by the grace period
minimum = self.ctx.config.pq_grace_period
deadline = datetime.now(timezone.utc) + timedelta(seconds=minimum)

# We want to delay the job by a random amount of time, in a range of 5 hours
jitter_range_seconds = 5 * 60 * 60
jitter_offset = timedelta(
seconds=random.uniform(-jitter_range_seconds, jitter_range_seconds) # noqa
)

return True
# Check if the adjusted time is earlier than the minimum, and
# ensure that the adjusted time is not earlier than the deadline
adjusted_time = deadline + jitter_offset
adjusted_time = max(adjusted_time, deadline)

def is_item_on_queue_by_hash(self, item_hash: str) -> bool:
return self.queue.is_item_on_queue_by_hash(item_hash)
return adjusted_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that, if I read this correctly, this create a uniform distribution between "(minimal) deadline - 5 hours" and "deadline + 5 hours", and then sets everything below the deadline equal to the deadline. In practice this will converge to a situation where 50% of the time the adjusted_time equals the deadline and the other 50% is uniformly distributed over the 5 hours after the deadline. I think that's not what's intended? Perhaps the jitter offset should be:

jitter_offset = timedelta(seconds=random.uniform(0, 2*jitter_range_seconds))

To spread it out uniformly over the 10 hours after the (minimal) deadline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Donnype I've updated it, would you mind checking?

mula/scheduler/storage/schedule_store.py Show resolved Hide resolved
depends_on = None


def upgrade():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I challenge you to test it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, tested it with a minimal set. But would be good to fully test it

Copy link
Contributor

@ammar92 ammar92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a few remarks. We've already discussed the PR in a high level overview offline

mula/Makefile Outdated Show resolved Hide resolved
mula/scheduler/context/context.py Outdated Show resolved Hide resolved
mula/scheduler/schedulers/normalizer.py Outdated Show resolved Hide resolved
mula/scheduler/server/serializers/schedule.py Outdated Show resolved Hide resolved
mula/scheduler/server/serializers/task.py Outdated Show resolved Hide resolved
from croniter import croniter


def next_run(expression: str, start_time: datetime = datetime.now(timezone.utc)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't assign datetime.now(timezone.utc) in the signature, otherwise it will always have the same value (based on the module import time). The Pythonic way is to assign a None as default, and assign a new datetime in the function body if the passed value is indeed None.

@jpbruinsslot
Copy link
Contributor Author

Good stuff @jpbruinsslot! Do you have a todo list or timeline on when we can merge this and make the switch?

Thanks! And thank you for your suggestions. I wanted to get the reviews out first. For the next steps, I'll implement the suggestions, get QA to look at it and focus on the migration for a bit to make sure that works out fine. When merged I'd like to have it deployed in order to test to uncover edge cases, and how it will fare with more throughput/load.

* main:
  Basic audit trails via logging (#3317)
  Raw upload with Scan OOIS (#3169)
  Fix Garbage collection and disappearing ports issue (#3214)
  Updated `Django` and `opentelemetry` packages (#3324)
@stephanie0x00
Copy link
Contributor

stephanie0x00 commented Aug 12, 2024

Checklist for QA:

  • I have checked out this branch, and successfully ran a fresh make reset.
  • I confirmed that there are no unintended functional regressions in this branch:
    • I have managed to pass the onboarding flow
    • Objects and Findings are created properly
    • Tasks are created and completed properly
  • I confirmed that the PR's advertised feature or hotfix works as intended.
  • I checked the logs for errors and/or warnings and made issues where necessary

What works:

General scheduling works on a 'make reset', tasks can be rescheduled, links in the UI to/from the task list seem to work. Findings are created.
Migrating also seems to work. Can still reschedule tasks, perform new scanning activities. Data of tasks list is still visible. Looks good 👍

What doesn't work:

When migrating (followed the steps as described in the QA notes), the task list history disappears and an error message is shown in the UI. The task statistics table does show that multiple tasks are present.
n/a

Bug or feature?:

n/a

@jpbruinsslot
Copy link
Contributor Author

Updated and amended qa instructions, for additional testing

@underdarknl underdarknl merged commit df6f526 into main Aug 13, 2024
21 checks passed
@underdarknl underdarknl deleted the feature/mula/refactor-queue branch August 13, 2024 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mula Issues related to the scheduler
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Flexible scheduling
5 participants