Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline improvement prerequisites #427

Merged
merged 7 commits into from
Feb 28, 2022

Conversation

johnbaldwin
Copy link
Contributor

@johnbaldwin johnbaldwin commented Feb 24, 2022

This PR contains updates to Figures as prerequisites to adding the second switchable workflow

Each PR has it's own description and context. So please read the changes in calendar order so you see the details to the story.

Why this PR exists and high level story

  1. We need to improve Figures daily metrics collection job (the "daily pipeline") performance for one of our standalone deployments
  2. The existing daily job is working on Tahoe and we don't want to break it
  3. So we add a second daily job workflow
  4. We enable the deployment to choose which workflow to run on a setting in Figures server-vars
  5. After we prove the robustness of the second workflow, we will deprecate and then remove the original workflow

What's the scope of this PR

  1. This PR contains code refactoring and changes needed to support the new workflow
  2. This PR does NOT contain the actual second workflow code
  3. These are broken up into two PRs for readability

What's IN this PR?

  1. Added a Figures ENV_TOKENS flag to choose an override task to kick of the daily Celery top level task or use the default one: 926cd7c
  2. Minor code cleanup found along the way that is opportune to include in this PR: 83687ca
  3. Because it can be confusing to get the specific calendar day (24 hour period starting at 00:00 UTC) for "yesterdays" data collection, added a convenience function to figures.helpers: 42ce02d
  4. Found a bug in Figures devsite settings when I needed to create a migration, so fixed that here: 5a5ff31
  5. And the most important part of this PR, updated metrics collection handling to encapsulate and abstract "per enrollment" progress collection here: c6693fe

It's the last PR that is most important (I repeat myself because it is the most important. There, I've done it again). Anyway, with the new workflow, we will invert the relationship of LearnerCourseGradeMetrics (LCGM) and EnrollmentData where we write first to EnrollmentData (let's call him "ed") and use ed as the primary for data retrieval in the API (we started that already when we introduced ed as a model in figures). LCGM gets pushed back to be a historical record to track the time series per-enrollment performance

But we're not there yet, lots of refactoring to do to get there. What no. 5 commit above does is abstract this relationship from the rest of Figures and isolate the progress retrieval to make Figures safer, faster and more fun to program on

This commit adds ENV_TOKENS['FIGURES']['DAILY_TASK'] override to define
which Celery task is called by CeleryBeat to run the daily data update
figures.sites.default_site() is returning the Site object matching
`settings.SITE_ID` if we are running in single site mode and that is
exactly what is called in single site mode for
`get_site_for_course(course_id)`, so just making it that the source of
truth is the same code
This is a convenience function to get the previous calendar day from th
ecurrent UTC time. This is the date for which the pipeline uses to
collect, save data and aggregate metrics for the daily jobs
Fixes:

* Apparently devsite/settings.py broke with the Celery routing addition.
This didn't impact `test_settings` but would fail to run devsite
`manage.py` which relies on `devsite/settings.py`. This commit fixes
that
* Added `update_celery_routes` to `devsite/settings.py` as part of the
Celery fix to make it  to prepare for a
better devsite Celery story to make Figures Celery task development a
better experience

Improvements:

* Updated devsite's default Open edX version to Juniper
This commit contains changes to prepare for adding new daily metrics
workflow to speed up daily job execution

* Added new EnrollmentData model manager method to handle and
encapsulate EnrollmentData/LCGM update logic given a `CourseEnrollment`
record as parameter. Added unit tests for this new method.
* Adds `collect_elapsed` field to EnrollmentData and LCGM to track how long it takes to collect the progress data for that enrollment. The big cost here is currently to `CourseGradeFactory()`
* Added convenience method to StudentModuleFactory to create one form a
`CourseEnrollment` record instead of having to pass in two variables
(user and course_id) to construct a test StudentModule record to match
an already constructed CourseEnrollment record
@OmarIthawi
Copy link
Contributor

Thanks John! I'll review it today.

We primarily use this for the daily metrics collection. However, it proves
handy as a convenience function for exploring data in the Django shell
"""
return prev_day(datetime.datetime.utcnow().date())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My standard advice anytime I see datetime.now() or an equivalent, is that you should allow a now parameter to be passed in in place of it. Ie,

def utc_yesterday(now=None):
    if not now:
        now = datetime.datetime.utcnow()
    return prev_day(now.date())

This makes it considerably easier to test, particularly for the weird edge cases that come up around dates (daylight savings, leap years, etc.) and eliminate the flakey tests that can come up when tests potentially behave differently depending on when they are run (what happens when this code is run exactly at midnight?).

Code that calls utc_yesterday() also ought to allow their "now" to be overridden for the same reasons, though they would then only need to pass that along to utc_yesterday(). That also tends to make it simpler to implement things like backfill operations, since you can run some code as if it were run at some different time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise an excellent point make a good argument. I started to draft a reply, but then the reply started to grow longer. So I moved that off for later follow up so I can focus on getting part 2 in PR and come back to this topic. I just wanted to thank you now for reviewing so soon after I posted the PR before the weekend

Copy link
Contributor

@OmarIthawi OmarIthawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @johnbaldwin! Looks good to me.

I've added one question about timeit and tests for utc_yesterday but none are blockers.

@@ -5,6 +5,7 @@

from __future__ import absolute_import
from datetime import date
import timeit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you timeit safe for production use? I don't know. I've checked https://docs.python.org/3/library/timeit.html and it shows a GCP related note:

Note By default, timeit() temporarily turns off garbage collection during the timing. The advantage of this approach is that it makes independent timings more comparable. The disadvantage is that GC may be an important component of the performance of the function being measured. If so, GC can be re-enabled as the first statement in the setup string. For example

I think it's only related to the following usage timeit.Timer('for i in range(10): oct(i)', 'gc.enable()').timeit().

I think it's safe, but want to double check if we've missed something here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Omar! To be on the safe side, I'll just capture start and end times without timeit

@@ -188,6 +194,15 @@ def prev_day(val):
return days_from(val, -1)


def utc_yesterday():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a test case for this function? One happy scenario should be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

* Replaced `timeit` with `time` to capture EnrollmentData collection
timing
* Added basic test for `figures.helpers.utc_yesterday`
@johnbaldwin johnbaldwin merged commit 9b7cee6 into main Feb 28, 2022
@johnbaldwin johnbaldwin deleted the john/pipeline-perf-improvement-prep branch February 28, 2022 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants