Pipeline improvement prerequisites #427

johnbaldwin · 2022-02-24T19:13:48Z

This PR contains updates to Figures as prerequisites to adding the second switchable workflow

Each PR has it's own description and context. So please read the changes in calendar order so you see the details to the story.

Why this PR exists and high level story

We need to improve Figures daily metrics collection job (the "daily pipeline") performance for one of our standalone deployments
The existing daily job is working on Tahoe and we don't want to break it
So we add a second daily job workflow
We enable the deployment to choose which workflow to run on a setting in Figures server-vars
After we prove the robustness of the second workflow, we will deprecate and then remove the original workflow

What's the scope of this PR

This PR contains code refactoring and changes needed to support the new workflow
This PR does NOT contain the actual second workflow code
These are broken up into two PRs for readability

What's IN this PR?

Added a Figures ENV_TOKENS flag to choose an override task to kick of the daily Celery top level task or use the default one: 926cd7c
Minor code cleanup found along the way that is opportune to include in this PR: 83687ca
Because it can be confusing to get the specific calendar day (24 hour period starting at 00:00 UTC) for "yesterdays" data collection, added a convenience function to figures.helpers: 42ce02d
Found a bug in Figures devsite settings when I needed to create a migration, so fixed that here: 5a5ff31
And the most important part of this PR, updated metrics collection handling to encapsulate and abstract "per enrollment" progress collection here: c6693fe

It's the last PR that is most important (I repeat myself because it is the most important. There, I've done it again). Anyway, with the new workflow, we will invert the relationship of LearnerCourseGradeMetrics (LCGM) and EnrollmentData where we write first to EnrollmentData (let's call him "ed") and use ed as the primary for data retrieval in the API (we started that already when we introduced ed as a model in figures). LCGM gets pushed back to be a historical record to track the time series per-enrollment performance

But we're not there yet, lots of refactoring to do to get there. What no. 5 commit above does is abstract this relationship from the rest of Figures and isolate the progress retrieval to make Figures safer, faster and more fun to program on

This commit adds ENV_TOKENS['FIGURES']['DAILY_TASK'] override to define which Celery task is called by CeleryBeat to run the daily data update

figures.sites.default_site() is returning the Site object matching `settings.SITE_ID` if we are running in single site mode and that is exactly what is called in single site mode for `get_site_for_course(course_id)`, so just making it that the source of truth is the same code

This is a convenience function to get the previous calendar day from th ecurrent UTC time. This is the date for which the pipeline uses to collect, save data and aggregate metrics for the daily jobs

Fixes: * Apparently devsite/settings.py broke with the Celery routing addition. This didn't impact `test_settings` but would fail to run devsite `manage.py` which relies on `devsite/settings.py`. This commit fixes that * Added `update_celery_routes` to `devsite/settings.py` as part of the Celery fix to make it to prepare for a better devsite Celery story to make Figures Celery task development a better experience Improvements: * Updated devsite's default Open edX version to Juniper

This commit contains changes to prepare for adding new daily metrics workflow to speed up daily job execution * Added new EnrollmentData model manager method to handle and encapsulate EnrollmentData/LCGM update logic given a `CourseEnrollment` record as parameter. Added unit tests for this new method. * Adds `collect_elapsed` field to EnrollmentData and LCGM to track how long it takes to collect the progress data for that enrollment. The big cost here is currently to `CourseGradeFactory()` * Added convenience method to StudentModuleFactory to create one form a `CourseEnrollment` record instead of having to pass in two variables (user and course_id) to construct a test StudentModule record to match an already constructed CourseEnrollment record

OmarIthawi · 2022-02-25T09:32:14Z

Thanks John! I'll review it today.

thraxil · 2022-02-25T10:04:05Z

figures/helpers.py

+    We primarily use this for the daily metrics collection. However, it proves
+    handy as a convenience function for exploring data in the Django shell
+    """
+    return prev_day(datetime.datetime.utcnow().date())


My standard advice anytime I see datetime.now() or an equivalent, is that you should allow a now parameter to be passed in in place of it. Ie,

def utc_yesterday(now=None): if not now: now = datetime.datetime.utcnow() return prev_day(now.date())

This makes it considerably easier to test, particularly for the weird edge cases that come up around dates (daylight savings, leap years, etc.) and eliminate the flakey tests that can come up when tests potentially behave differently depending on when they are run (what happens when this code is run exactly at midnight?).

Code that calls utc_yesterday() also ought to allow their "now" to be overridden for the same reasons, though they would then only need to pass that along to utc_yesterday(). That also tends to make it simpler to implement things like backfill operations, since you can run some code as if it were run at some different time.

You raise an excellent point make a good argument. I started to draft a reply, but then the reply started to grow longer. So I moved that off for later follow up so I can focus on getting part 2 in PR and come back to this topic. I just wanted to thank you now for reviewing so soon after I posted the PR before the weekend

OmarIthawi

Thanks @johnbaldwin! Looks good to me.

I've added one question about timeit and tests for utc_yesterday but none are blockers.

OmarIthawi · 2022-02-25T13:16:29Z

figures/models.py

@@ -5,6 +5,7 @@

 from __future__ import absolute_import
 from datetime import date
+import timeit


Do you timeit safe for production use? I don't know. I've checked https://docs.python.org/3/library/timeit.html and it shows a GCP related note:

Note By default, timeit() temporarily turns off garbage collection during the timing. The advantage of this approach is that it makes independent timings more comparable. The disadvantage is that GC may be an important component of the performance of the function being measured. If so, GC can be re-enabled as the first statement in the setup string. For example

I think it's only related to the following usage timeit.Timer('for i in range(10): oct(i)', 'gc.enable()').timeit().

I think it's safe, but want to double check if we've missed something here.

Thank you Omar! To be on the safe side, I'll just capture start and end times without timeit

OmarIthawi · 2022-02-25T13:19:14Z

figures/helpers.py

@@ -188,6 +194,15 @@ def prev_day(val):
    return days_from(val, -1)


+def utc_yesterday():


Would you mind adding a test case for this function? One happy scenario should be enough.

* Replaced `timeit` with `time` to capture EnrollmentData collection timing * Added basic test for `figures.helpers.utc_yesterday`

johnbaldwin added 5 commits February 17, 2022 14:41

Enables alternate Celery daily task for Figures

926cd7c

This commit adds ENV_TOKENS['FIGURES']['DAILY_TASK'] override to define which Celery task is called by CeleryBeat to run the daily data update

Add utc_yesterday helper to figures.helpers

42ce02d

This is a convenience function to get the previous calendar day from th ecurrent UTC time. This is the date for which the pipeline uses to collect, save data and aggregate metrics for the daily jobs

johnbaldwin requested review from OmarIthawi, melvinsoft, shadinaif, estherjsuh and bryanlandia as code owners February 24, 2022 19:13

Merge branch 'main' into john/pipeline-perf-improvement-prep

906e5da

This was referenced Feb 25, 2022

[INFO] Update enrollment data handling - "Enrollment Data Next" #428

Open

Figures pipeline performance improvement #429

Merged

thraxil reviewed Feb 25, 2022

View reviewed changes

OmarIthawi approved these changes Feb 25, 2022

View reviewed changes

Address PR comment for #427

82b4c28

* Replaced `timeit` with `time` to capture EnrollmentData collection timing * Added basic test for `figures.helpers.utc_yesterday`

johnbaldwin merged commit 9b7cee6 into main Feb 28, 2022

johnbaldwin deleted the john/pipeline-perf-improvement-prep branch February 28, 2022 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline improvement prerequisites #427

Pipeline improvement prerequisites #427

johnbaldwin commented Feb 24, 2022 •

edited

Loading

OmarIthawi commented Feb 25, 2022

thraxil Feb 25, 2022

johnbaldwin Feb 25, 2022

OmarIthawi left a comment

OmarIthawi Feb 25, 2022

johnbaldwin Feb 25, 2022

OmarIthawi Feb 25, 2022

johnbaldwin Feb 25, 2022

		@@ -188,6 +194,15 @@ def prev_day(val):
		return days_from(val, -1)


		def utc_yesterday():

Pipeline improvement prerequisites #427

Pipeline improvement prerequisites #427

Conversation

johnbaldwin commented Feb 24, 2022 • edited Loading

Why this PR exists and high level story

What's the scope of this PR

What's IN this PR?

OmarIthawi commented Feb 25, 2022

thraxil Feb 25, 2022

Choose a reason for hiding this comment

johnbaldwin Feb 25, 2022

Choose a reason for hiding this comment

OmarIthawi left a comment

Choose a reason for hiding this comment

OmarIthawi Feb 25, 2022

Choose a reason for hiding this comment

johnbaldwin Feb 25, 2022

Choose a reason for hiding this comment

OmarIthawi Feb 25, 2022

Choose a reason for hiding this comment

johnbaldwin Feb 25, 2022

Choose a reason for hiding this comment

johnbaldwin commented Feb 24, 2022 •

edited

Loading