Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline improvement prerequisites #427

Merged
merged 7 commits into from
Feb 28, 2022
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions devsite/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ DATABASE_URL=mysql://figures_user:[email protected]:3306/figures-db
# Set which expected Open edX release mocks for devsite to use.
# Valid options are: "GINKGO", "HAWTHORN", "JUNIPER"
#
# If not specified here, then "HAWTHORN" is used
OPENEDX_RELEASE=JUNIPER
# If not specified here, then "JUNIPER" is used
# OPENEDX_RELEASE=JUNIPER

# Enable/disable Figures multisite mode in devsite
# This also requires
Expand Down
13 changes: 11 additions & 2 deletions devsite/devsite/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from figures.settings.lms_production import (
update_webpack_loader,
update_celerybeat_schedule,
update_celery_routes,
)

DEVSITE_BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
Expand All @@ -22,7 +23,7 @@
env = environ.Env(
DEBUG=(bool, True),
ALLOWED_HOSTS=(list, []),
OPENEDX_RELEASE=(str, 'HAWTHORN'),
OPENEDX_RELEASE=(str, 'JUNIPER'),
FIGURES_IS_MULTISITE=(bool, False),
ENABLE_DEVSITE_CELERY=(bool, True),
ENABLE_OPENAPI_DOCS=(bool, False),
Expand Down Expand Up @@ -240,8 +241,16 @@
# We have an empty dict here to replicate behavior in the LMS
ENV_TOKENS = {}

PRJ_SETTINGS = {
'CELERY_ROUTES': "app.celery.routes"
}

FIGURES_PIPELINE_TASKS_ROUTING_KEY = ""

update_webpack_loader(WEBPACK_LOADER, ENV_TOKENS)
update_celerybeat_schedule(CELERYBEAT_SCHEDULE, ENV_TOKENS)
update_celerybeat_schedule(CELERYBEAT_SCHEDULE, ENV_TOKENS, FIGURES_PIPELINE_TASKS_ROUTING_KEY)
update_celery_routes(PRJ_SETTINGS, ENV_TOKENS, FIGURES_PIPELINE_TASKS_ROUTING_KEY)


# Used by Django Debug Toolbar
INTERNAL_IPS = [
Expand Down
2 changes: 1 addition & 1 deletion devsite/devsite/test_settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ def root(*args):


env = environ.Env(
OPENEDX_RELEASE=(str, 'HAWTHORN'),
OPENEDX_RELEASE=(str, 'JUNIPER'),
)

environ.Env.read_env(join(dirname(dirname(__file__)), '.env'))
Expand Down
17 changes: 16 additions & 1 deletion figures/helpers.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""Helper functions to make data handling and conversions easier

# Figures 0.3.13 - Defining scope of this module
# Figures 0.4.x - Yep, this module is still handy and the scope hasn't exploded

The purpose of this module is to provide conveniece methods around commonly
executed statements. These convenience methods should serve as shorthand for
Expand Down Expand Up @@ -34,6 +35,8 @@

## What does not belong here?

* Most importantly, if you have to import from another figures module, it does
not belong here!
* "Feature" functionality does not belong here
* Long functions do not belong here
* Code that communicates outside of Figures does not belong here. No database,
Expand All @@ -42,7 +45,10 @@
This is not an exhaustive list. We'll grow it as needed.

An important point is that we should not expect this module to be a permanent
home for functionality.
home for the functionality included. As we evolve Figures, we may find functions
here that have a stronger context with another module. For example, we've got
a decent set of date oriented functions that are candidatdes for a datetime and
date handling module.
"""

from __future__ import absolute_import
Expand Down Expand Up @@ -188,6 +194,15 @@ def prev_day(val):
return days_from(val, -1)


def utc_yesterday():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a test case for this function? One happy scenario should be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

"""Get "yesterday" form the utc datetime

We primarily use this for the daily metrics collection. However, it proves
handy as a convenience function for exploring data in the Django shell
"""
return prev_day(datetime.datetime.utcnow().date())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My standard advice anytime I see datetime.now() or an equivalent, is that you should allow a now parameter to be passed in in place of it. Ie,

def utc_yesterday(now=None):
    if not now:
        now = datetime.datetime.utcnow()
    return prev_day(now.date())

This makes it considerably easier to test, particularly for the weird edge cases that come up around dates (daylight savings, leap years, etc.) and eliminate the flakey tests that can come up when tests potentially behave differently depending on when they are run (what happens when this code is run exactly at midnight?).

Code that calls utc_yesterday() also ought to allow their "now" to be overridden for the same reasons, though they would then only need to pass that along to utc_yesterday(). That also tends to make it simpler to implement things like backfill operations, since you can run some code as if it were run at some different time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise an excellent point make a good argument. I started to draft a reply, but then the reply started to grow longer. So I moved that off for later follow up so I can focus on getting part 2 in PR and come back to this topic. I just wanted to thank you now for reviewing so soon after I posted the PR before the weekend



def days_in_month(month_for):
_, num_days_in_month = calendar.monthrange(month_for.year, month_for.month)
return num_days_in_month
Expand Down
24 changes: 24 additions & 0 deletions figures/migrations/0016_add_collect_elapsed_to_ed_and_lcgm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# -*- coding: utf-8 -*-
from __future__ import unicode_literals

from django.db import migrations, models


class Migration(migrations.Migration):

dependencies = [
('figures', '0015_add_enrollment_data_model'),
]

operations = [
migrations.AddField(
model_name='enrollmentdata',
name='collect_elapsed',
field=models.FloatField(null=True),
),
migrations.AddField(
model_name='learnercoursegrademetrics',
name='collect_elapsed',
field=models.FloatField(null=True),
),
]
82 changes: 77 additions & 5 deletions figures/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

from __future__ import absolute_import
from datetime import date
import timeit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you timeit safe for production use? I don't know. I've checked https://docs.python.org/3/library/timeit.html and it shows a GCP related note:

Note By default, timeit() temporarily turns off garbage collection during the timing. The advantage of this approach is that it makes independent timings more comparable. The disadvantage is that GC may be an important component of the performance of the function being measured. If so, GC can be re-enabled as the first statement in the setup string. For example

I think it's only related to the following usage timeit.Timer('for i in range(10): oct(i)', 'gc.enable()').timeit().

I think it's safe, but want to double check if we've missed something here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Omar! To be on the safe side, I'll just capture start and end times without timeit

from django.conf import settings
from django.contrib.sites.models import Site
from django.core.validators import MaxValueValidator, MinValueValidator
Expand All @@ -17,7 +18,7 @@
from model_utils.models import TimeStampedModel

from figures.compat import CourseEnrollment
from figures.helpers import as_course_key
from figures.helpers import as_course_key, utc_yesterday
from figures.progress import EnrollmentProgress


Expand Down Expand Up @@ -203,7 +204,7 @@ class EnrollmentDataManager(models.Manager):
EnrollmentData instances.

"""
def set_enrollment_data(self, site, user, course_id, course_enrollment=False):
def set_enrollment_data(self, site, user, course_id, course_enrollment=None):
"""
This is an expensive call as it needs to call CourseGradeFactory if
there is not already a LearnerCourseGradeMetrics record for the learner
Expand Down Expand Up @@ -259,6 +260,71 @@ def set_enrollment_data(self, site, user, course_id, course_enrollment=False):
defaults=defaults)
return obj, created

def update_metrics(self, site, course_enrollment, force_update=False):
"""
This is an expensive call as it needs to call CourseGradeFactory if
there is not already a LearnerCourseGradeMetrics record for the learner

"""
date_for = utc_yesterday()

# check if we already have a record for the date for EnrollmentData

# Alternately, we could use a try/except on a 'get' call, however, this
# would be much slower for a bunch of new enrollments

# Should only be one even if we don't include the site in the query
# because course_id should be globally unique
# If course id is ever NOT globally unique, then we need to add site
# to the query
ed_recs = EnrollmentData.objects.filter(
user_id=course_enrollment.user_id,
course_id=str(course_enrollment.course_id))

if not ed_recs or ed_recs[0].date_for < date_for or force_update:
# We do the update
start_time = timeit.default_timer()
# get the progress data
ep = EnrollmentProgress(user=course_enrollment.user,
course_id=str(course_enrollment.course_id))
defaults = dict(
date_for=date_for,
is_completed=ep.is_completed(),
progress_percent=ep.progress_percent(),
points_possible=ep.progress.get('points_possible', 0),
points_earned=ep.progress.get('points_earned', 0),
sections_possible=ep.progress.get('sections_possible', 0),
sections_worked=ep.progress.get('sections_worked', 0),
is_enrolled=course_enrollment.is_active,
date_enrolled=course_enrollment.created,
)
elapsed = timeit.default_timer() - start_time
defaults['collect_elapsed'] = elapsed

ed_rec, created = self.update_or_create(
site=site,
user=course_enrollment.user,
course_id=str(course_enrollment.course_id),
defaults=defaults)
# create a new LCGM record
# if it already exists for the day
LearnerCourseGradeMetrics.objects.update_or_create(
site=ed_rec.site,
user=ed_rec.user,
course_id=ed_rec.course_id,
date_for=date_for,
defaults=dict(
points_possible=ed_rec.points_possible,
points_earned=ed_rec.points_earned,
sections_worked=ed_rec.sections_worked,
sections_possible=ed_rec.sections_possible,
collect_elapsed=elapsed
)
)
return ed_rec, created
else:
return ed_recs[0], False


@python_2_unicode_compatible
class EnrollmentData(TimeStampedModel):
Expand Down Expand Up @@ -301,6 +367,9 @@ class EnrollmentData(TimeStampedModel):
sections_worked = models.IntegerField()
sections_possible = models.IntegerField()

# seconds it took to collect progress data
collect_elapsed = models.FloatField(null=True)

objects = EnrollmentDataManager()

class Meta:
Expand All @@ -324,7 +393,7 @@ def progress_details(self):


class LearnerCourseGradeMetricsManager(models.Manager):
"""Custom model manager for LearnerCourseGrades model
"""Custom model manager for LearnerCourseGradeMetrics model
"""
def latest_lcgm(self, user, course_id):
"""Gets the most recent record for the given user and course
Expand Down Expand Up @@ -414,9 +483,9 @@ class LearnerCourseGradeMetrics(TimeStampedModel):
Purpose is primarliy to improve performance for the front end. In addition,
data collected can be used for course progress over time

We're capturing data from figures.metrics.LearnerCourseGrades
We're capturing data from figures.progress.EnrollmentProgress

Note: We're probably going to move ``LearnerCourseGrades`` to figures.pipeline
Note: We're probably going to move ``EnrollmentProgress`` to figures.pipeline
since that class will only be needed by the pipeline

Even though this is for a course enrollment, we're mapping to the user
Expand Down Expand Up @@ -452,6 +521,9 @@ class LearnerCourseGradeMetrics(TimeStampedModel):
sections_worked = models.IntegerField()
sections_possible = models.IntegerField()

# seconds it took to collect progress data
collect_elapsed = models.FloatField(null=True)

objects = LearnerCourseGradeMetricsManager()

class Meta:
Expand Down
6 changes: 5 additions & 1 deletion figures/settings/lms_production.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
import os
from celery.schedules import crontab

FIGURES_DEFAULT_DAILY_TASK = 'figures.tasks.populate_daily_metrics'


class FiguresRouter(object):

Expand Down Expand Up @@ -64,8 +66,10 @@ def update_celerybeat_schedule(
https://stackoverflow.com/questions/51631455/how-to-route-tasks-to-different-queues-with-celery-and-django
"""
if figures_env_tokens.get('ENABLE_DAILY_METRICS_IMPORT', True):
figures_daily_task = figures_env_tokens.get('DAILY_TASK',
FIGURES_DEFAULT_DAILY_TASK)
celerybeat_schedule_settings['figures-populate-daily-metrics'] = {
'task': 'figures.tasks.populate_daily_metrics',
'task': figures_daily_task,
'schedule': crontab(
hour=figures_env_tokens.get('DAILY_METRICS_IMPORT_HOUR', 2),
minute=figures_env_tokens.get('DAILY_METRICS_IMPORT_MINUTE', 0),
Expand Down
2 changes: 1 addition & 1 deletion figures/sites.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ def get_site_for_course(course_id):
site = None
else:
# Operating in single site / standalone mode, return the default site
site = Site.objects.get(id=settings.SITE_ID)
site = default_site()
return site


Expand Down
14 changes: 14 additions & 0 deletions tests/factories.py
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,20 @@ class Meta:
2018,2,2, tzinfo=factory.compat.UTC))


@classmethod
def from_course_enrollment(cls, course_enrollment, **kwargs):
"""Contruct a StudentModule for the given CourseEnrollment

kwargs provides for additional optional parameters if you need to
override the default factory assignment
"""
kwargs.update({
'student': course_enrollment.user,
'course_id': course_enrollment.course_id,
})
return cls(**kwargs)


if OPENEDX_RELEASE == GINKGO:
class CourseEnrollmentFactory(DjangoModelFactory):
class Meta:
Expand Down
Loading