You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The purpose of this issue is to briefly describe an identified performance issue with collecting per-enrollment data and a high level of Figures next iteration for handling per-enrollment data collection and management.
The problem
Figures daily metrics job gets slower as the number of StudentModule records gets bigger. In the daily job, Figures checks StudentModule records for any changes since the last time we saved a LearnerCourseGradeMetrics record for the enrollment. We check the modified data of the latest StudentModule for the enrollment against the latest date_for of the enrollment's LearnerCourseGradeMetrics records. If the StudentModule is newer, we update Figures data store for the enrollment. Aside: I'm sorry, that model name is awful and it is my fault and I will eventually make this right.
In brief, Figures does this on each enrollment in a course. It's a simple architecture, has worked really well. However, while this woks fine for small data sets or for beefy databases, it runs into performance issues with larger data sets and less beefy databases. Why? Because MySQL needs to do dynamic sorting on StudentModule on modified with filtering on the user id and course id. And edx-platform appears to not have indexed this.
We need a solution "like yesterday" and we want a portable solution for the community too. So that ruled out adding indexes to StudentModule. There may be a way to speed up the queries with raw SQL, but we'd prefer to keep with vanilla Django queryset API.
The solution we implement
We are implementing a second workflow that can be enabled by setting a flag in Figures ENV_TOKEN settings. This flag allows existing deployments to continue to use the current workflow while we deploy and validate correctness and performance of the new workflow in a production environment with the same Figures release. After we've validated it, we will then make the new workflow the default, deprecating the original, then remove the original workflow from Figures codebase. So, what's the new workflow?
The new workflow, aka "Enrollment Data Next"
First, for each course we process, we query for StudentModule records created or modified on the previous calendar day. This is the twenty four hour period starting at 00:00 UTC. Since we don't need to sort them to find the latest, this is a pretty fast query. We then get the distinct enrollments matching these StudentModule records and iterate over each of these CourseEnrollment records. This removes the need to query StudentMdule for each enrollment as we iterate over enrollments in the course to collect the previous day's metrics.
Second, we can then simply query CourseGradeFactory() to retrieve the section/subsection tree structure and collect the progress data
johnbaldwin
changed the title
Update enrollment data handling - "Enrollment Data Next"
[INFO]] Update enrollment data handling - "Enrollment Data Next"
Mar 7, 2022
johnbaldwin
changed the title
[INFO]] Update enrollment data handling - "Enrollment Data Next"
[INFO] Update enrollment data handling - "Enrollment Data Next"
Mar 7, 2022
The purpose of this issue is to briefly describe an identified performance issue with collecting per-enrollment data and a high level of Figures next iteration for handling per-enrollment data collection and management.
The problem
Figures daily metrics job gets slower as the number of
StudentModule
records gets bigger. In the daily job, Figures checksStudentModule
records for any changes since the last time we saved aLearnerCourseGradeMetrics
record for the enrollment. We check themodified
data of the latestStudentModule
for the enrollment against the latestdate_for
of the enrollment'sLearnerCourseGradeMetrics
records. If theStudentModule
is newer, we update Figures data store for the enrollment. Aside: I'm sorry, that model name is awful and it is my fault and I will eventually make this right.In brief, Figures does this on each enrollment in a course. It's a simple architecture, has worked really well. However, while this woks fine for small data sets or for beefy databases, it runs into performance issues with larger data sets and less beefy databases. Why? Because MySQL needs to do dynamic sorting on
StudentModule
onmodified
with filtering on the user id and course id. Andedx-platform
appears to not have indexed this.We need a solution "like yesterday" and we want a portable solution for the community too. So that ruled out adding indexes to
StudentModule
. There may be a way to speed up the queries with raw SQL, but we'd prefer to keep with vanilla Django queryset API.The solution we implement
We are implementing a second workflow that can be enabled by setting a flag in Figures
ENV_TOKEN
settings. This flag allows existing deployments to continue to use the current workflow while we deploy and validate correctness and performance of the new workflow in a production environment with the same Figures release. After we've validated it, we will then make the new workflow the default, deprecating the original, then remove the original workflow from Figures codebase. So, what's the new workflow?The new workflow, aka "Enrollment Data Next"
StudentModule
records created or modified on the previous calendar day. This is the twenty four hour period starting at 00:00 UTC. Since we don't need to sort them to find the latest, this is a pretty fast query. We then get the distinct enrollments matching theseStudentModule
records and iterate over each of theseCourseEnrollment
records. This removes the need to queryStudentMdule
for each enrollment as we iterate over enrollments in the course to collect the previous day's metrics.CourseGradeFactory()
to retrieve the section/subsection tree structure and collect the progress dataSee related PRs
The text was updated successfully, but these errors were encountered: