Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data models for Component-level Versioning Only #28

Merged
merged 33 commits into from
Feb 10, 2023

Conversation

ormsbee
Copy link
Contributor

@ormsbee ormsbee commented Feb 6, 2023

High Level Overview

This is a pared down version of the last experimental pull request, with Unit-level work and top level versioning of the entire LearningContext stripped out. A lot of other changes as well.

The load_components management command is functional and should pull in the XBlock contents for HTML, Problem, and Video types. However it doesn't yet pull in all the references files (e.g. HTML files, subtitle files, etc.) yet. It will pull in regex matched static asset references from HTML and Capa, but not things where it has to understand conventions based on XML attribute values (like how some video subtitles are specified).

Apps

There are two apps at work here:

  • openedx_learning.core.publishing has just enough to tie content together (LearningPackage) and to track when things get published (PublishLogEntry).
  • openedx_learning.core.components has the models needed to track Components, their versioning, the raw data they address, and the publishing status of those models. It depends on the publishing app.

Versioning

Versioning happens at the level of each Component (e.g. an individual block, like a problem or video) via the ComponentVersion model.

The Content model just holds raw bytes with some very minimal metadata (size, MIME type). Content has a M:M relationship with ComponentVersion via the custom ComponentVersionContent model (which provides an identifier that we use like a filename). That allows us to have the same Content associated with multiple ContentVersions if we want, making identifier renames cheap. The load_components management command will associate static assets with the appropriate ContentVersions.

The latest published version of a component is tracked in PublishedComponent, with a historical log available in ComponentPublishLogEntry which is M:1 with the publish app's PublishLogEntry. This allow us to model "a bunch of ComponentVersions were published at the same time".

Overall, the shift to versioning at a more granular level has made the modeling simpler, faster, and with less per-version overhead. It also makes diffs easier. That comes at a cost of making "point-in-time" snapshots more difficult to construct (you'd have to look at ComponentPublishLogEntry (joined with PublishLogEntry) to look for the most recent component_version before a certain time for each component. Certainly possible, but more expensive.

Status

  • there is no data.py defined
  • there is no api.py layer at the moment, only direct model access.

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 6, 2023

FYI @bradenmacdonald

@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 6, 2023

FYI @feanil, @kdmccormick

David Ormsbee added 4 commits February 6, 2023 18:50
These were placeholders, and they're cluttering the PR at the moment.
There might be some useful scraps to pull from them, but they exist in
another branch we can look at later.
Copy link
Member

@kdmccormick kdmccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormsbee @feanil You guys were right, reading the description made this all very clear.

I like that you've scoped this down to just Content and Component for now. I think that models here are extremely reasonable and will be a solid base as we work our way back up the hierarchy.

I had been resisting Component-level versioning for a while, but in retrospect trying to do Segment- or Unit-level versioning just made things more complex. I'm not sure why I felt so strongly--maybe because allowing library components to be versioned separately sounds like a UX nightmare? Even so, the Library Authoring system could mandate that library components are versioned together, without that restriction having to appear at this level.

Just some nits/thoughts below but nothing blocking.

Comment on lines +108 to +112
contents = models.ManyToManyField(
"Content",
through="ComponentVersionContent",
related_name="component_versions",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should every ComponentVersion have at least one contents?

If so, consider noting that with a comment (or a db constraint, but I assume that's not possible).

Otherwise, consider setting blank=True here, as that would let us construct a content-less CV in django admin.

Suggested change
contents = models.ManyToManyField(
"Content",
through="ComponentVersionContent",
related_name="component_versions",
)
contents = models.ManyToManyField(
"Content",
through="ComponentVersionContent",
related_name="component_versions",
blank=True,
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh... I guess it's possible that we'd make one without Content (like just a title). I'd rather make the admin read-only altogether though, because ComponentVersion creation should include adding Publish entries rather than just direct model manipulation (I guess I really should do the api.py pieces of this soon).

openedx_learning/core/components/models.py Show resolved Hide resolved
openedx_learning/core/components/models.py Outdated Show resolved Hide resolved
openedx_learning/core/components/models.py Outdated Show resolved Hide resolved
openedx_learning/core/components/models.py Outdated Show resolved Hide resolved
Comment on lines +48 to +52
# type is a way to help sub-divide namespace if that's convenient. This
# field cannot be null, but it can be blank if it's not necessary. For an
# XBlock, type corresponds to tag, e.g. "video". It's also the block_type in
# the UsageKey.
type = models.CharField(max_length=100, null=False, blank=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't love type as a name, but trying to think about this field outside the context of XBlocks is so abstract and hypothetical that I have no better name to suggest.

Copy link
Contributor

@bradenmacdonald bradenmacdonald Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we expect that the only actual usage of this field will be for subdividing the ID namespace for XBlocks to match the historical behavior, then we could conceivably just call this field block_type. I don't think it should be used outside of XBlocks, as having two levels of namespacing is not ideal.

Personally, I would probably remove this field and combine the block_type and block_id into the identifier field with a special character string separator, and leave it up to the "identifier <=> usage key" logic to sort it out. But I'm completely fine with this approach here, so not suggesting we change this.

@bradenmacdonald
Copy link
Contributor

bradenmacdonald commented Feb 8, 2023

Hah, I was just thinking about this recently and I ended up concluding to myself that component-level versioning seemed like the way to go. Glad to see you trying it out and to hear it seems to be proving out so far from what you've tried.

Ran out of time to review today, will check out more tomorrow or Fri :)


Unit
This is a list of one or more Segments that is displayed to the user on one page. A Unit may be stitched together using content that comes from multiple sources, such as content libraries. Units do not have to be strictly instructional content, as things like upgrade offers and error messages may also be injected.
A Unit is an ordered list of one or more Components. A Unit is addressable in the browser at some URL, and is displayed together. A common use case might be to display some introductory Text, a Video, and some followup Problem (all separate Components). An individual Component in a Unit may or may not make sense when taken outside of that Unit–e.g. a Video may be reusable elsewhere, but the Problem referencing the video might not be.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Unit is addressable in the browser at some URL

Does this imply that a Component is not accessible at some URL in general?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was a leftover from when Segments were not going to necessarily be separately addressable.

I'm honestly not sure how I feel about Components being separately rendered vs. having implicit Units with single Components in them as the thing being rendered. I like the idea of having one type of thing being rendered at that level, and the idea that some Components aren't meant to be rendered in isolation. But that's super-hand-wavy at this point, so I'll just remove this wording and kick that decision down the road a little.

@ormsbee ormsbee self-assigned this Feb 9, 2023
@ormsbee
Copy link
Contributor Author

ormsbee commented Feb 10, 2023

@kdmccormick, @feanil, @bradenmacdonald: I think I addressed all the comments. Also added a couple small conveniences to the Django admin, a sprinkling of indexes, and some more fleshed out comments.

Copy link
Contributor

@bradenmacdonald bradenmacdonald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormsbee This is excellent and I really like the direction. I don't have any major feedback, just a couple little questions. Nothing blocking. Looking forward to building on this!

about scalability issues. For instance, video files should not be stored
here directly. There is a 10 MB limit set for the moment, to accomodate
things like PDF files and images, but the itention is for the vast majority
of rows to be much smaller than that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should data files over 10 MB be stored?

I would almost prefer a much lower limit like 250 kB and an integrated way to store and reference larger Content instances as objects on S3. I suspect you may already have implemented or planned something that but I forgot or don't see it yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I've punted on this issue. I came to the same conclusion that you did: S3 objects (probably being django-storages) + database references, probably through an optional FileField on Content.

Captured this in #29

Architecture Guidelines
-----------------------

* We're keeping nearly unlimited history, so per-version metadata (i.e. the space/time cost of making a new version) must be kept low.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't suppose there's any way to tell which versions are in use or not with this model, because any application may or may not have foreign keys to specific ComponentVersions?

(I'm wondering if we can auto-prune unused versions.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can using Django ORM introspection or a mixin, but it wouldn't be perfect. I'm honestly hoping the per-version overhead is low enough now where it's okay not to prune, because that just makes content lifecycle management so much easier.

@ormsbee ormsbee merged commit 867ea8d into openedx:main Feb 10, 2023
@ormsbee ormsbee deleted the denver2 branch February 10, 2023 21:18
@ormsbee ormsbee mentioned this pull request Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

4 participants