Content data model should use File Storage. #29

ormsbee · 2023-02-10T20:56:53Z

All files are currently stored as a BinaryField. That's okay for now, but it brings scalability problems over the long term:

On AWS, S3 storage is significantly cheaper than RDS.
Reading large blobs from the database will put unnecessary load on it.

On the other hand, there are conveniences with having this data in the database directly:

Imports will obey transactional semantics, so we don't have to worry about failed partial imports leaving data behind.
Database access is generally much more predictable in terms of latency and throughput.
Database access will be much faster in general, particularly when reading across multiple rows.
The "authenticated download" use case is more straightforward.

We can probably implement this via an optional FileField on Content, preserving the basic metadata in that model (mime_type, size, hash), and storing the data as hash-named files in a directory that represents a LearningPackage (using the UUID).

The big question is: How do we determine whether something is offloaded or not? We can do this purely by size, or we can flag it by type (with some size limits). For instance, we might say that image files will always be put onto the file system, while the OLX is always stored locally, with the caveat that the OLX cannot be larger than a certain limit. It might actually be best to leave this logic completely up to apps to decide since usage of the same type of file (JSON, XML) might vary depending on the application usage.

PRs related to this:

bradenmacdonald · 2023-02-10T21:36:22Z

For discussion purposes: What do you think about having a high-level API for reading/writing these blobs which transparently chooses FileField or BinaryField for you automatically? (Can we use asyncio APIs yet?) Many use cases may not need to distinguish between these.

As for OLX:

"OLX is always stored locally, with the caveat that the OLX cannot be larger than a certain limit" - this would be fine I think but would mean there is a limit on how much inline HTML a given html or problem block can have. We can either be happy with that or allow blocks to include an external .html file that contains the details.
Or we could treat it the same as everything else, with a strict limit on what can be in BinaryField vs. FileField; large OLX would be saved to S3, not the DB. Since there is some other table that extracts and stores the metadata from the OLX at publish time, this would only affect latency when the entire OLX is needed for authoring/rendering of that exact component, and would be no different in performance than the case of allowing the block to reference an external HTML file.

ormsbee · 2023-02-11T03:51:27Z

For discussion purposes: What do you think about having a high-level API for reading/writing these blobs which transparently chooses FileField or BinaryField for you automatically?

I think the access characteristics are different enough (10-100X latency) where the apps should be made to understand the tradeoffs and make a conscious choice about it. So in this case, whatever XBlock related apps/runtime we have that uses these backing models should have the logic for what goes where.

(Can we use asyncio APIs yet?)

We can probably make the library use it, but in practice people are going to use it synchronously anyway...?

OLX is always stored locally, with the caveat that the OLX cannot be larger than a certain limit

this would be fine I think but would mean there is a limit on how much inline HTML a given html or problem block can have. We can either be happy with that or allow blocks to include an external .html file that contains the details.

Yeah, I think we just set a limit, say 100K to start, and see if it breaks things. The largest thing I see in my current data set is about 50K for a particularly crazy ProblemBlock. I have definitely seen HTML ones that are much larger, but that was because they copy-pasted from a Word doc that included large base64-encoding images inline on the page.

My stab at a guideline would be that if a Content is going to be used by server-side code during the rendering of the page, it should be stored in the BinaryField in the database. So that would include HTML files that are referenced by the OLX of the HTMLBlock. If the Content is only going to be delivered to the the client (images, videos, subtitles, etc.) it should go in file storage with a pointer in the Content.

Some advantages of that approach:

It would mean that we don't have to worry about serving static assets to browsers directly out of the database and the various operational issues that can cause. That will also simplify the overall asset-serving story since there'd be only one code path for it.
We'd have better low-latency guarantees for things like Unit rendering or even just basic XBlock rendering.
We'd save money since the bulk of things would be moved to S3.

Disadvantages:

It will likely break some content to enforce these limits.
Imports will be slower, particularly when there are many assets–and/or code will have to get more convoluted in order to do S3 uploads in parallel batches.
There will be weird edge cases that we need to figure out what to do with, like python_lib.zip.

Random thoughts:

There just might be a use case for having data on both sides, e.g. subtitles that that are searchable via database but served directly to clients as well.
Maybe it should be a TextField instead of a BinaryField–if all the things being stored inside are text anyway, with binary data going to the file store. I'm not sure how I feel about this one, but it would make it more easily browsable.

ormsbee · 2023-02-12T21:26:06Z

Some more uncollected thoughts because I started hacking on this over the weekend...

The File Serving Solution I Wish I Had

Actually building one of these is hard, but what I wish I had was a server that I could instruct to serve this hash-identified file with this URL to this client. I would give each ContentVersion its own virtual space of sorts, so that the urls could look like https://{learning_package_identifier}.{safe-domain}/components/{component_identifier}/{version_num}/{specified_path}

So an example might be:

https://sandbox_openedx_101.openedx.io/components/welcome_video/v3/tracks/en-US.vtt

But it would really know to map that to the file named after its hash (e.g. odpzXni_YLt76s7e-loBTWq5LSQ) from an S3 bucket.

The nice thing about having versioning done in the URL at the ComponentVersion level is that relative links will continue to work. If there's a piece of HTML that's referencing tracks/en-US.vtt, we don't have to worry that the file is really odpzXni_YLt76s7e-loBTWq5LSQ in S3.

On the downside, this would mean that browsers wouldn't properly cache things across versions–when the video component is updated, it's very likely that this subtitle file didn't change, but it would look like a new file to the browser. I think that's acceptable, because content doesn't change that often relative to the people consuming it. It's the per-XBlock JS and CSS code files that are important to keep properly cached for client-side performance, and those would be accessed elsewhere. This would only be for user-uploaded media files.

Since I'm making wishes, it would also have support for authenticated requests, range-requests, great caching, and a pony.

ormsbee · 2023-02-12T21:32:40Z

... okay, this is a side thing, and probably completely impractical, but I'm totally going to try to hack that in starlette now.

ormsbee · 2023-02-13T05:08:25Z

Said hack is:

File uploads + Experimental Media Server #31

ormsbee · 2023-02-16T21:57:43Z

Updated approach:

File Uploads + media_server app #33

ormsbee mentioned this issue Feb 10, 2023

Data models for Component-level Versioning Only #28

Merged

ormsbee self-assigned this Feb 10, 2023

ormsbee moved this to To do in Modular Learning - Tagging and Libraries Feb 10, 2023

ormsbee added this to Modular Learning - Tagging and Libraries Feb 10, 2023

ormsbee changed the title ~~Content should use File Storage~~ Content data model should use File Storage. Feb 10, 2023

ormsbee moved this from To do to In Progress in Modular Learning - Tagging and Libraries Feb 12, 2023

ormsbee mentioned this issue Feb 13, 2023

File uploads + Experimental Media Server #31

Closed

ormsbee added arch Architecture data model Anything relating to the relational models or more abstract "model" concepts around Learning Core. labels Apr 2, 2023

ormsbee closed this as completed May 9, 2023

ormsbee moved this from In Progress to Done in Modular Learning - Tagging and Libraries May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content data model should use File Storage. #29

Content data model should use File Storage. #29

ormsbee commented Feb 10, 2023 •

edited

Loading

bradenmacdonald commented Feb 10, 2023

ormsbee commented Feb 11, 2023 •

edited

Loading

ormsbee commented Feb 12, 2023

ormsbee commented Feb 12, 2023

ormsbee commented Feb 13, 2023

ormsbee commented Feb 16, 2023

Content data model should use File Storage. #29

Content data model should use File Storage. #29

Comments

ormsbee commented Feb 10, 2023 • edited Loading

bradenmacdonald commented Feb 10, 2023

ormsbee commented Feb 11, 2023 • edited Loading

ormsbee commented Feb 12, 2023

ormsbee commented Feb 12, 2023

ormsbee commented Feb 13, 2023

ormsbee commented Feb 16, 2023

ormsbee commented Feb 10, 2023 •

edited

Loading

ormsbee commented Feb 11, 2023 •

edited

Loading