Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow including all thumbnails in import #10780

Merged
merged 5 commits into from
Jun 9, 2023

Conversation

dbnicholson
Copy link
Contributor

Summary

Add an option to include all thumbnails in import/export data and wire it up to be used when importing via the API or CLI. The idea is that you have a subset of a channel's nodes that you actually want to import but you want all the thumbnails so that you can display the nodes that aren't available in a visually appealing way.

Note that I only wired up the import side, but I don't think there's any reason it can't be wired up on the export side.

References

#10770

Reviewer guidance

This can be tested with the CLI. Here's an example using the "How to get started with Kolibri" channel:

kolibri manage importchannel network 624e09bb5eeb4d20aa8de62e7b4778a0
kolibri manage importcontent --node_ids 8593e4eec90a4cc3a4f2967c0c7dfd03 --include-unrenderable-content --all-thumbnails --fail-on-error network 624e09bb5eeb4d20aa8de62e7b4778a0

After that you should find that all thumbnails in the channel. You can do something like this from the shell:

from kolibri.core.content.models import File
assert File.objects.filter(contentnode__channel_id="624e09bb5eeb4d20aa8de62e7b4778a0", thumbnail=True, local_file__available=False).count() == 0

Testing checklist

  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Critical and brittle code paths are covered by unit tests

PR process

  • PR has the correct target branch and milestone
  • PR has 'needs review' or 'work-in-progress' label
  • If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
  • If this is an important user-facing change, PR or related issue has a 'changelog' label
  • If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@github-actions github-actions bot added APP: Device Re: Device App (content import/export, facility-syncing, user permissions, etc.) DEV: backend Python, databases, networking, filesystem... SIZE: medium labels Jun 2, 2023
@dbnicholson
Copy link
Contributor Author

Also, if you prefer, I can squash the last 4 commits together.

"fields": {
"available": true,
"extension": "png",
"file_size": null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is with setting the file size of these to null. Our current fixture is terrible and already has some completely unrealistic data in it, so my thought is not to make it any more unrealistic by having files with null size.

I assume you maybe tried with a non-null file size and lots of tests broke?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just followed what was already in use. I can try setting them to 1 and see what happens. I did run into a few other hard to debug issues touching this fixture, though. It's actually how I ran into #8255 since it seems some of the tests actually depend on not only real network requests but for the resources to exist on studio.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - in that case, we can leave as is, it's been a long time since I manually edited this fixture!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed them to 1 and it just changed the test expectations.

@rtibbles
Copy link
Member

rtibbles commented Jun 5, 2023

My only other question here is that it seems like you need to import a specific resource in order to import all the thumbnails? Is this the desired behaviour? It seems like just being able to import all the thumbnails at the same time the channel metadata is imported seems more desired? So would adding this to the import channel task/management command be better for your purposes?

This helps exercise more of the `get_import_export_data` functionality
that was previously doing nothing since there were no thumbnails in the
test data. A few tests needed to be updated to account for the added
files.
@dbnicholson
Copy link
Contributor Author

My only other question here is that it seems like you need to import a specific resource in order to import all the thumbnails? Is this the desired behaviour? It seems like just being able to import all the thumbnails at the same time the channel metadata is imported seems more desired? So would adding this to the import channel task/management command be better for your purposes?

If you specify an empty node_ids list then you'll just get the thumbnails for the channel. I added a test for that and it worked as expected. It would be nice to expose that in the CLI in more friendly way than --node_ids="" --all-thumbnails, but I don't have any great ideas.

Add an `all_thumbnails` keyword argument to `get_import_export_data` to
request that all thumbnails in a channel should be included and not just
those associated with the desired nodes and their associated topics.
This can be useful for the frontend to display content that is available
to download in a richer way. The option default to `False` to preserve
the existing behavior.
The CLI option is `--all-thumbnails` and defaults to `False`.
@dbnicholson
Copy link
Contributor Author

My only other question here is that it seems like you need to import a specific resource in order to import all the thumbnails? Is this the desired behaviour? It seems like just being able to import all the thumbnails at the same time the channel metadata is imported seems more desired? So would adding this to the import channel task/management command be better for your purposes?

Re-reading what you said, I see what you mean now. It's a bit more a property of the channel rather than the channel content. I still feel like this is the better way to go in the short term as the channel import code is currently very simple and adding some content import on the side would make it more complicated.

What I think would be better long term is if the thumbnails weren't separately downloaded files but all inlined in the database like the channel thumbnail. To maintain the deduplication, I think a separate Thumbnail model with references from ContentNode would suffice. That would bloat the channel database but it would drop the need for handling the thumbnails specially (both all_thumbnails and the existing topic_thumbnails options). It would also have the nice property that any File referencing a ContentNode would be purely content. But that all seems like it would be a fairly big project since the channel databases on studio need to be backwards compatible.

@rtibbles
Copy link
Member

rtibbles commented Jun 9, 2023

What I think would be better long term is if the thumbnails weren't separately downloaded files but all inlined in the database like the channel thumbnail. To maintain the deduplication, I think a separate Thumbnail model with references from ContentNode would suffice. That would bloat the channel database but it would drop the need for handling the thumbnails specially (both all_thumbnails and the existing topic_thumbnails options). It would also have the nice property that any File referencing a ContentNode would be purely content. But that all seems like it would be a fairly big project since the channel databases on studio need to be backwards compatible.

We do this currently for the channel thumbnails specifically - so it would not be a huge technical challenge - however, as you say, the backwards compatibility would be a big issue, and I do worry particularly about the channel database bloat. Given that a large channel can already have ~100MB of metadata in its database loading in several thousand image files seems like it would be a big change.

As an example, the thumbnail images for Khan Academy English total about 600MB, which would make the total channel database size now ~750MB. Given we've already had significant issues raised from the size of the current channel databases, I am not inclined to think this is a good change.

I think there may be a place for embedding thumbnails alongside metadata, but only in the context of more granular metadata preview and import as we are pushing on with 0.16.

@rtibbles rtibbles merged commit 0e2dc3a into learningequality:develop Jun 9, 2023
@dbnicholson dbnicholson deleted the all-thumbnails branch June 9, 2023 21:50
@dbnicholson
Copy link
Contributor Author

We do this currently for the channel thumbnails specifically - so it would not be a huge technical challenge - however, as you say, the backwards compatibility would be a big issue, and I do worry particularly about the channel database bloat. Given that a large channel can already have ~100MB of metadata in its database loading in several thousand image files seems like it would be a big change.

I did some research on this the other day over here. Khan Academy has by far the most thumbnails. Most of the channels we're looking at are under 20 MB of total thumbnails, though. Repeating that data here:

Total 1.7 GB (1844833202)
c9d7f950ab6b5a1199e3d6c10d7f0103 (Khan Academy (English - US curriculum)) 1.1 GB (1186440792)
7aca54975a2c415c888d5fe73e0e8163 (हिन्दी) 166.5 MB (174574651)
59b8deeb90f544da923187e77c8d3820 (wikiHow) 88.1 MB (92409113)
914fee213ee146de869016c287116b23 (Chapter Books) 55.2 MB (57849018)
000409f81dbe5d1ba67101cb9fed4530 (Touchable Earth (en)) 50.4 MB (52894914)
bbb4ea407a3c450cb18cbaa76f2d75cd (CSpathshala (English)) 47.5 MB (49830241)
08897e003ea9489eb3d86fc94ba08c21 (Українською) 22.6 MB (23665950)
74f36493bb475b62935fa8705ed59fed (Thoughtful Learning) 20.8 MB (21826123)
f061fce103ff5d4e9b8433e67802e666 (Arts & Crafts) 20.3 MB (21326248)
79cd09863eed51e98576c35ede6f9c9d (Cooking) 16.0 MB (16797114)
fc47aee82e0153e2a30197d3fdee1128 (Open Stax) 15.4 MB (16113723)
2f95235c3709511fa12d007f31ed6a7b (STEAM) 9.3 MB (9803758)
efcc464be5a85ba5a58d1636b00313fc (Gardening) 9.1 MB (9556010)
f5f6729f95b55753badeaa066fa6e986 (Healthy Body) 7.6 MB (7921762)
e9d0d54d209344849e9bed0aa8c222ad (Sikana DIY) 7.4 MB (7737800)
3fcffebc58d15175b948b140434ef6e6 (Sports) 7.2 MB (7531679)
0418cc231e9c5513af0fff9f227f7172 (Free English with Hello Channel) 7.0 MB (7367609)
97111903de564de49483a9705d41a8ac (Career Girls) 6.1 MB (6359663)
ee52db4a62a94e9683599af8782f2d03 (The SciGirls Collection (en español)) 5.5 MB (5807639)
1b1fc9bd453a4c52bb5628d9ae804ede (The SciGirls Collection) 5.5 MB (5782572)
92e96efc082e5c62b0aac3847bdcdb33 (Staff Playlist) 4.7 MB (4940529)
e11462f71c6f5472b113311c69071b05 (Dance) 4.7 MB (4934302)
197934f144305350b5820c7c4dd8e194 (PhET Interactive Simulations (English)) 4.3 MB (4508692)
1520f018610256549c98ca0140cceebe (Virtual Field Trips) 4.0 MB (4198784)
359e048230974c8f80db1a95dc80d544 (EiE Familias) 3.9 MB (4092851)
9c33eb395508447d96c96682cb18c57a (Techbridge Girls @ Home) 3.6 MB (3802707)
f1ada7abc4194ff48a958337a31972c7 (EiE Families) 3.6 MB (3749048)
bcc6e12a0ddf4a17a8b600c6b880e3ed (Common Sense Student Resources) 3.3 MB (3499386)
2091ca47ff544c96b4ae02b3a92346e1 (TED-Ed) 3.1 MB (3298810)
bf0260ed911f44cda27a263db93a8512 (49ers EDU Digital Playbook) 2.6 MB (2697563)
4968191fba07548c9592fc174a70b5d6 (Beauty) 2.5 MB (2610982)
57e23812e0dc562581958e39acedd717 (Games & Gaming) 2.5 MB (2573844)
e409b964366a59219c148f2aaa741f43 (Blockly Games) 2.2 MB (2260272)
4e413158eac55422a5343af9fcfa8d59 (Healthy Mind) 2.1 MB (2162902)
2b43973f53f1538bad5ece63ad847606 (Financial Literacy) 2.0 MB (2143450)
3160899a73564d8a8467284d9219b91c (Terminal Two) 2.0 MB (2124581)
057f871caa405ec29d62ba0523c193d7 (Music) 2.0 MB (2072904)
bf36d8e7e1ee56b194fe52cafbfd9db3 (Fashion) 1.8 MB (1863063)
a8e6591f1afa426d859318a0a29d1237 (SAMHSA) 1.5 MB (1587918)
eb4373b5da054c07879d0c969dc1976a (Virtual Science Teachers) 1.2 MB (1281591)
b40491d1ef8b5506b8c6ae861372e9de (Jewelry Making) 1.1 MB (1191929)
79a50be66bad5eb686c42617c914fd45 (Careers) 908.4 KB (930183)
85b42a40745f4e2392ed62e72d4dad6e (OceanX) 616.0 KB (630786)
f62db29be20453c4a267132e93a9e602 (Wikipedia) 77.9 KB (79746)

It's definitely a non-trivial amount of growth but not that excessive for most channels. Anyways, some other time. I'm pretty sure this should fill the gap for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
APP: Device Re: Device App (content import/export, facility-syncing, user permissions, etc.) DEV: backend Python, databases, networking, filesystem... SIZE: medium
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants