Import from other channels search optimized #3399

vkWeb · 2022-06-02T11:49:15Z

Summary

This PR brings in full text search with GIN indexes to make import from other channels and content library searches super fast.

Initially the investigation was done with @ozer550.

Reviewer guidance

Here are some manual tests that MUST be performed to see if everything is working as expected or not:

After a channel is published, it should be available for search from import from other channel search.
After a channel is published, we should be able to search on updated fields from import from other channel search. For e.g. if the updated title for a node is "New Title" then searching for new should output that node.
import from other channels search only searches for published channels.
When searching content library, searches for channel by name, description, uuids should return the channel.
When searching content library, searches for nodes inside channel should return the channel as well. For e.g. if a channel has a node with title "Math exercise" then searching "math" should output that channel.
Administration search should not be affected and should work as it used to work before this PR, i.e. it should be able to search for unpublished channels also.

References

Closes #3186
Closes #2934

Contributor's Checklist

PR process:

If this is an important user-facing change, PR or related issue the CHANGELOG label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later time
If this includes an internal dependency change, a link to the diff is provided
The docs label has been added if this introduces a change that needs to be updated in the user docs?
If any Python requirements have changed, the updated requirements.txt files also included in this PR
Opportunities for using Google Analytics here are noted
Migrations are safe for a large db

Studio-specifc:

All user-facing strings are translated properly
The notranslate class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)
All UI components are LTR and RTL compliant
Views are organized into pages, components, and layouts directories as described in the docs
Users' storage used is recalculated properly on any changes to main tree files
If there new ways this uses user data that needs to be factored into our Privacy Policy, it has been noted.

Testing:

Code is clean and well-commented
Contributor has fully tested the PR manually
If there are any front-end changes, before/after screenshots are included
Critical user journeys are covered by Gherkin stories
Any new interactions have been added to the QA Sheet
Critical and brittle code paths are covered by unit tests

Reviewer's Checklist

This section is for reviewers to fill out.

Automated test coverage is satisfactory
PR is fully functional
PR has been tested for accessibility regressions
External dependency files were updated if necessary (yarn and pip)
Documentation is updated
Contributor is in AUTHORS.md

contentcuration/search/viewsets/contentnode.py

vkWeb · 2022-07-03T11:41:11Z

Update:

location_ids will not be queried.
initial queryset will be reduced to just the accessible nodes by user.
sorting will only be done on the last result set.
channel_id annotation will be done only on the accessible nodes by user.

We need to test our changes with our gcloud develop SQL instance to verify the perf improvement. And then once its done I'll push commits here and it'll be ready to merge.

contentcuration/contentcuration/dev_settings.py

bjester · 2022-07-19T17:56:56Z

contentcuration/search/viewsets/contentnode.py

            thumbnail_checksum=Subquery(thumbnails.values("checksum")[:1]),
            thumbnail_extension=Subquery(
                thumbnails.values("file_format__extension")[:1]
            ),
            content_tags=NotNullMapArrayAgg("tags__tag_name"),
-            original_channel_name=original_channel_name,
+            original_channel_name=channel_name,


This seems like a change that introduces a regression. The query producing this annotation has changed

This indeed introduces a regression. This regression is intended. I have removed the logic for producing unique content nodes. Copied nodes (nodes with same content id but on different channels) will also be present in the result set. So channel name should be the name of the channel the content node is in right now.

We do not need to tell the user about the name of channel from where it was copied. We just need to let them know about the channel it is in right now.

Does this make sense @bjester? What do you think?

This changes the meaning of original_channel_id - I think if we need to remove this field for performance reasons, then we should remove it, rather than update it to something inaccurate. We should check with @jtamiace to see how best to handle this in the user interface if this is a strong performance concern.

Right now, we should keep this on hold until we evaluate our performance of new full text search. I am hoping that our search will be efficient enough to allow us to run de-duplication of content_ids and then we will be able to keep our original_channel_name field intact.

@vkWeb To clarify, what do you mean by "we should keep this on hold"?

I have updated the queryset to annotate the original channel name.

But I have not de-duplicated the query because two contentnodes with the same content_id can have entirely different content inside, we don't know what node the user wants to import so probably we should not de-duplicate. For e.g. an exercise was imported from a published channel and we completely changed the questions inside it but kept the metadata as is, upon search both nodes will show up, how do we decide which node to show and which to discard since both of them have completely different content...?

Instead in a future PR we should rank the results appropriately and display the top most 5000 results maybe? That would be more helpful from user's perspective in my opinion sir.

What are your thoughts sir?

But I have not de-duplicated the query because two contentnodes with the same content_id can have entirely different content inside, we don't know what node the user wants to import so probably we should not de-duplicate

The fact that this is true is a long standing bug in Studio, most acutely for exercises, but also applicable to other resources: #1055

woah, I thought its a desired behaviour... 👁️ btw on what scenarios we care if two nodes have exact same content inside? 🤔

The content_id is used to track progress on a specific resource in Kolibri, so it has quite an important role, and when the value is conflated across multiple resources it causes issues.

Oh, that's pretty important, then we should fix that bug I think, if the node gets modified we should change the content_id.

vkWeb · 2022-08-14T15:29:29Z

@bjester @rtibbles this is ready for review.

The only thing that has been bugging me since past few days is that calling set_tsvectors command can take a long time (several days maybe?) on develop db or production db. Until we update tsvector field our search will remain broken. Any thoughts on this? I've read on stackoverflow that dropping indexes then updating our column and then adding indexes back can speed up the update process because update being a delete + insert causes heavy updates to indexes.

Once we have updated tsvector, we can implement the following based on query performance:

de duplicate content_ids.
ranking results.

vkWeb · 2022-08-14T15:35:45Z

A relative performance benchmark against unstable and full text search with just GIN index. The current implementation is adding an explicit column to store tsvectors so the query performance will be much much better, better almost 8 - 10x.

vkWeb · 2022-08-15T21:18:04Z

Converting to draft because Blaine and I discussed that I should research more on creating a new table for storing tsvectors.

rtibbles

Left some questions on review. I also attempted to manually test, but got a permissions error when trying to migrate. Traceback below:

  Applying contentcuration.0141_contentnode_search_vector...Traceback (most recent call last):
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.InsufficientPrivilege: permission denied to create extension "pg_trgm"
HINT:  Must be superuser to create this extension.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/richard/github/studio/./contentcuration/manage.py", line 11, in <module>
    execute_from_command_line(sys.argv)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
    utility.execute()
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/__init__.py", line 413, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/base.py", line 354, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/base.py", line 398, in execute
    output = self.handle(*args, **options)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/base.py", line 89, in wrapped
    res = handle_func(*args, **kwargs)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/commands/migrate.py", line 244, in handle
    post_migrate_state = executor.migrate(
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/migrations/executor.py", line 117, in migrate
    state = self._migrate_all_forwards(state, plan, full_plan, fake=fake, fake_initial=fake_initial)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/migrations/executor.py", line 147, in _migrate_all_forwards
    state = self.apply_migration(state, migration, fake=fake, fake_initial=fake_initial)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/migrations/executor.py", line 227, in apply_migration
    state = migration.apply(state, schema_editor)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/migrations/migration.py", line 126, in apply
    operation.database_forwards(self.app_label, schema_editor, old_state, project_state)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/contrib/postgres/operations.py", line 25, in database_forwards
    schema_editor.execute(
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/base/schema.py", line 145, in execute
    cursor.execute(sql, params)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.ProgrammingError: permission denied to create extension "pg_trgm"
HINT:  Must be superuser to create this extension.

contentcuration/contentcuration/models.py

contentcuration/contentcuration/settings.py

contentcuration/contentcuration/viewsets/contentnode.py

contentcuration/search/tests/test_search.py

contentcuration/search/viewsets/contentnode.py

contentcuration/contentcuration/models.py

contentcuration/contentcuration/management/commands/set_tsvectors.py

rtibbles

Some thoughts on how the TSVector updating at publishing could be made a bit simpler, and potentially reduce memory usage.

contentcuration/contentcuration/utils/publish.py

contentcuration/search/viewsets/contentnode.py

vkWeb · 2022-09-22T11:39:20Z

Turning this to draft to stop reviews. I found a strange bug that sets tsvectors as NULL during updates on publish. I'll discuss this with sir @bjester on Slack and will re open for final review very soon.

cc @rtibbles.

contentcuration/contentcuration/utils/publish.py

vkWeb · 2022-09-28T20:23:05Z

Richard sir and I did a profile test on the current implementation vs richard sir's suggesested method.

We found that the current implementation of first updating only changed nodes then bulk creating new nodes in chunks is around 25% faster and also more memory efficient. So, we are safe to go ahead with the current implementation.

cc @bjester @rtibbles.

rtibbles

Nothing blocking from me - quite happy to get this merged and iterate in unstable.

contentcuration/contentcuration/utils/publish.py

contentcuration/search/serializers.py

contentcuration/search/viewsets/contentnode.py

bjester

LGTM! Thanks for all over your hardwork on this @vkWeb! 🎉

bjester · 2022-10-07T14:28:25Z

contentcuration/contentcuration/utils/publish.py

+    call_command("set_contentnode_tsvectors",
+                 "--channel-id={}".format(channel_id),
+                 "--tree-id={}".format(channel["main_tree__tree_id"]),
+                 "--complete")


Any reason not to pass the command options as keyword args? It would be slightly cleaner. The only caveat is that it bypasses the argument parser, but for your purposes, I don't see that there would be a difference. https://docs.djangoproject.com/en/3.2/ref/django-admin/#django.core.management.call_command

The reason I chose this was to let the argument parser get invoked. It helped me manual test (and gave me the confidence) on what would happen when this command gets run from the commandline.

bjester · 2022-10-07T15:32:03Z

One last thing, @vkWeb, could you add a followup issue regarding the content deduplication that was removed?

vkWeb · 2022-10-08T07:15:32Z

Thank you sir @bjester and @rtibbles for both of your constant support throughout ❤️ all these feedback-iteration loop made this PR what it is today! Seeing this merged gives me joy 😄

vkWeb · 2022-10-08T07:46:31Z

@bjester I've added a follow up issue for contentnode deduplication: #3725.

vkWeb added 2 commits June 2, 2022 16:42

Optimized import from other channels search

357204c

Merge branch 'unstable' into optimize/search

9028b22

vkWeb requested review from rtibbles and bjester June 2, 2022 11:49

bjester reviewed Jun 2, 2022

View reviewed changes

contentcuration/search/viewsets/contentnode.py Outdated Show resolved Hide resolved

bjester and others added 2 commits June 15, 2022 13:47

Add search test for channel filtering and location_ids handling

d24416c

Fix autodiscovery of search tests

676526e

vkWeb added 2 commits July 9, 2022 19:47

Remove location_ids, zero db queries on descendant resource count

6d40c55

Upgrade django debug toolbar and fix its settings

094f05f

bjester reviewed Jul 19, 2022

View reviewed changes

contentcuration/contentcuration/dev_settings.py Outdated Show resolved Hide resolved

bjester reviewed Jul 19, 2022

View reviewed changes

contentcuration/contentcuration/dev_settings.py Outdated Show resolved Hide resolved

bjester reviewed Jul 19, 2022

View reviewed changes

vkWeb added 2 commits July 22, 2022 15:02

Remove unnecessary dev settings

701ddc9

Add .envrc to .gitignore

1a14749

vkWeb marked this pull request as draft July 24, 2022 07:35

vkWeb added 5 commits August 12, 2022 21:57

Add vector search column & indexes, also GiST trigram index

bff595c

Merge branch 'unstable' into optimize/search

8b7152e

Remove cyclic migration conflicts

c0e55ee

Fix wrong indentation happened due to merge conflict

34e8436

Add a command for setting tsvectors and fix tests

e00c512

vkWeb marked this pull request as ready for review August 14, 2022 13:39

Remove grade_level default to pass failing tests

f3280d9

vkWeb marked this pull request as draft August 15, 2022 21:17

rtibbles reviewed Aug 18, 2022

View reviewed changes

vkWeb added 2 commits August 24, 2022 14:50

Merge branch 'unstable' into optimize/search

475aeef

Full text search models and data migrations

3ff0edd

rtibbles reviewed Sep 14, 2022

View reviewed changes

bjester reviewed Sep 15, 2022

View reviewed changes

contentcuration/search/viewsets/contentnode.py Outdated Show resolved Hide resolved

vkWeb added 5 commits September 16, 2022 15:58

Merge branch 'unstable' into optimize/search

55e4acf

Enforce only-one search entries

57724e0

Remove unnecessary select_related

e53e56a

fix cache tests mock by setting ContentNodeFullTextSearch

4b3d4c7

fix cache & nodes tests by using db-friendly TestCase

44ab74c

vkWeb mentioned this pull request Sep 16, 2022

Add new models for full text search and migration commands #3651

Merged

26 tasks

vkWeb added 2 commits September 16, 2022 23:33

Merge branch 'unstable' into optimize/search

4beaf3c

Use command for tsv insertion & simpler tsv update on publish

001e788

vkWeb requested review from bjester and rtibbles September 21, 2022 17:33

vkWeb marked this pull request as draft September 22, 2022 11:39

rtibbles reviewed Sep 22, 2022

View reviewed changes

contentcuration/contentcuration/utils/publish.py Show resolved Hide resolved

vkWeb added 3 commits September 28, 2022 00:21

Merge branch 'unstable' into optimize/search

ff59495

fixes the strict update subquery, lightens it up

9ec60cf

Do not output deleted channel nodes on search

aae3be1

vkWeb marked this pull request as ready for review September 28, 2022 20:22

vkWeb requested a review from rtibbles September 28, 2022 20:48

rtibbles reviewed Sep 29, 2022

View reviewed changes

contentcuration/contentcuration/utils/publish.py Show resolved Hide resolved

contentcuration/search/serializers.py Show resolved Hide resolved

contentcuration/search/viewsets/contentnode.py Show resolved Hide resolved

bjester approved these changes Oct 7, 2022

View reviewed changes

bjester merged commit b7470b7 into learningequality:unstable Oct 7, 2022

vkWeb mentioned this pull request Oct 8, 2022

Add contentnode deduplication to import from other channel search #3725

Open

bjester mentioned this pull request May 3, 2023

Hotfixes release: Kolibri 0.16 support #4002

Merged

bjester mentioned this pull request Jul 5, 2023

Release v2023.07.05 - Kolibri 0.16 support #4187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import from other channels search optimized #3399

Import from other channels search optimized #3399

vkWeb commented Jun 2, 2022 •

edited

Loading

vkWeb commented Jul 3, 2022

bjester Jul 19, 2022

vkWeb Jul 19, 2022

rtibbles Aug 18, 2022

vkWeb Aug 18, 2022

bjester Aug 19, 2022

vkWeb Sep 28, 2022

rtibbles Sep 28, 2022

vkWeb Sep 28, 2022

rtibbles Sep 28, 2022

vkWeb Sep 28, 2022

vkWeb commented Aug 14, 2022 •

edited

Loading

vkWeb commented Aug 14, 2022

vkWeb commented Aug 15, 2022

rtibbles left a comment

rtibbles left a comment

vkWeb commented Sep 22, 2022

vkWeb commented Sep 28, 2022

rtibbles left a comment

bjester left a comment

bjester Oct 7, 2022

vkWeb Oct 8, 2022

bjester commented Oct 7, 2022

vkWeb commented Oct 8, 2022

vkWeb commented Oct 8, 2022

Import from other channels search optimized #3399

Import from other channels search optimized #3399

Conversation

vkWeb commented Jun 2, 2022 • edited Loading

Summary

Reviewer guidance

References

Contributor's Checklist

Reviewer's Checklist

This section is for reviewers to fill out.

vkWeb commented Jul 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkWeb commented Aug 14, 2022 • edited Loading

vkWeb commented Aug 14, 2022

vkWeb commented Aug 15, 2022

rtibbles left a comment

Choose a reason for hiding this comment

rtibbles left a comment

Choose a reason for hiding this comment

vkWeb commented Sep 22, 2022

vkWeb commented Sep 28, 2022

rtibbles left a comment

Choose a reason for hiding this comment

bjester left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjester commented Oct 7, 2022

vkWeb commented Oct 8, 2022

vkWeb commented Oct 8, 2022

vkWeb commented Jun 2, 2022 •

edited

Loading

vkWeb commented Aug 14, 2022 •

edited

Loading