Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import from other channels search optimized #3399

Merged
merged 32 commits into from
Oct 7, 2022

Conversation

vkWeb
Copy link
Member

@vkWeb vkWeb commented Jun 2, 2022

Summary

This PR brings in full text search with GIN indexes to make import from other channels and content library searches super fast.

Initially the investigation was done with @ozer550.

Reviewer guidance

Here are some manual tests that MUST be performed to see if everything is working as expected or not:

  • After a channel is published, it should be available for search from import from other channel search.
  • After a channel is published, we should be able to search on updated fields from import from other channel search. For e.g. if the updated title for a node is "New Title" then searching for new should output that node.
  • import from other channels search only searches for published channels.
  • When searching content library, searches for channel by name, description, uuids should return the channel.
  • When searching content library, searches for nodes inside channel should return the channel as well. For e.g. if a channel has a node with title "Math exercise" then searching "math" should output that channel.
  • Administration search should not be affected and should work as it used to work before this PR, i.e. it should be able to search for unpublished channels also.

References

Closes #3186
Closes #2934


Contributor's Checklist

PR process:

  • If this is an important user-facing change, PR or related issue the CHANGELOG label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later time
  • If this includes an internal dependency change, a link to the diff is provided
  • The docs label has been added if this introduces a change that needs to be updated in the user docs?
  • If any Python requirements have changed, the updated requirements.txt files also included in this PR
  • Opportunities for using Google Analytics here are noted
  • Migrations are safe for a large db

Studio-specifc:

  • All user-facing strings are translated properly
  • The notranslate class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)
  • All UI components are LTR and RTL compliant
  • Views are organized into pages, components, and layouts directories as described in the docs
  • Users' storage used is recalculated properly on any changes to main tree files
  • If there new ways this uses user data that needs to be factored into our Privacy Policy, it has been noted.

Testing:

  • Code is clean and well-commented
  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Any new interactions have been added to the QA Sheet
  • Critical and brittle code paths are covered by unit tests

Reviewer's Checklist

This section is for reviewers to fill out.

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@vkWeb vkWeb requested review from rtibbles and bjester June 2, 2022 11:49
@vkWeb
Copy link
Member Author

vkWeb commented Jul 3, 2022

Update:

  • location_ids will not be queried.
  • initial queryset will be reduced to just the accessible nodes by user.
  • sorting will only be done on the last result set.
  • channel_id annotation will be done only on the accessible nodes by user.

We need to test our changes with our gcloud develop SQL instance to verify the perf improvement. And then once its done I'll push commits here and it'll be ready to merge.

thumbnail_checksum=Subquery(thumbnails.values("checksum")[:1]),
thumbnail_extension=Subquery(
thumbnails.values("file_format__extension")[:1]
),
content_tags=NotNullMapArrayAgg("tags__tag_name"),
original_channel_name=original_channel_name,
original_channel_name=channel_name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a change that introduces a regression. The query producing this annotation has changed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indeed introduces a regression. This regression is intended. I have removed the logic for producing unique content nodes. Copied nodes (nodes with same content id but on different channels) will also be present in the result set. So channel name should be the name of the channel the content node is in right now.

We do not need to tell the user about the name of channel from where it was copied. We just need to let them know about the channel it is in right now.

Does this make sense @bjester? What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the meaning of original_channel_id - I think if we need to remove this field for performance reasons, then we should remove it, rather than update it to something inaccurate. We should check with @jtamiace to see how best to handle this in the user interface if this is a strong performance concern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, we should keep this on hold until we evaluate our performance of new full text search. I am hoping that our search will be efficient enough to allow us to run de-duplication of content_ids and then we will be able to keep our original_channel_name field intact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkWeb To clarify, what do you mean by "we should keep this on hold"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the queryset to annotate the original channel name.

But I have not de-duplicated the query because two contentnodes with the same content_id can have entirely different content inside, we don't know what node the user wants to import so probably we should not de-duplicate. For e.g. an exercise was imported from a published channel and we completely changed the questions inside it but kept the metadata as is, upon search both nodes will show up, how do we decide which node to show and which to discard since both of them have completely different content...?

Instead in a future PR we should rank the results appropriately and display the top most 5000 results maybe? That would be more helpful from user's perspective in my opinion sir.

What are your thoughts sir?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I have not de-duplicated the query because two contentnodes with the same content_id can have entirely different content inside, we don't know what node the user wants to import so probably we should not de-duplicate

The fact that this is true is a long standing bug in Studio, most acutely for exercises, but also applicable to other resources: #1055

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woah, I thought its a desired behaviour... 👁️ btw on what scenarios we care if two nodes have exact same content inside? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content_id is used to track progress on a specific resource in Kolibri, so it has quite an important role, and when the value is conflated across multiple resources it causes issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's pretty important, then we should fix that bug I think, if the node gets modified we should change the content_id.

@vkWeb vkWeb marked this pull request as draft July 24, 2022 07:35
@vkWeb vkWeb marked this pull request as ready for review August 14, 2022 13:39
@vkWeb
Copy link
Member Author

vkWeb commented Aug 14, 2022

@bjester @rtibbles this is ready for review.

The only thing that has been bugging me since past few days is that calling set_tsvectors command can take a long time (several days maybe?) on develop db or production db. Until we update tsvector field our search will remain broken. Any thoughts on this? I've read on stackoverflow that dropping indexes then updating our column and then adding indexes back can speed up the update process because update being a delete + insert causes heavy updates to indexes.

Once we have updated tsvector, we can implement the following based on query performance:

  • de duplicate content_ids.
  • ranking results.

@vkWeb
Copy link
Member Author

vkWeb commented Aug 14, 2022

A relative performance benchmark against unstable and full text search with just GIN index. The current implementation is adding an explicit column to store tsvectors so the query performance will be much much better, better almost 8 - 10x.

image

@vkWeb vkWeb marked this pull request as draft August 15, 2022 21:17
@vkWeb
Copy link
Member Author

vkWeb commented Aug 15, 2022

Converting to draft because Blaine and I discussed that I should research more on creating a new table for storing tsvectors.

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions on review. I also attempted to manually test, but got a permissions error when trying to migrate. Traceback below:

  Applying contentcuration.0141_contentnode_search_vector...Traceback (most recent call last):
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.InsufficientPrivilege: permission denied to create extension "pg_trgm"
HINT:  Must be superuser to create this extension.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/richard/github/studio/./contentcuration/manage.py", line 11, in <module>
    execute_from_command_line(sys.argv)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
    utility.execute()
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/__init__.py", line 413, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/base.py", line 354, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/base.py", line 398, in execute
    output = self.handle(*args, **options)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/base.py", line 89, in wrapped
    res = handle_func(*args, **kwargs)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/commands/migrate.py", line 244, in handle
    post_migrate_state = executor.migrate(
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/migrations/executor.py", line 117, in migrate
    state = self._migrate_all_forwards(state, plan, full_plan, fake=fake, fake_initial=fake_initial)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/migrations/executor.py", line 147, in _migrate_all_forwards
    state = self.apply_migration(state, migration, fake=fake, fake_initial=fake_initial)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/migrations/executor.py", line 227, in apply_migration
    state = migration.apply(state, schema_editor)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/migrations/migration.py", line 126, in apply
    operation.database_forwards(self.app_label, schema_editor, old_state, project_state)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/contrib/postgres/operations.py", line 25, in database_forwards
    schema_editor.execute(
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/base/schema.py", line 145, in execute
    cursor.execute(sql, params)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.ProgrammingError: permission denied to create extension "pg_trgm"
HINT:  Must be superuser to create this extension.

contentcuration/contentcuration/models.py Outdated Show resolved Hide resolved
contentcuration/contentcuration/models.py Outdated Show resolved Hide resolved
contentcuration/contentcuration/settings.py Outdated Show resolved Hide resolved
contentcuration/contentcuration/viewsets/contentnode.py Outdated Show resolved Hide resolved
contentcuration/search/tests/test_search.py Show resolved Hide resolved
contentcuration/contentcuration/models.py Outdated Show resolved Hide resolved
Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts on how the TSVector updating at publishing could be made a bit simpler, and potentially reduce memory usage.

contentcuration/contentcuration/utils/publish.py Outdated Show resolved Hide resolved
contentcuration/contentcuration/utils/publish.py Outdated Show resolved Hide resolved
@vkWeb vkWeb requested review from bjester and rtibbles September 21, 2022 17:33
@vkWeb
Copy link
Member Author

vkWeb commented Sep 22, 2022

Turning this to draft to stop reviews. I found a strange bug that sets tsvectors as NULL during updates on publish. I'll discuss this with sir @bjester on Slack and will re open for final review very soon.

cc @rtibbles.

@vkWeb vkWeb marked this pull request as draft September 22, 2022 11:39
@vkWeb vkWeb marked this pull request as ready for review September 28, 2022 20:22
@vkWeb
Copy link
Member Author

vkWeb commented Sep 28, 2022

Richard sir and I did a profile test on the current implementation vs richard sir's suggesested method.

We found that the current implementation of first updating only changed nodes then bulk creating new nodes in chunks is around 25% faster and also more memory efficient. So, we are safe to go ahead with the current implementation.

cc @bjester @rtibbles.

image

@vkWeb vkWeb requested a review from rtibbles September 28, 2022 20:48
Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing blocking from me - quite happy to get this merged and iterate in unstable.

contentcuration/search/serializers.py Show resolved Hide resolved
Copy link
Member

@bjester bjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for all over your hardwork on this @vkWeb! 🎉

call_command("set_contentnode_tsvectors",
"--channel-id={}".format(channel_id),
"--tree-id={}".format(channel["main_tree__tree_id"]),
"--complete")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to pass the command options as keyword args? It would be slightly cleaner. The only caveat is that it bypasses the argument parser, but for your purposes, I don't see that there would be a difference. https://docs.djangoproject.com/en/3.2/ref/django-admin/#django.core.management.call_command

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I chose this was to let the argument parser get invoked. It helped me manual test (and gave me the confidence) on what would happen when this command gets run from the commandline.

@bjester
Copy link
Member

bjester commented Oct 7, 2022

One last thing, @vkWeb, could you add a followup issue regarding the content deduplication that was removed?

@bjester bjester merged commit b7470b7 into learningequality:unstable Oct 7, 2022
@vkWeb
Copy link
Member Author

vkWeb commented Oct 8, 2022

Thank you sir @bjester and @rtibbles for both of your constant support throughout ❤️ all these feedback-iteration loop made this PR what it is today! Seeing this merged gives me joy 😄

@vkWeb
Copy link
Member Author

vkWeb commented Oct 8, 2022

@bjester I've added a follow up issue for contentnode deduplication: #3725.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants