-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import from other channels search optimized #3399
Conversation
Update:
We need to test our changes with our gcloud develop SQL instance to verify the perf improvement. And then once its done I'll push commits here and it'll be ready to merge. |
thumbnail_checksum=Subquery(thumbnails.values("checksum")[:1]), | ||
thumbnail_extension=Subquery( | ||
thumbnails.values("file_format__extension")[:1] | ||
), | ||
content_tags=NotNullMapArrayAgg("tags__tag_name"), | ||
original_channel_name=original_channel_name, | ||
original_channel_name=channel_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a change that introduces a regression. The query producing this annotation has changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indeed introduces a regression. This regression is intended. I have removed the logic for producing unique content nodes. Copied nodes (nodes with same content id but on different channels) will also be present in the result set. So channel name should be the name of the channel the content node is in right now.
We do not need to tell the user about the name of channel from where it was copied. We just need to let them know about the channel it is in right now.
Does this make sense @bjester? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the meaning of original_channel_id
- I think if we need to remove this field for performance reasons, then we should remove it, rather than update it to something inaccurate. We should check with @jtamiace to see how best to handle this in the user interface if this is a strong performance concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, we should keep this on hold until we evaluate our performance of new full text search. I am hoping that our search will be efficient enough to allow us to run de-duplication of content_ids and then we will be able to keep our original_channel_name field intact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkWeb To clarify, what do you mean by "we should keep this on hold"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the queryset to annotate the original channel name.
But I have not de-duplicated the query because two contentnodes with the same content_id
can have entirely different content inside, we don't know what node the user wants to import so probably we should not de-duplicate. For e.g. an exercise was imported from a published channel and we completely changed the questions inside it but kept the metadata as is, upon search both nodes will show up, how do we decide which node to show and which to discard since both of them have completely different content...?
Instead in a future PR we should rank the results appropriately and display the top most 5000 results maybe? That would be more helpful from user's perspective in my opinion sir.
What are your thoughts sir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I have not de-duplicated the query because two contentnodes with the same content_id can have entirely different content inside, we don't know what node the user wants to import so probably we should not de-duplicate
The fact that this is true is a long standing bug in Studio, most acutely for exercises, but also applicable to other resources: #1055
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woah, I thought its a desired behaviour... 👁️ btw on what scenarios we care if two nodes have exact same content inside? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content_id
is used to track progress on a specific resource in Kolibri, so it has quite an important role, and when the value is conflated across multiple resources it causes issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's pretty important, then we should fix that bug I think, if the node gets modified we should change the content_id
.
@bjester @rtibbles this is ready for review. The only thing that has been bugging me since past few days is that calling Once we have updated tsvector, we can implement the following based on query performance:
|
Converting to draft because Blaine and I discussed that I should research more on creating a new table for storing tsvectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions on review. I also attempted to manually test, but got a permissions error when trying to migrate. Traceback below:
Applying contentcuration.0141_contentnode_search_vector...Traceback (most recent call last):
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
psycopg2.errors.InsufficientPrivilege: permission denied to create extension "pg_trgm"
HINT: Must be superuser to create this extension.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/richard/github/studio/./contentcuration/manage.py", line 11, in <module>
execute_from_command_line(sys.argv)
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
utility.execute()
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/__init__.py", line 413, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/base.py", line 354, in run_from_argv
self.execute(*args, **cmd_options)
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/base.py", line 398, in execute
output = self.handle(*args, **options)
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/base.py", line 89, in wrapped
res = handle_func(*args, **kwargs)
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/core/management/commands/migrate.py", line 244, in handle
post_migrate_state = executor.migrate(
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/migrations/executor.py", line 117, in migrate
state = self._migrate_all_forwards(state, plan, full_plan, fake=fake, fake_initial=fake_initial)
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/migrations/executor.py", line 147, in _migrate_all_forwards
state = self.apply_migration(state, migration, fake=fake, fake_initial=fake_initial)
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/migrations/executor.py", line 227, in apply_migration
state = migration.apply(state, schema_editor)
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/migrations/migration.py", line 126, in apply
operation.database_forwards(self.app_label, schema_editor, old_state, project_state)
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/contrib/postgres/operations.py", line 25, in database_forwards
schema_editor.execute(
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/base/schema.py", line 145, in execute
cursor.execute(sql, params)
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/home/richard/.virtualenvs/studio/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
django.db.utils.ProgrammingError: permission denied to create extension "pg_trgm"
HINT: Must be superuser to create this extension.
contentcuration/contentcuration/management/commands/set_tsvectors.py
Outdated
Show resolved
Hide resolved
contentcuration/contentcuration/management/commands/set_tsvectors.py
Outdated
Show resolved
Hide resolved
contentcuration/contentcuration/management/commands/set_tsvectors.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts on how the TSVector updating at publishing could be made a bit simpler, and potentially reduce memory usage.
Richard sir and I did a profile test on the current implementation vs richard sir's suggesested method. We found that the current implementation of first updating only changed nodes then bulk creating new nodes in chunks is around 25% faster and also more memory efficient. So, we are safe to go ahead with the current implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing blocking from me - quite happy to get this merged and iterate in unstable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for all over your hardwork on this @vkWeb! 🎉
call_command("set_contentnode_tsvectors", | ||
"--channel-id={}".format(channel_id), | ||
"--tree-id={}".format(channel["main_tree__tree_id"]), | ||
"--complete") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to pass the command options as keyword args? It would be slightly cleaner. The only caveat is that it bypasses the argument parser, but for your purposes, I don't see that there would be a difference. https://docs.djangoproject.com/en/3.2/ref/django-admin/#django.core.management.call_command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I chose this was to let the argument parser get invoked. It helped me manual test (and gave me the confidence) on what would happen when this command gets run from the commandline.
One last thing, @vkWeb, could you add a followup issue regarding the content deduplication that was removed? |
Summary
This PR brings in full text search with GIN indexes to make
import from other channels
andcontent library
searches super fast.Initially the investigation was done with @ozer550.
Reviewer guidance
Here are some manual tests that MUST be performed to see if everything is working as expected or not:
import from other channel
search.import from other channel
search. For e.g. if the updated title for a node is "New Title" then searching fornew
should output that node.import from other channels
search only searches for published channels.content library
, searches for channel by name, description, uuids should return the channel.content library
, searches for nodes inside channel should return the channel as well. For e.g. if a channel has a node with title "Math exercise" then searching "math" should output that channel.References
Closes #3186
Closes #2934
Contributor's Checklist
PR process:
CHANGELOG
label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later timedocs
label has been added if this introduces a change that needs to be updated in the user docs?requirements.txt
files also included in this PRStudio-specifc:
notranslate
class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)pages
,components
, andlayouts
directories as described in the docsTesting:
Reviewer's Checklist
This section is for reviewers to fill out.
yarn
andpip
)