-
Notifications
You must be signed in to change notification settings - Fork 713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ancestor information to summary and session log exports #12782
Add ancestor information to summary and session log exports #12782
Conversation
@nucleogenesis @rtibbles
|
Build Artifacts
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can optimize this a bit more, as we are currently fetching contentnode data three times, when we could have a helper function to cache it better between runs.
kolibri/core/logger/csv_export.py
Outdated
|
||
def map_object(item): | ||
mapped_item = output_mapper(item, labels=labels, output_mappings=mappings) | ||
node = ContentNode.objects.filter(content_id=item["content_id"]).first() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could halve the number of queries here by getting and caching the ancestor information when we get the title too.
So perhaps we have a general helper function cache_content_data
that is used by a renamed get_content_title
and a new get_content_ancestors
- each one calls the cache_content_data
function that queries for the content node and stores both the title and the ancestors in the cache and returns them.
We could also use fill this cache in get_max_ancestor_depth calculation, so that it is already populated, as we've had to iterate over every content node already to calculate this.
kolibri/core/logger/csv_export.py
Outdated
def get_max_ancestor_depth(): | ||
max_depth = 0 | ||
for node in ContentNode.objects.filter( | ||
content_id__in=ContentSummaryLog.objects.values_list("content_id", flat=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably batch this query, and iterate over slices of 500 content_ids at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking to learn: If we dont do batch operation with 500 content_ids at a time, how many content_ids at a time would it require for this function to be memory inefficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong, but my intuition here is that having a very long IN lookup will make the query not perform well, as I think it is equivalent to doing an OR-ed equal check against each value, so the more values there are, the longer the lookups will take. Possible that the query planner optimizes this, so it's worth testing to see if it makes any difference, so if you try that and see no impact, can disregard!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood!
There wasn't much difference as I tested with a small data. I added iterator() still
fac9514
to
9927b62
Compare
kolibri/core/logger/csv_export.py
Outdated
content_ids = ContentSummaryLog.objects.values_list("content_id", flat=True) | ||
nodes = ( | ||
ContentNode.objects.filter(content_id__in=content_ids) | ||
.only("content_id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to get the title and ancestors fields in this fetch and seed the cache with it?
That would bring us from N + 1 queries, down to N / BATCH_SIZE queries!
kolibri/core/logger/csv_export.py
Outdated
nodes = ( | ||
ContentNode.objects.filter(content_id__in=content_ids) | ||
.only("content_id") | ||
.iterator(chunk_size=BATCH_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this will address the issue I was suggesting, which is where the performance issue might come from the length of the content_ids
subquery - so, maybe let's just remove the iterator for now, and just see whether there are actually performance issues here in practice when we test.
721a096
to
789d140
Compare
It is super optimized now(maybe) |
789d140
to
03113e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me. In manual testing, I've spotted a pre-existing issue where we are looking up by the content_id but not also the channel_id, but I think we can fix that in a follow up issue.
The only other thought here is that it might be good for the topic headers to appear in the CSV between the Channel name and the Content id headers.
Interested in @radinamatic and @pcenov's thoughts here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, actually - I realized that we should exclude the highest level topic, as it is just the channel name, so 'topic level 1' should start at the second entry in the ancestors list, and max depth will be 1 less than the max of the ancestors length.
If you could insert the topic headers after the Channel name header and before the Content id header while you are updating this, that would be great, thanks!
@rtibbles Updated! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! Thank you, @thesujai - this is wonderful.
Summary
Adds the ancestor information(ContentNode title) of each content to log exports before writing into csv
References
Fixes #12691
Reviewer guidance
Testing checklist
PR process
Reviewer checklist
yarn
andpip
)