-
Notifications
You must be signed in to change notification settings - Fork 713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ancestor information to summary and session log exports #12782
Changes from 4 commits
e835ce3
c84ff83
caf8811
de9b036
03113e0
93af375
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,7 +4,6 @@ | |
import math | ||
import os | ||
from collections import OrderedDict | ||
from functools import partial | ||
|
||
from dateutil import parser | ||
from django.core.cache import cache | ||
|
@@ -103,7 +102,35 @@ def cache_content_title(obj): | |
) | ||
) | ||
|
||
map_object = partial(output_mapper, labels=labels, output_mappings=mappings) | ||
|
||
def get_max_ancestor_depth(): | ||
max_depth = 0 | ||
for node in ContentNode.objects.filter( | ||
content_id__in=ContentSummaryLog.objects.values_list("content_id", flat=True) | ||
): | ||
max_depth = max(max_depth, len(node.ancestors)) | ||
return max_depth | ||
|
||
|
||
def add_ancestors_info(row, ancestors, max_depth): | ||
row.update( | ||
{ | ||
f"Topic level {level + 1}": ancestors[level]["title"] | ||
if level < len(ancestors) | ||
else "" | ||
for level in range(max_depth) | ||
} | ||
) | ||
|
||
|
||
def map_object(item): | ||
mapped_item = output_mapper(item, labels=labels, output_mappings=mappings) | ||
node = ContentNode.objects.filter(content_id=item["content_id"]).first() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could halve the number of queries here by getting and caching the ancestor information when we get the title too. So perhaps we have a general helper function We could also use fill this cache in get_max_ancestor_depth calculation, so that it is already populated, as we've had to iterate over every content node already to calculate this. |
||
if node and node.ancestors: | ||
add_ancestors_info(mapped_item, node.ancestors, get_max_ancestor_depth()) | ||
else: | ||
add_ancestors_info(mapped_item, [], get_max_ancestor_depth()) | ||
return mapped_item | ||
|
||
|
||
classes_info = { | ||
|
@@ -171,11 +198,18 @@ def csv_file_generator( | |
queryset = queryset.filter(start_timestamp__lte=end) | ||
|
||
# Exclude completion timestamp for the sessionlog CSV | ||
header_labels = tuple( | ||
header_labels = list( | ||
label | ||
for label in labels.values() | ||
if log_type == "summary" or label != labels["completion_timestamp"] | ||
) | ||
# len of topic headers should be equal to the max depth of the content node | ||
topic_headers = [ | ||
(f"Topic level {i+1}", _(f"Topic level {i+1}")) | ||
for i in range(get_max_ancestor_depth()) | ||
] | ||
|
||
header_labels += [label for _, label in topic_headers] | ||
|
||
csv_file = open_csv_for_writing(filepath) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably batch this query, and iterate over slices of 500 content_ids at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking to learn: If we dont do batch operation with 500 content_ids at a time, how many content_ids at a time would it require for this function to be memory inefficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong, but my intuition here is that having a very long IN lookup will make the query not perform well, as I think it is equivalent to doing an OR-ed equal check against each value, so the more values there are, the longer the lookups will take. Possible that the query planner optimizes this, so it's worth testing to see if it makes any difference, so if you try that and see no impact, can disregard!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood!
There wasn't much difference as I tested with a small data. I added iterator() still