Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ancestor information to summary and session log exports #12782

Merged
merged 6 commits into from
Nov 6, 2024
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 37 additions & 3 deletions kolibri/core/logger/csv_export.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
import math
import os
from collections import OrderedDict
from functools import partial

from dateutil import parser
from django.core.cache import cache
Expand Down Expand Up @@ -103,7 +102,35 @@ def cache_content_title(obj):
)
)

map_object = partial(output_mapper, labels=labels, output_mappings=mappings)

def get_max_ancestor_depth():
max_depth = 0
for node in ContentNode.objects.filter(
content_id__in=ContentSummaryLog.objects.values_list("content_id", flat=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably batch this query, and iterate over slices of 500 content_ids at a time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking to learn: If we dont do batch operation with 500 content_ids at a time, how many content_ids at a time would it require for this function to be memory inefficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong, but my intuition here is that having a very long IN lookup will make the query not perform well, as I think it is equivalent to doing an OR-ed equal check against each value, so the more values there are, the longer the lookups will take. Possible that the query planner optimizes this, so it's worth testing to see if it makes any difference, so if you try that and see no impact, can disregard!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood!
There wasn't much difference as I tested with a small data. I added iterator() still

):
max_depth = max(max_depth, len(node.ancestors))
return max_depth


def add_ancestors_info(row, ancestors, max_depth):
row.update(
{
f"Topic level {level + 1}": ancestors[level]["title"]
if level < len(ancestors)
else ""
for level in range(max_depth)
}
)


def map_object(item):
mapped_item = output_mapper(item, labels=labels, output_mappings=mappings)
node = ContentNode.objects.filter(content_id=item["content_id"]).first()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could halve the number of queries here by getting and caching the ancestor information when we get the title too.

So perhaps we have a general helper function cache_content_data that is used by a renamed get_content_title and a new get_content_ancestors - each one calls the cache_content_data function that queries for the content node and stores both the title and the ancestors in the cache and returns them.

We could also use fill this cache in get_max_ancestor_depth calculation, so that it is already populated, as we've had to iterate over every content node already to calculate this.

if node and node.ancestors:
add_ancestors_info(mapped_item, node.ancestors, get_max_ancestor_depth())
else:
add_ancestors_info(mapped_item, [], get_max_ancestor_depth())
return mapped_item


classes_info = {
Expand Down Expand Up @@ -171,11 +198,18 @@ def csv_file_generator(
queryset = queryset.filter(start_timestamp__lte=end)

# Exclude completion timestamp for the sessionlog CSV
header_labels = tuple(
header_labels = list(
label
for label in labels.values()
if log_type == "summary" or label != labels["completion_timestamp"]
)
# len of topic headers should be equal to the max depth of the content node
topic_headers = [
(f"Topic level {i+1}", _(f"Topic level {i+1}"))
for i in range(get_max_ancestor_depth())
]

header_labels += [label for _, label in topic_headers]

csv_file = open_csv_for_writing(filepath)

Expand Down