Add ancestor information to summary and session log exports #12782

thesujai · 2024-11-04T07:56:12Z

Summary

Adds the ancestor information(ContentNode title) of each content to log exports before writing into csv

References

Fixes #12691

Reviewer guidance

Testing checklist

Contributor has fully tested the PR manually
If there are any front-end changes, before/after screenshots are included
Critical user journeys are covered by Gherkin stories
Critical and brittle code paths are covered by unit tests

PR process

PR has the correct target branch and milestone
PR has 'needs review' or 'work-in-progress' label
If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
If this is an important user-facing change, PR or related issue has a 'changelog' label
If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

PR is fully functional
PR has been tested for accessibility regressions
External dependency files were updated if necessary (yarn and pip)
Documentation is updated
Contributor is in AUTHORS.md

thesujai · 2024-11-04T07:59:17Z

@nucleogenesis @rtibbles
A few more things:

Should we cache the max_ancestor_depth in get_max_ancestor_depth()?
Are unit tests required for this PR?

github-actions · 2024-11-04T08:27:52Z

Build Artifacts

Asset type	Download link
PEX file	kolibri-0.18.0.dev0_git.20241106170722.pex
Windows Installer (EXE)	kolibri-0.18.0.dev0+git.20241106170722-windows-setup-unsigned.exe
Debian Package	kolibri_0.18.0.dev0+git.20241106170722-0ubuntu1_all.deb
Mac Installer (DMG)	kolibri-0.18.0.dev0+git.20241106170722.dmg
Android Package (APK)	kolibri-0.18.0.dev0+git.20241106170722-0.1.4-debug.apk
TAR file	kolibri-0.18.0.dev0+git.20241106170722.tar.gz
WHL file	kolibri-0.18.0.dev0+git.20241106170722-py2.py3-none-any.whl

rtibbles

I think we can optimize this a bit more, as we are currently fetching contentnode data three times, when we could have a helper function to cache it better between runs.

rtibbles · 2024-11-04T16:03:10Z

kolibri/core/logger/csv_export.py

+
+def map_object(item):
+    mapped_item = output_mapper(item, labels=labels, output_mappings=mappings)
+    node = ContentNode.objects.filter(content_id=item["content_id"]).first()


We could halve the number of queries here by getting and caching the ancestor information when we get the title too.

So perhaps we have a general helper function cache_content_data that is used by a renamed get_content_title and a new get_content_ancestors - each one calls the cache_content_data function that queries for the content node and stores both the title and the ancestors in the cache and returns them.

We could also use fill this cache in get_max_ancestor_depth calculation, so that it is already populated, as we've had to iterate over every content node already to calculate this.

rtibbles · 2024-11-04T16:03:35Z

kolibri/core/logger/csv_export.py

+def get_max_ancestor_depth():
+    max_depth = 0
+    for node in ContentNode.objects.filter(
+        content_id__in=ContentSummaryLog.objects.values_list("content_id", flat=True)


We should probably batch this query, and iterate over slices of 500 content_ids at a time.

Asking to learn: If we dont do batch operation with 500 content_ids at a time, how many content_ids at a time would it require for this function to be memory inefficient?

I could be wrong, but my intuition here is that having a very long IN lookup will make the query not perform well, as I think it is equivalent to doing an OR-ed equal check against each value, so the more values there are, the longer the lookups will take. Possible that the query planner optimizes this, so it's worth testing to see if it makes any difference, so if you try that and see no impact, can disregard!

Understood!
There wasn't much difference as I tested with a small data. I added iterator() still

rtibbles · 2024-11-04T20:42:33Z

kolibri/core/logger/csv_export.py

+    content_ids = ContentSummaryLog.objects.values_list("content_id", flat=True)
+    nodes = (
+        ContentNode.objects.filter(content_id__in=content_ids)
+        .only("content_id")


Any reason not to get the title and ancestors fields in this fetch and seed the cache with it?

That would bring us from N + 1 queries, down to N / BATCH_SIZE queries!

rtibbles · 2024-11-04T20:43:34Z

kolibri/core/logger/csv_export.py

+    nodes = (
+        ContentNode.objects.filter(content_id__in=content_ids)
+        .only("content_id")
+        .iterator(chunk_size=BATCH_SIZE)


I don't know if this will address the issue I was suggesting, which is where the performance issue might come from the length of the content_ids subquery - so, maybe let's just remove the iterator for now, and just see whether there are actually performance issues here in practice when we test.

thesujai · 2024-11-05T13:59:29Z

It is super optimized now(maybe)

rtibbles

This is looking good to me. In manual testing, I've spotted a pre-existing issue where we are looking up by the content_id but not also the channel_id, but I think we can fix that in a follow up issue.

The only other thought here is that it might be good for the topic headers to appear in the CSV between the Channel name and the Content id headers.

Interested in @radinamatic and @pcenov's thoughts here!

rtibbles

Oh, actually - I realized that we should exclude the highest level topic, as it is just the channel name, so 'topic level 1' should start at the second entry in the ancestors list, and max depth will be 1 less than the max of the ancestors length.

If you could insert the topic headers after the Channel name header and before the Content id header while you are updating this, that would be great, thanks!

thesujai · 2024-11-06T17:10:27Z

@rtibbles Updated!
if it can be optimized more let me know

rtibbles

Perfect! Thank you, @thesujai - this is wonderful.

thesujai added 4 commits November 4, 2024 12:59

add function to get the max_ancestor_depth

e835ce3

make header_labels mutable

c84ff83

add topic headers in header_labels

caf8811

update the map_object to add the ancestor info before writing into csv

de9b036

github-actions bot added the DEV: backend Python, databases, networking, filesystem... label Nov 4, 2024

rtibbles reviewed Nov 4, 2024

View reviewed changes

thesujai force-pushed the ancestor-info-in-logs branch from fac9514 to 9927b62 Compare November 4, 2024 20:35

rtibbles reviewed Nov 4, 2024

View reviewed changes

thesujai force-pushed the ancestor-info-in-logs branch 2 times, most recently from 721a096 to 789d140 Compare November 5, 2024 13:59

optimize retreival of ancestors by caching

03113e0

thesujai force-pushed the ancestor-info-in-logs branch from 789d140 to 03113e0 Compare November 5, 2024 14:05

rtibbles self-assigned this Nov 5, 2024

rtibbles reviewed Nov 5, 2024

View reviewed changes

rtibbles requested changes Nov 5, 2024

View reviewed changes

rtibbles mentioned this pull request Nov 5, 2024

Session and Summary export logs do not prioritize the content node with the selected channel id #12791

Closed

move the ancestor info in between of csv and ignore the first ancestor

93af375

rtibbles approved these changes Nov 6, 2024

View reviewed changes

rtibbles merged commit 9e119ca into learningequality:develop Nov 6, 2024
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ancestor information to summary and session log exports #12782

Add ancestor information to summary and session log exports #12782

thesujai commented Nov 4, 2024

thesujai commented Nov 4, 2024

github-actions bot commented Nov 4, 2024 •

edited

Loading

rtibbles left a comment

rtibbles Nov 4, 2024

rtibbles Nov 4, 2024

thesujai Nov 4, 2024

rtibbles Nov 4, 2024

thesujai Nov 4, 2024

rtibbles Nov 4, 2024

rtibbles Nov 4, 2024

thesujai commented Nov 5, 2024

rtibbles left a comment

rtibbles left a comment •

edited

Loading

thesujai commented Nov 6, 2024

rtibbles left a comment

Add ancestor information to summary and session log exports #12782

Add ancestor information to summary and session log exports #12782

Conversation

thesujai commented Nov 4, 2024

Summary

References

Reviewer guidance

Testing checklist

PR process

Reviewer checklist

thesujai commented Nov 4, 2024

github-actions bot commented Nov 4, 2024 • edited Loading

rtibbles left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thesujai commented Nov 5, 2024

rtibbles left a comment

Choose a reason for hiding this comment

rtibbles left a comment • edited Loading

Choose a reason for hiding this comment

thesujai commented Nov 6, 2024

rtibbles left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 4, 2024 •

edited

Loading

rtibbles left a comment •

edited

Loading