Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ancestor information to summary and session log exports #12782

Merged
merged 6 commits into from
Nov 6, 2024

Conversation

thesujai
Copy link
Contributor

@thesujai thesujai commented Nov 4, 2024

Summary

Adds the ancestor information(ContentNode title) of each content to log exports before writing into csv

References

Fixes #12691

Reviewer guidance


Testing checklist

  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Critical and brittle code paths are covered by unit tests

PR process

  • PR has the correct target branch and milestone
  • PR has 'needs review' or 'work-in-progress' label
  • If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
  • If this is an important user-facing change, PR or related issue has a 'changelog' label
  • If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@github-actions github-actions bot added the DEV: backend Python, databases, networking, filesystem... label Nov 4, 2024
@thesujai
Copy link
Contributor Author

thesujai commented Nov 4, 2024

@nucleogenesis @rtibbles
A few more things:

  1. Should we cache the max_ancestor_depth in get_max_ancestor_depth()?
  2. Are unit tests required for this PR?

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can optimize this a bit more, as we are currently fetching contentnode data three times, when we could have a helper function to cache it better between runs.


def map_object(item):
mapped_item = output_mapper(item, labels=labels, output_mappings=mappings)
node = ContentNode.objects.filter(content_id=item["content_id"]).first()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could halve the number of queries here by getting and caching the ancestor information when we get the title too.

So perhaps we have a general helper function cache_content_data that is used by a renamed get_content_title and a new get_content_ancestors - each one calls the cache_content_data function that queries for the content node and stores both the title and the ancestors in the cache and returns them.

We could also use fill this cache in get_max_ancestor_depth calculation, so that it is already populated, as we've had to iterate over every content node already to calculate this.

def get_max_ancestor_depth():
max_depth = 0
for node in ContentNode.objects.filter(
content_id__in=ContentSummaryLog.objects.values_list("content_id", flat=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably batch this query, and iterate over slices of 500 content_ids at a time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking to learn: If we dont do batch operation with 500 content_ids at a time, how many content_ids at a time would it require for this function to be memory inefficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong, but my intuition here is that having a very long IN lookup will make the query not perform well, as I think it is equivalent to doing an OR-ed equal check against each value, so the more values there are, the longer the lookups will take. Possible that the query planner optimizes this, so it's worth testing to see if it makes any difference, so if you try that and see no impact, can disregard!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood!
There wasn't much difference as I tested with a small data. I added iterator() still

@thesujai thesujai force-pushed the ancestor-info-in-logs branch from fac9514 to 9927b62 Compare November 4, 2024 20:35
content_ids = ContentSummaryLog.objects.values_list("content_id", flat=True)
nodes = (
ContentNode.objects.filter(content_id__in=content_ids)
.only("content_id")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to get the title and ancestors fields in this fetch and seed the cache with it?

That would bring us from N + 1 queries, down to N / BATCH_SIZE queries!

nodes = (
ContentNode.objects.filter(content_id__in=content_ids)
.only("content_id")
.iterator(chunk_size=BATCH_SIZE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this will address the issue I was suggesting, which is where the performance issue might come from the length of the content_ids subquery - so, maybe let's just remove the iterator for now, and just see whether there are actually performance issues here in practice when we test.

@thesujai thesujai force-pushed the ancestor-info-in-logs branch 2 times, most recently from 721a096 to 789d140 Compare November 5, 2024 13:59
@thesujai
Copy link
Contributor Author

thesujai commented Nov 5, 2024

It is super optimized now(maybe)

@thesujai thesujai force-pushed the ancestor-info-in-logs branch from 789d140 to 03113e0 Compare November 5, 2024 14:05
@rtibbles rtibbles self-assigned this Nov 5, 2024
Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me. In manual testing, I've spotted a pre-existing issue where we are looking up by the content_id but not also the channel_id, but I think we can fix that in a follow up issue.

The only other thought here is that it might be good for the topic headers to appear in the CSV between the Channel name and the Content id headers.

Interested in @radinamatic and @pcenov's thoughts here!

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually - I realized that we should exclude the highest level topic, as it is just the channel name, so 'topic level 1' should start at the second entry in the ancestors list, and max depth will be 1 less than the max of the ancestors length.

If you could insert the topic headers after the Channel name header and before the Content id header while you are updating this, that would be great, thanks!

@thesujai
Copy link
Contributor Author

thesujai commented Nov 6, 2024

@rtibbles Updated!
if it can be optimized more let me know

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! Thank you, @thesujai - this is wonderful.

@rtibbles rtibbles merged commit 9e119ca into learningequality:develop Nov 6, 2024
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DEV: backend Python, databases, networking, filesystem...
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ancestor information to summary and session log exports
2 participants