Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content recommendation bugs #4675

Merged
merged 4 commits into from
Nov 19, 2015

Conversation

MCGallaspy
Copy link
Contributor

commit message:

Handles three errors.

1. In get_exercise_prereqs, if an unknown exercise id is passed in, it
now contributes nothing to the prereqs instead of raising an error.

2. Avoid a divide-by-zero error in coachreports.models.

3. Finally, in get_explore_recommendations, use of `next` built-in could
result in an unhandled StopIteration exception. Refactor to use for..in block.

@rtibbles inspite of the fact that no errors are now raised, the content_recommender api endpoint takes a really long time to return (one run took me 3.7 minutes). I think we can do two things (perhaps for 0.16), let me know if you agree:

  1. Add a placeholder when awaiting the response.
  2. Separate the content recommendation endpoint into three separate endpoints.

Handles three errors.

1. In get_exercise_prereqs, if an unknown exercise id is passed in, it
now contributes nothing to the prereqs instead of raising an error.

2. Avoid a divide-by-zero error in coachreports.models.

3. Finally, in get_explore_recommendations, use of `next` built-in could
result in an unhandled StopIteration exception. Refactor to use for..in block.
@MCGallaspy MCGallaspy added this to the 0.15.1 milestone Nov 18, 2015
@MCGallaspy
Copy link
Contributor Author

Line profiler output for the endpoint:

Timer unit: 4.27654e-07 s

Total time: 180.254 s
File: kalite\main\api_views.py
Function: content_recommender at line 22

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    22                                           @api_handle_error_with_json
    23                                           @profile
    24                                           def content_recommender(request):
    25                                               """Populate response with recommendation(s)"""
    26                                           
    27         1           50     50.0      0.0      user_id = request.GET.get('user', None)
    28         1           20     20.0      0.0      user = request.session.get('facility_user')
    29                                           
    30         1            5      5.0      0.0      if not user:
    31                                                   if request.user.is_authenticated() and request.user.is_superuser:
    32                                                       user = get_object_or_404(FacilityUser, pk=user_id)
    33                                                   else:
    34                                                       return JsonResponseMessageError("You are not authorized to view these recommendations.", status=401)
    35                                           
    36         1           28     28.0      0.0      resume = request.GET.get('resume', None)
    37         1           24     24.0      0.0      next = request.GET.get('next', None)
    38         1           25     25.0      0.0      explore = request.GET.get('explore', None)
    39                                           
    40         1            5      5.0      0.0      def set_bool_flag(flag_name, rec_dict):
    41                                                   rec_dict[flag_name] = True
    42                                                   return rec_dict
    43                                           
    44                                               # retrieve resume recommendation(s) and set resume boolean flag
    45         2       255808 127904.0      0.1      resume_recommendations = [set_bool_flag("resume", rec) for rec in get_resume_recommendations(user, request)] if resume else []
    46                                           
    47                                               # retrieve next_steps recommendations, set next_steps boolean flag, and flatten results for api response
    48         5    421123206 84224641.2     99.9      next_recommendations = [set_bool_flag("next", rec) for rec in get_next_recommendations(user, request)] if next else []
    49                                           
    50                                               # retrieve explore recommendations, set explore boolean flag, and flatten results for api response
    51         5       111215  22243.0      0.0      explore_recommendations = [set_bool_flag("explore", rec) for rec in get_explore_recommendations(user, request)] if explore else []
    52                                           
    53         1         3525   3525.0      0.0      return JsonResponse(resume_recommendations + next_recommendations + explore_recommendations)

@rtibbles
Copy link
Member

Yeah, this needs some optimizations... while splitting it into three endpoints could work, I think this shows that the way we are doing the 'next' recommendations currently is not helpful.

Most of this is because of the way that it builds a mapping of 'cousin' exercises, which then gets stored in a memory cache. I think we could do this more efficiently using path "contains" selectors to query exercises that are in the same broader topic.

Could you open an issue for this for develop?

@rtibbles
Copy link
Member

I got a similar test failure to your behave test failure here on my PR on develop also.

@MCGallaspy
Copy link
Contributor Author

I profiled get_next_recommendations afterwards. Again it's dominated by one hotspot:

Function: get_next_recommendations at line 53

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    53                                           @profile
    54                                           def get_next_recommendations(user, request):
    55                                               """Get the recommendations for the Next section, and return them as a list.
    56                                           
    57                                               Logic:
    58                                               Next recommendations are currently comprised of 3 main subgroups: group recommendations,
    59                                               struggling exercises, and topic tree based data. Group recommendations consist of 
    60                                               finding the most common item tackled immediately after the most recent item, struggling
    61                                               is determined by the "struggling" model attribute, and topic tree data is based off
    62                                               the graphical distance between neighboring exercise/topic nodes. 
    63                                               Args:
    64                                               user -- The current user as a facility user model object.
    65                                           
    66                                               """
    67                                           
    68         1      1792433 1792433.0      0.3      exercise_parents_table = get_exercise_parents_lookup_table()
    69                                           
    70         1         5511   5511.0      0.0      most_recent = get_most_recent_exercises(user)
    71                                           
    72         1       114403 114403.0      0.0      complete_exercises = set(get_completed_exercises(user))
    73                                           
    74         1           10     10.0      0.0      def filter_complete(ex):
    75                                                   return ex not in complete_exercises
    76                                           
    77         1       103816 103816.0      0.0      if len(most_recent) > 0 and most_recent[0] in exercise_parents_table:
    78         1           20     20.0      0.0          current_subtopic = exercise_parents_table[most_recent[0]]['subtopic_id']
    79                                               else:
    80                                                   current_subtopic = None
    81                                           
    82                                               #logic for recommendations based off of the topic tree structure
    83         1            3      3.0      0.0      if current_subtopic:
    84         1      1713957 1713957.0      0.3          topic_tree_based_data = generate_recommendation_data()[current_subtopic]['related_subtopics'][:settings.TOPIC_RECOMMENDATION_DEPTH]
    85         1        64643  64643.0      0.0          topic_tree_based_data = get_exercises_from_topics(topic_tree_based_data)
    86                                               else:
    87                                                   topic_tree_based_data = []
    88                                               
    89                                               #for checking that only exercises that have not been accessed are returned
    90        16          929     58.1      0.0      topic_tree_based_data = [ex for ex in topic_tree_based_data if ex not in most_recent or filter_complete(ex)]
    91                                           
    92                                               #logic to generate recommendations based on exercises student is struggling with
    93         1       103541 103541.0      0.0      struggling = filter(filter_complete, get_exercise_prereqs(get_struggling_exercises(user)))
    94                                           
    95                                               #logic to get recommendations based on group patterns, if applicable
    96         1    514234562 514234562.0     99.2      group = filter(filter_complete, get_group_recommendations(user))
    97                                             
    98                                               #now append titles and other metadata to each exercise id
    99         1            9      9.0      0.0      final = [] # final data to return
   100         6           20      3.3      0.0      for exercise_id in (group[:2] + struggling[:2] + topic_tree_based_data[:1]):  #notice the concatenation
   101                                           
   102         5           19      3.8      0.0          if exercise_id in exercise_parents_table:
   103         4           15      3.8      0.0              subtopic_id = exercise_parents_table[exercise_id]['subtopic_id']
   104         4           98     24.5      0.0              exercise = get_exercise_data(request, exercise_id)
   105         4           46     11.5      0.0              exercise["topic"] = get_topic_data(request, subtopic_id)
   106         4           11      2.8      0.0              final.append(exercise)
   107                                           
   108                                           
   109                                               #final recommendations are a combination of struggling, group filtering, and topic_tree filtering
   110         1            2      2.0      0.0      return final

@MCGallaspy
Copy link
Contributor Author

I got a similar test failure to your behave test failure here on my PR on develop also.

I'll take a look at it.

@rtibbles
Copy link
Member

Oh, I am slightly surprised by the location of the hotspot, but this may actually be an easier fix.

Would you like me to look at fixing this on 0.15? Looking at the function, I think I can see what is going on, and where it could be optimized.

@rtibbles
Copy link
Member

@MCGallaspy if this is indeed the culprit:

        #If the user has recently engaged with exercises, use these exercises as the basis for filtering
        user_exercises = ExerciseLog.objects\
            .filter(user__in=user_list).order_by("-latest_activity_timestamp")\
            .extra(select={'null_complete': "completion_timestamp is null"},
                order_by=["-null_complete", "-completion_timestamp"])

        exercise_counts = collections.defaultdict(lambda :0)

        for user in user_list:
            user_logs = user_exercises.filter(user=user)
            for i, log in enumerate(user_logs[1:]):
                prev_log = user_logs[i]
                if log.exercise_id in recent_exercises:
                    exercise_counts[prev_log.exercise_id] += 1

        exercise_counts = [{"exercise_id": key, "count": value} for key, value in exercise_counts.iteritems()]

then I think it could be collapsed into a single query without having to loop?

exercise_counts = ExerciseLog.objects.filter(user__in=user_list).values("exercise_id").annotate(count=Count("exercise_id"))

is the next line, which could be modified to include the exercise_id filter as well:

exercise_counts = ExerciseLog.objects.filter(user__in=user_list, exercise_id__in=recent_exercises).values("exercise_id").annotate(count=Count("exercise_id"))

Or, it's possible that just querying the database like this takes forever.

@MCGallaspy
Copy link
Contributor Author

I was on a roll, so I did one last profile run:

Timer unit: 4.27654e-07 s

Total time: 171.358 s
File: kalite\topic_tools\content_recommendation.py
Function: get_group_recommendations at line 111

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   111                                           @profile
   112                                           def get_group_recommendations(user):
   113                                               """Returns a list of exercises immediately tackled by other individuals in the same group."""
   114                                           
   115         1         5287   5287.0      0.0      recent_exercises = get_most_recent_exercises(user)
   116                                               
   117         1         2469   2469.0      0.0      user_list = FacilityUser.objects.filter(group=user.group)
   118                                           
   119         1        98899  98899.0      0.0      if recent_exercises:
   120                                           
   121                                                   #If the user has recently engaged with exercises, use these exercises as the basis for filtering
   122         1           19     19.0      0.0          user_exercises = ExerciseLog.objects\
   123         1         5583   5583.0      0.0              .filter(user__in=user_list).order_by("-latest_activity_timestamp")\
   124         1            6      6.0      0.0              .extra(select={'null_complete': "completion_timestamp is null"},
   125         1         2627   2627.0      0.0                  order_by=["-null_complete", "-completion_timestamp"])
   126                                               
   127         1            9      9.0      0.0          exercise_counts = collections.defaultdict(lambda :0)
   128                                           
   129        22         7033    319.7      0.0          for user in user_list:
   130        21        98796   4704.6      0.0              user_logs = user_exercises.filter(user=user)
   131      2681      4221168   1574.5      1.1              for i, log in enumerate(user_logs[1:]):
   132      2660    396087243 148905.0     98.9                  prev_log = user_logs[i]
   133      2660       142120     53.4      0.0                  if log.exercise_id in recent_exercises:
   134      2427        20079      8.3      0.0                      exercise_counts[prev_log.exercise_id] += 1
   135                                           
   136       333         1168      3.5      0.0          exercise_counts = [{"exercise_id": key, "count": value} for key, value in exercise_counts.iteritems()]
   137                                           
   138                                               else:
   139                                                   #If not, only look at the group data
   140                                                   exercise_counts = ExerciseLog.objects.filter(user__in=user_list).values("exercise_id").annotate(count=Count("exercise_id"))
   141                                           
   142                                               #sort the results in order of highest counts to smallest
   143         1          537    537.0      0.0      sorted_counts = sorted(exercise_counts, key=lambda k:k['count'], reverse=False)
   144                                               
   145                                               #the final list of recommendations to return, WITHOUT counts
   146       333         1002      3.0      0.0      group_rec = [c['exercise_id'] for c in sorted_counts]
   147                                           
   148         1            3      3.0      0.0      return group_rec

Total time: 0.00688266 s
File: kalite\topic_tools\content_recommendation.py
Function: get_most_recent_exercises at line 292

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   292                                           @profile
   293                                           def get_most_recent_exercises(user):
   294                                               """Return a list of the most recent exercises (ids) accessed by the user."""
   295                                           
   296         3        16085   5361.7     99.9      exercises_by_user = ExerciseLog.objects.filter(user=user).order_by("-latest_activity_timestamp").values_list("exercise_id", flat=True)
   297                                            
   298         3            9      3.0      0.1      return exercises_by_user

@rtibbles
Copy link
Member

Nice, looks like the looping through each log is what is doing it, so hopefully the suggestion I made above should help?

@MCGallaspy
Copy link
Contributor Author

Yeah, I'll see if I can fiddle with the QuerySet.

@MCGallaspy
Copy link
Contributor Author

This is faster. Though I'll check also if evaluating the big queryset once is faster than evaluating several smaller querysets (as it is now).

@MCGallaspy
Copy link
Contributor Author

I'll check also if evaluating the big queryset once is faster than evaluating several smaller querysets

Tried a few variations, this is still the fastest.

@MCGallaspy
Copy link
Contributor Author

Here's the result for posterity, though perhaps as you say this could be reduced to one QuerySet with annotation:

Timer unit: 4.27654e-07 s

Total time: 4.00492 s
File: kalite\topic_tools\content_recommendation.py
Function: get_group_recommendations at line 112

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   112                                           @profile
   113                                           def get_group_recommendations(user):
   114                                               """Returns a list of exercises immediately tackled by other individuals in the same group."""
   115                                           
   116         1        23242  23242.0      0.2      recent_exercises = get_most_recent_exercises(user)
   117                                               
   118         1        26296  26296.0      0.3      user_list = FacilityUser.objects.filter(group=user.group)
   119                                           
   120         1       369336 369336.0      3.9      if recent_exercises:
   121                                           
   122                                                   #If the user has recently engaged with exercises, use these exercises as the basis for filtering
   123         1           64     64.0      0.0          user_exercises = ExerciseLog.objects\
   124         1        20651  20651.0      0.2              .filter(user__in=user_list).order_by("-latest_activity_timestamp")\
   125         1           21     21.0      0.0              .extra(select={'null_complete': "completion_timestamp is null"},
   126         1         9851   9851.0      0.1                     order_by=["-null_complete", "-completion_timestamp"])
   127                                               
   128         1           38     38.0      0.0          exercise_counts = collections.defaultdict(lambda: 0)
   129                                           
   130        22        26712   1214.2      0.3          for user in user_list:
   131        21      8680521 413358.1     92.7              logs = list(user_exercises.filter(user=user))
   132      2681        28819     10.7      0.3              for log, prev_log in itertools.izip(logs[1:], logs):
   133      2660       140679     52.9      1.5                  if log.exercise_id in recent_exercises:
   134      2427        28127     11.6      0.3                      exercise_counts[prev_log.exercise_id] += 1
   135                                           
   136       333         4691     14.1      0.1          exercise_counts = [{"exercise_id": key, "count": value} for key, value in exercise_counts.iteritems()]
   137                                           
   138                                               else:
   139                                                   #If not, only look at the group data
   140                                                   exercise_counts = ExerciseLog.objects.filter(user__in=user_list).values("exercise_id").annotate(count=Count("exercise_id"))
   141                                           
   142                                               #sort the results in order of highest counts to smallest
   143         1         1752   1752.0      0.0      sorted_counts = sorted(exercise_counts, key=lambda k:k['count'], reverse=False)
   144                                               
   145                                               #the final list of recommendations to return, WITHOUT counts
   146       333         4055     12.2      0.0      group_rec = [c['exercise_id'] for c in sorted_counts]
   147                                           
   148         1           11     11.0      0.0      return group_rec

@MCGallaspy
Copy link
Contributor Author

With the test failures, looks like the given elements have zero area -- my guess is that throws off the click action.

@MCGallaspy
Copy link
Contributor Author

This might be an exciting opportunity to write a test case for selenium!

The tests select and try to click a zero-area element -- it
doesn't do what's intended, notably load another page.
Instead, the intent is to click one of that element's
children with a non-zero area.
@rtibbles
Copy link
Member

This works for me.

rtibbles added a commit that referenced this pull request Nov 19, 2015
@rtibbles rtibbles merged commit ff7a099 into learningequality:0.15.x Nov 19, 2015
@rtibbles rtibbles removed the has PR label Nov 19, 2015
@MCGallaspy
Copy link
Contributor Author

Plan for merging into develop?

@rtibbles
Copy link
Member

Merged into develop already :)

@MCGallaspy
Copy link
Contributor Author

Fixed #4603.

@aronasorman aronasorman mentioned this pull request Dec 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants