This notebook is an implementation of a recommender system based on clustering anime user ratings
-
Anime with unknown ratings have been removed from the data.
-
The data were cleaned using the Chauvenet criterion:
In ratings table the criterion considered outliers to be those who watched too much anime; they don’t need a recommendation system, since they already understand the topic better than any algorithm.
in anime table the criterion considered anime with too high or too low ratings as outliers, and it also excluded the most popular anime
After compiling the one-hot table, we have the following tables:
- Anime genres - in the anime rows, in the anime genre columns, and in the cells 0 or 1.
- Ratings - in the users rows, in the anime id and rating columns
You need to get another table, in which the rows will contain users, and the columns will contain anime genres. And in the cells, the average genre rating by this user.
The average genre rating is calculated as follows: all user reviews are taken. All of his reviews are grouped by genre and the average rating that he gives to anime with this genre is calculated. all NaNs are replaced with -1.
For clustering, features (scores) were scaled using a MinMaxScaler
It is logical that we want the points to be located close to the centers of their clusters. But here's the problem: the minimum of such a functional will be achieved when there are as many clusters as points (i.e., each point is a cluster of one element). To solve this problem (choosing the number of clusters), the following heuristic is often used: choose the number of clusters starting from which the described functional $ J(C) $ decreases "not so fast"
Where, in the case of KMeans $$ J(C) = \sum_{k=1}^K\sum_{i~\in~C_k} ||x_i - \mu_k|| \rightarrow \min\limits_C,$$ — the sum of the squares of the distances from points to the centroids of the clusters to which they belong
First, the silhouette is determined separately for each object. Let
The silhouette of a sample is the average value of the silhouette of objects in a given sample. Thus, the silhouette shows how much the average distance to objects in its cluster differs from the average distance to objects in other clusters.
After analyzing the graphs, the value for the number of clusters was chosen to be k=4.
Heatmap of cluster centers:
To recommend something to a user, you need his ratings for anime genres and viewing history, so as not to recommend watched anime.
Using user ratings we predict user membership in 1 of 4 clusters.
-
The most popular genres in the cluster are determined
-
A hyper parameter combo is set, which is responsible for the number of top genres in a particular anime
-
Anime in which the number of top genres is greater than the combo are selected from the data
Finally, to implement the recommendation, the anime already watched by the user is removed from the resulting list and the top 5 by rating are selected from them.