Overview
This is a very basic and simplistic collaborative filtering based recommender system. It uses a subset of the MovieLens dataset provided by GroupLens.
As of right now, the recommender generates a list of top 10 recommendations for any user in the dataset. Below are the steps on how it accomplishes this:
1. First we read in the raw data tables - movies and ratings
2. Then we merge the 2 data tables on the movieId key and delete the timestamp field
3. We then generate a pivot table with the values as the ratings, the movie titles as the rows and the user IDs as the columns
4. A correlation matrix is then generated that represents the correlations between movies
5. Then the movies that the selected user has already watched are extracted and their correlations with the other movies are then summed
6. The recommendation list is then sorted from highest to lowest correlation and the movies with the top 10 correlations are then recommended
Limitations:
The greatest limitation of this technique is that if a user has not rated any movies, a recommendation cannot be provided since we do not understand the preferences of the user, known as the cold start problem.
Next Steps:
The ultimate goal is to fill in the missing ratings for all users by making predictions on what the user is most likely to rate the movies that they have not watched. This can be accomplished in many ways:
1. Create buckets of correlation values that represent ratings (C<=0 represents rating of 0, C > 0 & C < 0.1 represents rating of 0.5, etc...)
2. Use machine learning algorithms to predict the missing ratings (kNN, Decision Trees/Random Forests, Clustering)
For the next iteration of this system, I will be going with using a machine learning algorithm to make the predictions. Although it is worth trying all of the algorithms mentioned above, I would most likely start with kNN.
Measuring the accuracy of the predicted ratings is also very crucial to determine the credibility of the recommendation system. This can be done by using the full MovieLens dataset and calculating the precision and recall values.
# Import libraries
import numpy as np
import pandas as pd
# Read movies data
movies_df = pd.read_csv('movies.csv')
# Display first few rows of movies dataframe
movies_df.head()
|
movieId |
title |
genres |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1 |
2 |
Jumanji (1995) |
Adventure|Children|Fantasy |
2 |
3 |
Grumpier Old Men (1995) |
Comedy|Romance |
3 |
4 |
Waiting to Exhale (1995) |
Comedy|Drama|Romance |
4 |
5 |
Father of the Bride Part II (1995) |
Comedy |
# Read ratings data
ratings_df = pd.read_csv('ratings.csv')
# Display first few rows of ratings dataframe
ratings_df.head()
|
userId |
movieId |
rating |
timestamp |
0 |
1 |
31 |
2.5 |
1260759144 |
1 |
1 |
1029 |
3.0 |
1260759179 |
2 |
1 |
1061 |
3.0 |
1260759182 |
3 |
1 |
1129 |
2.0 |
1260759185 |
4 |
1 |
1172 |
4.0 |
1260759205 |
# Join the movies and ratings tables
movies_ratings_df = pd.merge(movies_df, ratings_df, on = 'movieId')
# Remove timestamp column
del movies_ratings_df['timestamp']
# Display first few rows of merged movies and ratings dataframe
movies_ratings_df.head()
|
movieId |
title |
genres |
userId |
rating |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
7 |
3.0 |
1 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
9 |
4.0 |
2 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
13 |
5.0 |
3 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
15 |
2.0 |
4 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
19 |
3.0 |
# Convert movieId, userId and rating to numeric
movies_ratings_df[['movieId','userId','rating']] = movies_ratings_df[['movieId','userId','rating']].apply(pd.to_numeric)
# Convert table to pivot table
movies_ratings_pivot = movies_ratings_df.pivot_table(values = 'rating', index = 'title', columns = 'userId')
# Replace NaNs with 0s
movies_ratings_pivot.fillna(0, inplace = True)
# Display first few rows and columns of movies and ratings pivot table
movies_ratings_pivot.head()
userId |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
... |
662 |
663 |
664 |
665 |
666 |
667 |
668 |
669 |
670 |
671 |
title |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
"Great Performances" Cats (1998) |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
$9.99 (2008) |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
'Hellboy': The Seeds of Creation (2004) |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
'Neath the Arizona Skies (1934) |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
'Round Midnight (1986) |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
5 rows × 671 columns
# Extract and save movie titles
movie_titles = movies_ratings_pivot.index
# Define correlation matrix to get correlation between movies
corr_matrix = np.corrcoef(movies_ratings_pivot)
# Display correlation matrix
corr_matrix
array([[ 1. , -0.00289353, -0.00171802, ..., -0.00940622,
-0.00171802, -0.00171802],
[-0.00289353, 1. , -0.00251377, ..., 0.04586879,
-0.00251377, -0.00251377],
[-0.00171802, -0.00251377, 1. , ..., -0.00817171,
-0.00149254, -0.00149254],
...,
[-0.00940622, 0.04586879, -0.00817171, ..., 1. ,
-0.00817171, -0.00817171],
[-0.00171802, -0.00251377, -0.00149254, ..., -0.00817171,
1. , -0.00149254],
[-0.00171802, -0.00251377, -0.00149254, ..., -0.00817171,
-0.00149254, 1. ]])
# Define function provide top recommendations
def recommender(movie_names):
# Initialize recommendations output length of number of movies
recommendations = np.zeros(corr_matrix.shape[0])
# Loop through the movie titles that the user has rated
for movie in movie_names:
# Sum the correlations
recommendations = recommendations + corr_matrix[list(movie_titles).index(movie)]
# Convert recommendations into dataframe
recommendations_df = pd.DataFrame({
'Title': movie_titles,
'Recommendation': recommendations})
# Remove movie titles that the user has already rated
recommendations_df = recommendations_df[~(recommendations_df.Title.isin(movie_names))]
# Sort from most correlated to least correlated
recommendations_df = recommendations_df.sort_values(by=['Recommendation'], ascending = False)
return recommendations_df
# Define user
user = 1
# Get list of movies user has rated
user_movies = movies_ratings_df[movies_ratings_df.userId == user].title.tolist()
print('Movies rated by user\n')
print(user_movies)
# Get list of recommendations
recommendations = recommender(user_movies)
# Print out top 10 recommendations
print('\nRecommendations for user\n')
print(recommendations.Title.head(10))
Movies rated by user
['Dangerous Minds (1995)', 'Dumbo (1941)', 'Sleepers (1996)', 'Escape from New York (1981)', 'Cinema Paradiso (Nuovo cinema Paradiso) (1989)', 'Deer Hunter, The (1978)', 'Ben-Hur (1959)', 'Gandhi (1982)', "Dracula (Bram Stoker's Dracula) (1992)", 'Cape Fear (1991)', 'Star Trek: The Motion Picture (1979)', 'Beavis and Butt-Head Do America (1996)', 'French Connection, The (1971)', 'Tron (1982)', 'Gods Must Be Crazy, The (1980)', 'Willow (1988)', 'Antz (1998)', 'Fly, The (1986)', 'Time Bandits (1981)', 'Blazing Saddles (1974)']
Recommendations for user
2831 Fisher King, The (1991)
8450 Unforgiven (1992)
4406 King Kong (1933)
4203 Jaws (1975)
8473 Untouchables, The (1987)
2700 Fantasia (1940)
6471 Raising Arizona (1987)
6211 Player, The (1992)
2776 Field of Dreams (1989)
6665 Risky Business (1983)
Name: Title, dtype: object