Collaborative Filtering Based Recommendation System

Overview

This is a very basic and simplistic collaborative filtering based recommender system. It uses a subset of the MovieLens dataset provided by GroupLens.
As of right now, the recommender generates a list of top 10 recommendations for any user in the dataset. Below are the steps on how it accomplishes this:
1. First we read in the raw data tables - movies and ratings
2. Then we merge the 2 data tables on the movieId key and delete the timestamp field
3. We then generate a pivot table with the values as the ratings, the movie titles as the rows and the user IDs as the columns
4. A correlation matrix is then generated that represents the correlations between movies
5. Then the movies that the selected user has already watched are extracted and their correlations with the other movies are then summed
6. The recommendation list is then sorted from highest to lowest correlation and the movies with the top 10 correlations are then recommended

Limitations:
The greatest limitation of this technique is that if a user has not rated any movies, a recommendation cannot be provided since we do not understand the preferences of the user, known as the cold start problem.

Next Steps:
The ultimate goal is to fill in the missing ratings for all users by making predictions on what the user is most likely to rate the movies that they have not watched. This can be accomplished in many ways:
1. Create buckets of correlation values that represent ratings (C<=0 represents rating of 0, C > 0 & C < 0.1 represents rating of 0.5, etc...)
2. Use machine learning algorithms to predict the missing ratings (kNN, Decision Trees/Random Forests, Clustering)

For the next iteration of this system, I will be going with using a machine learning algorithm to make the predictions. Although it is worth trying all of the algorithms mentioned above, I would most likely start with kNN.

Measuring the accuracy of the predicted ratings is also very crucial to determine the credibility of the recommendation system. This can be done by using the full MovieLens dataset and calculating the precision and recall values.

# Import libraries
import numpy as np
import pandas as pd  
# Read movies data
movies_df = pd.read_csv('movies.csv')

# Display first few rows of movies dataframe
movies_df.head()
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
# Read ratings data
ratings_df = pd.read_csv('ratings.csv')

# Display first few rows of ratings dataframe
ratings_df.head()
userId movieId rating timestamp
0 1 31 2.5 1260759144
1 1 1029 3.0 1260759179
2 1 1061 3.0 1260759182
3 1 1129 2.0 1260759185
4 1 1172 4.0 1260759205
# Join the movies and ratings tables
movies_ratings_df = pd.merge(movies_df, ratings_df, on = 'movieId')

# Remove timestamp column
del movies_ratings_df['timestamp']

# Display first few rows of merged movies and ratings dataframe
movies_ratings_df.head()
movieId title genres userId rating
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 7 3.0
1 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 9 4.0
2 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 13 5.0
3 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 15 2.0
4 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 19 3.0
# Convert movieId, userId and rating to numeric
movies_ratings_df[['movieId','userId','rating']] = movies_ratings_df[['movieId','userId','rating']].apply(pd.to_numeric)

# Convert table to pivot table
movies_ratings_pivot = movies_ratings_df.pivot_table(values = 'rating', index = 'title', columns = 'userId')  

# Replace NaNs with 0s
movies_ratings_pivot.fillna(0, inplace = True)

# Display first few rows and columns of movies and ratings pivot table
movies_ratings_pivot.head()
userId 1 2 3 4 5 6 7 8 9 10 ... 662 663 664 665 666 667 668 669 670 671
title
"Great Performances" Cats (1998) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
$9.99 (2008) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
'Hellboy': The Seeds of Creation (2004) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
'Neath the Arizona Skies (1934) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
'Round Midnight (1986) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 671 columns

# Extract and save movie titles
movie_titles = movies_ratings_pivot.index

# Define correlation matrix to get correlation between movies
corr_matrix = np.corrcoef(movies_ratings_pivot)

# Display correlation matrix
corr_matrix
array([[ 1.        , -0.00289353, -0.00171802, ..., -0.00940622,
        -0.00171802, -0.00171802],
       [-0.00289353,  1.        , -0.00251377, ...,  0.04586879,
        -0.00251377, -0.00251377],
       [-0.00171802, -0.00251377,  1.        , ..., -0.00817171,
        -0.00149254, -0.00149254],
       ..., 
       [-0.00940622,  0.04586879, -0.00817171, ...,  1.        ,
        -0.00817171, -0.00817171],
       [-0.00171802, -0.00251377, -0.00149254, ..., -0.00817171,
         1.        , -0.00149254],
       [-0.00171802, -0.00251377, -0.00149254, ..., -0.00817171,
        -0.00149254,  1.        ]])
# Define function provide top recommendations
def recommender(movie_names):  
    # Initialize recommendations output length of number of movies
    recommendations = np.zeros(corr_matrix.shape[0])
    # Loop through the movie titles that the user has rated
    for movie in movie_names:
        # Sum the correlations
        recommendations = recommendations + corr_matrix[list(movie_titles).index(movie)]
    # Convert recommendations into dataframe
    recommendations_df = pd.DataFrame({
            'Title': movie_titles,
            'Recommendation': recommendations})
    # Remove movie titles that the user has already rated
    recommendations_df = recommendations_df[~(recommendations_df.Title.isin(movie_names))]
    # Sort from most correlated to least correlated
    recommendations_df = recommendations_df.sort_values(by=['Recommendation'], ascending = False)
    return recommendations_df
# Define user
user = 1

# Get list of movies user has rated
user_movies = movies_ratings_df[movies_ratings_df.userId == user].title.tolist()  
print('Movies rated by user\n')
print(user_movies)
# Get list of recommendations
recommendations = recommender(user_movies)

# Print out top 10 recommendations
print('\nRecommendations for user\n')
print(recommendations.Title.head(10))
Movies rated by user

['Dangerous Minds (1995)', 'Dumbo (1941)', 'Sleepers (1996)', 'Escape from New York (1981)', 'Cinema Paradiso (Nuovo cinema Paradiso) (1989)', 'Deer Hunter, The (1978)', 'Ben-Hur (1959)', 'Gandhi (1982)', "Dracula (Bram Stoker's Dracula) (1992)", 'Cape Fear (1991)', 'Star Trek: The Motion Picture (1979)', 'Beavis and Butt-Head Do America (1996)', 'French Connection, The (1971)', 'Tron (1982)', 'Gods Must Be Crazy, The (1980)', 'Willow (1988)', 'Antz (1998)', 'Fly, The (1986)', 'Time Bandits (1981)', 'Blazing Saddles (1974)']

Recommendations for user

2831     Fisher King, The (1991)
8450           Unforgiven (1992)
4406            King Kong (1933)
4203                 Jaws (1975)
8473    Untouchables, The (1987)
2700             Fantasia (1940)
6471      Raising Arizona (1987)
6211          Player, The (1992)
2776      Field of Dreams (1989)
6665       Risky Business (1983)
Name: Title, dtype: object

social