Playing with the Netflix dataset

15 Feb 2018

I downloaded the Netflix dataset from Kaggle. The Netflix dataset is a list of movie ratings entered by different customers. It was provided by Netflix for the Netflix Prize competition. The goal of the competition was to predict missing movie ratings.

So far, I have been playing with the data to format them in an interesting way. My approach is

create column for movieID by copying custID and removing all entries not finishing by ‘:’, then extend movieID ‘ffill’.
create sparse matrix with movie ratings for each customer
create SparseDataFrame from sparse matrix

An interesting question is how do you reconcile ratings from different customers. One approach is to normalize the ratings for each customer, by substracting the mean rating and dividing by the standard deviation of the ratings for that customer. Also, as I don’t want to have to deal with pathological cases, I am going to remove all customers with a single rating.

Unfortunately, the SparseDateFrame object appears to be extremely slow for any sort of operation (subtract, mean,…). I ended making all operations prior to the transformation to sparse dataframe.

[ pandas netflix ML ]

Fourre-tout

Playing with the Netflix dataset

Related Posts

Webdev 101 03 Jan 2022

Goodhart's law 06 Oct 2021

Notes on the Transformer architecture 17 Sep 2021