Monday, July 27, 2009

If you hated this, you'll love that

Most internet shoppers are familiar with recommendations of the form "users who liked X also liked Y". Most Netflix Prize competitors are familiar with the k-nearest-neighbors algorithm as the classic and most typical implementation of such a system. One computes the correlations between all pairs of movies and then predicts a user's reaction to a movie based on the user's available ratings of movies most correlated with that movie.

It was interesting but ultimately not that surprising to me to find that there were also some strong negative correlations between many pairs of movies. Yehuda Koren's recently developed neighborhood algorithm infers predictive relationships between all possible pairs of movies and therefore takes into account negatively correlated pairs as well as positively correlated pairs, but there's very little emphasis on the negative correlations in the presentation of the method. I doubt I'm the only person to have observed the strength of the negative correlations, but I haven't seen them discussed much, so I thought I'd mention a few of my findings (I'll refer to the correlation level as "rho").

For instance, Titanic is positively correlated with Ghost (rho=0.245) and Pearl Harbor (rho=0.238) but negatively correlated with Fight Club (rho= -0.190) and Lost in Translation (rho=-0.189) . Harry Potter and the Sorcerer's Stone is positively correlated with Star Wars, The Phantom Menace (rho=0.152) but negatively correlated with Taxi Driver (-0.142) and Pulp Fiction (-0.138). Saving Private Ryan is positively correlated with Braveheart (rho=0.169) and Platoon (rho=0.168) but negatively correlated with Sex and the City, Season II (rho =-0.135), The Rocky Horror Picture Show (rho =-0.135) and Dirty Dancing (rho=-0.1347).

I implemented the "negative" version of a nearest neighbor algorithm -a "furthest opposities" algorithm, if you will, which relied only on negative correlations. It achieved an RMSE of 0.9562. That score is fairly competitive with the 0.9513 which Netflix's algorithm, Cinematch, had achieved prior to the start of the competition. I suspect I could have improved it further if I had done more than minimal tuning. I would have loved to see Netflix run a recommendation system which generated predictions with reasoning like "since you hated Armageddon and Lethal Weapon 3, you'll probably love Being John Malkovich".

About Me

In 1998, I completed a PhD in Computation and Neural Systems from Caltech, where I focused on machine learning. Since finishing my PhD, I worked for about a year at NASA Ames Research Center and then did four year stints in the internet startup world and as a quant for a hedge fund in Chicago. At the moment, though, I am a man of leisure aside from my Netflix Prize labors.

I can be contacted at joe _ sill at yahoo dot com.

Saturday, July 25, 2009

My Personal Opinion

I just want to say that regardless of how things turn out this weekend, I think the members of BellKor's Pragmatic Chaos are probably the best collaborative filtering researchers in the world. In particular, Yehuda Koren's contributions to the field over the past few years have been enormous. I also consider Pragmatic Theory's rise to the top spot while working full-time in an unrelated area to be a jaw-droppingly impressive feat. Of course, I have no particular standing to make such pronouncements, but that's what I think.

Nothing that has happened or will happen over the next 24 hours is going to change that assessment.

Wednesday, February 18, 2009


Howdy. I am Joe Sill, the sole member of team "Expensive Lunch".

As I described in the Netflix Prize forum here, I have implemented a version of Pyflix which provides easy access to the dates on which user ratings were made. Anyone who's interested in using this version should send me an email via the Netflix Prize forum email facility.