Adventures from Kathmandu to San Francisco by Sandeep Giri: How to win $1 Million from Netflix?

Fellow blogger Michael Fassnacht noted rightly that Netflix has caused quite a stir for marketing data geeks with their recent $1 million prize offer for "substantially improving" their existing Cinematch algorithm to make more accurate predictions of "how much someone is going to love a movie based on their movie preferences".

Call it "crowdsourcing", or harnessing "group smart" -- the approach is intriguing, and one of a kind. Being a curious soul myself, I decided to register a team from our company to check this out (who knows what may happen? we can be smart sometimes with enough luck :-)

A few interesting facts:

The contest is actually slated to go another 5 years until 2011, the bar being raised each year to improve over last year's winner
A fine but important distinction: the algorithm needs to predict how someone will rent a movie, NOT what movie someone will rent
At first glance, the data provided by Netflix seems pretty "skimpy" in terms of richness. Basically you get:

List of movies
List of ratings assigned for each movie by an extensive list of Netflix members

My first reaction was that having extra information on the movies themselves might help. There's a bunch of stuff available from IMDB . However, apparently there are license restictions and also Netflix doesn't really consider extra data to be valuable in improving their algorithm (see the discussion thread )

The "enjoy the journey, not the destination" mantra may be apt for this contest. As you can see on the discussion forum on netflix , this process has invited all sorts of interesting conversation on the validity of approaches, whether Netflix has provided enough data, why should one even bother, etc. etc. -- a dream peer review IMHO, albeit a bit too noisy. So, Netflix should be getting a lot more than their money's worth via this process -- not just by getting better algorithms and the PR buzz, but also by leveraging an almost open-source-type process to involve external community for their internal R&D.

At the moment, I agree with Michael's assessment that trying to solve this with ratings data alone might not be the best way to go. There seem to be so many other interesting dimensions that should influence somone's movie rating: movie characteristics like the cast, director, etc., review from critics, local media review, geo/demographic information about the Netflix member, among others. None of these are being considered in the current algorithm. I can understand Netflix's hesitancy to interface with 3rd party resources, but perhaps they should make all the datapoints within Netflix's movie database available for this contest -- and second, encourage contestants to add their own qualitative datapoints. If the goal is to approach this as a pure improvement of a data mining problem -- then increasing the depth of data should help.

I'll keep you all posted how far we get on this. Being a small company, we will do this in the copious amount of spare time left over after working on existing client work that pays the bills. Still, it should be a lot of fun.

Adventures from Kathmandu to San Francisco by Sandeep Giri

Monday, October 09, 2006

How to win $1 Million from Netflix?

No comments: