Metis Data Science - Week 3 (9/15 - 9/19)

Note: As of this post I am actually already in the 10th week of the Metis Data Science program. I’ve been writing down notes each week, but I haven’t put them anywhere until now. These posts will reflect my thoughts and opinions as I encountered new challenges each week.

This week culminated in a presentation of Project Luther. We started the week off with more in-depth discussion of linear regression and it’s evaluation metrics (over/underfitting, cross-validation, adjusted R², AIC, BIC). We reviewed model learning curves as a demonstration of how increasing model features affected the error and as a method of model selection.

The rest of the week’s time was devoted to working on linear regresion challenges and finishing our projects in time for presentation day. We were required to submit an MVP (minimum viable product) as a check-in for progress, though I ended up presenting something completely different. I actually spent a majority of the final day working on making the charts easy to read and pleasing to the eye. Everything was done in matplotlib, which has quite a horrendous default color scheme. This lead me to disover prettyplotlib which is essentially a library that overrides a lot of the matplotlib defaults with colorbrewer.

If you’ve been reading to this point you’re probably curious about what my actual project was. I’ve decided it diserved it’s own separate post.

Thoughts:

Despite working with a well-used data set (movies tend to be pretty popular for this kind of thing) there were still a lot of interesting insights presented.
Scraping data from html can be quite a hassle.
A lot of people tried to predict movie success using revenue as the predicted metric. Some ideas included: predicting movie success from wikipedia page views of actors, predicting movie success in international markets based on country of release, and predicting the success of movie sequels based on prequel performance.

Written on November 4, 2014