Metis Data Science - Week 5 (9/29 - 10/3)


Note: As of this post I am actually already in the 10th week of the Metis Data Science program. I’ve been writing down notes each week, but I haven’t put them anywhere until now. These posts will reflect my thoughts and opinions as I encountered new challenges each week.


This week brought us deeper into the world of classification models. We started with an introduction to metrics for evaluating classifier performance. Precision, recall, sensitivity, specificity, receiver operating characteristinc (ROC) curves, and the confusion matrix were all explored using scikit-learn. Precision (the fraction of correct predictions out of all predictions) and recall (the fraction of correct predictions out of all possible correct labels) are metrics that measure different types of error. I’m learning that achieving high precision AND recall with a single model doesn’t always happen, and depending on the situation it’s sometimes better to just focus on one instead of trying to optimize for both.

We then moved on to other classification algorithms: decision tress, support vector machines (SVMs), and naive bayes. Decision tree classifiers split the data based on maximum entropy, and are generally fast and easy to understand with the downside of being not very accurate and prone to overfitting. Ways to rectify overfitting include specifying a maximum depth, but a better way to deal with poor fitting is to use ensemble methods like a random forest (bagging) classifier. SVMs classify data by finding a line with the maximum margins between groups of data, and is generally a very good algorithm. However, I like how they emphasized the true value of data science comes from interpretation of the classification results. Simply fitting better isn’t the end goal, it’s how your questions are phrased and answered. Sometimes a simple logistic regression is the better model to use because the results are more easily understood.

The rest of the time this week was spent on making further progress on project Mcnulty. We split up into our project teams and worked on producing an agenda document that laid out the team goals. Our company name, the data set we chose to work on, and our justifications for each of our individual topics had to be included. By the end of the week we were all rushing to produce our MVP for the project, and I was still trying to obtain relevant data to work with. I settled on using data from the BRFSS prevalence and trends data which had convenient tables grouped by state. I quickly built a classifier model trained on the UCI data to predict the heart disease rate on the aggregate BRFSS state data, but the results aren’t very interesting so far.

Written on November 5, 2014