Metis Data Science - Week 4 (9/22 - 9/26)


Note: As of this post I am actually already in the 10th week of the Metis Data Science program. I’ve been writing down notes each week, but I haven’t put them anywhere until now. These posts will reflect my thoughts and opinions as I encountered new challenges each week.


A new week, a new project. We began with a primer on supervised learning classification algorithms (logistic regression, K-nearest neighbors) and coding challenges. The topic then moved into the cloud, where we set up an ubuntu server deployed on a digitalocean droplet. We had our first introduction to databases by setting up a mysql server followed by a short primer of sql syntax and some more challenges in using SQL queries on practice set of men’s and women’s tennis results data from the US, Australian, and French Opens. To tie all of this into python, we were introduced to the python package MySQLdb to facilitate SQL queries in a python environment.

The latter half of the week was devoted to introducing then received our formal introduction to our project, Mcnulty. This project was to be a team project, and the goal was to work in teams and to create a scenario as if we were an internal data science team for a company. The class was given the choice of three data sets, and my group decided to work on the UC Irvine heart-disease data set. The scenario we created was that we were to be the internal data science team of a health insurance company. It didn’t seem like the most creative idea at the beginning, but we felt there was a lot of room for the team to explore different ideas using the same data set. My idea was to use the the UC Irvine data from 1980 to train a classifier. I would then use this classifier to on survey data from from recent dates to assess the overall health of different regions of the USA. THe idea was to aid the insurance company in identifying high-risk markets where they could avoid or developer different pricing structures for.

Thoughts this week:

  • The structure for our new project is very different from the previous one. We’re working in groups pulling and pushing to a group github repo, pulling/cleaning data, and inserting it into the team database server we set up. We have one data set to work with as a group, but each of us still have individual projects so we’re given freedom to choose our own research direction.
  • During our brainstorming sessions we threw around a lot of surprisingly creating scenario ideas: a hospital or non-profit interested in identifying high-risk populations to target for education outreach, a fitness center interested in marketing their fitness services to “unhealthy” people, and a government agency interested in observing population health to establish guidlines for policy-making.
  • We had our first industry guest this week from Rush Street Gaming. They’re a regional casino owner that operates across various cities in the Midwest. What I found most striking is that they’re just starting up a data science team focused on streamlining and optimizing casino operations. I always assumed casinos were these huge complicated machines that were highly optimized using customer data to produce as much profit as possible, but the surprising news was that they….weren’t. They weren’t even recording customer data. Like a lot of older industries, a lot of what they do are apprently built on best practices carried down through experince, and customer retention is handled through frequent player card programs. The different parts of the casino (like the kitchen vs the casino floor bar) are managed entirely separate from each other. It makes a lot of sense for casinos to be interested optimizing operations using data science. It just surprises me that they weren’t already doing it.
Written on November 5, 2014