Metis Data Science - Week 6 (10/6 - 10/10)

Note: As of this post I am actually already in the 10th week of the Metis Data Science program. I’ve been writing down notes each week, but I haven’t put them anywhere until now. These posts will reflect my thoughts and opinions as I encountered new challenges each week.

This week we’re introduced to data visualization, with emphasis on D3.js. I was actually pretty excited for this because I’ve heard a lot about D3, but haven’t had a chance to work with it until now. We spent the better part of two days learning the basics of D3. The first day started with the basics: introductions to D3 syntax, javascript method chaining, SVG shapes. We were walked through a simple example of constructing a bar chart. While powerful, D3 can be quite aggravating to work with since nearly everything needs to be explicitly defined, such as the position of each individual bar. Helpful built-in functions alleviate some of the tedium, but as a data visualization package it’s noticeably different in structure from anything else I’ve ever worked with. We even ran into problems during our example when the bars weren’t positioning correctly.

Our second day featured a guest instructor from Kaplan: Paul Buffa, director of business analytics & streategy at Kaplan Test Prep. He’s actually been visiting us about once a week, helping us with project work and general questions. He demonstrated a few of his favorite visualization examples that showcased what is possible with D3: a chart that dynamically transitioned between different chart types using the same set of data, an interactive word tree, etc. It looked really impressive, but I’m not sure I’ll have time to implement any of these features given the limited time left for our current project. Paul also showed a number of helpful example sites:

And some additional helper packages:

The rest of our time was spent trying to finish project Mcnulty for our group presentations later in the week. I bounced between a number of ideas, but due to time constraints I settled on a simple logistic regression model that predicted the likelihood of heart disease for people in each state, based on a few features I could find aggregate statistics for. Part of the reason was I really wanted to include a D3 visualization as part of my project, and I had a lot of difficulty figuring out how to get it to work. We each presented our Mcnulty projects within our group members (to simulate the scenario of an “internal group presentation” at a company). I wished we could have presented our work to the entire class, but I can understand the reasoning behind having it this way. You can view my project here.

Disjointed thoughts:

  • D3 looks amazing, but it’s notoriously difficult to work with. That’s something I finally experienced first-hand this week when it took me nearly an entire day just to build a simple horizontal bar chart.
  • We had record FOUR guest speakers this week. @erikhinton of the New York Times gave a presentation on his experience working there as a web developer. This was a shared guest speaker with the Ruby on Rails and UX/UI programs at Metis.
  • We also had the senior VP of MarketBridge, Peter Guy (@ThePGuy) talk to us about the business side of analytics from the perspective of a marketing firm.
  • Cameron Wellock of nRelate talked about his work in optimizing their recommndation algorithm. Their business is producing the “related links” part of websites like buzzfeed and slate. On the surface it seems like a pretty trivial probem to just match website content, but there’s a lot of data cleaning and adjustments to be made.
  • Finally, Dan Valente (@dpvalente) from Chartbeat gave a very interesting case study on using Latent Dirichlet Allocation (LDA), a type of topic modeling. In LDA, topics are defined mathematically as a probability distribution around some vocabulary. Chartbeat provides real-time analytics for web traffic, the problem they were having is categorizing web pages by topic or content. He shared an iPython Notebook showing the LDA in action where he extracted topics from a gigantic corpus of some million+ web pages. It worked well enough, but now there is problem of productionizing this process.
Written on November 6, 2014