Metis Data Science - Week 7 (10/13 - 10/17)


Note: As of this post I am actually already in the 10th week of the Metis Data Science program. I’ve been writing down notes each week, but I haven’t put them anywhere until now. These posts will reflect my thoughts and opinions as I encountered new challenges each week.


We are now officially halfway complete with the bootcamp. With Mcnulty behind, it’s time to move forward with new topics. We started with a nice visual introdution to webapps (link). We then worked through an example in building a simple app that receives data in json format and returns prediction results from a logistic regression classifier, which is essentially an api. On the database front, we set up a mongodb server and sped through quick examples of mongodb queries and syntax. Our last topic of the week was NLP (natural language processing). We were introduced to some of the core concepts of text analysis with nltk and textblob. While there are more complicated models, the basics are all centered around tokenizing documents of text that can be then clustered for similarity, classified for positive/negative sentiment, or grouped into topics. Linking all of these concepts together involved an exercise in interacting with the twitter api to pull down tweets and insert them to mongodb where we could query and perform some text analysis on.

This week also introduced us to our next project: Fletcher. This project will be working with unsupervised clustering algorithms on text data and storing them in a noSQL database. The latter half of this week was spent running through our brainstorming and design sessions for our projects. In contrast to our previous project, this one is once again an individual one. I’m leaning towards sticking with working with the twitter api and doing some clustering on tweets to see if there is some common topics among them. I’m also interested in seeing what twitter has to say about the developing ebola epidemic in West Africa.


Random Thoughts:

  • This are really getting busy now. Each of the new topics came with challenges for us to work on in addition to the twitter api exercise and the fletcher project scoping.
  • I had quite the time trying to parse the mongodb documentation to make sense of how to set up authentication in the database server for remote access and figuring out the differences between a regular and a capped collection.
  • Twitter has a streaming api (link) that seems perfect for capturing tweets in real-time as they arrive. The RESTful api doesn’t seem well suited for gathering tweets from the past, as they only seem to allow you to query for tweets up to a week ago.
  • We had just one guest speaker this week, Ben Bleikamp from Github. He was another guest speaker for all three of the Metis bootcamps, and you can read more about it on their official blog.
Written on November 9, 2014