machine learning

TensorFlow's DC Introduction

The Hello, TensorFlow! post introducing the basic workings of the TensorFlow deep learning framework, up now in O'Reilly's Data, AI, and Learning sections, is a product of the local data community.

Aaron Schumacher, one of the Data Science DC organizers and an employee of Arlington-based Deep Learning Analytics, wrote the article with the support of many local reviewers, including feedback from members of the DC Machine Learning Journal Club.

Aaron will be giving a talk on the material of Hello, TensorFlow! on Wednesday June 29 as part of the Deep Dive into TensorFlow meetup to be hosted at Sapient in Arlington. It should be a great opportunity to explore and discuss this new and exciting tool!

Supervised Machine Learning with R Workshop on April 30th

Supervised Machine Learning with R Workshop on April 30th

Data Community DC and District Data Labs are hosting a Supervised Machine Learning with R workshop on Saturday April 30th. Come out and learn about R's capabilities for regression and classification, how to perform inference with these models, and how to use out-of-sample evaluation methods for your models!

Deep Learning Inspires Deep Thinking

This is a guest post by Mary Galvin, founder and managing principal at AIC. Mary provides technical consulting services to clients including LexisNexis’ HPCC Systems team. The HPCC is an open source, massive parallel-processing computing platform that solves Big Data problems. 

Data Science DC hosted a packed house at the Artisphere on Monday evening, thanks to the efforts of organizers Harlan Harris, Sean Gonzalez, and several others who helped plan and coordinate the event. Michael Burke, Jr, Arlington County Business Development Manager, provided opening remarks and emphasized Arlington’s commitment to serving local innovators and entrepreneurs. Michael subsequently introduced Sanju Bansal, a former MicroStrategy founder and executive who presently serves as the CEO of an emerging, Arlington-based start-up, Hunch Analytics. Sanju energized the audience by providing concrete examples of data science’s applicability to business; this no better illustrated than by the $930 million acquisition of Climate Corps. roughly 6 months ago.

Michael, Sanju, and the rest of the Data Science DC team helped set the stage for a phenomenal presentation put on by John Kaufhold, Managing Partner and Data Scientist at Deep Learning Analytics. John started his presentation by asking the audience for a show of hands on two items: 1) whether anyone was familiar with deep learning, and 2) of those who said yes to #1, whether they could explain what deep learning meant to a fellow data scientist. Of the roughly 240 attendees present, the majority of hands that answered favorably to question #1 dropped significantly upon John’s prompting of question #2.

I’ll be the first to admit that I was unable to raise my hand for either of John’s introductory questions. The fact I was at least a bit knowledgeable in the broader machine learning topic helped to somewhat put my mind at ease, thanks to prior experiences working with statistical machine translation, entity extraction, and entity resolution engines. That said, I still entered John’s talk fully prepared to brace myself for the ‘deep’ learning curve that lay ahead. Although I’m still trying to decompress from everything that was covered – it being less than a week since the event took place – I’d summarize key takeaways from the densely-packed, intellectually stimulating, 70+ minute session that ensued as follows:

  1. Machine learning’s dirty work: labelling and feature engineering. John introduced his topic by using examples from image and speech recognition to illustrate two mandatory (and often less-than-desirable) undertakings in machine learning: labelling and feature engineering. In the case specific to image recognition, say you wanted to determine whether a photo, ‘x’, contained an image of a cat, ‘y’ (i.e., p(y|x)). This would typically involve taking a sizable database of images and manually labelling which subset of those images were cats. The human-labeled images would then serve as a body of knowledge upon which features representative of those cats would be generated, as required by the feature engineering step in the machine learning process. John emphasized the laborious, expensive, and mundane nature of feature engineering, using his own experiences in medical imaging to prove his point.

    Above said, various machine learning algorithms could use the fruits of the labelling and feature engineering labors to discern a cat within any photo – not just those cats previously observed by the system. Although there’s no getting around machine learning’s dirty work to achieve these results, the emergence of deep learning has helped to lesson it.

  2. Machine Learning’s ‘Deep’ Bench. I entered John’s presentation knowing a handful of machine learning algorithms but left realizing my knowledge had barely scratched the surface. Cornell University’s machine learning benchmarking tests can serve as a good reference point for determining which algorithm to use, provided the results are taken into account with the wider, ‘No Free Lunch Theorem’ consideration that even the ‘best’ algorithm has the potential to perform poorly on a subclass of problems.

    Provided machine learning’s ‘deep’ bench, the neural network might have been easy to overlook just 10 years ago. Not only did it place 10th in Cornell’s 2004 benchmarking test, but John enlightened us to its fair share of limitations: inability to learn p(x), inefficiencies with layers greater than 3, overfitting, and relatively slow performance.

  3. The Restricted Boltzmann Machine’s (RBM’s) revival of the neural network. The year 2006 witnessed a breakthrough in machine learning, thanks to the efforts of an academic triumvirate consisting of Geoff Hinton, Yann LeCun, and Yoshua Bengio. I’m not going to even pretend like I understand the details, but will just say that their application of the Restricted Boltzmann Machine (RBM) to neural networks has played a major role in eradicating the neural network’s limitations outlined in #2 above. Take, for example, ‘inability to learn p(x)’. Going back to the cat example in #1, what this essentially states is that before the triumvirate’s discovery, the neural net was incapable of using an existing set of cat images to draw a new image of a cat. Figuratively speaking, not only can neural nets now draw cats, but they can do so with impressive time metrics thanks to the emergence of the GPU. Stanford, for example, was able to process 14 terabytes of images in just 3 hours through overlaying deep learning algorithms on top of a GPU-centric computer architecture. What’s even better? The fact that many implementations of the deep learning algorithm are openly available under the BSD licensing agreement.

  4. Deep learning’s astonishing results. Deep learning has experienced an explosive amount of success in a relatively small amount of time. Not only have several international image recognition contests been recently won by those who used deep learning, but technology powerhouses such as Google, Facebook, and Netflix are investing heavily in the algorithm’s adoption. For example, deep learning triumvirate member Geoff Hinton was hired by Google in 2013 to help the company make sense of their massive amounts of data and to optimize existing products that use machine learning techniques. Fellow deep learning triumvirate member Yann LeCun was hired by Facebook, also in 2013, to help integrate deep learning technologies into the company’s IT systems.

As for all the hype surrounding deep learning, John concluded his presentation by suggesting ‘cautious optimism in results, without reckless assertions about the future’. Although it would be careless to claim that deep learning has cured disease, for example, one thing most certainly is for sure: deep learning has inspired deep thinking throughout the DC metropolitan area.

As to where deep learning has left our furry feline friends, the attached YouTube video will further explain….

(created by an anonymous audience member following the presentation)

You can see John Kaufhold's slides from this event here.

Ensemble Learning Reading List

Tuesday's Data Science DC Meetup features GMU graduate student Jay Hyer's introduction of Ensemble Learning, a core set of Machine Learning techniques. Here are Jay's suggestions for readings and resources related to the topic. Attend the Meetup, and follow Jay on Twitter at @aDataHead! Also note that all images contain Amazon Affiliate links and will result in DC2 getting a small percentage of the proceeds should you purchase the book. Thanks for the support!

L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classification and Regression Trees. Chapman and Hall.CRC, Boca Raton, FL, 1984.

This book does not cover ensemble methods, but is the book that introduced classification and regression trees (CART), which is the basis of Random Forests. Classification trees are also the basis of the AdaBoost algorithm. CART methods are an important tool for a data scientist to have in their skill set.

L. Breiman. Random Forests. Machine Learning, 45(1):5-32, 2001.

This is the article that started it all.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, 2nd ed. Springer, New York, NY, 2009.

This book is light on application and heavy on theory. Nevertheless, chapters 10, 15 & 16 give very thorough coverage to boosting, Random Forests and ensemble learning, respectively. A free PDF version of the book is available on Tibshirani’s website.

G. James, D. Witten, T. Hastie, R. Tibshirani. An Introduction to Statistical Learning: with Apllications in R, Springer, New York, NY, 2013.

As the name and co-authors imply, this is an introductory version of the previous book in this list. Chapter 8 covers, bagging, Random Forests and boosting.

Y. Freund and R.E. Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Journal of Computer and System Sciences, 55(1): 119-139, 1997.

This is the article that introduced the AdaBoost algorithm.

G. Seni, and J. Elder. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan & Claypool Publishers, USA, 2010.

This is a good book with great illustrations and graphs. There is also a lot of R code too!

Z.H. Zhou. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC, 2012.

This is an excellent book the covers ensemble learning from A-Z and is well suited for anyone from an eager beginner to a critical expert.

2013 September DSMD Event: Teaching Data Science to the Masses

Stats-vs-data-science For Data Science MD's Septmeber meetup, we were very fortunate to have the very talented and very passionate Dr. Jeff Leek speak about his experiences teaching Data Science through the online learning platform Coursera. This was also a unique event for DSMD itself because it was the first meetup that only featured one speaker. Having one speaker speak for a whole hour can be a disaster if the speaker is unable to keep the attention of those in the audience. However, Dr. Leek is a very dynamic and engaging speaker and had no problem keeping the attention of everyone in the room, including a couple of middle school students.

For those of you who are not familiar with Dr. Leek, he is a biostatistician at Johns Hopkins University as well as a instructor in the JHU biostatistics program. His biostatistics work typically entails analyzing human genome sequenced data to provide insights to doctors and patients in the form of raw data and advanced visualizations. However, when he is not revolutionizing the medical world or teaching the great biostatisticians of tomorrow at JHU, you may look for him teaching his course on Coursera, or providing new content to his blog, Simply Statistics.

Now, on to the talk. Johns Hopkins and specifically Dr. Leek got involved in teaching a Coursera course because they have constantly been looking at ways to improve learning for their students. They had been "flipping the classroom" by taking lectures and posting them to YouTube so that students could review the lecture material before class and then use the classroom time to dig deeper into specific topics. Because online videos are such a vital component of Massive Open Online Classes (MOOCs), it is no surprise that they took the next leap.

But just in case you think that JHU and Dr. Leek are new to this whole data science "thing," check out their Kaggle team's results for the Heritage Health Prize.


Even though their team fell a few places when run on the private data, they still had a very impressive showing considering there were 1358 teams that entered and over 20,000 entries. But what exactly does data science mean to Dr. Leek? Check out his expanded components of data science chart, that differs from similar charts of other data scientists by showing the root disciplines of each component too.


But what does the course look like?


He covers topics such as type of analyses, how to organize a data analysis, data munging as well as others like:



One of the interesting things to note though is that he also shows examples of poor data analysis attempts. There is a core problem with the statistics example from above (pointed out by high school students). Below is an example of another:


And this course, in addition to two other courses, Computing for Data Analysis and Mathematical Biostatistics Bootcamp taught by other JHU faculty, have had a very positive response.


But how do you teach that many people effectively? That is where the power of Coursera comes in; JHU could have chosen other providers like edX or Udacity but decided to go with Coursera. The videos make it easy to convey knowledge and message boards provide a mechanism to ask questions. Dr. Leek even had students answering questions for other students so that all he had to do was validate the response. But he also pointed out that his class' message board was just like all other message boards and followed 1/98/1 rule where 1% of people respond in a mean way and are unhelpful, 1% of people are very nice and very helpful and the other 98% don't care and don't really respond.

One of the most unique aspects of Coursera is that it helps to scale to tens of thousands of students by using peer/student grading. Each person grades 4 different assignments so that everyone is submitting one answer and grading 4 others. The final score for each student is the median of the four scores from the other students. The rubric used in Dr. Leek's class is below.


The result of this grading policy, based on Dr. Leek's analysis is that good students received good grades, poor students received poor grades and middle students' grades fluctuated a fair amount. So it seems like the policy works mostly, but there is still room for improvement.

But why does Johns Hopkins and Dr. Leek even support this model of learning? They do have full time jobs that involve teaching after all. Well, besides being huge supporters of open source technology and open learning, they also see many other reasons for supporting this paradigm.


Check out the video for the many other reasons why JHU further supports this paradigm. And, while you are at it, see if you can figure out if the x and y axes are related in some way. This was our data science/statistics problem for the evening. The answer can also be found in the video.


We also got a sneak peek at a new tool/component that integrates directly into R - swirl. Look for a meetup or blog post about this tool in the future.


Our next meetup is on October 9th at in Baltimore beginning at 6:30PM. We will have Don Miner speak about using Hadoop for Data Science. If you can make it, come out and join us.

Weekly Round-Up: Data Analysis Tools, M2M, Machine Learning, and Naming Babies

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from data analysis tools to naming babies. In this week's round-up:

  • Data Analysis Tools Target Non-experts
  • How M2M Data Will Dominate the Big Data Era
  • What Hackers Should Know About Machine Learning
  • Knowledge Engineering Applied to Baby Names

Data Analysis Tools Target Non-experts

Our first piece this week is an O'Reilly Strata article about some of the data analysis tools that are coming to market and are aimed at providing business users with the analytics they need to make decisions. The article highlights several tools from a variety of companies and categorizes them into three different categories according to what they help you do. The article also includes links to all the companies' websites so that, if you're anything like me, you can check out every single one of them.

How M2M Data Will Dominate the Big Data Era

The Internet of Things is getting a lot of attention these days, partly due to the amount of data that gets produced when one connected device communicates with another connected device. This is known as Machine-to-Machine data (M2M), and this Smart Data Collective article describes where a lot of this data may come from and how much data can potentially be generated.

What Hackers Should Know About Machine Learning

Our third piece is a Fast Company interview with Drew Conway, the author of the must-own book Machine Learning for Hackers. In the interview Drew answers questions about why developers should learn machine learning, the biggest knowledge gaps they need to overcome, and the differences between a machine learning project and a development project. (Editor's Note, the image to the left links to Amazon where if you buy the book we get a small cut of the proceeds. Buy enough books through this link, and we retire to an island.)

Knowledge Engineering Applied to Baby Names

Our final piece this week is a blog post about a company called Nameling is in the midst of holding a contest to improve the algorithms behind their baby name recommendation engine. Coming up with a good name for your baby is very important to parents, as the consequences of choosing a bad one almost certainly result in ridicule and tears. It should be interesting to see the results of the contest as well as what kinds of names the recommendation engine spits out.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Machine Learning, DIY Data Scientists, Games, and Helping Couples Conceive

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from machine learning to helping couple's conceive. In this week's round-up:

  • Jeff Hawkins: Where Open Source and Machine Learning Meet Big Data
  • The Rise Of The DIY Data Scientist
  • Why Games Matter to Artificial Intelligence
  • Three Questions for Max Levchin About His New Startup

Jeff Hawkins: Where Open Source and Machine Learning meet Big Data

Our first piece this week is an InfoWorld article about Jeff Hawkins, the machine learning work that him and his company have been doing, and the open source project they've recently released on Github. The project's name is the Numenta Platform for Intelligent Computing (NuPIC) and it's goal is to allow others to be able to embed machine intelligence into their own systems. The article has a short interview with Jeff and a link to the Github page where the project resides.

The Rise Of The DIY Data Scientist

This is an interesting Fast Company article about how Kaggle competition winners tend to be self-taught. The author of the article interview's Kaggle's chief scientist Jeremy Howard about this phenomenon and other interesting findings derived from Kaggle's competitions about data scientists. Some of the questions inquire about where the winners are from, how they learned data science, and what machine learning algorithms they use.

Why Games Matter to Artificial Intelligence

This blog post on the IBM Research blog is an interview with Dr. Gerald Tesauro about the significance of games in the Artificial Intelligence field. Dr. Tesauro was the IBM research scientists who taught Watson how to play Jeopardy. In the interview, he explains how games tend to be an ideal training ground for machines because they tend to simplify real life. He goes on to answer questions about how that prepares the machines for transitioning to other real-world problems, what he's currently working on, what Watson is doing these days, and where else machine learning can be used.

Three Questions for Max Levchin About His New Startup

Our final piece this week is an MIT Technology Review article about PayPal co-founder Max Levchin's new startup called Glow. A lot of people are having children later in life these days and one downside of this is that many couples have trouble trying to conceive. Levchin has developed an iPhone app that uses data to help couples identify the optimal time for conception. In this brief interview, Levchin talks about what they are doing, why, and the degree of accuracy they hope to achieve.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

The State of Recommender Technology

Reblogged with permission from Cobrain. socialnetwork_graph

So let’s start with the big idea that is the reason that we are all here: recommendation engines. If you are reading this, you have probably already overcome the mental hurdle of the massive design and implementation challenge that recommendation engines represent, otherwise I can’t imagine why you would have signed up! Or perhaps you don’t know what a massive design and implementation challenge recommendation engines represent. Either way, you’re in the right place- this post is an introduction to the state of the technology of recommendation systems.

Well sort of– here is a working state of the technology: Academia has created a series of novel machine learning and predictive algorithms that would allow scarily accurate trend analysis, recommendations, and predictions given the right, unbiased supervised training sets of sufficient magnitude. Commercial applications in very specific domains have leveraged these insights and extremely large data sets to create interesting results in the release phase of applications but have found that over time the quality of these predictions decreases rapidly. Companies with even larger data sets that have tackled other algorithmic challenges involving supervised training sets (Google) have avoided current recommender systems because of their domain specificity, and have yet to find a generic enough application.

To sum up:

Recommendation Engines are really really hard, and you need a whole heckuva lot of data to make them work.

Now go build one.

Don’t despair though! If it wasn’t hard, everyone would be doing it! We’re here precisely because we want to leverage existing techniques on interesting and novel data sets, but also to continue to push forward the state of the technology. In the process we will probably learn a lot and hopefully also provide a meaningful experience for our users. But before we get into that, let’s talk more generically about the current generation of recommender systems.

Who Does it Well?

The current big boys in the recommendation space are AmazonNetflixHunch (now owned by eBay), Pandora, and Goodreads. I strongly encourage you to understand how these guys operate and what they do to create domain specific recommendations. For example, the domain of Goodreads, Netflix, and Pandora is books, movies, and music respectively. Recommending inside a particular domain allows you to leverage external knowledge resources that either solve scarcity issues or allow ontological reasoning that can add a more accurate layer on top of the pure graph analyses that typically happen with recommenders.

Amazon and Hunch seem to be more generic, but in fact they also have domain qualification. Amazon has the data set of all SKU-level transactions through it’s massive eCommerce site. Even so, Amazon has spent 10 years and a lot of money perfecting how to rank various member behaviors. Because it is Amazon-specific, Amazon can leverage Amazon-only trends and purchasing behaviors, and they are still working on perfecting it. Hunch doesn’t have an item-specific domain, but rather a system-specific domain, using social and taste-making graphs to propose recommendations inside the context of social networks.

Speaking of Amazon’s decade long effort to create a decent recommender with tons of data, I hope you’ve heard of the Netflix Prize. Netflix was so desperate for a better algorithm for recommendations that they instituted an X-Prize like contest for a unique algorithm for recommending movies in particular. In fact, the test methodology for the Netflix Prize has become a standard for movie recommendations, and since 2009 (when the prize was awarded) other algorithm sets have actually achieved better results, most notably, Filmaster.

Given what these companies have tried to do, we can more generically speak of the state of the technology as follows: An “adequate” recommender system comprises of the following items:

  1. An unbiased, non-scarce data set of sufficient size
  2. A suite of machine learning and predictive algorithms that traverse that data set
  3. Knowledge resources to apply transformations on the results of those algorithms

Pandora is a great example of this. They have created an intensive project at detailing a “music genome” or an ontological breakdown of a sample of music. The genome itself is the knowledge resource. The analysis of the genomics of a piece of music aggregated across a large number of pieces is the unbiased non-scarce data set of sufficient size. Finally the suite recommendation algorithms that Pandora applies to these two sets then generates ranked recommendations that are interesting.

Types of Recommenders

Without getting into a formal description of recommenders, I do want to list a few of the common types of recommendation systems that exist within domain specific contexts. To do this, I need to describe the two basic classes of algorithms that power these systems:

  1. Collaborative Filtering: recommendations based on shared behavior with other people or things. E.g. if you and I bought a widget, and I also bought a sprocket, it is likely that you would also like a sprocket.
  2. Expert Adaptive or Generative Systems: recommendations based on shared traits of people or things or rules about how things interact with each other in a non-behavior way. E.g. if you play football and live in Michigan, this particular pair of cleats is great in the snow.

In the world of recommenders, we are trying to create a semantic relationship between people and things, therefore we can discuss person-centric and item-centric approaches in each of these classes of algorithms; and that gives us four main types of recommenders!

  1. Personalized Recommendations- A person-centric, expert adaptive model based on the person’s previous behavior or traits.
  2. Social/Collaborative Recommendations- A person-centric collaborative filtering model based on the past behavior of people similar to you, either because of shared traits or shared behavior. Note that the clustering of similar people can fall into either algorithm set, but the recommendations come from collaborative filtering.
  3. Ontological Reasoned Recommendations- An item-centric expert adaptive system that uses rules and knowledge mined with machine learning approaches to determine an inter-item relational model.
  4. Basket Recommendations- An item-centric collaborative filtering algorithm that uses inter-item relationships like “purchased together” to create recommendations.

Keep in mind, however, that these types of recommenders and classes are very loose and there is a lot of overlap!


Now that large scale search has been dramatically improved and artificial intelligence knowledge bases are being constructed with a reasonable degree of accuracy, it is generally considered that the next step in true AI will be effective trend and prediction analysis. Methodologies to deal with Big Data have evolved to make this possible, and many large companies are rushing towards predictive systems with a wide range of success. Recent approaches have revealed that near-time, large data, domain-specific efforts yield interesting results, if not truly predictive.

The overwhelming challenge is not just in engineering architectures that traverse graphs extremely well (see the picture at the top of this post), but also in finding a unique combination of data, algorithms, and knowledge that will give our applications a chance to provide truly scary, inspiring results to our users. Even though this might be a challenge, there are four very promising approaches that we can leverage within our own categories.

Stay tuned for more on this topic soon!

Weekly Round-Up: Computer Vision, Machine Learning, Benchmarking, and R Packages

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from computer vision to popular R packages. In this week's round-up:

  • Google Explains How AI Photo Search Works
  • Matter Over Mind in Machine Learning
  • Principles of ML Benchmarking
  • A List of R Packages, By Popularity

Google Explains How AI Photo Search Works

This is an interesting blog post about how Google recently enhanced their image search functionality using computer vision and machine learning algorithms. The post describes in layman's terms how the algorithms work and how they are able to classify pictures. It also includes a link to Google's research blog, where they made the original announcement.

Matter Over Mind in Machine Learning

This is a post on the BigML blog which talks about the work of Dr. Kiri Wagstaff from NASA's Jet Propulsion Laboratory. The post highlights a specific paper of hers where she argues that instead of aiming for incremental abstract improvements in machine learning processes, we should be focused on attaining results that translate into a measurable impact for society at large. More detail is provided about what that means, the author plays a little devil's advocate, and the post also includes a link to Wagstaff's paper for those that would like to read more about this.

Principles of ML Benchmarking

This is a post on the blog about how to benchmark machine learning algorithms. The post is structured as a thought exercise where the author starts by thinking about the purpose of benchmarking, why we should do it, and what our goals should be. From that point, he is able to formulate a set of guidelines for benchmarking that are very logical. The post lists each of the guiding principles along with some steps that can be taken to make sure you are abiding by them.

A List of R Packages, By Popularity

Our last article this week is a post on the Revolution Analyitcs blog that lists the top R packages in order of popularity. Some of the most popular packages include plyr, digest, ggplot2, and colorspace. Check out the list, see where your favorite packages rank, and potentially discover some useful packages you didn't know about!

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups