recommender systems

Hadoop for Data Science: A Data Science MD Recap

Hadoop logo On October 9th, Data Science MD welcomed Dr. Donald Miner as its speaker to talk about doing data science work and how the hadoop framework can help. To start the presentation, Don was very clear about one thing: hadoop is bad at a lot of things. It is not meant to be a panacea for every problem a data scientist will face.

With that in mind, Don spoke about the benefits that hadoop offers data scientists. Hadoop is a great tool for data exploration. It can easily handle filtering, sampling and anti-filtering (summarization) tasks. When speaking about these concepts, Don expressed the benefits of each and included some anecdotes that helped to show real world value. He also spoke about data cleanliness in a very Baz Luhrmann Wear Sunscreen sort of way, offering that as his biggest piece of advice.

Don then transitioned to the more traditional data science problems of classification (including NLP) and recommender systems.

The talk was very well received by DSMD members. If you missed it, check out the video:

Our next event will be November 20th, 2013 at Loyola University Maryland Graduate Center starting at 6:30PM. We will be digging deeper into the daily lives of 3 data scientists. We hope you will join us!

The State of Recommender Technology

Reblogged with permission from Cobrain. socialnetwork_graph

So let’s start with the big idea that is the reason that we are all here: recommendation engines. If you are reading this, you have probably already overcome the mental hurdle of the massive design and implementation challenge that recommendation engines represent, otherwise I can’t imagine why you would have signed up! Or perhaps you don’t know what a massive design and implementation challenge recommendation engines represent. Either way, you’re in the right place- this post is an introduction to the state of the technology of recommendation systems.

Well sort of– here is a working state of the technology: Academia has created a series of novel machine learning and predictive algorithms that would allow scarily accurate trend analysis, recommendations, and predictions given the right, unbiased supervised training sets of sufficient magnitude. Commercial applications in very specific domains have leveraged these insights and extremely large data sets to create interesting results in the release phase of applications but have found that over time the quality of these predictions decreases rapidly. Companies with even larger data sets that have tackled other algorithmic challenges involving supervised training sets (Google) have avoided current recommender systems because of their domain specificity, and have yet to find a generic enough application.

To sum up:

Recommendation Engines are really really hard, and you need a whole heckuva lot of data to make them work.

Now go build one.

Don’t despair though! If it wasn’t hard, everyone would be doing it! We’re here precisely because we want to leverage existing techniques on interesting and novel data sets, but also to continue to push forward the state of the technology. In the process we will probably learn a lot and hopefully also provide a meaningful experience for our users. But before we get into that, let’s talk more generically about the current generation of recommender systems.

Who Does it Well?

The current big boys in the recommendation space are AmazonNetflixHunch (now owned by eBay), Pandora, and Goodreads. I strongly encourage you to understand how these guys operate and what they do to create domain specific recommendations. For example, the domain of Goodreads, Netflix, and Pandora is books, movies, and music respectively. Recommending inside a particular domain allows you to leverage external knowledge resources that either solve scarcity issues or allow ontological reasoning that can add a more accurate layer on top of the pure graph analyses that typically happen with recommenders.

Amazon and Hunch seem to be more generic, but in fact they also have domain qualification. Amazon has the data set of all SKU-level transactions through it’s massive eCommerce site. Even so, Amazon has spent 10 years and a lot of money perfecting how to rank various member behaviors. Because it is Amazon-specific, Amazon can leverage Amazon-only trends and purchasing behaviors, and they are still working on perfecting it. Hunch doesn’t have an item-specific domain, but rather a system-specific domain, using social and taste-making graphs to propose recommendations inside the context of social networks.

Speaking of Amazon’s decade long effort to create a decent recommender with tons of data, I hope you’ve heard of the Netflix Prize. Netflix was so desperate for a better algorithm for recommendations that they instituted an X-Prize like contest for a unique algorithm for recommending movies in particular. In fact, the test methodology for the Netflix Prize has become a standard for movie recommendations, and since 2009 (when the prize was awarded) other algorithm sets have actually achieved better results, most notably, Filmaster.

Given what these companies have tried to do, we can more generically speak of the state of the technology as follows: An “adequate” recommender system comprises of the following items:

  1. An unbiased, non-scarce data set of sufficient size
  2. A suite of machine learning and predictive algorithms that traverse that data set
  3. Knowledge resources to apply transformations on the results of those algorithms

Pandora is a great example of this. They have created an intensive project at detailing a “music genome” or an ontological breakdown of a sample of music. The genome itself is the knowledge resource. The analysis of the genomics of a piece of music aggregated across a large number of pieces is the unbiased non-scarce data set of sufficient size. Finally the suite recommendation algorithms that Pandora applies to these two sets then generates ranked recommendations that are interesting.

Types of Recommenders

Without getting into a formal description of recommenders, I do want to list a few of the common types of recommendation systems that exist within domain specific contexts. To do this, I need to describe the two basic classes of algorithms that power these systems:

  1. Collaborative Filtering: recommendations based on shared behavior with other people or things. E.g. if you and I bought a widget, and I also bought a sprocket, it is likely that you would also like a sprocket.
  2. Expert Adaptive or Generative Systems: recommendations based on shared traits of people or things or rules about how things interact with each other in a non-behavior way. E.g. if you play football and live in Michigan, this particular pair of cleats is great in the snow.

In the world of recommenders, we are trying to create a semantic relationship between people and things, therefore we can discuss person-centric and item-centric approaches in each of these classes of algorithms; and that gives us four main types of recommenders!

  1. Personalized Recommendations- A person-centric, expert adaptive model based on the person’s previous behavior or traits.
  2. Social/Collaborative Recommendations- A person-centric collaborative filtering model based on the past behavior of people similar to you, either because of shared traits or shared behavior. Note that the clustering of similar people can fall into either algorithm set, but the recommendations come from collaborative filtering.
  3. Ontological Reasoned Recommendations- An item-centric expert adaptive system that uses rules and knowledge mined with machine learning approaches to determine an inter-item relational model.
  4. Basket Recommendations- An item-centric collaborative filtering algorithm that uses inter-item relationships like “purchased together” to create recommendations.

Keep in mind, however, that these types of recommenders and classes are very loose and there is a lot of overlap!


Now that large scale search has been dramatically improved and artificial intelligence knowledge bases are being constructed with a reasonable degree of accuracy, it is generally considered that the next step in true AI will be effective trend and prediction analysis. Methodologies to deal with Big Data have evolved to make this possible, and many large companies are rushing towards predictive systems with a wide range of success. Recent approaches have revealed that near-time, large data, domain-specific efforts yield interesting results, if not truly predictive.

The overwhelming challenge is not just in engineering architectures that traverse graphs extremely well (see the picture at the top of this post), but also in finding a unique combination of data, algorithms, and knowledge that will give our applications a chance to provide truly scary, inspiring results to our users. Even though this might be a challenge, there are four very promising approaches that we can leverage within our own categories.

Stay tuned for more on this topic soon!