machine learning

TensorFlow's DC Introduction

The Hello, TensorFlow! post introducing the basic workings of the TensorFlow deep learning framework, up now in O'Reilly's Data, AI, and Learning sections, is a product of the local data community.

Aaron Schumacher, one of the Data Science DC organizers and an employee of Arlington-based Deep Learning Analytics, wrote the article with the support of many local reviewers, including feedback from members of the DC Machine Learning Journal Club.

Aaron will be giving a talk on the material of Hello, TensorFlow! on Wednesday June 29 as part of the Deep Dive into TensorFlow meetup to be hosted at Sapient in Arlington. It should be a great opportunity to explore and discuss this new and exciting tool!

Supervised Machine Learning with R Workshop on April 30th

Supervised Machine Learning with R Workshop on April 30th

Data Community DC and District Data Labs are hosting a Supervised Machine Learning with R workshop on Saturday April 30th. Come out and learn about R's capabilities for regression and classification, how to perform inference with these models, and how to use out-of-sample evaluation methods for your models!

Deep Learning Inspires Deep Thinking

This is a guest post by Mary Galvin, founder and managing principal at AIC. Mary provides technical consulting services to clients including LexisNexis’ HPCC Systems team. The HPCC is an open source, massive parallel-processing computing platform that solves Big Data problems. 

Data Science DC hosted a packed house at the Artisphere on Monday evening, thanks to the efforts of organizers Harlan Harris, Sean Gonzalez, and several others who helped plan and coordinate the event. Michael Burke, Jr, Arlington County Business Development Manager, provided opening remarks and emphasized Arlington’s commitment to serving local innovators and entrepreneurs. Michael subsequently introduced Sanju Bansal, a former MicroStrategy founder and executive who presently serves as the CEO of an emerging, Arlington-based start-up, Hunch Analytics. Sanju energized the audience by providing concrete examples of data science’s applicability to business; this no better illustrated than by the $930 million acquisition of Climate Corps. roughly 6 months ago.

Michael, Sanju, and the rest of the Data Science DC team helped set the stage for a phenomenal presentation put on by John Kaufhold, Managing Partner and Data Scientist at Deep Learning Analytics. John started his presentation by asking the audience for a show of hands on two items: 1) whether anyone was familiar with deep learning, and 2) of those who said yes to #1, whether they could explain what deep learning meant to a fellow data scientist. Of the roughly 240 attendees present, the majority of hands that answered favorably to question #1 dropped significantly upon John’s prompting of question #2.

I’ll be the first to admit that I was unable to raise my hand for either of John’s introductory questions. The fact I was at least a bit knowledgeable in the broader machine learning topic helped to somewhat put my mind at ease, thanks to prior experiences working with statistical machine translation, entity extraction, and entity resolution engines. That said, I still entered John’s talk fully prepared to brace myself for the ‘deep’ learning curve that lay ahead. Although I’m still trying to decompress from everything that was covered – it being less than a week since the event took place – I’d summarize key takeaways from the densely-packed, intellectually stimulating, 70+ minute session that ensued as follows:

  1. Machine learning’s dirty work: labelling and feature engineering. John introduced his topic by using examples from image and speech recognition to illustrate two mandatory (and often less-than-desirable) undertakings in machine learning: labelling and feature engineering. In the case specific to image recognition, say you wanted to determine whether a photo, ‘x’, contained an image of a cat, ‘y’ (i.e., p(y|x)). This would typically involve taking a sizable database of images and manually labelling which subset of those images were cats. The human-labeled images would then serve as a body of knowledge upon which features representative of those cats would be generated, as required by the feature engineering step in the machine learning process. John emphasized the laborious, expensive, and mundane nature of feature engineering, using his own experiences in medical imaging to prove his point.

    Above said, various machine learning algorithms could use the fruits of the labelling and feature engineering labors to discern a cat within any photo – not just those cats previously observed by the system. Although there’s no getting around machine learning’s dirty work to achieve these results, the emergence of deep learning has helped to lesson it.

  2. Machine Learning’s ‘Deep’ Bench. I entered John’s presentation knowing a handful of machine learning algorithms but left realizing my knowledge had barely scratched the surface. Cornell University’s machine learning benchmarking tests can serve as a good reference point for determining which algorithm to use, provided the results are taken into account with the wider, ‘No Free Lunch Theorem’ consideration that even the ‘best’ algorithm has the potential to perform poorly on a subclass of problems.

    Provided machine learning’s ‘deep’ bench, the neural network might have been easy to overlook just 10 years ago. Not only did it place 10th in Cornell’s 2004 benchmarking test, but John enlightened us to its fair share of limitations: inability to learn p(x), inefficiencies with layers greater than 3, overfitting, and relatively slow performance.

  3. The Restricted Boltzmann Machine’s (RBM’s) revival of the neural network. The year 2006 witnessed a breakthrough in machine learning, thanks to the efforts of an academic triumvirate consisting of Geoff Hinton, Yann LeCun, and Yoshua Bengio. I’m not going to even pretend like I understand the details, but will just say that their application of the Restricted Boltzmann Machine (RBM) to neural networks has played a major role in eradicating the neural network’s limitations outlined in #2 above. Take, for example, ‘inability to learn p(x)’. Going back to the cat example in #1, what this essentially states is that before the triumvirate’s discovery, the neural net was incapable of using an existing set of cat images to draw a new image of a cat. Figuratively speaking, not only can neural nets now draw cats, but they can do so with impressive time metrics thanks to the emergence of the GPU. Stanford, for example, was able to process 14 terabytes of images in just 3 hours through overlaying deep learning algorithms on top of a GPU-centric computer architecture. What’s even better? The fact that many implementations of the deep learning algorithm are openly available under the BSD licensing agreement.

  4. Deep learning’s astonishing results. Deep learning has experienced an explosive amount of success in a relatively small amount of time. Not only have several international image recognition contests been recently won by those who used deep learning, but technology powerhouses such as Google, Facebook, and Netflix are investing heavily in the algorithm’s adoption. For example, deep learning triumvirate member Geoff Hinton was hired by Google in 2013 to help the company make sense of their massive amounts of data and to optimize existing products that use machine learning techniques. Fellow deep learning triumvirate member Yann LeCun was hired by Facebook, also in 2013, to help integrate deep learning technologies into the company’s IT systems.

As for all the hype surrounding deep learning, John concluded his presentation by suggesting ‘cautious optimism in results, without reckless assertions about the future’. Although it would be careless to claim that deep learning has cured disease, for example, one thing most certainly is for sure: deep learning has inspired deep thinking throughout the DC metropolitan area.

As to where deep learning has left our furry feline friends, the attached YouTube video will further explain….

(created by an anonymous audience member following the presentation)

You can see John Kaufhold's slides from this event here.

Ensemble Learning Reading List

Tuesday's Data Science DC Meetup features GMU graduate student Jay Hyer's introduction of Ensemble Learning, a core set of Machine Learning techniques. Here are Jay's suggestions for readings and resources related to the topic. Attend the Meetup, and follow Jay on Twitter at @aDataHead! Also note that all images contain Amazon Affiliate links and will result in DC2 getting a small percentage of the proceeds should you purchase the book. Thanks for the support!

L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classification and Regression Trees. Chapman and Hall.CRC, Boca Raton, FL, 1984.

This book does not cover ensemble methods, but is the book that introduced classification and regression trees (CART), which is the basis of Random Forests. Classification trees are also the basis of the AdaBoost algorithm. CART methods are an important tool for a data scientist to have in their skill set.

L. Breiman. Random Forests. Machine Learning, 45(1):5-32, 2001.

This is the article that started it all.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, 2nd ed. Springer, New York, NY, 2009.

This book is light on application and heavy on theory. Nevertheless, chapters 10, 15 & 16 give very thorough coverage to boosting, Random Forests and ensemble learning, respectively. A free PDF version of the book is available on Tibshirani’s website.

G. James, D. Witten, T. Hastie, R. Tibshirani. An Introduction to Statistical Learning: with Apllications in R, Springer, New York, NY, 2013.

As the name and co-authors imply, this is an introductory version of the previous book in this list. Chapter 8 covers, bagging, Random Forests and boosting.

Y. Freund and R.E. Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Journal of Computer and System Sciences, 55(1): 119-139, 1997.

This is the article that introduced the AdaBoost algorithm.

G. Seni, and J. Elder. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan & Claypool Publishers, USA, 2010.

This is a good book with great illustrations and graphs. There is also a lot of R code too!

Z.H. Zhou. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC, 2012.

This is an excellent book the covers ensemble learning from A-Z and is well suited for anyone from an eager beginner to a critical expert.

2013 September DSMD Event: Teaching Data Science to the Masses

Stats-vs-data-science For Data Science MD's Septmeber meetup, we were very fortunate to have the very talented and very passionate Dr. Jeff Leek speak about his experiences teaching Data Science through the online learning platform Coursera. This was also a unique event for DSMD itself because it was the first meetup that only featured one speaker. Having one speaker speak for a whole hour can be a disaster if the speaker is unable to keep the attention of those in the audience. However, Dr. Leek is a very dynamic and engaging speaker and had no problem keeping the attention of everyone in the room, including a couple of middle school students.

For those of you who are not familiar with Dr. Leek, he is a biostatistician at Johns Hopkins University as well as a instructor in the JHU biostatistics program. His biostatistics work typically entails analyzing human genome sequenced data to provide insights to doctors and patients in the form of raw data and advanced visualizations. However, when he is not revolutionizing the medical world or teaching the great biostatisticians of tomorrow at JHU, you may look for him teaching his course on Coursera, or providing new content to his blog, Simply Statistics.

Now, on to the talk. Johns Hopkins and specifically Dr. Leek got involved in teaching a Coursera course because they have constantly been looking at ways to improve learning for their students. They had been "flipping the classroom" by taking lectures and posting them to YouTube so that students could review the lecture material before class and then use the classroom time to dig deeper into specific topics. Because online videos are such a vital component of Massive Open Online Classes (MOOCs), it is no surprise that they took the next leap.

But just in case you think that JHU and Dr. Leek are new to this whole data science "thing," check out their Kaggle team's results for the Heritage Health Prize.

jhu-kaggle-team

Even though their team fell a few places when run on the private data, they still had a very impressive showing considering there were 1358 teams that entered and over 20,000 entries. But what exactly does data science mean to Dr. Leek? Check out his expanded components of data science chart, that differs from similar charts of other data scientists by showing the root disciplines of each component too.

expanded-fusion-of-data-science

But what does the course look like?

course-setup

He covers topics such as type of analyses, how to organize a data analysis, data munging as well as others like:

concepts-machine-learning

statistics-concept

One of the interesting things to note though is that he also shows examples of poor data analysis attempts. There is a core problem with the statistics example from above (pointed out by high school students). Below is an example of another:

concepts-confounding

And this course, in addition to two other courses, Computing for Data Analysis and Mathematical Biostatistics Bootcamp taught by other JHU faculty, have had a very positive response.

jhu-course-enrollment

But how do you teach that many people effectively? That is where the power of Coursera comes in; JHU could have chosen other providers like edX or Udacity but decided to go with Coursera. The videos make it easy to convey knowledge and message boards provide a mechanism to ask questions. Dr. Leek even had students answering questions for other students so that all he had to do was validate the response. But he also pointed out that his class' message board was just like all other message boards and followed 1/98/1 rule where 1% of people respond in a mean way and are unhelpful, 1% of people are very nice and very helpful and the other 98% don't care and don't really respond.

One of the most unique aspects of Coursera is that it helps to scale to tens of thousands of students by using peer/student grading. Each person grades 4 different assignments so that everyone is submitting one answer and grading 4 others. The final score for each student is the median of the four scores from the other students. The rubric used in Dr. Leek's class is below.

grading-rubric

The result of this grading policy, based on Dr. Leek's analysis is that good students received good grades, poor students received poor grades and middle students' grades fluctuated a fair amount. So it seems like the policy works mostly, but there is still room for improvement.

But why does Johns Hopkins and Dr. Leek even support this model of learning? They do have full time jobs that involve teaching after all. Well, besides being huge supporters of open source technology and open learning, they also see many other reasons for supporting this paradigm.

partial-motivation

Check out the video for the many other reasons why JHU further supports this paradigm. And, while you are at it, see if you can figure out if the x and y axes are related in some way. This was our data science/statistics problem for the evening. The answer can also be found in the video.

stat-challenge

http://www.youtube.com/playlist?list=PLgqwinaq-u-NNCAp3nGLf93dh44TVJ0vZ

We also got a sneak peek at a new tool/component that integrates directly into R - swirl. Look for a meetup or blog post about this tool in the future.

swirl

Our next meetup is on October 9th at Advertising.com in Baltimore beginning at 6:30PM. We will have Don Miner speak about using Hadoop for Data Science. If you can make it, come out and join us.

Weekly Round-Up: Data Analysis Tools, M2M, Machine Learning, and Naming Babies

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from data analysis tools to naming babies. In this week's round-up:

  • Data Analysis Tools Target Non-experts
  • How M2M Data Will Dominate the Big Data Era
  • What Hackers Should Know About Machine Learning
  • Knowledge Engineering Applied to Baby Names

Data Analysis Tools Target Non-experts

Our first piece this week is an O'Reilly Strata article about some of the data analysis tools that are coming to market and are aimed at providing business users with the analytics they need to make decisions. The article highlights several tools from a variety of companies and categorizes them into three different categories according to what they help you do. The article also includes links to all the companies' websites so that, if you're anything like me, you can check out every single one of them.

How M2M Data Will Dominate the Big Data Era

The Internet of Things is getting a lot of attention these days, partly due to the amount of data that gets produced when one connected device communicates with another connected device. This is known as Machine-to-Machine data (M2M), and this Smart Data Collective article describes where a lot of this data may come from and how much data can potentially be generated.

What Hackers Should Know About Machine Learning

Our third piece is a Fast Company interview with Drew Conway, the author of the must-own book Machine Learning for Hackers. In the interview Drew answers questions about why developers should learn machine learning, the biggest knowledge gaps they need to overcome, and the differences between a machine learning project and a development project. (Editor's Note, the image to the left links to Amazon where if you buy the book we get a small cut of the proceeds. Buy enough books through this link, and we retire to an island.)

Knowledge Engineering Applied to Baby Names

Our final piece this week is a blog post about a company called Nameling is in the midst of holding a contest to improve the algorithms behind their baby name recommendation engine. Coming up with a good name for your baby is very important to parents, as the consequences of choosing a bad one almost certainly result in ridicule and tears. It should be interesting to see the results of the contest as well as what kinds of names the recommendation engine spits out.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Machine Learning, DIY Data Scientists, Games, and Helping Couples Conceive

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from machine learning to helping couple's conceive. In this week's round-up:

  • Jeff Hawkins: Where Open Source and Machine Learning Meet Big Data
  • The Rise Of The DIY Data Scientist
  • Why Games Matter to Artificial Intelligence
  • Three Questions for Max Levchin About His New Startup

Jeff Hawkins: Where Open Source and Machine Learning meet Big Data

Our first piece this week is an InfoWorld article about Jeff Hawkins, the machine learning work that him and his company have been doing, and the open source project they've recently released on Github. The project's name is the Numenta Platform for Intelligent Computing (NuPIC) and it's goal is to allow others to be able to embed machine intelligence into their own systems. The article has a short interview with Jeff and a link to the Github page where the project resides.

The Rise Of The DIY Data Scientist

This is an interesting Fast Company article about how Kaggle competition winners tend to be self-taught. The author of the article interview's Kaggle's chief scientist Jeremy Howard about this phenomenon and other interesting findings derived from Kaggle's competitions about data scientists. Some of the questions inquire about where the winners are from, how they learned data science, and what machine learning algorithms they use.

Why Games Matter to Artificial Intelligence

This blog post on the IBM Research blog is an interview with Dr. Gerald Tesauro about the significance of games in the Artificial Intelligence field. Dr. Tesauro was the IBM research scientists who taught Watson how to play Jeopardy. In the interview, he explains how games tend to be an ideal training ground for machines because they tend to simplify real life. He goes on to answer questions about how that prepares the machines for transitioning to other real-world problems, what he's currently working on, what Watson is doing these days, and where else machine learning can be used.

Three Questions for Max Levchin About His New Startup

Our final piece this week is an MIT Technology Review article about PayPal co-founder Max Levchin's new startup called Glow. A lot of people are having children later in life these days and one downside of this is that many couples have trouble trying to conceive. Levchin has developed an iPhone app that uses data to help couples identify the optimal time for conception. In this brief interview, Levchin talks about what they are doing, why, and the degree of accuracy they hope to achieve.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

The State of Recommender Technology

Reblogged with permission from Cobrain. socialnetwork_graph

So let’s start with the big idea that is the reason that we are all here: recommendation engines. If you are reading this, you have probably already overcome the mental hurdle of the massive design and implementation challenge that recommendation engines represent, otherwise I can’t imagine why you would have signed up! Or perhaps you don’t know what a massive design and implementation challenge recommendation engines represent. Either way, you’re in the right place- this post is an introduction to the state of the technology of recommendation systems.

Well sort of– here is a working state of the technology: Academia has created a series of novel machine learning and predictive algorithms that would allow scarily accurate trend analysis, recommendations, and predictions given the right, unbiased supervised training sets of sufficient magnitude. Commercial applications in very specific domains have leveraged these insights and extremely large data sets to create interesting results in the release phase of applications but have found that over time the quality of these predictions decreases rapidly. Companies with even larger data sets that have tackled other algorithmic challenges involving supervised training sets (Google) have avoided current recommender systems because of their domain specificity, and have yet to find a generic enough application.

To sum up:

Recommendation Engines are really really hard, and you need a whole heckuva lot of data to make them work.

Now go build one.

Don’t despair though! If it wasn’t hard, everyone would be doing it! We’re here precisely because we want to leverage existing techniques on interesting and novel data sets, but also to continue to push forward the state of the technology. In the process we will probably learn a lot and hopefully also provide a meaningful experience for our users. But before we get into that, let’s talk more generically about the current generation of recommender systems.

Who Does it Well?

The current big boys in the recommendation space are AmazonNetflixHunch (now owned by eBay), Pandora, and Goodreads. I strongly encourage you to understand how these guys operate and what they do to create domain specific recommendations. For example, the domain of Goodreads, Netflix, and Pandora is books, movies, and music respectively. Recommending inside a particular domain allows you to leverage external knowledge resources that either solve scarcity issues or allow ontological reasoning that can add a more accurate layer on top of the pure graph analyses that typically happen with recommenders.

Amazon and Hunch seem to be more generic, but in fact they also have domain qualification. Amazon has the data set of all SKU-level transactions through it’s massive eCommerce site. Even so, Amazon has spent 10 years and a lot of money perfecting how to rank various member behaviors. Because it is Amazon-specific, Amazon can leverage Amazon-only trends and purchasing behaviors, and they are still working on perfecting it. Hunch doesn’t have an item-specific domain, but rather a system-specific domain, using social and taste-making graphs to propose recommendations inside the context of social networks.

Speaking of Amazon’s decade long effort to create a decent recommender with tons of data, I hope you’ve heard of the Netflix Prize. Netflix was so desperate for a better algorithm for recommendations that they instituted an X-Prize like contest for a unique algorithm for recommending movies in particular. In fact, the test methodology for the Netflix Prize has become a standard for movie recommendations, and since 2009 (when the prize was awarded) other algorithm sets have actually achieved better results, most notably, Filmaster.

Given what these companies have tried to do, we can more generically speak of the state of the technology as follows: An “adequate” recommender system comprises of the following items:

  1. An unbiased, non-scarce data set of sufficient size
  2. A suite of machine learning and predictive algorithms that traverse that data set
  3. Knowledge resources to apply transformations on the results of those algorithms

Pandora is a great example of this. They have created an intensive project at detailing a “music genome” or an ontological breakdown of a sample of music. The genome itself is the knowledge resource. The analysis of the genomics of a piece of music aggregated across a large number of pieces is the unbiased non-scarce data set of sufficient size. Finally the suite recommendation algorithms that Pandora applies to these two sets then generates ranked recommendations that are interesting.

Types of Recommenders

Without getting into a formal description of recommenders, I do want to list a few of the common types of recommendation systems that exist within domain specific contexts. To do this, I need to describe the two basic classes of algorithms that power these systems:

  1. Collaborative Filtering: recommendations based on shared behavior with other people or things. E.g. if you and I bought a widget, and I also bought a sprocket, it is likely that you would also like a sprocket.
  2. Expert Adaptive or Generative Systems: recommendations based on shared traits of people or things or rules about how things interact with each other in a non-behavior way. E.g. if you play football and live in Michigan, this particular pair of cleats is great in the snow.

In the world of recommenders, we are trying to create a semantic relationship between people and things, therefore we can discuss person-centric and item-centric approaches in each of these classes of algorithms; and that gives us four main types of recommenders!

  1. Personalized Recommendations- A person-centric, expert adaptive model based on the person’s previous behavior or traits.
  2. Social/Collaborative Recommendations- A person-centric collaborative filtering model based on the past behavior of people similar to you, either because of shared traits or shared behavior. Note that the clustering of similar people can fall into either algorithm set, but the recommendations come from collaborative filtering.
  3. Ontological Reasoned Recommendations- An item-centric expert adaptive system that uses rules and knowledge mined with machine learning approaches to determine an inter-item relational model.
  4. Basket Recommendations- An item-centric collaborative filtering algorithm that uses inter-item relationships like “purchased together” to create recommendations.

Keep in mind, however, that these types of recommenders and classes are very loose and there is a lot of overlap!

Conclusion

Now that large scale search has been dramatically improved and artificial intelligence knowledge bases are being constructed with a reasonable degree of accuracy, it is generally considered that the next step in true AI will be effective trend and prediction analysis. Methodologies to deal with Big Data have evolved to make this possible, and many large companies are rushing towards predictive systems with a wide range of success. Recent approaches have revealed that near-time, large data, domain-specific efforts yield interesting results, if not truly predictive.

The overwhelming challenge is not just in engineering architectures that traverse graphs extremely well (see the picture at the top of this post), but also in finding a unique combination of data, algorithms, and knowledge that will give our applications a chance to provide truly scary, inspiring results to our users. Even though this might be a challenge, there are four very promising approaches that we can leverage within our own categories.

Stay tuned for more on this topic soon!

Weekly Round-Up: Computer Vision, Machine Learning, Benchmarking, and R Packages

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from computer vision to popular R packages. In this week's round-up:

  • Google Explains How AI Photo Search Works
  • Matter Over Mind in Machine Learning
  • Principles of ML Benchmarking
  • A List of R Packages, By Popularity

Google Explains How AI Photo Search Works

This is an interesting blog post about how Google recently enhanced their image search functionality using computer vision and machine learning algorithms. The post describes in layman's terms how the algorithms work and how they are able to classify pictures. It also includes a link to Google's research blog, where they made the original announcement.

Matter Over Mind in Machine Learning

This is a post on the BigML blog which talks about the work of Dr. Kiri Wagstaff from NASA's Jet Propulsion Laboratory. The post highlights a specific paper of hers where she argues that instead of aiming for incremental abstract improvements in machine learning processes, we should be focused on attaining results that translate into a measurable impact for society at large. More detail is provided about what that means, the author plays a little devil's advocate, and the post also includes a link to Wagstaff's paper for those that would like to read more about this.

Principles of ML Benchmarking

This is a post on the Wise.io blog about how to benchmark machine learning algorithms. The post is structured as a thought exercise where the author starts by thinking about the purpose of benchmarking, why we should do it, and what our goals should be. From that point, he is able to formulate a set of guidelines for benchmarking that are very logical. The post lists each of the guiding principles along with some steps that can be taken to make sure you are abiding by them.

A List of R Packages, By Popularity

Our last article this week is a post on the Revolution Analyitcs blog that lists the top R packages in order of popularity. Some of the most popular packages include plyr, digest, ggplot2, and colorspace. Check out the list, see where your favorite packages rank, and potentially discover some useful packages you didn't know about!

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

PyAutoDiff: automatic differentiation for NumPy

We are excited to have a guest post discussing a new tool that is freely available for the Python community. Welcome, Jeremiah Lowin, the Chief Scientist of the Lowin Data Company, to the growing pool of Data Community DC bloggers. We are very excited to announce an early release of PyAutoDiff, a library that allows automatic differentiation in NumPy, among other useful features. A quickstart guide is available here.

Autodiff can compute gradients (or derivatives) with a simple decorator:

from autodiff import gradient

def f(x):
    return x ** 2

@gradient
def g(x):
    return x ** 2

print f(10.0) # 100.0
print g(10.0) # 20.0

More broadly, autodiff leverages Theano's powerful symbolic engine to compile NumPy functions, allowing features like mathematical optimization, GPU acceleration, and of course automatic differentiation. Autodiff is compatible with any NumPy operation that has a Theano equivalent and fully supports multidimensional arrays. It also gracefully handles many Python constructs (though users should be very careful with control flow tools like if/else and loops!).

In addition to the  @gradient decorator, users can apply  @function to compile functions without altering their return values. Compiled functions can automatically take advantage of Theano's optimizations and available GPUs, though users should note that GPU computations are only supported for float32 dtypes. Other decorators, classes, and high-level functions are available; see the docs for more information.

It is also possible for autodiff to trace NumPy objects through multiple functions. It can then compile symbolic representations of all of the traced operations (or their gradients) -- even with respect to objects that were purely local to the function(s) scope.

import numpy as np
from autodiff import Symbolic, tag

# -- a vanilla function
def f1(x):
    return x + 2

# -- a function referencing a global variable
y = np.random.random(10)
def f2(x):
    return x * y

# -- a function with a local variable
def f3(x):
    z = tag(np.ones(10), 'local_var')
    return (x + z) ** 2

# -- create a general symbolic tracer
x = np.random.random(10)
tracer = Symbolic()

# -- trace the three functions
out1 = tracer.trace(f1, x)
out2 = tracer.trace(f2, out1)
out3 = tracer.trace(f3, out2)

# -- compile a function representing f(x, y, z) = out3
new_fn = tracer.compile_function(inputs=[x, y, 'local_var'], 
                                 outputs=out3)

assert np.allclose(new_fn(x, y, np.ones(10)), f3(f2(f1(x))))

One of the original motivations for autodiff was working with SVMs that were defined purely in NumPy. The following example (also available at autodiff/examples/svm.py) fits an SVM to random data, using autodiff to compute parameter gradients for SciPy's L-BFGS-B solver:

import numpy as np
from autodiff.optimize import fmin_l_bfgs_b

rng = np.random.RandomState(1)

# -- create some fake data
x = rng.rand(10, 5)
y = 2 * (rng.rand(10) > 0.5) - 1
l2_regularization = 1e-4

# -- define the loss function
def loss_fn(weights, bias):
    margin = y * (np.dot(x, weights) + bias)
    loss = np.maximum(0, 1 - margin) ** 2
    l2_cost = 0.5 * l2_regularization * np.dot(weights, weights)
    loss = np.mean(loss) + l2_cost
    return loss

# -- call optimizer
w_0, b_0 = np.zeros(5), np.zeros(())
w, b = fmin_l_bfgs_b(loss_fn, init_args=(w_0, b_0))

final_loss = loss_fn(w, b)

assert np.allclose(final_loss, 0.7229)

Some members of the scientific community will recall that James Bergstra began the PyAutoDiff project a year ago in an attempt to unify NumPy's imperative style with Theano's functional syntax. James successfully demonstrated the project's utility, and this version builds out and on top of that foundation. Standing on the shoulders of giants, indeed!

Please note that autodiff remains under active development and features may change. The library has been performing well in internal testing, but we're sure that users will find new and interesting ways to break it. Please file any bugs you may find!

Why You Should Not Build a Recommendation Engine

One does not simply build an MVP with a recommendation engine Recommendation engines are arguably one of the trendiest uses of data science in startups today. How many new apps have you heard of that claim to "learn your tastes"? However, recommendations engines are widely misunderstood both in terms of what is involved in building a one as well as what problems they actually solve. A true recommender system involves some fairly hefty data science -- it's not something you can build by simply installing a plugin without writing code. With the exception of very rare cases, it is not the killer feature of your minimum viable product (MVP) that will make users flock to you -- especially since there are so many fake and poorly performing recommender systems out there.

A recommendation engine is a feature (not a product) that filters items by predicting how a user might rate them. It solves the problem of connecting your existing users with the right items in your massive inventory (i.e. tens of thousands to millions) of products or content. Which means that if you don't have existing users and a massive inventory, a recommendation engine does not truly solve a problem for you. If I can view the entire inventory of your e-commerce store in just a few pages, I really don't need a recommendation system to help me discover products! And if your e-commerce store has no customers, who are you building a recommendation system for? It works for Netflix and Amazon because they have untold millions of titles and products and a large existing user base who are already there to stream movies or buy products. Presenting users with  recommended movies and products increases usage and sales, but doesn't create either to begin with.

There are two basic approaches to building a recommendation system: the collaborative filtering method and the content-based approach. Collaborative filtering algorithms take user ratings or other user behavior and make recommendations based on what users with similar behavior liked or purchased. For example, a widely used technique in the Netflix prize was to use machine learning to build a model that predicts how a user would rate a film based solely on the giant sparse matrix of how 480,000 users rated 18,000 films (100 million data points in all). This approach has the advantage of not requiring an understanding of the content itself, but does require a significant amount of data, ideally millions of data points or more, on user behavior. The more data the better. With little or no data, you won't be able to make recommendations at all -- a pitfall of this approach known as the cold-start problem. This is why you cannot use this approach in a brand new MVP. 

The content-based approach requires deep knowledge of your massive inventory of products. Each item must be profiled based on its characteristics. For a very large inventory (the only type of inventory you need a recommender system for), this process must be automatic, which can prove difficult depending on the nature of the items. A user's tastes are then deduced based on either their ratings, behavior, or directly entering information about their preferences. The pitfalls of this approach are that an automated classification system could require significant algorithmic development and is likely not available as a commodity technical solution. Second, as with the collaborative filtering approach, the user needs to input information on their personal tastes, though not on the same scale. One advantage of the content-based approach is that it doesn't suffer from the cold-start problem -- even the first user can gain useful recommendations if the content is classified well. But the benefit that recommendations offer to the user must justify the effort required to offer input on personal tastes. That is, the recommendations must be excellent and the effort required to enter personal preferences must be minimal and ideally baked into the general usage. (Note that if your offering is an e-commerce store, this data entry amounts to adding a step to your funnel and could hurt sales more than it helps.) One product that has been successful with this approach is Pandora. Based on naming a single song or artist, Pandora can recommend songs that you will likely enjoy. This is because a single song title offers hundreds of points of data via the Music Genome Project. The effort required to classify every song in the Music Genome Project cannot be understated -- it took 5 years to develop the algorithm and classify the inventory of music offered in the first launch of Pandora. Once again, this is not something you can do with a brand new MVP.

Pandora may be the only example of a successful business where the recommendation engine itself is the core product, not a feature layered onto a different core product. Unless you have the domain expertise, algorithm development skill, massive inventory, and frictionless user data entry design to build your vision of the Pandora for socks / cat toys / nail polish / etc, your recommendation system will not be the milkshake that brings all the boys to the yard. Instead, you should focus on building your core product, optimizing your e-commerce funnel, growing your user base, developing user loyalty, and growing your inventory. Then, maybe one day, when you are the next Netflix or Amazon, it will be worth it to add on a recommendation system to increase your existing usage and sales. In the mean time, you can drive serendipitous discovery simply by offering users a selection of most popular content or editor's picks.

Beyond Preprocessing - Weakly Inferred Meanings - Part 5

Congrats! This is the final post in our 6 part series! Just in case you have missed any parts, click through to the introductionpart 1part 2, part 3, and part 4.

NLP of Big Data using NLTK and Hadoop31

After you have treebanks, then what? The answer is that syntactic guessing is not the final frontier of NLP, we must go beyond to something more semantic. The idea is to determine the meaning of text in a machine tractable way by creating a TMR, a text-meaning representation (or thematic meaning representation). This, however, is not a trivial task, and now you’re at the frontier of the science.

 

NLP of Big Data using NLTK and Hadoop32

Text Meaning Representations are language-independent representations of a language unit, and can be thought of as a series of connected frames that represent knowledge. TMRs allow us to do extremely deep querying of natural language, including the creation of knowledge bases, question and answer systems, and even allow for conversational agents. Unfortunately they require extremely deep ontologies and knowledge to be constructed and can be extremely process intensive, particularly in resolving ambiguity.

NLP of Big Data using NLTK and Hadoop33

We have created a system called WIMs -- Weakly Inferred Meanings that attempts to stand in the middle ground of no semantic computation at all, and the extremely labor intensive TMR. WIMs reduces the search space by using a limited, but extremely important set of relations. These relations can be created using available knowledge -- Wordnet has proved to be a valuable ontology for creating WIMs, and they are extremely lightweight.

Even better, they’re open source!

NLP of Big Data using NLTK and Hadoop34

Both TMRs and WIMs are graph representations of content, and therefore any semantic computation involving these techniques will involve graph traversal. Although there are graph databases created on top of HDFS (particularly Titan on HBase), graph computation is not the strong point of MapReduce. Hadoop, unfortunately, can only get us so far.

NLP of Big Data using NLTK and Hadoop35

Hadoop for Preprocessing Language - Part 4

We are glad that you have stuck around for this long and, just in case you have missed any parts, click through to the introductionpart 1part 2, and part 3.

NLP of Big Data using NLTK and Hadoop21

You might ask me, doesn’t Hadoop do text processing extremely well? After all, the first Hadoop jobs we learn are word count and inverted index!

NLP of Big Data using NLTK and Hadoop22

The answer is that NLP preprocessing techniques are more complicated than splitting on whitespace and punctuation, and different tasks require different kinds of tokenization (also called segmentation or chunking). Consider the following sentence:

“You're not going to the U.S.A. in that super-zeppelin, Dr. Stoddard?”

How do you split this as a stand alone sentence? If you simply used punctuation, this would segment (sentence tokenization) to six sentences (“You’re not going to the U.”, “S.”, “A.”, “in that super-zeppelin, Dr.”, “Stoddard?”). Also, is the “You’re” two tokens or a single token? What about Punctuation? Is “Dr. Stoddard” one token or more? How about “super-zeppelin”. N-Gram analysis and other syntactic tokenization will also probably require different token lengths that go beyond white space.

NLP of Big Data using NLTK and Hadoop23

So we require some more formal NLP mechanisms even for simple tokenization. However, I propose that Hadoop might be perfect for language preprocessing. A Hadoop job creates output in the file system, so each job can be considered an NLP preprocessing task. Moreover, in many other Big Data analytics, Hadoop is used this way; last mile computations usually occur within 100GB of memory, Map Reduce jobs are used to perform calculations designed to transform data into something that is computable in that memory space. We will do the same thing with NLP, and transform our raw text as follows: 

Raw Text → Tokenized Text → Tagged Text → Parsed Text → Treebanks

Namely, after we have tokenized our text depending on our requirements, splitting it into sentences, chunks and tokens as required, we then want to understand the syntactic class of the tokens, and tag it as such. Tagged text can then be structured into parses - a structured representation of the sentence. The final output, used for training our stochastic mechanisms and going beyond to more semantic analyses are treebanks. Each of these tasks can be one or more MapReduce jobs.

NLP of Big Data using NLTK and Hadoop27

NLTK comes with a few notable built-ins making your preprocessing with Hadoop integration easier (you’ll note all these methods are stochastic):

  • Punct Word and Sentence tokenizer uses an unsupervised training set to capture the beginning of sentences and other non-sentence termination marks. It doesn’t require a single sentence to perform tokenization.

  • Brill Tagger - a transformational rule based tagger that does a first pass tagging then applies rules that were trained from a tagged training data set.

  • Viterbi Parser- a dynamic programming algorithm that uses a weighted grammar to fill in a most-likely-constituent table and very quickly come up with the most likely parse.

NLP of Big Data using NLTK and Hadoop28 The end result after a series of MapReduce jobs (we had six) was a Treebank -- a machine tractable syntactic representation of language; it’s very important.

Python's Natural Language Took Kit (NLTK) and Hadoop - Part 3

Welcome back to part 3 of Ben's talk about Big Data and Natural Language Processing. (Click through to see the intro, part 1, and part 2).

NLP of Big Data using NLTK and Hadoop12

We chose NLTK (Natural Language Toolkit) particularly because it’s not Stanford. Stanford is kind of a magic black box, and it costs money to get a commercial license. NLTK is open source and it’s Python. But more importantly, NLTK is completely built around stochastic analysis techniques and comes with data sets and training mechanisms built in. Particularly because the magic and foo of Big Data with NLP requires using your own domain knowledge and data set, NLTK is extremely valuable from a leadership perspective! And anyway, it does come with out of the box NLP - use the Viterbi parser with a trained PCFG (Probabilistic Context Free Grammer, also called a Weighted Grammar)  from the Penn Treebank, and you’ll get excellent parses immediately.

NLP of Big Data using NLTK and Hadoop13

Choosing Hadoop might seem obvious given that we’re talking about Data Science and particularly Big Data. But I do want to point out that the NLP tasks that we’re going to talk about right off the bat are embarrassingly parallel - meaning that they are extremely well suited for the Map Reduce paradigm. If you consider the unit of natural language the sentence, then each sentence (at least to begin with) can be analyzed on its own with little required knowledge about the surrounding processing of sentences.

NLP of Big Data using NLTK and Hadoop14

Combine that with the many flavors of Hadoop and the fact that you can get a cluster going in your closet for cheap-- it’s the right price for a startup!

NLP of Big Data using NLTK and Hadoop15

The glue to making NLTK (Python) and Hadoop (Java) play nice is Hadoop Streaming. Hadoop Streaming will allow you to create a mapper and a reducer with any executable, and expects that the executable will receive key-value pairs via stdin and output them via stdout. Just keep in mind that all other Hadoopy-ness exists, e.g. the FileInputFormat, HDFS, and Job Scheduling, all you get to replace is the mapper and reducer, but this is enough to include NLTK, so you’re off and running!

Here’s an example of a Mapper and Reducer to get you started doing token counts with NLTK (note that these aren’t word counts -- to computational linguists, words are language elements that have senses and therefore convey meaning. Instead, you’ll be counting tokens, the syntactic base for words in this example, and you might be surprised to find out what tokens are-- trust me, it isn’t as simple as splitting on whitespace and punctuation!).

mapper.py


    #!/usr/bin/env python

    import sys
    from nltk.tokenize import wordpunct_tokenize

    def read_input(file):
        for line in file:
            # split the line into tokens
            yield wordpunct_tokenize(line)

    def main(separator='\t'):
        # input comes from STDIN (standard input)
        data = read_input(sys.stdin)
        for tokens in data:
            # write the results to STDOUT (standard output);
            # what we output here will be the input for the
            # Reduce step, i.e. the input for reducer.py
            #
            # tab-delimited; the trivial token count is 1
            for token in tokens:
                print '%s%s%d' % (word, separator, 1)

    if __name__ == "__main__":
        main()

reducer.py


    #!/usr/bin/env python

    from itertools import groupby
    from operator import itemgetter
    import sys

    def read_mapper_output(file, separator='\t'):
        for line in file:
            yield line.rstrip().split(separator, 1)

    def main(separator='\t'):
        # input comes from STDIN (standard input)
        data = read_mapper_output(sys.stdin, separator=separator)
        # groupby groups multiple word-count pairs by word,
        # and creates an iterator that returns consecutive keys and their group:
        #   current_word - string containing a word (the key)
        #   group - iterator yielding all ["<current_word>", "<count>"] items
        for current_word, group in groupby(data, itemgetter(0)):
            try:
                total_count = sum(int(count) for current_word, count in group)
                print "%s%s%d" % (current_word, separator, total_count)
            except ValueError:
                # count was not a number, so silently discard this item
                pass

    if __name__ == "__main__":
        main()

Running the Job


    hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
        -file /home/hduser/mapper.py    -mapper /home/hduser/mapper.py \
        -file /home/hduser/reducer.py   -reducer /home/hduser/reducer.py \
        -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output

Some interesting notes about using Hadoop Streaming relate to memory usage. NLP can sometimes be a memory intensive task as you have to load up training data to compute various aspects of your processing- loading these things up can take minutes at the beginning of your processing. However, with Hadoop Streaming, only one interpreter per job is loaded, thus saving you repeating that loading process. Similarly, you can use generators and other python iteration techniques to carve through mountains of data very easily. There are some Python libraries, including dumbo, mrjob, and hadoopy that can make all of this a bit easier.

The "Foo" of Big Data - Part 2

Welcome to Part 2 of this epic Big Data and Natural Language Processing perspective series. Here is the intro and part one if you missed any of them.

NLP of Big Data using NLTK and Hadoop9

Domain knowledge is incredibly important, particularly in the context of stochastic methodologies, and particularly in NLP. Not all language, text, or domains have the same requirements, and there is no way to make a universal model for it. Consider how the language of doctors and lawyers may be outside our experience in the language of computer science or data science. Even more generally, regions tends to have specialized dialects or phrases even within the same language. As an anthropological rule, groups tend to specialize particular language features to communicate more effectively, and attempting to capture all of these specializations leads to ambiguity.

This leads to an interesting hypothesis: the foo of big data is to combine specialized domain knowledge with an interesting data set.

NLP of Big Data using NLTK and Hadoop11

Further, given that domain knowledge and an interesting data set or sets:

  1. Form hypothesis (a possible data product)
  2. Mix in NLP techniques and machine learning tools
  3. Perform computation and test hypothesis
  4. Add to data set and domain knowledge
  5. Iterate

If this sounds like the scientific method, you’re right! This is why it’s cool to hire PhDs again; in the context of Big Data, science is needed to create products. We’re not building bridges, we’re testing hypotheses, and this is the future of business.

But this alone is not why Big Data is a buzzword. The magic of big data is that we can iterate through the foo extremely rapidly and  multiple hypotheses can be tested via distributed computing techniques in weeks instead of years or ever shorter time periods. There is currently a surplus of data and domain knowledge, so everyone is interested in getting their own piece of data real estate, that is, getting their domain knowledge and their data set. The demand is rising to meet the surplus, and as a result we’re making a lot of money. Money means buzzwords. This is the magic of Big Data!

 

Big Data and Natural Language Processing - Part 1

We hope you enjoyed the introduction to this series, part 1 is below.

“The science that has been developed around the facts of language passed through three stages before finding its true and unique object. First something called "grammar" was studied. This study, initiated by the Greeks and continued mainly by the French, was based on logic. It lacked a scientific approach and was detached from language itself. Its only aim was to give rules for distinguishing between correct and incorrect forms; it was a normative discipline, far removed from actual observation, and its scope was limited.” -Ferdinand de Saussure

Language is dynamic - trying to create rules to capture the full scope of language (e.g. grammars) fails because of how rapidly language changes. Instead, it is much easier to learn from as many examples as possible and guess the likelihood of the meaning of language; this, after all, is what humans do. Therefore Natural Language Processing and Computational Linguistics are stochastic methodologies, and a subset of artificial intelligence that benefits from Machine Learning techniques. 

Machine Learning has many flavors, and most of them attempt to get at the long tail -- e.g. the low frequency events where the most relevant analysis occurs. To capture these events without resorting to some sort of comprehensive smoothing, more data is required, indeed the more data the better. I have yet to observe a machine learning discipline that complained of having too much data. (Generally speaking they complain of having too much modeling -- overfit). Therefore the stochastic approach of NLP needs Big Data. 

NLP of Big Data using NLTK and Hadoop6

The flip side of the coin is not as straightforward. We know there are many massive natural language data sets on the web and elsewhere. Consider tweets, reviews, job listings, emails, etc. These data sets fulfil the three V’s of Big Data: velocity, variety, and volume. But do these data sets require comprehensive natural language processing to produce interesting data products?

NLP of Big Data using NLTK and Hadoop7

The answer is not yet. Hadoop and other tools already have build in text processing support. There are many approaches being applied to these data sets, particularly inverted indices, co-location scores, even N-Gram modeling. However, these approaches are not true NLP -- they are simply search. They leverage string and lightweight syntactic analysis to perform frequency analyses.

NLP of Big Data using NLTK and Hadoop8

We have not yet exhausted all opportunities to perform these frequency analyses -- many interesting results, particularly in clustering, classification and authorship forensics, have been explored. However, these approaches will soon start to fail to produce the more interesting results that users are coming to expect. Products like machine translation, sentence generation, text summarization, and more meaningful text recommendation will require strong semantic methodologies, and eventually Big Data will come to require NLP, it’s just not there yet.

Natural Language Processing and Big Data: Using NLTK and Hadoop - Talk Overview

NLP of Big Data using NLTK and Hadoop1

My previous startup, Unbound Concepts, created a machine learning algorithm that determined the textual complexity (e.g. reading level) of children’s literature. Our approach started as a natural language processing problem -- designed to pull out language features to train our algorithms, and then quickly became a big data problem when we realized how much literature we had to go through in order to come up with meaningful representations. We chose to combine NLTK and Hadoop to create our Big Data NLP architecture, and we learned some useful lessons along the way. This series of posts is based on a talk done at the April Data Science DC meetup.

Think of this post as the Cliff Notes of the talk and the upcoming series of posts so you don’t have to read every word ... but trust me, it's worth it.

Related to the interaction between Big Data and NLP:

  • Natural Language Processing needs Big Data
  • Big Data doesn’t need NLP... yet.

Related to using Hadoop and NLTK:

  • The combination of NLTK and Hadoop is perfect for prepossessing raw text
  • More semantic analysis tend to be graph problems that Map Reduce isn’t great at computing.

About data products in general:

  • The foo of Big Data is the ability to take domain knowledge and a data set (or sets) and iterate quickly through hypotheses using available tools (NLP)
  • The magic of big data is that there is currently a surplus of both data and knowledge and our tools are working, so it’s easy to come up with a data product (until demand meets supply).

I'll go over each of these points in detail as I did in my presentation, so stay tuned for the longer version [editor: so long that it has been broken up into multiple posts]

 

 

Weekly Round-Up: Ford's Data, Apple's iWatch, Wavii's Acquisition, and Fighting Malaria

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from how Ford is leveraging data to improve their operations to combating malaria using data from cell phones. In this week's round-up:

  • How Data is Changing the Car Game for Ford
  • How Apple's iWatch Will Push Big Data Analytics
  • Google Bags Another Machine Learning Startup
  • Researchers Use Data from Cell Phones to Combat Outbreaks

How Data is Changing the Car Game for Ford

This is a GigaOM article about how Ford Motor Company is using data to build better cars and better customer experiences. The article goes into some detail about how the company is doing both of these things, such as creating data products that are available to consumers with some of their automobiles that provide them with data about their car's performance. The author goes on to quote some of the folks in charge of the data efforts at Ford about internal data processes and some of the changes the company has had to make in order to become more data-driven.

How Apple's iWatch Will Push Big Data Analytics

This is a Smart Data Collective article about what Apple's rumored iWatch could mean for Big Data. According to the article, the watch will be able to capture data about where you've been, what you've eaten, how many calories you've burned, and how you've slept among other things. The author provides some examples of products currently on the market (such as Nike's Fuelband and the Fitbit Ultra) that have opened up the amount of data that can be collected from individuals and opines that Apple's smart watch will capture significant share of this market. He also predicts that this will change the world of big data analytics, and he provides some examples of why he believes this.

Google Bags Another Machine Learning Startup

Google acquired machine learning startup, Wavii, this week and this Wired article has some of the details about the startup, the acquisition, and about how Wavii's technology may be used inside of Google. The article mentions that there was a bidding war between Apple and Google for the company, so hopefully Google will be able to make this victory pay off in the near future.

Researchers Use Data from Cell Phones to Combat Outbreaks

This is an MIT Technology Review article about how epistemologists at Harvard have been able to track the spread of diseases such as malaria by studying data generated from cell phone towers in Kenya. Using this data, they can track movement to and from regions of the country they know have a high infection rate and feed that information into predictive models that can forecast how the diseases may spread. The article goes into much more detail and is a fascinating and informative read.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Probabilistic Programming, Tech Startups, Data Viz Elements, and Super Mario Bros.

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from probabilistic programming to machines playing video games. In this week's round-up:

  • What is Probabilistic Programming?
  • 5 Ways for Tech Start-Ups to Attract Analytics Talent
  • The Three Elements of Successful Data Visualizations
  • AI Solves Super Mario Bros and Other NES Games

What is Probabilistic Programming?

This is an interesting O'Reilly article introducing probabilistic programming. The article talks about what probabilistic programming is, how it differs from regular high-level programming, and intuitively explains how it works. The author also explains how he believes the technology's development will progress and the impact it will have on data science and other technologies.

5 Ways for Tech Start-Ups to Attract Analytics Talent

For those looking to hire analytical talent, this article provides some practical pointers for hiring a data scientist. These pointers focus on some of the softer skills that are necessary to really excel in these types of roles and also on structuring an environment where your data scientists are properly motivated to do their absolute best work.

The Three Elements of Successful Data Visualizations

This is a Harvard Business Review article about what elements are necessary in making great data visualizations. The article highlights three elements - understanding the audience, setting up and framework, and telling a story - and explains why each of these are important in a little more detail.

AI Solves Super Mario Bros and Other NES Games

This article is about an interesting and fun application of machine learning - teaching a machine to solve video games. It revolves around a paper written by computer scientist Tom Murphy about how he was able to accomplish this using lexicographic ordering. The article talks about Murphy's research and how he went about figuring out how to do this. It also has a link to Murphy's paper for those that would like some more in-depth reading on the subject.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups