Deep Learning Inspires Deep Thinking

This is a guest post by Mary Galvin, founder and managing principal at AIC. Mary provides technical consulting services to clients including LexisNexis’ HPCC Systems team. The HPCC is an open source, massive parallel-processing computing platform that solves Big Data problems. 

Data Science DC hosted a packed house at the Artisphere on Monday evening, thanks to the efforts of organizers Harlan Harris, Sean Gonzalez, and several others who helped plan and coordinate the event. Michael Burke, Jr, Arlington County Business Development Manager, provided opening remarks and emphasized Arlington’s commitment to serving local innovators and entrepreneurs. Michael subsequently introduced Sanju Bansal, a former MicroStrategy founder and executive who presently serves as the CEO of an emerging, Arlington-based start-up, Hunch Analytics. Sanju energized the audience by providing concrete examples of data science’s applicability to business; this no better illustrated than by the $930 million acquisition of Climate Corps. roughly 6 months ago.

Michael, Sanju, and the rest of the Data Science DC team helped set the stage for a phenomenal presentation put on by John Kaufhold, Managing Partner and Data Scientist at Deep Learning Analytics. John started his presentation by asking the audience for a show of hands on two items: 1) whether anyone was familiar with deep learning, and 2) of those who said yes to #1, whether they could explain what deep learning meant to a fellow data scientist. Of the roughly 240 attendees present, the majority of hands that answered favorably to question #1 dropped significantly upon John’s prompting of question #2.

I’ll be the first to admit that I was unable to raise my hand for either of John’s introductory questions. The fact I was at least a bit knowledgeable in the broader machine learning topic helped to somewhat put my mind at ease, thanks to prior experiences working with statistical machine translation, entity extraction, and entity resolution engines. That said, I still entered John’s talk fully prepared to brace myself for the ‘deep’ learning curve that lay ahead. Although I’m still trying to decompress from everything that was covered – it being less than a week since the event took place – I’d summarize key takeaways from the densely-packed, intellectually stimulating, 70+ minute session that ensued as follows:

  1. Machine learning’s dirty work: labelling and feature engineering. John introduced his topic by using examples from image and speech recognition to illustrate two mandatory (and often less-than-desirable) undertakings in machine learning: labelling and feature engineering. In the case specific to image recognition, say you wanted to determine whether a photo, ‘x’, contained an image of a cat, ‘y’ (i.e., p(y|x)). This would typically involve taking a sizable database of images and manually labelling which subset of those images were cats. The human-labeled images would then serve as a body of knowledge upon which features representative of those cats would be generated, as required by the feature engineering step in the machine learning process. John emphasized the laborious, expensive, and mundane nature of feature engineering, using his own experiences in medical imaging to prove his point.

    Above said, various machine learning algorithms could use the fruits of the labelling and feature engineering labors to discern a cat within any photo – not just those cats previously observed by the system. Although there’s no getting around machine learning’s dirty work to achieve these results, the emergence of deep learning has helped to lesson it.

  2. Machine Learning’s ‘Deep’ Bench. I entered John’s presentation knowing a handful of machine learning algorithms but left realizing my knowledge had barely scratched the surface. Cornell University’s machine learning benchmarking tests can serve as a good reference point for determining which algorithm to use, provided the results are taken into account with the wider, ‘No Free Lunch Theorem’ consideration that even the ‘best’ algorithm has the potential to perform poorly on a subclass of problems.

    Provided machine learning’s ‘deep’ bench, the neural network might have been easy to overlook just 10 years ago. Not only did it place 10th in Cornell’s 2004 benchmarking test, but John enlightened us to its fair share of limitations: inability to learn p(x), inefficiencies with layers greater than 3, overfitting, and relatively slow performance.

  3. The Restricted Boltzmann Machine’s (RBM’s) revival of the neural network. The year 2006 witnessed a breakthrough in machine learning, thanks to the efforts of an academic triumvirate consisting of Geoff Hinton, Yann LeCun, and Yoshua Bengio. I’m not going to even pretend like I understand the details, but will just say that their application of the Restricted Boltzmann Machine (RBM) to neural networks has played a major role in eradicating the neural network’s limitations outlined in #2 above. Take, for example, ‘inability to learn p(x)’. Going back to the cat example in #1, what this essentially states is that before the triumvirate’s discovery, the neural net was incapable of using an existing set of cat images to draw a new image of a cat. Figuratively speaking, not only can neural nets now draw cats, but they can do so with impressive time metrics thanks to the emergence of the GPU. Stanford, for example, was able to process 14 terabytes of images in just 3 hours through overlaying deep learning algorithms on top of a GPU-centric computer architecture. What’s even better? The fact that many implementations of the deep learning algorithm are openly available under the BSD licensing agreement.

  4. Deep learning’s astonishing results. Deep learning has experienced an explosive amount of success in a relatively small amount of time. Not only have several international image recognition contests been recently won by those who used deep learning, but technology powerhouses such as Google, Facebook, and Netflix are investing heavily in the algorithm’s adoption. For example, deep learning triumvirate member Geoff Hinton was hired by Google in 2013 to help the company make sense of their massive amounts of data and to optimize existing products that use machine learning techniques. Fellow deep learning triumvirate member Yann LeCun was hired by Facebook, also in 2013, to help integrate deep learning technologies into the company’s IT systems.

As for all the hype surrounding deep learning, John concluded his presentation by suggesting ‘cautious optimism in results, without reckless assertions about the future’. Although it would be careless to claim that deep learning has cured disease, for example, one thing most certainly is for sure: deep learning has inspired deep thinking throughout the DC metropolitan area.

As to where deep learning has left our furry feline friends, the attached YouTube video will further explain….

(created by an anonymous audience member following the presentation)

You can see John Kaufhold's slides from this event here.

Weekly Round-Up: Big Data Projects, OpenGeo, Coca-Cola, and Crime-Fighting

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from big data projects to Coca-Cola. In this week's round-up:

  • 5 Big Data Projects That Could Impact Your Life
  • CIA Invests in Geodata Expert OpenGeo
  • How Coca-Cola Takes a Refreshing Approach to Big Data
  • Fighting Crime with Big Data

5 Big Data Projects That Could Impact Your Life

Our first piece this week is a Mashable article listing 5 interesting data projects. The projects range from one that projects transit times in NYC to one that tracks homicides in DC to one that illustrates the prevalence of HIV in the United States. All are great examples of people doing interesting things with data that is becoming increasingly available.

CIA Invests in Geodata Expert OpenGeo

A while back, the CIA spun off a strategic investment arm called In-Q-Tel to make investments in data and technologies that could benefit the intelligence community. This week, it was announced that they have invested in geo-data startup OpenGeo. This GigaOM article provides a little detail about the company and what they do and also lists some of the other companies In-Q-Tel has invested in thus far.

How Coca-Cola Takes a Refreshing Approach to Big Data

This is an interesting Smart Data Collective article about Coca-Cola and how they use data to drive their decisions and maintain a competitive advantage. The article describes multiple ways the company uses big data and analytics, from interacting with their Facebook followers to the formulas for their soft drinks.

Fighting Crime with Big Data

Our final piece this week is an article about how analytics platform provider, Palantir, helps investigators find patterns to uncover white collar crime, which is usually hidden using data. The article contains multiple quotes from Palantir's legal counsel Ryan Taylor about how they work with crime-fighting agencies and what methods they employ to bring these criminals to justice.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: House of Cards, Machine Learning, Lying, and the Internet of Things

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from the new Netflix series House of Cards to the Internet of Things. In this week's round-up:

  • House of Cards and Our Future of Algorithmic Programming
  • Everything You Wanted to Know About Machine Learning
  • The Future of Lying
  • Big Data and the Internet of Things

House of Cards and Our Future of Algorithmic Programming

This MIT Technology review article is about how Netflix used the data it has gathered from its 33 million users to take the guesswork out of creating its latest originally-created series, House of Cards. Some of this data includes how many movies viewers watched containing the different actors & actresses in the series, their opinions about director David Fincher's other works, and how favorably users rated similar political dramas. This could be the start of a new trend in entertainment as companies that have traditionally served as mediums delivering content to consumers delve into creating content of their very own.

Everything You Wanted to Know About Machine Learning

For those looking to get started with machine learning, BigML published a two-part series on its blog simplifying a paper recently published by University of Washington machine learning professor, Pedro Domingos. The post walks you through the basic concepts in machine learning with very intuitive language and plenty of examples to drive home the points. Part 2 of the series can be found here, and Domingos original paper can be found here.

The Future of Lying

This is an interesting Slate article by Intel futurist Brian David Johnson about how professors at Cornell are working on developing programs that can learn when people lie online. The programs use algorithms that include data such as the amount of detail people give when they describe things, the high-level reasoning being that the less detail is provided the more likely someone is lying. In the article, Johnson goes on to differentiate between different types of lies, touch on some of the implications of being able to tell a truth from a lie, and talk about its potential effect on humanity.

Big Data and the Internet of Things

This is an interesting blog post that describes some of the relationships and challenges between Big Data and the Internet of Things. Both are buzz words nowadays, but that doesn't change the fact that both will have a profound impact over the next several years as more devices start becoming smart - having sensors or RFIDs embedded - and generating useful data streams that can be used by companies to operate more efficiently than they've ever been able to operate and make more timely and better-informed decisions. The article also talks about the new systems that will be required to process all this sensor data and analyze it effectively.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups