Troubling instances of the mosaic effect — in which different anonymized datasets are combined to reveal unintended details — include the tracking of celebrity cab trips and the identification of Netflix user profiles. Also concerning is the tremendous influence wielded by corporations and their massive data stores, most notoriously embodied by Facebook’s secret psychological experiments.
This is a guest post by Vadim Y. Bichutskiy, a Lead Data Scientist at Echelon Insights, a Republican analytics firm. His background spans analytical/engineering positions in Silicon Valley, academia, and the US Government. He holds MS/BS Computer Science from University of California, Irvine, MS Statistics from California State University, East Bay, and is a PhD Candidate in Data Sciences at George Mason University. Follow him on Twitter @vybstat.
Recently I got a hold of Jared Lander's book R for Everyone. It is one of the best books on R that I have seen. I first started learning R in 2007 when I was a CS graduate student at UC Irvine. Bored with my research, I decided to venture into statistics and machine learning. I enrolled in several PhD-level statistics courses--the Statistics Department at UC Irvine is in the same school as the CS Dept.--where I was introduced to R. Coming from a C/C++/Java background, R was different, exciting, and powerful.
Learning R is challenging because documentation is scattered all over the place. There is no comprehensive book that covers many important use cases. To get the fundamentals, one has to look at multiple books as well as many online resources and tutorials. Jared has written an excellent book that covers the fundamentals (and more!). It is easy-to-understand, concise and well-written. The title "R for everyone" is accurate because, while it is great for R novices, it is also quite useful for experienced R hackers. It truly lives up to its title.
Chapters 1-4 cover the basics: installation, RStudio, the R package system, and basic language constructs. Chapter 5 discusses fundamental data structures: data frames, lists, matrices, and arrays. Importing data into R is covered in Chapter 6: reading data from CSV files, Excel spreadsheets, relational databases, and from other statistical packages such as SAS and SPSS. This chapter also illustrates saving objects to disk and scraping data from the Web. Statistical graphics is the subject of Chapter 7 including Hadley Wickham's irreplaceable ggplot2 package. Chapters 8-10 are about writing R functions, control structures, and loops. Altogether Chapters 1-10 cover lots of ground. But we're not even halfway through the book!
Chapters 11-12 introduce tools for data munging: base R's apply family of functions and aggregation, Hadley Wickham's packages plyr and reshape2, and various ways to do joins. A section on speeding up data frames with the indispensable data.table package is also included. Chapter 13 is all about working with string (character) data including regular expressions and Hadley Wickham's stringr package. Important probability distributions are the subject of Chapter 14. Chapter 15 discusses basic descriptive and inferential statistics including the t-test and the analysis of variance. Statistical modeling with linear and generalized linear models is the topic of Chapters 16-18. Topics here also include survival analysis, cross-validation, and the bootstrap. The last part of the book covers hugely important topics. Chapter 19 discusses regularization and shrinkage including Lasso and Ridge regression, their generalization the Elastic Net, and Bayesian shrinkage. Nonlinear and nonparametric methods are the focus of Chapter 20: nonlinear least squares, splines, generalized additive models, decision trees, and random forests. Chapter 21 covers time series analysis with autoregressive moving average (ARIMA), vector autoregressive (VAR), and generalized autoregressive conditional heteroskedasticity (GARCH) models. Clustering is the the topic of Chapter 22: K-means, partitioning around medoids (PAM), and hierarchical.
The final two chapters cover topics that are often omitted from other books and resources, making the book especially useful to seasoned programmers. Chapter 23 is about creating reproducible reports and slide shows with the Yihui Xie’s knitr package, LaTeX and Markdown. Developing R packages is the subject of Chapter 24.
A useful appendix on the R ecosystem puts icing on the cake with valuable resources including Meetups, conferences, Web sites and online documentation, other books, and folks to follow on Twitter.
Whether you are a beginner or an experienced R hacker looking to pick up new tricks, Jared's book will be good to have in your library. It covers a multitude of important topics, is concise and easy-to-read, and is as good as advertised.
This is a guest post by Catherine Madden (@catmule), a lifelong doodler who realized a few years ago that doodling, sketching, and visual facilitation can be immensely useful in a professional environment. The post consists of her notes from the most recent Data Visualization DC Meetup. Catherine works as the lead designer for the Analytics Visualization Studio at Deloitte Consulting, designing user experiences and visual interfaces for visual analytics prototypes. She prefers Paper by Fifty Three and their Pencil stylus for digital note taking. (Click on the image to open full size.)
Our (lucky number) 13th DIDC meetup took place at the spacious offices of Endgame in Clarendon, VA. Endgame very graciously provided incredible gourmet pizza (and beer) for all those who attended.
Beyond such excellent beverages and food, attendees were treated to four separate and compelling talks. For those of you who could not attend, a little information about the talks and speakers is below (as well as contact information) and the slides!
This is a guest post by Majid al-Dosari, a master’s student in Computational Science at George Mason University. I recently attended the first DC Energy and Data Summit organized by Potential Energy DC and co-hosted by the American Association for the Advancement of Science’s Fellowship Big Data Affinity Group. I was excited to be at a conference where two important issues of modern society meet: energy and (big) data!
— AAAS Big Data (@AAASBigData) June 27, 2014
There was a keynote and plenary panel. In addition, there were three breakout sessions where participants brainstormed improvements to building energy efficiency, the grid, and transportation. Many of the issues raised at the conference could be either big data or energy issues (separately). However, I’m only going to highlight points raised that deal with both energy and data.
In the keynote, Joel Gurin (NYU Governance Lab, Director of OpenData500) emphasized the benefits of open government data (which can include unexpected use cases). In the energy field, this includes data about electric power consumption, solar irradiance, and public transport. He mentioned that the private sector also has a role in publishing and adding value to existing data.
Then, in the plenary panel, Lucy Nowel (Department of Energy) brought up the costs associated with the management, transport, and analysis of big data. These costs can be measured in terms of time and energy. You can ask this question: At what point does it “cost” less to transport some amount of data physically (via a SneakerNet) than it does through some computer network?
After the panel, I attended the breakout session dealing with energy efficiency of homes and businesses. The former is the domain of Opower represented by Asher Burns-Burg, while the latter is the domain of Aquicore represented by Logan Soya. It is of interest to compare the general strategy of both companies here. Opower uses psychological methods to encourage households to reduce consumption. On the other hand, Aquicore uses business metrics to show how building managers can save money. But both are data-enabled.
Asher claims that Opower is just scratching the surface with what is possible with the use of data. He also talked about how personalization can be used to deliver more effective messages to consumers. Meanwhile, Aquicore has challenges associated with working with existing (old) metering technology in order to obtain more fine-grained data on building energy use.
In the concluding remarks, I became aware of discussions at the other breakout sessions. The most notable to me was a concern raised by the transportation session: The rebound effect can offset any gain in efficiency by an increase in consumption. Also, the grid breakout session suggested that there should be a centralized “data mart” and a way to be able to easily navigate the regulations of the energy industry.
While DC is not Houston, the unique environment of policy, entrepreneurship, and analytical talent give DC the potential to innovate in this area. Credit goes to Potential Energy DC for creating a supportive environment.
This is a guest post by Alex Evanczuk, a software engineer at FiscalNote. Hello DC2! My name is Alex Evanczuk, and I recently joined a government data startup right here in the nation's capital that goes by the name of FiscalNote. Our mission is to make government data easily accessible, transparent, and understandable for everyone. We are a passionate group of individuals and are actively looking for other like-minded people who want to see things change. If this is you, and particularly if you are a software developer (front-end, with experience in Ruby on Rails), please reach out to me at email@example.com and I can put you in touch with the right people.
The topics covered by the presenters at June’s Data Science DC Meetup were varied and interesting. Subjects included spatial forecasting in uncertain environments, cell phone surveys in Africa (GeoPoll), causal inference models for improving the lives and prospects of Children and Youth (Child Trends), and several others.
I noticed a number of fascinating trends about the presentations I saw. The first was the simple and unadulterated love of numbers and their relationships to one another. Each presenter proudly explained the mathematical underpinnings of the models and assumptions used in their research, and most had slides that contained nothing more than a single formula or graph. In my brief time in academia, I've noticed that to most statisticians and mathematicians, numbers are their poetry, and this rang true at the event as well.
To most statisticians and mathematicians, numbers are their poetry.
The second was something that is perhaps well known to data researchers, but perhaps not so much to others, and that was that the advantages and influences of data science can extend into any industry. From business, to social work, to education, to healthcare, data science can find a way to improve our understanding of any field.
The second was something that is perhaps well known to data researchers, but perhaps not so much to others, and that was that the advantages and influences of data science can extend into any industry. From business, to social work, to education, to healthcare, data science can find a way to improve our understanding of any field.
More important than the numbers, however, is the fact that behind every data point, integer, and graph, is a human being. The human beings behind our data inspire our use of numbers and their deep understanding to develop axiomatically correct solutions for real world problems. The researchers presented data that told us how we might better understand emotional sentiment in developing countries, or make decisions on cancer treatments, or help children reach their boundless potential. For me, this is what data science is all about--how the appreciation of mathematics can help us improve the lives of human beings.
This is the second part of a guest post by John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC Meetup. Tandem NSI is a public-private partnership between Arlington Economic Development and Amplifier Ventures. According to the TNSI website, the partnership is intended to foster a vibrant technology ecosystem that combines entrepreneurs, university researchers and students, national security program managers and the supporting business community. I attended the Tandem NSI Deal Day on May 7; this post is a summary of a few discussions relevant to DC2.
In part one, I discussed the pros and cons of starting a tech business in the DC region; in this post, I'll discuss the specific barriers to entry of which entrepreneurs focusing on obtaining federal contractors should be aware when operating in our region, as well as ideas for how interested members of our community can get involved.
Barriers to innovation and entrepreneurship for federal contractors
One of the first talks of the day came from SpaceX's Deputy General Counsel, David Harris. It captured in one slide an issue all small technology companies operating in the federal space face, namely the FAR (Federal Acquisition Regulations). Specifically, David simply counted the number of clauses in different types of contracts, including standard Collaborative Research And Development Agreements, Contract Service Level Agreement Property Licenses, SpaceX's Form LSA, and a consumer-off-the-shelf procurement contract. The number of clauses is generally 12 to 27 in each of these contracts. As a bottom line, he compared these to the number of clauses in a Traditional FAR-fixed-price with one cost-plus Contract Line Item Number: more than 200 clauses. In discussion, there was even a suggestion that the federal government might want to reexamine how it does business with smaller technology companies to encourage innovators to spend time innovating rather than parsing legalese. The tacit message was the FAR may go too far. Add to the FAR the requirements of the Defense Contract Audit Agency and sometimes months-long contracting delays, and you have created a heavy legal and accounting burden on innovators.
Peggy Styer of Blackbird also told a story about how commitment to mission and successful execution for the government can sometimes narrow the potential market for a business. A paraphrase of Peggy's story: It's good to be focused on mission, but there can be strategic conflict between commercial and government success. As an example, when they came under fire in theatre, special ops forces were once expected to carry a heavy tracking device the size of a car battery and run for their lives into the desert where a rescue team could later find and retrieve them. Blackbird miniaturized a tracking device with the same functionality, which made soldiers on foot faster and more mobile, improving survivability. The US government loved the device. But they loved it so much they asked Blackbird to sell to the US government exclusively (and not to commercialize it for competitors). This can put innovators for the government in a difficult position with a smaller market than they might have expected in the broader commercial space.
Dan Doney, Chief Innovation Officer at the Defense Intelligence Agency described a precedent “culture" of the “man on the moon” success that was in many ways a blueprint for how research is still conducted in the federal government. Specifically, putting a man on the moon was a project of a scale and complexity only our coordinated US government could manage in the 1960s. To accomplish the mission, the US government collected requirements, matched requirements with contractors, and systematically filled them all. And that was a tremendous success. However, almost 50 years later, a slavish focus on requirements may be the problem, Dan argued. Dan described "so much hunger” to solve mission-critical problems by our local innovative entrepreneurs that in order to exploit it, the government needs to eliminate the “friction” from the system. Dan argued eliminating that “friction” has been shown to get enormous results faster and cheaper than traditional contracting models. He continued: "our innovation problems are communication problems," pointing out that Broad Area Announcements -- how the US govt often announces project needs--are terrible abstractions of problems to be solved. The overwhelming jumble of legalese that has nothing to do with technical work was also discussed as a barrier for technical minds—just finding the technical nugget the BAA is really asking for is an exhausting search across all the fedbizops announcements.
A brief discussion of how contracts can become inflexible handcuffs that focus contractors on “hitting their numbers” on the tasks a PM originally thought they should solve at the time of contracting, while in the course of a program it may even become clear a contractor should now be solving other, more relevant problems. In essence, contractors are asked to ask and answer relevant research questions, and research is executed with contracts, but those contracts often become counterproductively inflexible for asking and answering research questions.
What can DC2 do?
- I only recognized three DC2 participants at this event. With a bigger presence, we could be a more active and relevant part of the discussion on how to incentivize government to make better use of its innovative entrepreneurial resources here in the DMV.
- Deal Day provided a forum to hear from both successful entrepreneurs and the government side. These panels documented some strategies for how some performers successfully navigated those opportunities for their businesses. What Deal Day didn’t offer was a chance to hear from small innovative startups on what their particular needs are. Perhaps DC2 could conduct a survey of its members to inform future Tandem NSI discussions.
This is a guest post by John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC Meetup. Tandem NSI is a public-private partnership between Arlington Economic Development and Amplifier Ventures. According to the TNSI website, the partnership is intended to foster a vibrant technology ecosystem that combines entrepreneurs, university researchers and students, national security program managers and the supporting business community. I attended the Tandem NSI Deal Day on May 7; this post is a summary of a few discussions relevant to DC2.
The format of Deal Day was a collection of speakers and panel discussions from both successful entrepreneurs and government representatives from the Arlington area, including:
- Introductions by Arlington County Board Chairperson, Jay Fisette, and Arlington House Representative Jim Moran;
- Current trends in mergers and acquisitions and business acquisitions for national security product startups;
- “How to Hack the System,” a discussion with successful national security product entrepreneurs;
- “Free Money,” in which national security agency program managers told us where they need research done by small business and how you can commercialize what you learn; and
- “What’s on the Edge,” in which national security program managers told us where they have cutting edge opportunities for entrepreneurs that are on the edge of today’s tech, and will be the basis of tomorrow’s great startups.
There were two DC2-relevant themes from the day that I’ve distilled: the pros and cons of starting a tech business in the DC region, and the specific barriers to entry of which entrepreneurs focusing on obtaining federal contracts should be aware when operating in our region. This post will focus on the first theme; the second will be discussed in Part 2 of the recap, later this week.
Startups in the DC Metropolitan Statistical Area vs. “The Valley”
A lot of discussion focused on starting up a tech company here in the DC MSA (which includes Washington, DC; Calvert, Charles, Frederick, Montgomery and Prince George’s counties in MD; and Arlington, Fairfax, Loudoun, Prince William, and Stafford counties as well as the cities of Alexandria, Fairfax, Falls Church, Manassas and Manassas Park in VA) versus the Valley. Most of the panelists and speakers had experience starting companies in both places, and there were pros and cons to both. Here's a brief summary in no particular order.
DC MSA Startup Pros
- Youth! According to Jay Fisette, Arlington has the highest percentage of 25-34 year olds in America.
- Education. Money magazine called Arlington is the most educated city in America.
- Capital. The concentration of many high-end government research sponsors--the National Science Foundation, Defense Advanced Research Projects Agency, Intelligence Advanced Research Projects Agency, the Office of Naval Research, etc.--can provide early-stage, non-dilutive research investment.
- Localized impact. Entrepreneurial aims are often US-centric, rather than global.
- A mission-focused talent pool.
- A high concentration of American citizens and cleared personnel.
- Local government support. As an example, initiatives like ConnectArlington provide more secure broadband for Arlington companies.
DC MSA Startup Cons
- Localized impact. Entrepreneurial aims are often US-centric, rather than global. (Yes, this appears on both lists!)
- Heavy regulations. Federal Acquisition Regulations (FAR) and Defense Contract Audit Agency accounting requirements can complicate the already difficult task of starting a business.
- Bureaucracy. It’s DC. It’s a fact.
- Extremely complex government organization with significant personnel turnover.
- Less experienced “product managers.”
Silicon Valley Startup Pros
- Venture capitalists and big corporations are “throwing money at you” in the tech space.
- Plenty of entrepreneurial breadth.
- Plenty of talent in productization.
- Plenty of experience in commercial projects.
- Very liquid and competitive labor market--which is great for individual employees.
- Aims are often global, rather than US-centric.
- Compensation is unconstrained by government regulation.
- Great local higher education infrastructure: Berkeley, UNSF, National Labs, Stanford...
Silicon Valley Startup Cons
- Very liquid and competitive labor market--which means building a loyal, talented team can be a struggle.
- VCs and big corporation investments are unsustainably frothy.
- Less talent in or exposure to federal contracting.
- A smaller pool of American citizens and cleared personnel.
Check back later this week to find out what TNSI Deal Day panelists had to say about stumbling blocks to obtaining federal contracts!
Perhaps you’ve heard the phrase lately “software is eating the world”. Well, to be successful at that, it’s going to have to do as least as good a job of eating the world’s data as do the systems of Kalev Leetaru, Georgetown/Yahoo! fellow.
Kalev Leetaru, lead investigator on GDELT and other tools, defines “world class” work — certainly in the sense of size and scope of data. The goal of GDELT and related systems is to stream global news and social media in as near realtime as possible through multiple steps. The overall goal is to arrive at reliable tone (sentiment) mining and differential conflict detection and to do so …. globally. It is a grand goal.
Kalev Leetaru’s talk covered several broad areas. History of data and communication, data quality and “gotcha” issues in data sourcing and curation, geography of Twitter, processing architecture, toolkits and considerations, and data formatting observations. In each he had a fresh perspective or a novel idea, born of the requirement to handle enormous quantities of ‘noisy’ or ‘dirty’ data.
Keetaru observed that “the map is not the territory” in the sense that actual voting, resource or policy boundaries as measured by various data sources may not match assigned national boundaries. He flagged this as a question of “spatial error bars” for maps.
Distinguishing Global data science from hard established HPC-like pursuits (such as computational chemistry) Kalev Leetaru observed that we make our own bespoke toolkits, and that there is no single ‘magic toolkit” for Big Data, so we should be prepared and willing to spend time putting our toolchain together.
After talking a bit about the historical evolution and growth of data, Kalev Leetaru asked a few perspective-changing questions (some clearly relevant to intelligence agency needs) How to find all protests? How to locate all law books? Some of the more interesting data curation tools and resources Kalev Leetaru mentioned — and a lot more — might be found by the interested reader in The Oxford Guide to Library Research by Thomas Mann.
GDELT (covered further below), labels parse trees with error rates, and reaches beyond the “WHAT” of simple news media to tell us WHY, and ‘how reliable’. One GDELT output product among many is the Daily Global Conflict Report, which covers world leader emotional state and differential change in conflict, not absolute markers.
One recurring theme was to find ways to define and support “truth.” Kalev Leetaru decried one current trend in Big Data, the so-called “Apple Effect”: making luscious pictures from data; with more focus on appearance than actual ground truth. One example he cited was a conclusion from a recent report on Syria, which -- blithely based on geotagged English-language tweets and Facebook postings -- cast a skewed light on Syria’s rebels (Bzzzzzt!)
Leetaru provided one answer on “how to ‘ground truth’ data” by asking “how accurate are geotagged tweets?” Such tweets are after all only 3% of the total. But he reliably used those tweets. How? By correlating location to electric power availability. (r = .89) He talked also about how to handle emoticons, irony, sarcasm, and other affective language, cautioning analysts to think beyond blindly plugging data into pictures.
Kalev Leetaru talked engagingly about Geography of Twitter, encouraging us to to more RTFD (D=data) than RTFM. Cut your own way through the forest. The valid maps have not been made yet, so be prepared to make your own. Some of the challenges he cited were how to break up typical #hashtagswithnowhitespace and put them back into sentences, how to build — and maintain — sentiment/tone dictionaries and to expect, therefore, to spend the vast majority of time in innovative projects in human tuning the algorithms and understanding the data, and then iterating the machine. Refreshingly “hands on.”
Scale and Tech Architecture
Kalev Leetaru turned to discuss the scale of data, which is now generating easily in the petabytes per day range. There is no longer any question that automation must be used and that serious machinery will be involved. Our job is to get that automation machinery doing the right thing, and if we do so, we can measure the ‘heartbeat of society.’
For a book images project (60 Million images across hundreds of years) he mentioned a number of tools and file systems (but neither Gluster nor CEPH, disappointingly to this reviewer!) and delved deeply and masterfully into the question of how to clean and manage the very dirty data of “closed captioning” found in news reports. To full-text geocode and analyze half a million hours of news (from the Internet Archives), we need fast language detection and captioning error assessment. What makes this task horrifically difficult is that POS tagging “fails catastrophically on closed captioning” and that CC is worse, far worse in terms of quality than is Optical Character Recognition. The standard Stanford NL Understanding toolkit is very “fragile” in this domain: one reason being that news media has an extremely high density of location references, forcing the analyst into using context to disambiguate.
He covered his GDELT (Global Database of Event, Language and Tone), covering human/societal behavior and beliefs at scale around the world. A system of half a billion plus georeferenced rows, 58 columns wide, comprising 100,000 sources such as broadcast, print, online media back to 1979, it relies on both human translation and Google translate, and will soon be extended across languages and back to the 1800s. Further, he’s incorporating 21 billion words of academic literature into this model (a first!) and expects availability in Summer 2014, (Sources include JSTOR, DTIC, CIA, CVORE CiteSeerX, IA.)
GDELT’s architecture, which relies heavily on the Google Cloud and BigQuery, can stream at 100,000 input observations/second. This reviewer wanted to ask him about update and delete needs and speeds, but the stream is designed to optimize ingest and query. GDELT tools were myriad, but Perl was frequently mentioned (for text processing).
Kalev Leetaru shared some post GDELT construction takeaways — “it’s not all English” and “watch out for full Unicode compliance” in your toolset, lest your lovely data processing stack SEGFAULT halfway through a load. Store data in whatever is easy to maintain and fast. Modularity is good but performance can be an issue; watch out for XML which bogs down processing on highly nested data. Use for interchange more than anything; sharing seems “nice” but “you can’t shared a graph” and “RAM disk is your friend” more so even than SSD, FusionIO, or fast SANs.
The talk, like this blog post, ran over allotted space and time, but the talk was well worth the effort spent understanding it.
This is a guest post by Mary Galvin, founder and managing principal at AIC. Mary provides technical consulting services to clients including LexisNexis’ HPCC Systems team. The HPCC is an open source, massive parallel-processing computing platform that solves Big Data problems.
Data Science DC hosted a packed house at the Artisphere on Monday evening, thanks to the efforts of organizers Harlan Harris, Sean Gonzalez, and several others who helped plan and coordinate the event. Michael Burke, Jr, Arlington County Business Development Manager, provided opening remarks and emphasized Arlington’s commitment to serving local innovators and entrepreneurs. Michael subsequently introduced Sanju Bansal, a former MicroStrategy founder and executive who presently serves as the CEO of an emerging, Arlington-based start-up, Hunch Analytics. Sanju energized the audience by providing concrete examples of data science’s applicability to business; this no better illustrated than by the $930 million acquisition of Climate Corps. roughly 6 months ago.
Michael, Sanju, and the rest of the Data Science DC team helped set the stage for a phenomenal presentation put on by John Kaufhold, Managing Partner and Data Scientist at Deep Learning Analytics. John started his presentation by asking the audience for a show of hands on two items: 1) whether anyone was familiar with deep learning, and 2) of those who said yes to #1, whether they could explain what deep learning meant to a fellow data scientist. Of the roughly 240 attendees present, the majority of hands that answered favorably to question #1 dropped significantly upon John’s prompting of question #2.
I’ll be the first to admit that I was unable to raise my hand for either of John’s introductory questions. The fact I was at least a bit knowledgeable in the broader machine learning topic helped to somewhat put my mind at ease, thanks to prior experiences working with statistical machine translation, entity extraction, and entity resolution engines. That said, I still entered John’s talk fully prepared to brace myself for the ‘deep’ learning curve that lay ahead. Although I’m still trying to decompress from everything that was covered – it being less than a week since the event took place – I’d summarize key takeaways from the densely-packed, intellectually stimulating, 70+ minute session that ensued as follows:
Machine learning’s dirty work: labelling and feature engineering. John introduced his topic by using examples from image and speech recognition to illustrate two mandatory (and often less-than-desirable) undertakings in machine learning: labelling and feature engineering. In the case specific to image recognition, say you wanted to determine whether a photo, ‘x’, contained an image of a cat, ‘y’ (i.e., p(y|x)). This would typically involve taking a sizable database of images and manually labelling which subset of those images were cats. The human-labeled images would then serve as a body of knowledge upon which features representative of those cats would be generated, as required by the feature engineering step in the machine learning process. John emphasized the laborious, expensive, and mundane nature of feature engineering, using his own experiences in medical imaging to prove his point.
Above said, various machine learning algorithms could use the fruits of the labelling and feature engineering labors to discern a cat within any photo – not just those cats previously observed by the system. Although there’s no getting around machine learning’s dirty work to achieve these results, the emergence of deep learning has helped to lesson it.
Machine Learning’s ‘Deep’ Bench. I entered John’s presentation knowing a handful of machine learning algorithms but left realizing my knowledge had barely scratched the surface. Cornell University’s machine learning benchmarking tests can serve as a good reference point for determining which algorithm to use, provided the results are taken into account with the wider, ‘No Free Lunch Theorem’ consideration that even the ‘best’ algorithm has the potential to perform poorly on a subclass of problems.
Provided machine learning’s ‘deep’ bench, the neural network might have been easy to overlook just 10 years ago. Not only did it place 10th in Cornell’s 2004 benchmarking test, but John enlightened us to its fair share of limitations: inability to learn p(x), inefficiencies with layers greater than 3, overfitting, and relatively slow performance.
The Restricted Boltzmann Machine’s (RBM’s) revival of the neural network. The year 2006 witnessed a breakthrough in machine learning, thanks to the efforts of an academic triumvirate consisting of Geoff Hinton, Yann LeCun, and Yoshua Bengio. I’m not going to even pretend like I understand the details, but will just say that their application of the Restricted Boltzmann Machine (RBM) to neural networks has played a major role in eradicating the neural network’s limitations outlined in #2 above. Take, for example, ‘inability to learn p(x)’. Going back to the cat example in #1, what this essentially states is that before the triumvirate’s discovery, the neural net was incapable of using an existing set of cat images to draw a new image of a cat. Figuratively speaking, not only can neural nets now draw cats, but they can do so with impressive time metrics thanks to the emergence of the GPU. Stanford, for example, was able to process 14 terabytes of images in just 3 hours through overlaying deep learning algorithms on top of a GPU-centric computer architecture. What’s even better? The fact that many implementations of the deep learning algorithm are openly available under the BSD licensing agreement.
Deep learning’s astonishing results. Deep learning has experienced an explosive amount of success in a relatively small amount of time. Not only have several international image recognition contests been recently won by those who used deep learning, but technology powerhouses such as Google, Facebook, and Netflix are investing heavily in the algorithm’s adoption. For example, deep learning triumvirate member Geoff Hinton was hired by Google in 2013 to help the company make sense of their massive amounts of data and to optimize existing products that use machine learning techniques. Fellow deep learning triumvirate member Yann LeCun was hired by Facebook, also in 2013, to help integrate deep learning technologies into the company’s IT systems.
As for all the hype surrounding deep learning, John concluded his presentation by suggesting ‘cautious optimism in results, without reckless assertions about the future’. Although it would be careless to claim that deep learning has cured disease, for example, one thing most certainly is for sure: deep learning has inspired deep thinking throughout the DC metropolitan area.
As to where deep learning has left our furry feline friends, the attached YouTube video will further explain….
(created by an anonymous audience member following the presentation)
You can see John Kaufhold's slides from this event here.