Meetup

DC NLP November 2014 Meetup Announcement: Introduction to Topic Modeling with LDA

Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP November Meetup!

This month's event features an overview of Latent Dirichlet Allocation and probabilistic topic modeling.

Topic models are a family of models to estimate the distribution of abstract concepts (topics) that make up a collection of documents. Over the last several years, the popularity of topic modeling has swelled. One model, Latent Dirichlet Allocation (LDA), is especially popular.

Notes on a Meetup

This is a guest post by Catherine Madden (@catmule), a lifelong doodler who realized a few years ago that doodling, sketching, and visual facilitation can be immensely useful in a professional environment. The post consists of her notes from the most recent Data Visualization DC Meetup. Catherine works as the lead designer for the Analytics Visualization Studio at Deloitte Consulting, designing user experiences and visual interfaces for visual analytics prototypes. She prefers Paper by Fifty Three and their Pencil stylus for digital note taking. (Click on the image to open full size.)

DC NLP September 2014 Meetup Announcement: Natural Language Processing for Assistive Technologies

Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP September Meetup!

This month, we're joined by Kathy McCoy, Professor of Computer & Information Science and Linguistics at the University of Delaware. Kathy is also a consultant for the National Institute on Disability and Rehabilitation Research (NIDRR) at the U.S. Department of Education. Her research focuses on natural language generation and understanding, particularly for assistive technologies, and she'll be giving a presentation on Replicating Semantic Connections Made by Visual Readers for a Scanning System for Nonvisual Readers.

DC NLP August 2014 Meetup Announcement: Automatic Segmentation

Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP August Meetup! dcnlp_july

This August, we're joined by Tony Davis, technical manager in the NLP and machine learning group at 3M Health Information Systems and adjunct professor in the Georgetown University Linguistics Department, where he's taught courses including information retrieval and extraction, and lexical semantics.

Tony will be introducing us to automatic segmentation. Automatic segmentation deals with breaking up unstructured documents into units - words, sentences, topics, etc. Search and retrieval, document categorization, and analysis of dialog and discourse all benefit from segmentation.

Tony's talk will cover some of the techniques, linguistic and otherwise, that have been applied to segmentation, with particular reference to two use cases: multimedia information retrieval and medical coding.

Following July's joint meetup with Statistical Programming DC and Data Wranglers, where Tommy Jones and Charlie Greenbacker performed a showdown between tools in R and Python, we're back at our usual U-Street location.

DC NLP August Meetup

Wednesday, August 13, 2014 6:30 PM to 8:30 PM Stetsons Famous Bar & Grill 1610 U Street Northwest, Washington, DC

The DC NLP meetup group is for anyone in the Washington, D.C. area working in (or interested in) Natural Language Processing. Our meetings will be an opportunity for folks to network, give presentations about their work or research projects, learn about the latest advancements in our field, and exchange ideas or brainstorm. Topics may include computational linguistics, machine learning, text analytics, data mining, information extraction, speech processing, sentiment analysis, and much more.

For more information and to RSVP, please visit: http://www.meetup.com/DC-NLP/events/192504832/

@DCNLP

Natural Language Processing in Python and R

This is a guest post by Charlie Greenbacker and Tommy Jones.

Data comes in many forms. As a data scientist, you might be comfortable working with large amounts of structured data nicely organized in a database or other tabular format, but what do you do if a customer drops 10,000 unstructured text documents in your lap and asks you to analyze them?

Some estimates claim unstructured data accounts for more than 90 percent of the digital universe, much of it in the form of text. Digital publishing, social media, and other forms of electronic communication all contribute to the deluge of text data from which you might seek to derive insights and extract value. Fortunately, many tools and techniques have been developed to facilitate large-scale text analytics. Operating at the intersection of computer science, artificial intelligence, and computational linguistics, Natural Language Processing (NLP) focuses on algorithmically understanding human language.

Interested in getting started with Natural Language Processing but don't know where to begin? On July 9th, a joint meetup co-hosted by Statistical Programming DC, Data Wranglers DC, and DC NLP will feature two introductory talks on the nuts & bolts of working with NLP in Python and R.

The Python programming language is increasingly popular in the data science community for a variety of reasons, including its ease of use and the plethora of open source software libraries available for scientific computing & data analysis. Packages like SciPy, NumPy, Scikit-learn, Pandas, NetworkX, and others help Python developers perform everything from linear algebra and dimensionality reduction, to clustering data and analyzing multigraphs.

Back in the dark ages (about 10+ years ago), folks working in NLP usually maintained an assortment of homemade utility programs designed to handle many of the common tasks involved with NLP. Despite our best intentions, most of this code was lousy, brittle, and poorly documented -- hardly a good foundation upon which to build your masterpiece. Over the past several years, however, mainstream open source software libraries like the Natural Language Toolkit for Python (NLTK) have emerged to offer a collection of high-quality reusable NLP functionality. NLTK enables researchers and developers to spend more time focusing on the application logic of the task at hand, and less on debugging an abandoned method for sentence segmentation or reimplementing noun phrase chunking.

If you're already familiar with Python, the NLTK library will equip you with many powerful tools for working with text data. The O'Reilly book Natural Language Processing with Python written by Steven Bird, Ewan Klein, and Edward Loper offers an excellent overview of using NLTK for text analytics. Topics include processing raw text, tagging words, document classification, information extraction, and much more. Best of all, the entire contents of this NLTK book are freely available online under a Creative Commons license.

The Python portion of this joint meetup event will cover a handful of the NLP building blocks provided by NLTK, including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. These components will then be assembled to build a very basic document summarization program.

Additional NLP resources in Python

- Natural Language Toolkit for Python (NLTK): http://www.nltk.org/ - Natural Language Processing with Python (book): http://oreilly.com/catalog/9780596516499/ (free online version: http://www.nltk.org/book/) - Python Text Processing with NLTK 2.0 Cookbook (book): http://www.packtpub.com/python-text-processing-nltk-20-cookbook/book - Python wrapper for the Stanford CoreNLP Java library: https://pypi.python.org/pypi/corenlp - guess_language (Python library for language identification): https://bitbucket.org/spirit/guess_language - MITIE (new C/C++-based NER library from MIT with a Python API): https://github.com/mit-nlp/MITIE - gensim (topic modeling library for Python): http://radimrehurek.com/gensim/

R is a programming language popular in statistics and machine learning research. R has several advantages in the ML/stat domains. R is optimized for vector operations. This simplifies programming since your code is very close to the math that you're trying to execute. R also has a huge community behind it; packages exist for just about any application you can think of. R has a close relationship with C, C++, and Fortran and there are R packages to execute Java and Python code, increasing its flexibility. Finally, the folks at CRAN are zealous about version control and compatibility, making installing R and subsequent packages a smooth experience.

However, R does have some sharp edges that become obvious when working with any non-trivially-sized linguistic data. R holds all data in your active workspace in RAM. If you are running R on a 32-bit system, you have a 4 GB limit to the RAM R can access. There are two implications of this: NLP data need to be stored in memory-efficient objects (more on that later) and (regrettably) there is still a hard limit on how much data you can work on at one time. There are packages, such as `bigmemory` that are moving to address this, but they are outside the scope of this presentation. You also need to write efficient code; the size of NLP data will punish you for inefficiencies.

What advantages, then, does R have? Every person and every problem is unique, but I can offer a few suggestions:

1. You are doing statistics/ML research and not developing software. 2. (Similar to 1.) You are a quantitative generalist (and probably good in R already) and NLP is just another feather in your cap.

Sometimes being a data scientist is about developing and tweaking your own algorithms. Sometimes being a data scientist is taking others' algorithms, plugging in your data, and moving on to other areas of the problem. If you are doing more of the former, R is a solid choice. If you are doing more of the latter, R isn't too bad. But I've found that my code often runs faster than some of the pre-packaged code. Your individual mileage may vary.

The second presentation at this meetup will cover the basics of reading documents into R and creating a document term matrix, then demonstrating some basic document summarization, keyword extraction, and document clustering techniques.

Seats are filling up quickly, so RSVP here now: http://www.meetup.com/stats-prog-dc/events/177772322/

Event Recap: Jawbone UP & Behavior Change

Guest post blogger Jenna Dutcher is the community relations manager for UC Berkeley’s datascience@berkeley degree – the first and only online Master of Information and Data Science.  Follow datascience@berkeley on Twitter and Facebook for news and updates. jenna May’s meeting of Action Design DC was sponsored by Fluencia and Hello Wallet and featured a talk by Kelvin Kwong of Jawbone.  Kwong is a product manager for the UP band, a role that left him well prepared to speak to the group about Jawbone UP: Designing for Exercise and More.  But what is the “more” in this scenario?  Kwong explained that Jawbone UP tracks the rhyming actions of food, mood, sleep, and eat[ing], but their true purpose goes further than just data collection. In fact, the company’s bread and butter is in behavior change, the act of getting people to do the things Jawbone knows they want to do.

The company doesn’t do this by guessing or “intuition”; rather, they employ the latest behavioral science in their quest to turn intention into action.  “No matter where we're starting, we all want to be better,” Kwong said, and Jawbone has taken it upon themselves to make this into a reality.  Human nature often leaves a gap between intention and action; for example, people may want to eat healthier, finish their course of antibiotics, or quit smoking, but those actions are often more difficult than they seem at first glance.  Many don’t follow through.  The goal of lifestyle wearables is to drive behavior change.  They aim to do this with a three-step process, by tracking, understanding, and acting.

  • Track - The first step is to build something wearable, a device that customers will want to sport daily.  Constant monitoring will allow Jawbone to gather your data, which is a necessary first step before any behaviors can be changed.
  • Understand - Once data has been collected, scientists get to work comparing and contrasting, looking for correlations in the data that might hint towards causation. More on this below.
  • Act -  Here, Kwong says, scientists face an interesting dilemma: now that we understand you, how do we compel you to do the things you said you want to do?  Forty-seven percent of daily decisions are made unconsciously. Think about all of the tiny decisions we have in life that we don’t even think about, like taking the stairs instead of the elevator, or walking rather than driving your car; this is the area where the UP band wants to have a real impact.  Once scientists understand those ingrained habits, they can make an effort to change them.

What are the principles behind this?  Kwong explained that we’re at an inflection point in behavioral science.  There has been a lot of theory to lay the groundwork of this science, and now companies like Jawbone must figure out the practical applications: how can Jawbone scientists apply these tenets to encourage behavior change for their user base?  In past decades, a lot of the research has been applied to consumer science, helping markets to advertise and upsell.

Jawbone doesn’t want to market, however; instead, they want to make an impactful difference in their customers’ lives.  Put a different way, Jawbone wants to create episodic interventions that lead to longitudinal interventions.  Luckily for them, they have a large pool of research subjects to draw from.  Rather than working with the tens or hundreds of subjects an academic researcher might have access to, Jawbone has data from hundreds of thousands of users at their fingertips.

Using this dataset, they conducted the biggest sleep study ever performed, on an aggregated 80 million nights of sleep, and observed interesting trends like gender and age gaps in amount of time slept.  While academic researchers have noted this in the past, it was never feasible to measure it across age groups, simply due to the difficulties of collecting a large enough sample from each group.  Using the UP band’s aggregate user data, however, Jawbone’s scientists were able to note a wide gap in average time spent sleeping between men and women at a college age, and a narrowing of this gap in retirement.

Let’s look at another real example: Jawbone’s extensive dataset allows their researchers to tie self-reported data (do you sleep with multiple pillows? Do you share a bed with a partner?  Is your mobile phone in the room during slumber?) to actual data (does your tracker show eight hours of sleep? Five times awakened?) and spot patterns in this aggregated dataset (for example, people who share a bed with a partner average 35 minutes more sleep each night).  After they notice these correlations, they can begin to apply a narrative and generate hypotheses. In this scenario, the presence of a partner illuminates a decision point, a sort of physically present bedtime reminder.

Kwong also demonstrated UP’s “Today I Will” feature, which is a very real manifestation of these behavior change efforts.  This feature of the UP band sends intelligent suggestions to users based on their behavior, and holds them accountable for completing the tasks (i.e., a daily increase in steps, sleep, or water intake).  Research has shown that if you say you’re going to do something, the likelihood is much higher that you’ll complete that task.

This is what’s known as the “foot in the door” technique - in Jawbone’s case, if you ask someone to use the Today I Will feature and announce their daily goal, they’re much more likely to hit that target.  In fact, opt-ins to the Today I Will feature go to bed 23 minutes earlier than those who have not pledged to modify this behavior.  In addition, there’s a 72% increased likelihood that these people will go to bed early enough to hit their sleep goal.

These behavioral change results have also been replicated with an activity goal.  Thanksgiving is one of the days where Jawbone users take the fewest daily steps (too busy eating turkey and catching up with family!).  In one experiment, the UP band’s Today I Will feature simply told users that they were less likely to get their steps on this day, and challenged them to get more steps.  The result?  Opt-ins walked 20% more on Thanksgiving than on their average days simply because they had publicly announced this goal.  The “foot in the door” technique is successful for one key reason: reaching your expressed goal allows you to remain congruent with your self image.

Above all else, Jawbone wants to create a product that solves a true human need and can be fully integrated into a user’s life.  Audience members raised questions about the accuracy of a wrist-based step tracker, but Kwong assured them that the product has been subjected to innumerable tests in an effort to increase step count accuracy.  More interestingly, he said, the future of activity tracker success isn't in incremental accuracy gains, but rather in what can be done with the data collected.  As he said, a few hundred steps gained or lost won’t make a difference in the long run; it’s the behavioral influences brought on by challenges like Today I Will that let Jawbone know they’re really on to something special with the UP band.

Want more information? Kelvin can be found on twitter at @kelvinskwong.

A Rush of Ideas: Kalev Leetaru at Data Science DC

gdeltThis review of the April Data Science DC Meetup was written by Ross Mohan. Ross is a solutions architect for Five 9 Group.

Perhaps you’ve heard the phrase lately “software is eating the world”. Well, to be successful at that, it’s going to have to do as least as good a job of eating the world’s data as do the systems of Kalev Leetaru, Georgetown/Yahoo! fellow.

Kalev Leetaru, lead investigator on GDELT and other tools, defines “world class” work — certainly in the sense of size and scope of data. The goal of GDELT and related systems is to stream global news and social media in as near realtime as possible through multiple steps. The overall goal is to arrive at reliable tone (sentiment) mining and differential conflict detection and to do so …. globally. It is a grand goal.

Kalev Leetaru’s talk covered several broad areas. History of data and communication, data quality and “gotcha” issues in data sourcing and curation, geography of Twitter, processing architecture, toolkits and considerations, and data formatting observations. In each he had a fresh perspective or a novel idea, born of the requirement to handle enormous quantities of ‘noisy’ or ‘dirty’ data.

Perspectives

Keetaru observed that “the map is not the territory” in the sense that actual voting, resource or policy boundaries as measured by various data sources may not match assigned national boundaries. He flagged this as a question of “spatial error bars” for maps.

Distinguishing Global data science from hard established HPC-like pursuits (such as computational chemistry) Kalev Leetaru observed that we make our own bespoke toolkits, and that there is no single ‘magic toolkit” for Big Data, so we should be prepared and willing to spend time putting our toolchain together.

After talking a bit about the historical evolution and growth of data, Kalev Leetaru asked a few perspective-changing questions (some clearly relevant to intelligence agency needs) How to find all protests? How to locate all law books? Some of the more interesting data curation tools and resources Kalev Leetaru mentioned — and a lot more — might be found by the interested reader in The Oxford Guide to Library Research by Thomas Mann.

GDELT (covered further below), labels parse trees with error rates, and reaches beyond the “WHAT” of simple news media to tell us WHY, and ‘how reliable’. One GDELT output product among many is the Daily Global Conflict Report, which covers world leader emotional state and differential change in conflict, not absolute markers.

One recurring theme was to find ways to define and support “truth.” Kalev Leetaru decried one current trend in Big Data, the so-called “Apple Effect”: making luscious pictures from data; with more focus on appearance than actual ground truth. One example he cited was a conclusion from a recent report on Syria, which -- blithely based on geotagged English-language tweets and Facebook postings -- cast a skewed light on Syria’s rebels (Bzzzzzt!)

Twitter

Leetaru provided one answer on “how to ‘ground truth’ data” by asking “how accurate are geotagged tweets?” Such tweets are after all only 3% of the total. But he reliably used those tweets. How?  By correlating location to electric power availability. (r = .89) He talked also about how to handle emoticons, irony, sarcasm, and other affective language, cautioning analysts to think beyond blindly plugging data into pictures.

Kalev Leetaru talked engagingly about Geography of Twitter, encouraging us to to more RTFD (D=data) than RTFM. Cut your own way through the forest. The valid maps have not been made yet, so be prepared to make your own. Some of the challenges he cited were how to break up typical #hashtagswithnowhitespace and put them back into sentences, how to build — and maintain — sentiment/tone dictionaries and to expect, therefore, to spend the vast majority of time in innovative projects in human tuning the algorithms and understanding the data, and then iterating the machine. Refreshingly “hands on.”

Scale and Tech Architecture

Kalev Leetaru turned to discuss the scale of data, which is now generating easily  in the petabytes per day range. There is no longer any question that automation must be used and that serious machinery will be involved. Our job is to get that automation machinery doing the right thing, and if we do so, we can measure the ‘heartbeat of society.’

For a book images project (60 Million images across hundreds of years) he mentioned a number of tools and file systems (but neither Gluster nor CEPH, disappointingly to this reviewer!) and delved deeply and masterfully into the question of how to clean and manage the very dirty data of “closed captioning” found in news reports. To full-text geocode and analyze half a million hours of news (from the Internet Archives), we need fast language detection and captioning error assessment. What makes this task horrifically difficult is that POS tagging “fails catastrophically on closed captioning” and that CC is worse, far worse in terms of quality than is Optical Character Recognition. The standard Stanford NL Understanding toolkit is very “fragile” in this domain: one reason being that news media has an extremely high density of location references, forcing the analyst into using context to disambiguate.

He covered his GDELT (Global Database of Event, Language and Tone), covering human/societal behavior and beliefs at scale around the world. A system of half a billion plus georeferenced rows, 58 columns wide, comprising 100,000 sources such as  broadcast, print, online media back to 1979, it relies on both human translation and Google translate, and will soon be extended across languages and back to the 1800s. Further, he’s incorporating 21 billion words of academic literature into this model (a first!) and expects availability in Summer 2014, (Sources include JSTOR, DTIC, CIA, CVORE CiteSeerX, IA.)

GDELT’s architecture, which relies heavily on the Google Cloud and BigQuery, can stream at 100,000 input observations/second. This reviewer wanted to ask him about update and delete needs and speeds, but the stream is designed to optimize ingest and query. GDELT tools were myriad, but Perl was frequently mentioned (for text processing).

Kalev Leetaru shared some post GDELT construction takeaways — “it’s not all English” and “watch out for full Unicode compliance” in your toolset, lest your lovely data processing stack SEGFAULT halfway through a load. Store data in whatever is easy to maintain and fast. Modularity is good but performance can be an issue; watch out for XML which bogs down processing on highly nested data. Use for interchange more than anything; sharing seems “nice” but “you can’t shared a graph” and “RAM disk is your friend” more so even than SSD, FusionIO, or fast SANs.

The talk, like this blog post, ran over allotted space and time, but the talk was well worth the effort spent understanding it.

The Evolution of Big Data Platforms and People

This is a guest post by Paco Nathan. Paco is an O’Reilly authorApache Spark open source evangelist with Databricks, and an advisor for ZettacapAmplify Partners, and The Data Guild. Google lives in his family’s backyard. Paco spoke at Data Science DC in 2012.  Data Workflows for Machine LearningA kind of “middleware” for Big Data has been evolving since the mid–2000s. Abstraction layers help make it simpler to write apps in frameworks such as Hadoop. Beyond the relatively simple issue of programming convenience, there are much more complex factors in play. Several open source frameworks have emerged that build on the notion of workflow, exemplifying highly sophisticated features. My recent talk Data Workflows for Machine Learning considers several OSS frameworks in that context, developing a kind of “scorecard” to help assess best-of-breed features. Hopefully it can help your decisions about which frameworks suit your use case needs.

By definition, a workflow encompasses both the automation that we’re leveraging (e.g., machine learning apps running on clusters) as well as people and process. In terms of automation, some larger players have departed from “conventional wisdom” for their clusters and ML apps. For example, while the rest of the industry embraced virtualization, Google avoided that path by using cgroups for isolation. Twitter sponsored a similar open source approach, Apache Mesos, which was attributed to helping resolve their “Fail Whale” issues prior to their IPO. As other large firms adopt this strategy, the implication is that VMs may have run out of steam. Certainly, single-digit utilization rates at data centers (current industry norm) will not scale to handle IoT data rates: energy companies could not handle that surge, let along the enormous cap-ex implied. I'll be presenting on Datacenter Computing with Apache Mesos next Tuesday at the Big Data DC Meetup, held at AddThis. We’ll discuss the Mesos approach of mixed workloads for better elasticity, higher utilization rates, and lower latency.

On the people side, a very different set of issues looms ahead. Industry is retooling on a massive scale. It’s not about buying a whole new set of expensive tools for Big Data. Rather it’s about retooling how people in general think about computable problems. One vital component may well not be having enough advanced math in the hands of business leaders. Seriously, we still frame requirements for college math in Cold War terms: years of calculus were intended to filter out the best Mechanical Engineering candidates, who could then help build the best ICBMs. However, in business today the leadership needs to understand how to contend with enormous data rates and meanwhile deploy high-ROI apps at scale: how and when to leverage graph queries, sparse matrices, convex optimization, Bayesian statistics – topics that are generally obscured beyond the “killing fields” of calculus.

A new book by Allen Day and me in development at O’Reilly called “Just Enough Math” introduces advanced math for business people, especially to learn how to leverage open source frameworks for Big Data – much of which comes from organizations that leverage sophisticated math, e.g., Twitter. Each morsel of math is considered in the context of concrete business use cases, lots of illustrations, and historical background – along with brief code examples in Python that one can cut & paste.

This next week in the DC area I will be teaching a full-day workshop that includes material from all of the above:

Machine Learning for Managers Tue, Apr 15, 8:30am–4:30pm (Eastern) MicroTek, 1101 Vermont Ave NW #700, Washington, DC 20005

That workshop provides an introduction to ML – something quite different than popular MOOCs or vendor training – with emphasis placed as much on the “soft skills” as on the math and coding. We’ll also have a drinkup in the area, to gather informally and discuss related topics in more detail:

Drinks and Data Science Wed, Apr 16, 6:30–9:00pm (Eastern) location TBD

Looking forward to meeting you there!

Deep Learning Inspires Deep Thinking

This is a guest post by Mary Galvin, founder and managing principal at AIC. Mary provides technical consulting services to clients including LexisNexis’ HPCC Systems team. The HPCC is an open source, massive parallel-processing computing platform that solves Big Data problems. 

Data Science DC hosted a packed house at the Artisphere on Monday evening, thanks to the efforts of organizers Harlan Harris, Sean Gonzalez, and several others who helped plan and coordinate the event. Michael Burke, Jr, Arlington County Business Development Manager, provided opening remarks and emphasized Arlington’s commitment to serving local innovators and entrepreneurs. Michael subsequently introduced Sanju Bansal, a former MicroStrategy founder and executive who presently serves as the CEO of an emerging, Arlington-based start-up, Hunch Analytics. Sanju energized the audience by providing concrete examples of data science’s applicability to business; this no better illustrated than by the $930 million acquisition of Climate Corps. roughly 6 months ago.

Michael, Sanju, and the rest of the Data Science DC team helped set the stage for a phenomenal presentation put on by John Kaufhold, Managing Partner and Data Scientist at Deep Learning Analytics. John started his presentation by asking the audience for a show of hands on two items: 1) whether anyone was familiar with deep learning, and 2) of those who said yes to #1, whether they could explain what deep learning meant to a fellow data scientist. Of the roughly 240 attendees present, the majority of hands that answered favorably to question #1 dropped significantly upon John’s prompting of question #2.

I’ll be the first to admit that I was unable to raise my hand for either of John’s introductory questions. The fact I was at least a bit knowledgeable in the broader machine learning topic helped to somewhat put my mind at ease, thanks to prior experiences working with statistical machine translation, entity extraction, and entity resolution engines. That said, I still entered John’s talk fully prepared to brace myself for the ‘deep’ learning curve that lay ahead. Although I’m still trying to decompress from everything that was covered – it being less than a week since the event took place – I’d summarize key takeaways from the densely-packed, intellectually stimulating, 70+ minute session that ensued as follows:

  1. Machine learning’s dirty work: labelling and feature engineering. John introduced his topic by using examples from image and speech recognition to illustrate two mandatory (and often less-than-desirable) undertakings in machine learning: labelling and feature engineering. In the case specific to image recognition, say you wanted to determine whether a photo, ‘x’, contained an image of a cat, ‘y’ (i.e., p(y|x)). This would typically involve taking a sizable database of images and manually labelling which subset of those images were cats. The human-labeled images would then serve as a body of knowledge upon which features representative of those cats would be generated, as required by the feature engineering step in the machine learning process. John emphasized the laborious, expensive, and mundane nature of feature engineering, using his own experiences in medical imaging to prove his point.

    Above said, various machine learning algorithms could use the fruits of the labelling and feature engineering labors to discern a cat within any photo – not just those cats previously observed by the system. Although there’s no getting around machine learning’s dirty work to achieve these results, the emergence of deep learning has helped to lesson it.

  2. Machine Learning’s ‘Deep’ Bench. I entered John’s presentation knowing a handful of machine learning algorithms but left realizing my knowledge had barely scratched the surface. Cornell University’s machine learning benchmarking tests can serve as a good reference point for determining which algorithm to use, provided the results are taken into account with the wider, ‘No Free Lunch Theorem’ consideration that even the ‘best’ algorithm has the potential to perform poorly on a subclass of problems.

    Provided machine learning’s ‘deep’ bench, the neural network might have been easy to overlook just 10 years ago. Not only did it place 10th in Cornell’s 2004 benchmarking test, but John enlightened us to its fair share of limitations: inability to learn p(x), inefficiencies with layers greater than 3, overfitting, and relatively slow performance.

  3. The Restricted Boltzmann Machine’s (RBM’s) revival of the neural network. The year 2006 witnessed a breakthrough in machine learning, thanks to the efforts of an academic triumvirate consisting of Geoff Hinton, Yann LeCun, and Yoshua Bengio. I’m not going to even pretend like I understand the details, but will just say that their application of the Restricted Boltzmann Machine (RBM) to neural networks has played a major role in eradicating the neural network’s limitations outlined in #2 above. Take, for example, ‘inability to learn p(x)’. Going back to the cat example in #1, what this essentially states is that before the triumvirate’s discovery, the neural net was incapable of using an existing set of cat images to draw a new image of a cat. Figuratively speaking, not only can neural nets now draw cats, but they can do so with impressive time metrics thanks to the emergence of the GPU. Stanford, for example, was able to process 14 terabytes of images in just 3 hours through overlaying deep learning algorithms on top of a GPU-centric computer architecture. What’s even better? The fact that many implementations of the deep learning algorithm are openly available under the BSD licensing agreement.

  4. Deep learning’s astonishing results. Deep learning has experienced an explosive amount of success in a relatively small amount of time. Not only have several international image recognition contests been recently won by those who used deep learning, but technology powerhouses such as Google, Facebook, and Netflix are investing heavily in the algorithm’s adoption. For example, deep learning triumvirate member Geoff Hinton was hired by Google in 2013 to help the company make sense of their massive amounts of data and to optimize existing products that use machine learning techniques. Fellow deep learning triumvirate member Yann LeCun was hired by Facebook, also in 2013, to help integrate deep learning technologies into the company’s IT systems.

As for all the hype surrounding deep learning, John concluded his presentation by suggesting ‘cautious optimism in results, without reckless assertions about the future’. Although it would be careless to claim that deep learning has cured disease, for example, one thing most certainly is for sure: deep learning has inspired deep thinking throughout the DC metropolitan area.

As to where deep learning has left our furry feline friends, the attached YouTube video will further explain….

(created by an anonymous audience member following the presentation)

You can see John Kaufhold's slides from this event here.

Facility Location Analysis Resources Incorporating Travel Time

This is a guest blog post by Alan Briggs. Alan is a operations researcher and data scientist at Elder Research. Alan and Harlan Harris (DC2 President and Data Science DC co-organizer) have co-presented a project with location analysis and Meetup location optimization at the Statistical Programming DC Meetup and an INFORMS-MD chapter meeting. Recently, Harlan presented a version of this work at the New York Statistical Programming Meetup. There was some great feedback on the Meetup page asking for some additional resources. This post by Alan is in response to that question.

If you’re looking for a good text resource to learn some of the basics about facility location, I highly recommend grabbing a chapter of Dr. Michael Kay’s e-book (pfd) available for free from his logistics engineering website. He gives an excellent overview of some of the basics of facility location, including single facility location, multi-facility location, facility location-allocation, etc. At ~20 pages, it’s entirely approachable, but technical enough to pique the interest of the more technically-minded analyst. For a deeper dive into some of the more advanced research in this space, I’d recommend using some of the subject headings in his book as seeds for a simple search on Google Scholar. It’s nothing super fancy, but there are plenty of articles in the public-domain that relate to minisum/minimax optimization and all of their narrowly tailored implementations.

One of the really nice things about the data science community is that it is full of people with an inveterate dedication to critical thinking. There is nearly always some great [constructive] criticism about what didn’t quite work or what could have been a little bit better. One of the responses to the presentation recommended optimizing with respect to travel time instead of distance. Obviously, in this work, we’re using Euclidean distance as a proxy for time. Harlan cites laziness as the primary motivation for this shortcut, and I’ll certainly echo his response. However, a lot of modeling boils down to cutting your losses, getting a good enough solution or trying to find the balance among your own limited resources (e.g. time, data, technique, etc.). For the problem at hand, clearly the goal is to make Meetup attendance convenient for the most number of people; we want to get people to Meetups. But, for our purposes, a rough order of magnitude was sufficient. Harlan humorously points out that the true optimal location for a version of our original analysis in DC was an abandoned warehouse—if it really was an actual physical location at all. So, when you really just need a good solution, and the precision associated with true optimality is unappreciated, distance can be a pretty good proxy for time.

A lot of times good enough works, but there are some obvious limitations. In statistics, the law of large numbers holds that independent measurements of a random quantity tend toward the theoretical average of that quantity. For this reason, logistics problems can be more accurately estimated when they involve a large number of entities (e.g. travelers, shipments, etc.). For the problem at hand, if we were optimizing facility location for 1,000 or 10,000 respondents, again, using distance as a proxy for time, we would feel much better about the optimality of the location. I would add that similar to large quantities, introducing greater degrees of variability can also serve to improve performance. Thus, optimizing facility location across DC or New York may be a little too homogeneous. If instead, however, your analysis uses data across a larger, more heterogeneous area, like say an entire state where you have urban, suburban and rural areas, you will again get better performance on optimality.

Let’s say you’ve weighed the pros and cons and you really want to dive deeper into optimizing based on travel time, there are a couple different options you can consider. First, applying a circuity factor to the Euclidean distance can account for non-direct travel between points. So, to go from point A to point B actually may take 1.2 units as opposed to the straight-line distance of 1 unit. That may not be as helpful for a really complex urban space where feasible routing is not always intuitive and travel times can vary wildly. However, it can give some really good approximations, again, over larger, more heterogeneous spaces. An extension to a singular circuity factor, would be to introduce some gradient circuity factor that is proportional to population density. There are some really good zip code data available that can be used to estimate population.

Increasing the circuity factor in higher-density locales and decreasing in the lower-density can help by providing a more realistic assessment of how far off of the straight line distance the average person would have to commute. For the really enthusiastic modeler that has some good data skills and is looking for even more integrity in their travel time analysis, there are a number of websites that provide road network information. They list roads across the United States by functional class (interstate, Expressway & principal arterial, Urban principal arterial, collector, etc.) and even provide shape files and things for GIS. I've done basic speed limit estimation by functional class, but you could also do something like introduce a speed gradient respective to population density (as we mentioned above for the circuity factor). You could also derive some type of an inverse speed distribution that slows traffic at certain times of the day based on rush hour travel in or near urban centers.

In the words of George E. P. Box, “all models are wrong, but some models are useful.” If you start with a basic location analysis and find it to be wrong but useful, you may have done enough. If not, however, perhaps you can make it useful by increasing complexity in one of the ways I have mentioned above.