Data Science DC

Socially Responsible Algorithms at Data Science DC

Socially Responsible Algorithms at Data Science DC

Troubling instances of the mosaic effect — in which different anonymized datasets are combined to reveal unintended details — include the tracking of celebrity cab trips and the identification of Netflix user profiles. Also concerning is the tremendous influence wielded by corporations and their massive data stores, most notoriously embodied by Facebook’s secret psychological experiments

Problems with the p-value -- References

Problems with the p-value -- References

On December 11th, Prof. Regina Nuzzo from Galludet University talked at Data Science DC, about Problems with the p-valueThe event was well-received. If you missed it, the slides and audio are available. Here we provide Dr. Nuzzo's references and links from the talk, which are on their own a great resource for those considering communication about statistical reliability. (Note that the five topics she covered used examples from highly-publicized studies of sexual behavior.)

Event Recap: DSDC June Meetup

This is a guest post by Alex Evanczuk, a software engineer at FiscalNote. Hello DC2!  My name is Alex Evanczuk, and I recently joined a government data startup right here in the nation's capital that goes by the name of FiscalNote. Our mission is to make government data easily accessible, transparent, and understandable for everyone. We are a passionate group of individuals and are actively looking for other like-minded people who want to see things change. If this is you, and particularly if you are a software developer (front-end, with experience in Ruby on Rails), please reach out to me at alex@fiscalnote.com and I can put you in touch with the right people.

Screen Shot 2014-07-02 at 2.13.51 PM

The topics covered by the presenters at June’s Data Science DC Meetup were varied and interesting. Subjects included spatial forecasting in uncertain environments, cell phone surveys in Africa (GeoPoll), causal inference models for improving the lives and prospects of Children and Youth (Child Trends), and several others.

I noticed a number of fascinating trends about the presentations I saw. The first was the simple and unadulterated love of numbers and their relationships to one another. Each presenter proudly explained the mathematical underpinnings of the models and assumptions used in their research, and most had slides that contained nothing more than a single formula or graph. In my brief time in academia, I've noticed that to most statisticians and mathematicians, numbers are their poetry, and this rang true at the event as well.

To most statisticians and mathematicians, numbers are their poetry.

The second was something that is perhaps well known to data researchers, but perhaps not so much to others, and that was that the advantages and influences of data science can extend into any industry. From business, to social work, to education, to healthcare, data science can find a way to improve our understanding of any field.

The second was something that is perhaps well known to data researchers, but perhaps not so much to others, and that was that the advantages and influences of data science can extend into any industry. From business, to social work, to education, to healthcare, data science can find a way to improve our understanding of any field.

More important than the numbers, however, is the fact that behind every data point, integer, and graph, is a human being. The human beings behind our data inspire our use of numbers and their deep understanding to develop axiomatically correct solutions for real world problems. The researchers presented data that told us how we might better understand emotional sentiment in developing countries, or make decisions on cancer treatments, or help children reach their boundless potential. For me, this is what data science is all about--how the appreciation of mathematics can help us improve the lives of human beings.

Missed the Meetup? You can review the audio files from the event here and access the slide deck here.

Where are the Deep Learning Courses?

This is a guest post by John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC.

Why aren't there more Deep Learning talks, tutorials, or workshops in DC2?

It's been about two months since my Deep Learning talk at Artisphere for DC2. Again, thanks to the organizers (especially Harlan Harris and Sean Gonzalez) and the sponsors (especially Arlington Economic Development). We had a great turnout and a lot of good questions that night. Since the talk and at other Meetups since, I've been encouraged by the tidal wave of interest from teaching organizations and prospective students alike.

First some preemptive answers to the “FAQ” downstream of the talk:

  • Mary Galvin wrote a blog review of this event.
  • Yes, the slides are available.
  • Yes, corresponding audio is also available (thanks Geoff Moes).
  • A recently "reconstructed" talk combining the slides and audio is also now available!
  • Where else can I learn more about Deep Learning as a data scientist? (This may be a request to teach, a question about how to do something in Deep Learning, a question about theory, or a request to do an internship. They're all basically the same thing.)
  • It's this last question that's the focus of this blog post. Lots of people have asked and there are some answers out there already, but if people in the DC MSA are really interested, there could be more. At the end of this post is a survey—if you want more Deep Learning, let DC2 know what you want and together we'll figure out what we can make happen.

There actually was a class...

Aaron Schumacher and Tommy Shen invited me to come talk in April for General Assemb.ly's Data Science course. I did teach one Deep Learning module for them. That module was a slightly longer version of the talk I gave at Artisphere combined with one abbreviated “hands on” module on unsupervised feature learning based on Stanford's tutorial. It didn't help that the tutorial was written in Octave and the class had mostly been using Python up to that point. Though feedback was generally positive for the Deep Learning module, some students wondered if they could get a little more hands on and focus on specifics. And I empathize with them. I've spent real money on Deep Learning tutorials that I thought could have been much more useful if they were more hands on.

Though I've appreciated all the invitations to teach courses, workshops, or lectures, except for the General Assemb.ly course, I've turned down all the invitations to teach something more on Deep Learning. This is not because the data science community here in DC is already expert in Deep Learning or because it's not worth teaching. Quite the opposite. I've not committed to teach more Deep Learning mostly because of these three reasons:

  1. There are already significant Deep Learning Tutorial resources out there,
  2. There are significant front end investments that neophytes need to make for any workshop or tutorial to be valuable to both the class and instructor and,
  3. I haven't found a teaching model in the DC MSA that convinces me teaching a “traditional” class in the formal sense is a better investment of time than instruction through project-based learning on research work contracted through my company.

Resources to learn Deep Learning

There are already many freely available resources to learn the theory of Deep Learning, and it's made even more accessible by many of the very lucid authors who participate in this community. My talk was cherry-picked from a number of these materials and news stories. Here are some representative links that can connect you to much of the mainstream literature and discussion in Deep Learning:

  • The tutorials link on the DeepLearning.net page
  • NYU's Deep Learning course course material
  • Yann LeCun's overview of Deep Learning with Marc'Aurelio Ranzato
  • Geoff Hinton's Coursera course on Neural Networks
  • A book on Deep Learning from the Microsoft Speech Group
  • A reading list list from Carnegie Mellon with student notes on many of the papers
  • A Google+ page on Deep Learning

This is the first reason I don't think it's all that valuable for DC to have more of its own Deep Learning “academic” tutorials. And by “academic” I mean tutorials that don't end with students leaving the class successfully implementing systems that learn representations to do amazing things with those learned features. I'm happy to give tutorials in that “academic” direction or shape them based on my own biases, but I doubt I'd improve on what's already out there. I've been doing machine learning for 15 years, so I start with some background to deeply appreciate Deep Learning, but I've only been doing Deep Learning for two years now. And my expertise is self-taught. And I never did a post-doc with Geoff Hinton, Yann LeCun or Yoshua Bengio. I'm still learning, myself.

The investments to go from 0 to Deep Learning

It's a joy to teach motivated students who come equipped with all the prerequisites for really mastering a subject. That said, teaching a less equipped, uninvested and/or unmotivated studentry is often an exercise in joint suffering for both students and instructor.

I believe the requests to have a Deep Learning course, tutorial, workshop or another talk are all well intentioned... Except for Sean Gonzalez—it creeps me out how much he wants a workshop. But I think most of this eager interest in tutorials overlooks just how much preparation a student needs to get a good return on their time and tuition. And if they're not getting a good return, what's the point? The last thing I want to do is give the DC2 community a tutorial on “the Past” of neural nets. Here are what I consider some practical prerequisites for folks to really get something out of a hands-on tutorial:

  • An understanding of machine learning, including
    • optimization and stochastic gradient descent
    • hyperparameter tuning
    • bagging
    • at least a passing understanding of neural nets
  • A pretty good grasp of Python, including
    • a working knowledge of how to configure different packages
    • some appreciation for Theano (warts and all)
    • a good understanding of data preparation
  • Some recent CUDA-capable NVIDIA GPU hardware* configured for your machine
    • CUDA drivers
    • NVIDIA's CUDA examples compiled

*hardware isn't necessarily a prerequisite, but I don't know how you can get an understanding of any more than toy problems on a CPU

Resources like the ones above are great for getting a student up to speed on the “academic” issues of understanding deep learning, but that only scratches the surface. Once students know what can be done, if they’re anything like me, they want to be able to do it. And at that point, students need a pretty deep understanding of not just the theory, but of both hardware and software to really make some contributions in Deep Learning. Or even apply it to their problem.

Starting with the hardware, let's say, for sake of argument, that you work for the government or are for some other arbitrary reason forced to buy Dell hardware. You begin your journey justifying the $4000 purchase for a machine that might be semi-functional as a Deep Learning platform when there's a $2500 guideline in your department. Individual Dell workstations are like Deep Learning kryptonite, so even if someone in the n layers of approval bureaucracy somehow approved it, it's still the beginning of a frustrating story with an unhappy ending. Or let's say you build your own machine. Now add “building a machine” for a minimum of about $1500 to the prerequisites. But to really get a return in the sweet spot of those components, you probably want to spend at least $2500. Now the prerequisites include a dollar investment in addition to talent and tuition! Or let’s say you’re just going to build out your three-year-old machine you have for the new capability. Oh, you only have a 500W power supply? Lucky you! You’re going shopping! Oh, your machine has an ATI graphics card. I’m sure it’s just a little bit of glue code to repurpose CUDA calls to OpenCL calls for that hardware. Let's say you actually have an NVIDIA card (at least as recent as a GTX 580) and wanted to develop in virtual machines, so you need PCI pass-through to reach the CUDA cores. Lucky you! You have some more reading to do! Pray DenverCoder9's made a summary post in the past 11 years.

“But I run everything in the cloud on EC2,” you say! It's $0.65/hour for G2 instances. And those are the cheap GPU instances. Back of the envelope, it took a week of churning through 1.2 million training images with CUDA convnets (optimized for speed) to produce a breakthrough result. At $0.65/hour, you get maybe 20 or 30 tries doing that before it would have made more sense to have built your own machine. This isn't a crazy way to learn, but any psychological disincentive to experimentation, even $0.65/hour, seems like an unnecessary distraction. I also can't endorse the idea of “dabbling” in Deep Learning; it seems akin to “dabbling” in having children—you either make the commitment or you don't.

At this point, I’m not aware of an “import deeplearning” package in Python that can then fit a nine layer sparse autoencoder with invisible CUDA calls to your GPU on 10 million images at the ipython command line. Though people are trying. That's an extreme example, but in general, you need a flexible, stable codebase to even experiment at a useful scale—and that's really what we data scientists should be doing. Toys are fine and all, but if scale up means a qualitatively different solution, why learn the toy? And that means getting acquainted with the pros and cons of various codebases out there. Or writing your own, which... Good luck!

DC Metro-area teaching models

I start from the premise that no good teacher in the history of teaching has ever been rewarded appropriately with pay for their contributions and most teaching rewards are personal. I accept that premise. And this is all I really ever expect from teaching. I do, however, believe teaching is becoming even less attractive to good teachers every year at every stage of lifelong learning. Traditional post-secondary instructional models are clearly collapsing. Brick and mortar university degrees often trap graduates in debt at the same time the universities have already outsourced their actual teaching mission to low-cost adjunct staff and diverted funds to marketing curricula rather than teaching them. For-profit institutions are even worse. Compensation for a career in public education has never been particularly attractive, but still there have always been teachers who love to teach, are good at it, and do it anyway. However, new narrow metric-based approaches that hold teachers responsible for the students they're dealt rather than the quality of their teaching can be demoralizing for even the most self-possessed teachers. These developments threaten to reduce that pool of quality teachers to a sparse band of marginalized die-hards. But enough of my view of “teaching” the way most people typically blindly suggest I do it. The formal and informal teaching options in the DC MSA mirror these broader developments. I run a company with active contracts and however much I might love teaching and would like to see a well-trained crop of deep learning experts in the region, the investment doesn't add up. So I continue to mentor colleagues and partners through contracted research projects.

I don't know all the models for teaching and haven't spent a lot of time understanding them, but none seem to make sense to me in terms of time invested to teach students—partly because many of them really can't get at the hardware part of the list of prerequisites above. This is my vague understanding of compensation models generally available in the online space*:

  • Udemy – produce and own a "digital asset" of the course content and sell tuition and advertising as a MOOC. I have no experience with Udemy, but some people seemed happy to have made $20,000 in a month. Thanks to Valerie at Feastie for suggesting this option.
  • Statistics.com – Typically a few thousand for four sessions that Statistics.com then sells; I believe this must be a “work for hire” copyright model for the digital asset that Statistics.com buys from the instructor. I assume it's something akin to commissioned art, that once you create, you no longer own. [Editor’s note: Statistics.com is a sponsor of Data Science DC. The arrangement that John describes is similar to our understanding too.]
  • Myngle – Sell lots of online lessons for typically less than a 30% share.

And this is my understanding of compensation models locally available in the DC MSA*:

  • General Assemb.ly – Between 15-20% of tuition (where tuition may be $4000/student for a semester class).
  • District Data Labs Workshop – Splits total workshop tuition or profit 50% with the instructor—which may be the best deal I've heard, but 50% is a lot to pay for advertising and logistics. [Editor's note: These are the workshops that Data Community DC runs with our partner DDL.]
  • Give a lecture – typically a one time lecture with a modest honorarium ($100s) that may include travel. I've given these kinds of lectures at GMU and Marymount.
  • Adjunct at a local university – This is often a very labor- and commute-intensive investment and pays no better (with no benefits) than a few thousand dollars. Georgetown will pay about $200 per contact hour with students. Assuming there are three hours of out of classroom commitment for every hour in class, this probably ends up somewhere in the $50 per hour range. All this said, this was the suggestion of a respected entrepreneur in the DC region.
  • Tenure-track position at a local university – As an Assistant Professor, you will typically have to forego being anything but a glorified post-doc until your tenure review. And good luck convincing this crowd they need you enough to hire you with tenure.

*These are what I understand to be the approximate options and if you got a worse or better deal, please understand I might be wrong about these specific figures. I'm not wrong, though, that none of these are “market rate” for an experienced data scientist in the DC MSA.

Currently, all of my teaching happens through hands-on internships and project-based learning at my company, where I know the students (i.e. my colleagues, coworkers, subcontractors and partners) are motivated and I know they have sufficient resources to succeed (including hardware). When I “teach,” I typically do it for free, and I try hard to avoid organizations that create asymmetrical relationships with their instructors or sell instructor time as their primary “product” at a steep discount to the instructor compensation. Though polemic, Mike Selik summarized the same issue of cut rate data science in "The End of Kaggle." I'd love to hear of a good model where students could really get the three practical prerequisites for Deep Learning and how I could help make that happen here in DC2 short of making “teaching” my primary vocation. If there's a viable model for that out there, please let me know. If you still think you'd like to learn more about Deep Learning through DC2, please help us understand what you'd want out of it and whether you'd be able to bring your own hardware.

[wufoo username="datacommunitydc" formhash="m11ujb9d0m66byv" autoresize="true" height="1073" header="show" ssl="true"]

A Rush of Ideas: Kalev Leetaru at Data Science DC

gdeltThis review of the April Data Science DC Meetup was written by Ross Mohan. Ross is a solutions architect for Five 9 Group.

Perhaps you’ve heard the phrase lately “software is eating the world”. Well, to be successful at that, it’s going to have to do as least as good a job of eating the world’s data as do the systems of Kalev Leetaru, Georgetown/Yahoo! fellow.

Kalev Leetaru, lead investigator on GDELT and other tools, defines “world class” work — certainly in the sense of size and scope of data. The goal of GDELT and related systems is to stream global news and social media in as near realtime as possible through multiple steps. The overall goal is to arrive at reliable tone (sentiment) mining and differential conflict detection and to do so …. globally. It is a grand goal.

Kalev Leetaru’s talk covered several broad areas. History of data and communication, data quality and “gotcha” issues in data sourcing and curation, geography of Twitter, processing architecture, toolkits and considerations, and data formatting observations. In each he had a fresh perspective or a novel idea, born of the requirement to handle enormous quantities of ‘noisy’ or ‘dirty’ data.

Perspectives

Keetaru observed that “the map is not the territory” in the sense that actual voting, resource or policy boundaries as measured by various data sources may not match assigned national boundaries. He flagged this as a question of “spatial error bars” for maps.

Distinguishing Global data science from hard established HPC-like pursuits (such as computational chemistry) Kalev Leetaru observed that we make our own bespoke toolkits, and that there is no single ‘magic toolkit” for Big Data, so we should be prepared and willing to spend time putting our toolchain together.

After talking a bit about the historical evolution and growth of data, Kalev Leetaru asked a few perspective-changing questions (some clearly relevant to intelligence agency needs) How to find all protests? How to locate all law books? Some of the more interesting data curation tools and resources Kalev Leetaru mentioned — and a lot more — might be found by the interested reader in The Oxford Guide to Library Research by Thomas Mann.

GDELT (covered further below), labels parse trees with error rates, and reaches beyond the “WHAT” of simple news media to tell us WHY, and ‘how reliable’. One GDELT output product among many is the Daily Global Conflict Report, which covers world leader emotional state and differential change in conflict, not absolute markers.

One recurring theme was to find ways to define and support “truth.” Kalev Leetaru decried one current trend in Big Data, the so-called “Apple Effect”: making luscious pictures from data; with more focus on appearance than actual ground truth. One example he cited was a conclusion from a recent report on Syria, which -- blithely based on geotagged English-language tweets and Facebook postings -- cast a skewed light on Syria’s rebels (Bzzzzzt!)

Twitter

Leetaru provided one answer on “how to ‘ground truth’ data” by asking “how accurate are geotagged tweets?” Such tweets are after all only 3% of the total. But he reliably used those tweets. How?  By correlating location to electric power availability. (r = .89) He talked also about how to handle emoticons, irony, sarcasm, and other affective language, cautioning analysts to think beyond blindly plugging data into pictures.

Kalev Leetaru talked engagingly about Geography of Twitter, encouraging us to to more RTFD (D=data) than RTFM. Cut your own way through the forest. The valid maps have not been made yet, so be prepared to make your own. Some of the challenges he cited were how to break up typical #hashtagswithnowhitespace and put them back into sentences, how to build — and maintain — sentiment/tone dictionaries and to expect, therefore, to spend the vast majority of time in innovative projects in human tuning the algorithms and understanding the data, and then iterating the machine. Refreshingly “hands on.”

Scale and Tech Architecture

Kalev Leetaru turned to discuss the scale of data, which is now generating easily  in the petabytes per day range. There is no longer any question that automation must be used and that serious machinery will be involved. Our job is to get that automation machinery doing the right thing, and if we do so, we can measure the ‘heartbeat of society.’

For a book images project (60 Million images across hundreds of years) he mentioned a number of tools and file systems (but neither Gluster nor CEPH, disappointingly to this reviewer!) and delved deeply and masterfully into the question of how to clean and manage the very dirty data of “closed captioning” found in news reports. To full-text geocode and analyze half a million hours of news (from the Internet Archives), we need fast language detection and captioning error assessment. What makes this task horrifically difficult is that POS tagging “fails catastrophically on closed captioning” and that CC is worse, far worse in terms of quality than is Optical Character Recognition. The standard Stanford NL Understanding toolkit is very “fragile” in this domain: one reason being that news media has an extremely high density of location references, forcing the analyst into using context to disambiguate.

He covered his GDELT (Global Database of Event, Language and Tone), covering human/societal behavior and beliefs at scale around the world. A system of half a billion plus georeferenced rows, 58 columns wide, comprising 100,000 sources such as  broadcast, print, online media back to 1979, it relies on both human translation and Google translate, and will soon be extended across languages and back to the 1800s. Further, he’s incorporating 21 billion words of academic literature into this model (a first!) and expects availability in Summer 2014, (Sources include JSTOR, DTIC, CIA, CVORE CiteSeerX, IA.)

GDELT’s architecture, which relies heavily on the Google Cloud and BigQuery, can stream at 100,000 input observations/second. This reviewer wanted to ask him about update and delete needs and speeds, but the stream is designed to optimize ingest and query. GDELT tools were myriad, but Perl was frequently mentioned (for text processing).

Kalev Leetaru shared some post GDELT construction takeaways — “it’s not all English” and “watch out for full Unicode compliance” in your toolset, lest your lovely data processing stack SEGFAULT halfway through a load. Store data in whatever is easy to maintain and fast. Modularity is good but performance can be an issue; watch out for XML which bogs down processing on highly nested data. Use for interchange more than anything; sharing seems “nice” but “you can’t shared a graph” and “RAM disk is your friend” more so even than SSD, FusionIO, or fast SANs.

The talk, like this blog post, ran over allotted space and time, but the talk was well worth the effort spent understanding it.

Deep Learning Inspires Deep Thinking

This is a guest post by Mary Galvin, founder and managing principal at AIC. Mary provides technical consulting services to clients including LexisNexis’ HPCC Systems team. The HPCC is an open source, massive parallel-processing computing platform that solves Big Data problems. 

Data Science DC hosted a packed house at the Artisphere on Monday evening, thanks to the efforts of organizers Harlan Harris, Sean Gonzalez, and several others who helped plan and coordinate the event. Michael Burke, Jr, Arlington County Business Development Manager, provided opening remarks and emphasized Arlington’s commitment to serving local innovators and entrepreneurs. Michael subsequently introduced Sanju Bansal, a former MicroStrategy founder and executive who presently serves as the CEO of an emerging, Arlington-based start-up, Hunch Analytics. Sanju energized the audience by providing concrete examples of data science’s applicability to business; this no better illustrated than by the $930 million acquisition of Climate Corps. roughly 6 months ago.

Michael, Sanju, and the rest of the Data Science DC team helped set the stage for a phenomenal presentation put on by John Kaufhold, Managing Partner and Data Scientist at Deep Learning Analytics. John started his presentation by asking the audience for a show of hands on two items: 1) whether anyone was familiar with deep learning, and 2) of those who said yes to #1, whether they could explain what deep learning meant to a fellow data scientist. Of the roughly 240 attendees present, the majority of hands that answered favorably to question #1 dropped significantly upon John’s prompting of question #2.

I’ll be the first to admit that I was unable to raise my hand for either of John’s introductory questions. The fact I was at least a bit knowledgeable in the broader machine learning topic helped to somewhat put my mind at ease, thanks to prior experiences working with statistical machine translation, entity extraction, and entity resolution engines. That said, I still entered John’s talk fully prepared to brace myself for the ‘deep’ learning curve that lay ahead. Although I’m still trying to decompress from everything that was covered – it being less than a week since the event took place – I’d summarize key takeaways from the densely-packed, intellectually stimulating, 70+ minute session that ensued as follows:

  1. Machine learning’s dirty work: labelling and feature engineering. John introduced his topic by using examples from image and speech recognition to illustrate two mandatory (and often less-than-desirable) undertakings in machine learning: labelling and feature engineering. In the case specific to image recognition, say you wanted to determine whether a photo, ‘x’, contained an image of a cat, ‘y’ (i.e., p(y|x)). This would typically involve taking a sizable database of images and manually labelling which subset of those images were cats. The human-labeled images would then serve as a body of knowledge upon which features representative of those cats would be generated, as required by the feature engineering step in the machine learning process. John emphasized the laborious, expensive, and mundane nature of feature engineering, using his own experiences in medical imaging to prove his point.

    Above said, various machine learning algorithms could use the fruits of the labelling and feature engineering labors to discern a cat within any photo – not just those cats previously observed by the system. Although there’s no getting around machine learning’s dirty work to achieve these results, the emergence of deep learning has helped to lesson it.

  2. Machine Learning’s ‘Deep’ Bench. I entered John’s presentation knowing a handful of machine learning algorithms but left realizing my knowledge had barely scratched the surface. Cornell University’s machine learning benchmarking tests can serve as a good reference point for determining which algorithm to use, provided the results are taken into account with the wider, ‘No Free Lunch Theorem’ consideration that even the ‘best’ algorithm has the potential to perform poorly on a subclass of problems.

    Provided machine learning’s ‘deep’ bench, the neural network might have been easy to overlook just 10 years ago. Not only did it place 10th in Cornell’s 2004 benchmarking test, but John enlightened us to its fair share of limitations: inability to learn p(x), inefficiencies with layers greater than 3, overfitting, and relatively slow performance.

  3. The Restricted Boltzmann Machine’s (RBM’s) revival of the neural network. The year 2006 witnessed a breakthrough in machine learning, thanks to the efforts of an academic triumvirate consisting of Geoff Hinton, Yann LeCun, and Yoshua Bengio. I’m not going to even pretend like I understand the details, but will just say that their application of the Restricted Boltzmann Machine (RBM) to neural networks has played a major role in eradicating the neural network’s limitations outlined in #2 above. Take, for example, ‘inability to learn p(x)’. Going back to the cat example in #1, what this essentially states is that before the triumvirate’s discovery, the neural net was incapable of using an existing set of cat images to draw a new image of a cat. Figuratively speaking, not only can neural nets now draw cats, but they can do so with impressive time metrics thanks to the emergence of the GPU. Stanford, for example, was able to process 14 terabytes of images in just 3 hours through overlaying deep learning algorithms on top of a GPU-centric computer architecture. What’s even better? The fact that many implementations of the deep learning algorithm are openly available under the BSD licensing agreement.

  4. Deep learning’s astonishing results. Deep learning has experienced an explosive amount of success in a relatively small amount of time. Not only have several international image recognition contests been recently won by those who used deep learning, but technology powerhouses such as Google, Facebook, and Netflix are investing heavily in the algorithm’s adoption. For example, deep learning triumvirate member Geoff Hinton was hired by Google in 2013 to help the company make sense of their massive amounts of data and to optimize existing products that use machine learning techniques. Fellow deep learning triumvirate member Yann LeCun was hired by Facebook, also in 2013, to help integrate deep learning technologies into the company’s IT systems.

As for all the hype surrounding deep learning, John concluded his presentation by suggesting ‘cautious optimism in results, without reckless assertions about the future’. Although it would be careless to claim that deep learning has cured disease, for example, one thing most certainly is for sure: deep learning has inspired deep thinking throughout the DC metropolitan area.

As to where deep learning has left our furry feline friends, the attached YouTube video will further explain….

(created by an anonymous audience member following the presentation)

You can see John Kaufhold's slides from this event here.

Will big data bring a return of sampling statistics? And a review of Aaron Strauss's talk at DSDC

This guest post by Tommy Jones was originally published on Biased Estimates. Tommy is a statistician or data scientist -- depending on the context -- in Washington, DC. He is a graduate of Georgetown's MS program for mathematics and statistics. Follow him on Twitter @thos_jones.

Some Background

What is sampling statistics?

Sampling statistics concerns the planning, collection, and analysis of survey data. When most people take a statistics course, they are learning "model-based" statistics. (Model-based statistics is not the same as statistical modeling, stick with me here.) Model-based statistics uses a mathematical function to model the distribution of an infinitely-sized population to quantify uncertainty. Sampling statistics, however, uses a priori knowledge of the size of the target population to inform quantifying uncertainty. The big lesson I learned after taking survey sampling is that if you assume the correct model, then the two statistical philosophies agree. But if your assumed model is wrong, the two approaches give different results. (And one approach has fewer assumptions, bee tee dubs.)
Sampling statistics also has a big bag of other tricks, too many to do justice here. But it provides frameworks for handling missing or biased data, combining data on subpopulations whose sample proportions differ from their proportions of the population, how to sample when subpopulations have very different statistical characteristics, etc.
As I write this, it is entirely possible to earn a PhD in statistics and not take a single course in sampling or survey statistics. Many federal agencies hire statisticians and then send them immediately back to school to places like UMD's Joint Program in Survey Methodology. (The federal government conducts a LOT of surveys.)
I can't claim to be certain, but I think that sampling statistics became esoteric for two reasons. First, surveys (and data collection in general) have traditionally been expensive. Until recently, there weren't many organizations except for the government that had the budget to conduct surveys properly and regularly. (Obviously, there are exceptions.) Second, model-based statistics tend to work well and have broad applicability. You can do a lot with a laptop, a .csv file, and the right education. My guess is that these two factors have meant that the vast majority of statisticians and statistician-like researchers have become consumers of data sets, rather than producers. In an age of "big data" this seems to be changing, however.

Much ado about response rates

Response rates for surveys have been dropping for years, causing frustration among statisticians and skepticism from the public. Having a lower response rate doesn't just mean your confidence intervals get wider. Given the nature of many surveys, it's possible (if not likely) that the probability a person responds to the survey may be related to one or a combination of relevant variables. If unaddressed, such non-response can damage an analysis. Addressing the problem drives up the cost of a survey, however.
Consider measuring unemployment. A person is considered unemployed if they don't have a job and they are looking for one. Somebody who loses their job may be less likely to respond to the unemployment survey for a variety of reasons. They may be embarrassed, they may move back home, they may have lost their house! But if the government sends a survey or interviewer and doesn't hear back, how will it know if the respondent is employed, unemployed (and looking), or off the job market completely? So, they have to find out. Time spent tracking a respondent down is expensive!
So, if you are collecting data that requires a response, you must consider who isn't responding and why. Many people anecdotally chalk this effect up to survey fatigue. Aren't we all tired of being bombarded by websites and emails asking us for "just a couple minutes" of our time? (Businesses that send a satisfaction survey every time a customer contacts customer service take note; you may be your own worst data-collection enemy.)

In Practice: Political Polling in 2012 and Beyond

In context of the above, Aaron Strauss's February 25th talk at DSDC was enlightening. Aaron's presentation was billed as covering "two things that people in [Washington D.C.] absolutely love. One of those things is political campaigns. The other thing is using data to estimate causal effects in subgroups of controlled experiments!" Woooooo! Controlled experiments! Causal effects! Subgroup analysis! Be still, my beating heart.
Aaron earned a PhD in political science from Princeton and has been involved in three of the last four presidential campaigns designing surveys, analyzing collected data, and providing actionable insights for the Democratic party. His blog is here. (For the record, I am strictly non-partisan and do not endorse anyone's politics though I will get in knife fights over statistical practices.)

In an hour-long presentation, Aaron laid a foundation for sampling and polling in the 21st century, revealing how political campaigns and businesses track our data, analyze it, and what the future of surveying may be. The most profound insight I got was to see how the traditional practices of sampling statistics were being blended with 21st century data collection methods, through apps and social media. Whether these changes will address the decline is response rates or only temporarily offset them remains to be seen.Some highlights:

  • The number of households that have only wireless telephone service is reaching parity with the number having land line phone service. When considering only households with children (excluding older people with grown children and young adults without children) the number sits at 45 percent.
  • Offering small savings on wireless bills may incentivize the taking of flash polls through smart phones.
  • Reducing the marginal cost of surveys allows political pollsters to design randomized controlled trials, to evaluate the efficacy of different campaign messages on voting outcomes. (As with all things statistics, there are tradeoffs and confounding variables with such approaches.)
  • Pollsters would love to get access to all of your Facebook data.

Sampling Statistics and "Big Data"

Today, businesses and other organizations are tracking people at unprecedented levels. One reason rationale for big data being a "revolution" is that for the first time organizations have access to the full population of interest. For example, Amazon can track the purchasing history of 100% of its customers.I would challenge the above argument, but won't outright disagree with it. Your current customer base may or may not be your full population of interest. You may, for example, be interested in people who don't purchase your product. You may wish to analyze a sample of your market, to figure out how who isn't purchasing from you and why. You may have access to some data on the whole population, but you may not have all the variables you want.More importantly, sampling statistics has tools that may allow organizations to design tracking schemes to gather the most relevant data to their questions of interest. To quote R.A. Fisher "To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: He may be able to say what the experiment died of." The world (especially the social-science world) is not static; priorities and people's behavior are sure to change.Data fusion, the process of pulling together data from heterogeneous sources into one analysis, is not a survey. But these sources may represent observations and variables in proportions or frequencies differing from the target population. Combining data from these sources with a simple merge may result in biased analyses. Sampling statistics has methods of using sample weights to combine strata of a stratified sample where some strata may be over or under sampled (and there are reasons to do this intentionally).

I am not proposing that sampling statistics will become the new hottest thing. But I would not be surprised if sampling courses move from the esoteric fringes, to being a core course in many or most statistics graduate programs in the coming decades. (And we know it may take over a hundred years for something to become the new hotness anyway.)

The professor that taught the sampling statistics course that I took a few years ago is the chief of the Statistical Research Division at the U.S. Census Bureau. When I last saw him at an alumni/prospective student mixer for Georgetown's math/stat program in 2013, he was wearing a button that said "ask me about big data." In a time when some think that statistics is the old school discipline only relevant for small data, seeing this button on a man whose field even within statistics is considered so "old school" that even most statisticians have moved on  made me chuckle. But it also made me think; things may be coming full circle for sample statistics.

Links for further reading

A statistician's role in big data (my source for the R.A. Fisher quote, above)

Ensemble Learning Reading List

Tuesday's Data Science DC Meetup features GMU graduate student Jay Hyer's introduction of Ensemble Learning, a core set of Machine Learning techniques. Here are Jay's suggestions for readings and resources related to the topic. Attend the Meetup, and follow Jay on Twitter at @aDataHead! Also note that all images contain Amazon Affiliate links and will result in DC2 getting a small percentage of the proceeds should you purchase the book. Thanks for the support!

L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classification and Regression Trees. Chapman and Hall.CRC, Boca Raton, FL, 1984.

This book does not cover ensemble methods, but is the book that introduced classification and regression trees (CART), which is the basis of Random Forests. Classification trees are also the basis of the AdaBoost algorithm. CART methods are an important tool for a data scientist to have in their skill set.

L. Breiman. Random Forests. Machine Learning, 45(1):5-32, 2001.

This is the article that started it all.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, 2nd ed. Springer, New York, NY, 2009.

This book is light on application and heavy on theory. Nevertheless, chapters 10, 15 & 16 give very thorough coverage to boosting, Random Forests and ensemble learning, respectively. A free PDF version of the book is available on Tibshirani’s website.

G. James, D. Witten, T. Hastie, R. Tibshirani. An Introduction to Statistical Learning: with Apllications in R, Springer, New York, NY, 2013.

As the name and co-authors imply, this is an introductory version of the previous book in this list. Chapter 8 covers, bagging, Random Forests and boosting.

Y. Freund and R.E. Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Journal of Computer and System Sciences, 55(1): 119-139, 1997.

This is the article that introduced the AdaBoost algorithm.

G. Seni, and J. Elder. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan & Claypool Publishers, USA, 2010.

This is a good book with great illustrations and graphs. There is also a lot of R code too!

Z.H. Zhou. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC, 2012.

This is an excellent book the covers ensemble learning from A-Z and is well suited for anyone from an eager beginner to a critical expert.

SynGlyphX: Hello and Thank You DC2!

The following is a sponsored post brought to you by one of the supporters of two of Data Community's five meetups.

Hello and Thank You DC2!

This week was my, and my company’s, introduction to Data Community DC (DC2).  We could not have asked for a more welcoming reception.  We attended and sponsored both Tuesday’s DVDC event on Data Journalism and Thursday’s DSDC event on GeoSpatial Data Analysis.  They were both pretty exciting, and timely, events for us.

SynglyphyxAs I mentioned, I’m new to DC2 and new to the “data as a science” community.  Don’t get me wrong, while I’m new to DC2 I’ve been awash in data my entire career.  I started as a young consultant reconciling discrepancies in the databases of a very early Client-Server implementation.  Basically, I had to make sure that all the big department store orders on the server were in sync with the home delivery client application.  A lot of manual reconciling that ultimately led to me programming code to semi-automatically reconcile the two databases.  Eventually (I think) they solved the technical issues that led the Client-Server databases being out of sync.

Synglyphyx2More recently, I was working for a company with a growing professional services organization.  The company typically hired new employees after a contract was signed; but the new professional services work involved short project durations.  If we waited to hire, the project would be over before someone started.  We developed a probability adjusted / portfolio analysis approach to compare supply of available resources (which is always changing as people finish projects, get extended, leave the organization) vs. demand (which is always changing as well), that enabled us to determine a range of positions and skillsets to hire for in a defined timeframe.

In both instances, it was data science that drove effective decision making.  Sure, you can apply some “gut” to any decision, but having some data science behind you makes the case much stronger.

So, I was fascinated to listen to the journalists discuss how they are applying data analysis to help:  1) support existing story lines; and 2) develop new story lines.  Nathan’s presentation on analyzing AIS data was interesting (and a bit timely as we had just gotten a verbal win for a client on doing similar type work, similar, but not exactly the same).

I know the power of data to solve complex business, operational, and other problems.  With our new company, SynGlyphX, we are focused on helping people both visualize and interact with their data.  We live in a world with sight and three dimensions.  We believe that by visualizing the data (unstructured, filtered, analyzed, any kind of data), we can help people leverage the power of the brain to identify patters, spot trends, and detect anomalies.  We joined DC2 to get to know folks in the community, generate some awareness for our company, and to get your feedback on what we are doing.  Thank you all for welcoming us and our company, SynGlyphX, to the community.  We appreciated everyone’s interest in the demonstrations of our interactive visualization technology.  Our website traffic was up significantly last week, so I am hoping this is a sign that you were interested in learning more about us.  Additionally, I have heard from a number of you since the events, and welcome hearing from more.

Here’s my call to action, I encourage you to tweet us your answer to the following question:  “Why do you find it helpful to visually interact with your data?”

See you at upcoming events.

Mark Sloan

About the Author:

As CEO of SynGlyphX, Mark brings over two decades of experience.  Mark began his career at Accenture, co-founded the global consulting firm RTM Consulting, and served as Vice President and General Manager of Convergys’ Consulting and Professional Services Group.

Mark has a M.B.A. from The Wharton School of the University of Pennsylvania, and a B.S. in Civil Engineering from the University of Notre Dame. He is a frequent speaker at industry events and has served as an Advisory Board Member for the Technology Professional Services Association (now Technology Services Industry Association (TSIA)).

November Data Science DC Event Review: Identifying Smugglers: Local Outlier Detection in Big Geospatial Data

This is a guest post from Data Science DC Member and quantitative political scientist David J. Elkind. Geopatial Outliers in the Strait of HormuzAs the November Data Science DC Meetup, Nathan Danneman, Emory University PhD and analytics engineer at Data Tactics, presented an approach to detecting unusual units within a geospatial data set. For me, the most enjoyable feature of Dr. Danneman’s talk was his engaging presentation. I suspect that other data consultants have also spent quite some time reading statistical articles and lost quite a few hours attempting to trace back the authors’ incoherent prose. Nathan approached his talk in a way that placed a minimal quantitative demand on the audience, instead focusing on the three essential components of his analysis: his analytical task, the outline of his approach, and the presentation of his findings. I’ll address each of these in turn.

Analytical Task

Nathan was presented with the problem of locating maritime vessels in the Strait of Hormuz engaged in smuggling activities: sanctions against Iran have made it very difficult for Iran to engage in international commerce, so improving detection of smugglers crossing the Strait from Iran to Qatar and the United Arab Emirates would improve the effectiveness of the sanctions regime and increase pressure on the regime. (I’ve written about issues related to Iranian sanctions for CSIS’s Project on Nuclear Issues Blog.)

Having collected publicly accessible satellite positioning data of maritime vessels, Nathan had four fields for each craft at several time intervals within some period: speed, heading, latitude and longitude.

But what do smugglers look like? Unfortunately, Nathan’s data set did not itself include any examples of watercraft which had been unambiguously identified by, e.g., the US Navy, as smugglers, so he could not rely on historical examples of smuggling as a starting point for his analysis. Instead, he has to puzzle out how to leverage information a craft’s spatial location

I’ve encountered a few applied quantitative researchers who, when faced with a lack of historical examples, would be entirely stymied in their progress, declaring the problem too hard. Instead of throwing up his hands, Dr. Danneman dug into the topic of maritime smuggling and found that many smuggling scenarios involve ship-to-ship transfers of contraband which take place outside of ordinary shipping lanes. This qualitatively-informed understanding transforms the project from mere speculation about what smugglers might look like into the problem of discovering maritime vessels which deviate too far from ordinary traffic patterns.

Importantly, framing the research in this way entirely premises the validity of inferences on the notion that unusual ships are smugglers and smugglers are unusual ships. But in reality, there are many reasons that ships might not conform to ordinary traffic patterns – for example, pleasure craft and fishing boats might have irregular movement patterns that don’t coincide with shipping lanes, and so look similar to the hypothesized smugglers.

Outline of Approach

The basic approach can be split into three sections: partitioning the strait into many grids, generating fake boats to compare the real boats, and then training a logistic regression to use the four data fields (speed, heading, latitude and longitude) to differentiate the real boats from the fake ones.

Partitioning the strait into grids helps emphasize the local character of ships’ movements in that region. For example, a grid square partially containing a shipping channel will have many ships located in the channel and on a heading taking it along that channel. Generating fake boats, with bivariate uniform distribution in the grid square, will tend not to fall in the path of ordinary traffic channel, just like the hypothesized behavior of smugglers. The same goes for the uniformly-distributed timestamps and otherwise randomly-assigned boat attributes for the comparison sets: these will all tend to stand apart from ordinary traffic. Therefore, training a model to differentiate between these two classes of behaviors will advance the goal of differentiating between smugglers and ordinary traffic.

Dr. Danneman described this procedure as unsupervised-as-supervised learning – a novel term for me, so forgive me if I’m loose with the particulars – but this in this case it refers to the notion that there are two classes of data points, one i.i.d. from some unknown density and another simulated via Monte Carlo methods from some known density.  Pooling both samples gives one a mixture of the two densities; this problem then becomes one of comparing the relative densities of the two classes of data points – that is, this problem is actually a restatement of the problem of logistic regression! Additional details can be found in Elements of Statistical Learning (2nd edition, section 14.2.4, p. 495).

Presentation of Findings

After fitting the model, we can examine which of the real boats the model rated as having a low odds of being real – that is, boats which looked so similar to the randomly-generated boats that the model had difficulty differentiating the two. These are the boats that we might call “outliers,” and, given the premise that ship-to-ship smuggling likely takes place aboard boats with unusual behavior, are more likely to be engaged in smuggling.

I will repeat here a slight criticism that I noted elsewhere and point out that the model output cannot be interpreted as a true probability, contrary to the results displayed in slide 39. In this research design, Dr. Danneman did not randomly sample from the population of all shipping traffic in the Strait of Hormuz to assemble a collection of smuggling craft and ordinary traffic in proportion roughly equal to their occurrence in nature. Rather, he generated one fake boat for each real boat. This is a case-control research design, so the intercept term of the logistic regression model is fixed to reflect the ratio of positive cases to negative cases in the data set. All of the terms in the model, including the intercept, are still MLE estimators, and all of the non-intercept terms are perfectly valid for comparing the odds of an observation being in one class or another. But to establish probabilities, one would have to replace the intercept term with knowledge of what the overall ratio of positives to negatives in the other.

In the question-and-answer session, some in the audience pushed back against the limited data set, noting that one could improve the results by incorporating other information specific to each of the ships (its flag, its shipping line, the type of craft, or other pieces of information). First, I believe that all applications would leverage this information – were it available – and model it appropriately; however, as befit a pedagogical talk on geospatial outlier detection, this talk focused on leveraging geospatial data for outlier detection.

Second, it should be intuitive that including more information in a model might improve the results: the more we know about the boats, the more we can differentiate between them. Collecting more data is, perhaps, the lowest-hanging fruit of model improvement. I think it’s more worthwhile to note that Nathan’s highly parsimonious model achieved very clean separation between fake and real boats despite the limited amount of information collected for each boat-time unit.

The presentation and code may be found on Nathan Danneman's web site. The audio for the presentation is also available.