GuestPost

Announcing Discussion Lists! First up: Deep Learning

Data Community DC is pleased to announce a new service to the area data community: topic-specific discussion lists! In this way we hope to extend the successes of our Meetups and workshops by providing a way for groups of local people with similar interests to maintain contact and have ongoing discussions. Our first discussion list will be on the topic of Deep Learning. The below is a guest post from John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC Meetup. A while back, there was this blog post about Deep Learning. At the end, we asked readers about their interest in hands-on Deep Learning tutorials.

ELEVEN

The results are in, and the survey went to 11. And as in all data science, context matters--and this eleven is decidedly less inspiring than Nigel Tufnel’s eleven. That said, ten out of eleven respondents wanted a hands-on Deep Learning tutorial, and eight respondents said they would register for a tutorial even if it required hardware approval or enrollment in a hardware tutorial. But interest in practical hands-on Deep Learning workshops appears to be highly nonuniform. One respondent said they’d drive from hundreds of miles away for these workshops, but of the 3000+ data scientists in DC’s data and analytics community, presumably more local, only eleven total responded with interest.

In short, the survey was a bust.

So it’s still not clear what the area data community wants out of Deep Learning, if anything, but since April I’ve gotten plenty of questions from plenty of people about Deep Learning on everything from hardware to parameter tuning, so I know there’s more interest than what we got back on the survey. Since a lot of these questions are probably shared, a discussion list might help us figure out how we can best help the most members get started in Deep Learning.

So how about a Deep Learning discussion list? If you’re a local and want to talk about Deep Learning, sign up here:

https://groups.google.com/a/datacommunitydc.org/d/forum/deeplearning

For the record, this discussion list was Harlan’s original suggestion. If you’re looking to take away any rules of thumb here, a simple one is “just agree with whatever Harlan says.” Tommy Jones and I will run this discussion list for now. To be clear, this list caters to the specific Deep Learning interests of data enthusiasts in the DC area. For a bigger community, there’s always deeplearning.net, the Deep Learning google+ page , and individual mailing lists and git repos for specific Deep Learning codebases, like Caffe, pylearn2, and Torch7.

In the meantime, I was happy to see some Deep Learning interest at DC NLP’s Open Mic night by Christo Kirov. And NLP data scientists need not watch Deep Learning developments from the sidelines anymore; some recent motivating results in the NLP space have been summarized in a tutorial by Richard Socher. I’m not qualified to say whether these are the kind of historic breakthroughs we’ve recently seen in speech recognition and object recognition, but it’s worth taking a look at what's happening out there.

Natural Language Processing in Python and R

This is a guest post by Charlie Greenbacker and Tommy Jones.

Data comes in many forms. As a data scientist, you might be comfortable working with large amounts of structured data nicely organized in a database or other tabular format, but what do you do if a customer drops 10,000 unstructured text documents in your lap and asks you to analyze them?

Some estimates claim unstructured data accounts for more than 90 percent of the digital universe, much of it in the form of text. Digital publishing, social media, and other forms of electronic communication all contribute to the deluge of text data from which you might seek to derive insights and extract value. Fortunately, many tools and techniques have been developed to facilitate large-scale text analytics. Operating at the intersection of computer science, artificial intelligence, and computational linguistics, Natural Language Processing (NLP) focuses on algorithmically understanding human language.

Interested in getting started with Natural Language Processing but don't know where to begin? On July 9th, a joint meetup co-hosted by Statistical Programming DC, Data Wranglers DC, and DC NLP will feature two introductory talks on the nuts & bolts of working with NLP in Python and R.

The Python programming language is increasingly popular in the data science community for a variety of reasons, including its ease of use and the plethora of open source software libraries available for scientific computing & data analysis. Packages like SciPy, NumPy, Scikit-learn, Pandas, NetworkX, and others help Python developers perform everything from linear algebra and dimensionality reduction, to clustering data and analyzing multigraphs.

Back in the dark ages (about 10+ years ago), folks working in NLP usually maintained an assortment of homemade utility programs designed to handle many of the common tasks involved with NLP. Despite our best intentions, most of this code was lousy, brittle, and poorly documented -- hardly a good foundation upon which to build your masterpiece. Over the past several years, however, mainstream open source software libraries like the Natural Language Toolkit for Python (NLTK) have emerged to offer a collection of high-quality reusable NLP functionality. NLTK enables researchers and developers to spend more time focusing on the application logic of the task at hand, and less on debugging an abandoned method for sentence segmentation or reimplementing noun phrase chunking.

If you're already familiar with Python, the NLTK library will equip you with many powerful tools for working with text data. The O'Reilly book Natural Language Processing with Python written by Steven Bird, Ewan Klein, and Edward Loper offers an excellent overview of using NLTK for text analytics. Topics include processing raw text, tagging words, document classification, information extraction, and much more. Best of all, the entire contents of this NLTK book are freely available online under a Creative Commons license.

The Python portion of this joint meetup event will cover a handful of the NLP building blocks provided by NLTK, including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. These components will then be assembled to build a very basic document summarization program.

Additional NLP resources in Python

- Natural Language Toolkit for Python (NLTK): http://www.nltk.org/ - Natural Language Processing with Python (book): http://oreilly.com/catalog/9780596516499/ (free online version: http://www.nltk.org/book/) - Python Text Processing with NLTK 2.0 Cookbook (book): http://www.packtpub.com/python-text-processing-nltk-20-cookbook/book - Python wrapper for the Stanford CoreNLP Java library: https://pypi.python.org/pypi/corenlp - guess_language (Python library for language identification): https://bitbucket.org/spirit/guess_language - MITIE (new C/C++-based NER library from MIT with a Python API): https://github.com/mit-nlp/MITIE - gensim (topic modeling library for Python): http://radimrehurek.com/gensim/

R is a programming language popular in statistics and machine learning research. R has several advantages in the ML/stat domains. R is optimized for vector operations. This simplifies programming since your code is very close to the math that you're trying to execute. R also has a huge community behind it; packages exist for just about any application you can think of. R has a close relationship with C, C++, and Fortran and there are R packages to execute Java and Python code, increasing its flexibility. Finally, the folks at CRAN are zealous about version control and compatibility, making installing R and subsequent packages a smooth experience.

However, R does have some sharp edges that become obvious when working with any non-trivially-sized linguistic data. R holds all data in your active workspace in RAM. If you are running R on a 32-bit system, you have a 4 GB limit to the RAM R can access. There are two implications of this: NLP data need to be stored in memory-efficient objects (more on that later) and (regrettably) there is still a hard limit on how much data you can work on at one time. There are packages, such as `bigmemory` that are moving to address this, but they are outside the scope of this presentation. You also need to write efficient code; the size of NLP data will punish you for inefficiencies.

What advantages, then, does R have? Every person and every problem is unique, but I can offer a few suggestions:

1. You are doing statistics/ML research and not developing software. 2. (Similar to 1.) You are a quantitative generalist (and probably good in R already) and NLP is just another feather in your cap.

Sometimes being a data scientist is about developing and tweaking your own algorithms. Sometimes being a data scientist is taking others' algorithms, plugging in your data, and moving on to other areas of the problem. If you are doing more of the former, R is a solid choice. If you are doing more of the latter, R isn't too bad. But I've found that my code often runs faster than some of the pre-packaged code. Your individual mileage may vary.

The second presentation at this meetup will cover the basics of reading documents into R and creating a document term matrix, then demonstrating some basic document summarization, keyword extraction, and document clustering techniques.

Seats are filling up quickly, so RSVP here now: http://www.meetup.com/stats-prog-dc/events/177772322/

Event Recap: DC Energy and Data Summit

This is a guest post by Majid al-Dosari, a master’s student in Computational Science at George Mason University. I recently attended the first DC Energy and Data Summit organized by Potential Energy DC and co-hosted by the American Association for the Advancement of Science’s Fellowship Big Data Affinity Group. I was excited to be at a conference where two important issues of modern society meet: energy and (big) data!

There was a keynote and plenary panel. In addition, there were three breakout sessions where participants brainstormed improvements to building energy efficiency, the grid, and transportation. Many of the issues raised at the conference could be either big data or energy issues (separately). However, I’m only going to highlight points raised that deal with both energy and data.

In the keynote, Joel Gurin (NYU Governance Lab, Director of OpenData500) emphasized the benefits of open government data (which can include unexpected use cases). In the energy field, this includes data about electric power consumption, solar irradiance, and public transport. He mentioned that the private sector also has a role in publishing and adding value to existing data.

Then, in the plenary panel, Lucy Nowel (Department of Energy) brought up the costs associated with the management, transport, and analysis of big data. These costs can be measured in terms of time and energy. You can ask this question: At what point does it “cost” less to transport some amount of data physically (via a SneakerNet) than it does through some computer network?

After the panel, I attended the breakout session dealing with energy efficiency of homes and businesses. The former is the domain of Opower represented by Asher Burns-Burg, while the latter is the domain of Aquicore represented by Logan Soya. It is of interest to compare the general strategy of both companies here. Opower uses psychological methods to encourage households to reduce consumption. On the other hand, Aquicore uses business metrics to show how building managers can save money. But both are data-enabled.

Asher claims that Opower is just scratching the surface with what is possible with the use of data. He also talked about how personalization can be used to deliver more effective messages to consumers. Meanwhile, Aquicore has challenges associated with working with existing (old) metering technology in order to obtain more fine-grained data on building energy use.

In the concluding remarks, I became aware of discussions at the other breakout sessions. The most notable to me was a concern raised by the transportation session: The rebound effect can offset any gain in efficiency by an increase in consumption. Also, the grid breakout session suggested that there should be a centralized “data mart” and a way to be able to easily navigate the regulations of the energy industry.

While DC is not Houston, the unique environment of policy, entrepreneurship, and analytical talent give DC the potential to innovate in this area. Credit goes to Potential Energy DC for creating a supportive environment.

Event Recap: DSDC June Meetup

This is a guest post by Alex Evanczuk, a software engineer at FiscalNote. Hello DC2!  My name is Alex Evanczuk, and I recently joined a government data startup right here in the nation's capital that goes by the name of FiscalNote. Our mission is to make government data easily accessible, transparent, and understandable for everyone. We are a passionate group of individuals and are actively looking for other like-minded people who want to see things change. If this is you, and particularly if you are a software developer (front-end, with experience in Ruby on Rails), please reach out to me at alex@fiscalnote.com and I can put you in touch with the right people.

Screen Shot 2014-07-02 at 2.13.51 PM

The topics covered by the presenters at June’s Data Science DC Meetup were varied and interesting. Subjects included spatial forecasting in uncertain environments, cell phone surveys in Africa (GeoPoll), causal inference models for improving the lives and prospects of Children and Youth (Child Trends), and several others.

I noticed a number of fascinating trends about the presentations I saw. The first was the simple and unadulterated love of numbers and their relationships to one another. Each presenter proudly explained the mathematical underpinnings of the models and assumptions used in their research, and most had slides that contained nothing more than a single formula or graph. In my brief time in academia, I've noticed that to most statisticians and mathematicians, numbers are their poetry, and this rang true at the event as well.

To most statisticians and mathematicians, numbers are their poetry.

The second was something that is perhaps well known to data researchers, but perhaps not so much to others, and that was that the advantages and influences of data science can extend into any industry. From business, to social work, to education, to healthcare, data science can find a way to improve our understanding of any field.

The second was something that is perhaps well known to data researchers, but perhaps not so much to others, and that was that the advantages and influences of data science can extend into any industry. From business, to social work, to education, to healthcare, data science can find a way to improve our understanding of any field.

More important than the numbers, however, is the fact that behind every data point, integer, and graph, is a human being. The human beings behind our data inspire our use of numbers and their deep understanding to develop axiomatically correct solutions for real world problems. The researchers presented data that told us how we might better understand emotional sentiment in developing countries, or make decisions on cancer treatments, or help children reach their boundless potential. For me, this is what data science is all about--how the appreciation of mathematics can help us improve the lives of human beings.

Missed the Meetup? You can review the audio files from the event here and access the slide deck here.

Event Recap: Tandem NSI Deal Day (Part 2)

This is the second part of a guest post by John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC Meetup. Tandem NSI is a public-private partnership between Arlington Economic Development and Amplifier Ventures. According to the TNSI website, the partnership is intended to foster a vibrant technology ecosystem that combines entrepreneurs, university researchers and students, national security program managers and the supporting business community. I attended the Tandem NSI Deal Day on May 7; this post is a summary of a few discussions relevant to DC2.

In part one, I discussed the pros and cons of starting a tech business in the DC region; in this post, I'll discuss the specific barriers to entry of which entrepreneurs focusing on obtaining federal contractors should be aware when operating in our region, as well as ideas for how interested members of our community can get involved.

Barriers to innovation and entrepreneurship for federal contractors

One of the first talks of the day came from SpaceX's Deputy General Counsel, David Harris. It captured in one slide an issue all small technology companies operating in the federal space face, namely the FAR (Federal Acquisition Regulations). Specifically, David simply counted the number of clauses in different types of contracts, including standard Collaborative Research And Development Agreements, Contract Service Level Agreement Property Licenses, SpaceX's Form LSA, and a consumer-off-the-shelf procurement contract. The number of clauses is generally 12 to 27 in each of these contracts. As a bottom line, he compared these to the number of clauses in a Traditional FAR-fixed-price with one cost-plus Contract Line Item Number: more than 200 clauses. In discussion, there was even a suggestion that the federal government might want to reexamine how it does business with smaller technology companies to encourage innovators to spend time innovating rather than parsing legalese. The tacit message was the FAR may go too far. Add to the FAR the requirements of the Defense Contract Audit Agency and sometimes months-long contracting delays, and you have created a heavy legal and accounting burden on innovators.

Peggy Styer of Blackbird also told a story about how commitment to mission and successful execution for the government can sometimes narrow the potential market for a business. A paraphrase of Peggy's story: It's good to be focused on mission, but there can be strategic conflict between commercial and government success. As an example, when they came under fire in theatre, special ops forces were once expected to carry a heavy tracking device the size of a car battery and run for their lives into the desert where a rescue team could later find and retrieve them. Blackbird miniaturized a tracking device with the same functionality, which made soldiers on foot faster and more mobile, improving survivability. The US government loved the device. But they loved it so much they asked Blackbird to sell to the US government exclusively (and not to commercialize it for competitors). This can put innovators for the government in a difficult position with a smaller market than they might have expected in the broader commercial space.

Dan Doney, Chief Innovation Officer at the Defense Intelligence Agency described a precedent “culture" of the “man on the moon” success that was in many ways a blueprint for how research is still conducted in the federal government. Specifically, putting a man on the moon was a project of a scale and complexity only our coordinated US government could manage in the 1960s. To accomplish the mission, the US government collected requirements, matched requirements with contractors, and systematically filled them all. And that was a tremendous success. However, almost 50 years later, a slavish focus on requirements may be the problem, Dan argued. Dan described "so much hunger” to solve mission-critical problems by our local innovative entrepreneurs that in order to exploit it, the government needs to eliminate the “friction” from the system. Dan argued eliminating that “friction” has been shown to get enormous results faster and cheaper than traditional contracting models. He continued: "our innovation problems are communication problems," pointing out that Broad Area Announcements -- how the US govt often announces project needs--are terrible abstractions of problems to be solved. The overwhelming jumble of legalese that has nothing to do with technical work was also discussed as a barrier for technical minds—just finding the technical nugget the BAA is really asking for is an exhausting search across all the fedbizops announcements.

A brief discussion of how contracts can become inflexible handcuffs that focus contractors on “hitting their numbers” on the tasks a PM originally thought they should solve at the time of contracting, while in the course of a program it may even become clear a contractor should now be solving other, more relevant problems. In essence, contractors are asked to ask and answer relevant research questions, and research is executed with contracts, but those contracts often become counterproductively inflexible for asking and answering research questions.

What can DC2 do?

  1. I only recognized three DC2 participants at this event. With a bigger presence, we could be a more active and relevant part of the discussion on how to incentivize government to make better use of its innovative entrepreneurial resources here in the DMV.
  2. Deal Day provided a forum to hear from both successful entrepreneurs and the government side. These panels documented some strategies for how some performers successfully navigated those opportunities for their businesses. What Deal Day didn’t offer was a chance to hear from small innovative startups on what their particular needs are. Perhaps DC2 could conduct a survey of its members to inform future Tandem NSI discussions.

Event Recap: Tandem NSI Deal Day (Part 1)

This is a guest post by John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC Meetup. Tandem NSI is a public-private partnership between Arlington Economic Development and Amplifier Ventures. According to the TNSI website, the partnership is intended to foster a vibrant technology ecosystem that combines entrepreneurs, university researchers and students, national security program managers and the supporting business community. I attended the Tandem NSI Deal Day on May 7; this post is a summary of a few discussions relevant to DC2.

The format of Deal Day was a collection of speakers and panel discussions from both successful entrepreneurs and government representatives from the Arlington area, including:

  • Introductions by Arlington County Board Chairperson, Jay Fisette, and Arlington House Representative Jim Moran;
  • Current trends in mergers and acquisitions and business acquisitions for national security product startups;
  • “How to Hack the System,” a discussion with successful national security product entrepreneurs;
  • “Free Money,” in which national security agency program managers told us where they need research done by small business and how you can commercialize what you learn; and
  • “What’s on the Edge,” in which national security program managers told us where they have cutting edge opportunities for entrepreneurs that are on the edge of today’s tech, and will be the basis of tomorrow’s great startups.

There were two DC2-relevant themes from the day that I’ve distilled: the pros and cons of starting a tech business in the DC region, and the specific barriers to entry of which entrepreneurs focusing on obtaining federal contracts should be aware when operating in our region. This post will focus on the first theme; the second will be discussed in Part 2 of the recap, later this week.

Startups in the DC Metropolitan Statistical Area vs. “The Valley”

A lot of discussion focused on starting up a tech company here in the DC MSA (which includes Washington, DC; Calvert, Charles, Frederick, Montgomery and Prince George’s counties in MD; and Arlington, Fairfax, Loudoun, Prince William, and Stafford counties as well as the cities of Alexandria, Fairfax, Falls Church, Manassas and Manassas Park in VA) versus the Valley. Most of the panelists and speakers had experience starting companies in both places, and there were pros and cons to both. Here's a brief summary in no particular order.

DC MSA Startup Pros

  • Youth! According to Jay Fisette, Arlington has the highest percentage of 25-34 year olds in America.
  • Education. Money magazine called Arlington is the most educated city in America.
  • Capital. The concentration of many high-end government research sponsors--the National Science Foundation, Defense Advanced Research Projects Agency, Intelligence Advanced Research Projects Agency, the Office of Naval Research, etc.--can provide early-stage, non-dilutive research investment.
  • Localized impact. Entrepreneurial aims are often US-centric, rather than global.
  • A mission-focused talent pool.
  • A high concentration of American citizens and cleared personnel.
  • Local government support. As an example, initiatives like ConnectArlington provide more secure broadband for Arlington companies.

DC MSA Startup Cons

  • Localized impact. Entrepreneurial aims are often US-centric, rather than global. (Yes, this appears on both lists!)
  • Heavy regulations. Federal Acquisition Regulations (FAR) and Defense Contract Audit Agency accounting requirements can complicate the already difficult task of starting a business.
  • Bureaucracy. It’s DC. It’s a fact.
  • Extremely complex government organization with significant personnel turnover.
  • Less experienced “product managers.”

Silicon Valley Startup Pros

  • Venture capitalists and big corporations are “throwing money at you” in the tech space.
  • Plenty of entrepreneurial breadth.
  • Plenty of talent in productization.
  • Plenty of experience in commercial projects.
  • Very liquid and competitive labor market--which is great for individual employees.
  • Aims are often global, rather than US-centric.
  • Compensation is unconstrained by government regulation.
  • Great local higher education infrastructure: Berkeley, UNSF, National Labs, Stanford...

Silicon Valley Startup Cons

  • Very liquid and competitive labor market--which means building a loyal, talented team can be a struggle.
  • VCs and big corporation investments are unsustainably frothy.
  • Less talent in or exposure to federal contracting.
  • A smaller pool of American citizens and cleared personnel.

Check back later this week to find out what TNSI Deal Day panelists had to say about stumbling blocks to obtaining federal contracts!

Win Free eCopies of Social Media Mining with R

This is a sponsored post by Richard Heimann. Rich is Chief Data Scientist at L-3 NSS and recently published Social Media Mining with R (Packt Publishing, 2014) with co-author Nathan Danneman, also a Data Scientist at L-3 NSS Data Tactics. Nathan has been featured at recent Data Science DC and DC NLP meetups. Nathan Danneman and Richard Heimann have teamed up with DC2 to organize a giveaway of their new book, Social Media Mining with R.

Over the new two weeks five lucky winners will win a digital copy of the book. Please keep reading to find out how you can be one of the winners and learn more about Social Media Mining with R.

Overview: Social Media Mining with R

Social Media Mining with R is a concise, hands-on guide with several practical examples of social media data mining and a detailed treatise on inference and social science research that will help you in mining data in the real world.

Whether you are an undergraduate who wishes to get hands-on experience working with social data from the Web, a practitioner wishing to expand your competencies and learn unsupervised sentiment analysis, or you are simply interested in social data analysis, this book will prove to be an essential asset. No previous experience with R or statistics is required, though having knowledge of both will enrich your experience. Readers will learn the following:

  • Learn the basics of R and all the data types
  • Explore the vast expanse of social science research
  • Discover more about data potential, the pitfalls, and inferential gotchas
  • Gain an insight into the concepts of supervised and unsupervised learning
  • Familiarize yourself with visualization and some cognitive pitfalls
  • Delve into exploratory data analysis
  • Understand the minute details of sentiment analysis

How to Enter?

All you need to do is share your favorite effort in social media mining or more broadly in text analysis and natural language processing in the comments section of this blog. This can be some analytical output, a seminal white paper or an interesting commercial or open source package! In this way, there are no losers as we will all learn. 

The first five commenters will win a free copy of the eBook. (DC2 board members and staff are not eligible to win.) Share your public social media accounts (about.me, Twitter, LinkedIn, etc.) in your comment, or email media@datacommunitydc.org after posting.

Where are the Deep Learning Courses?

This is a guest post by John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC.

Why aren't there more Deep Learning talks, tutorials, or workshops in DC2?

It's been about two months since my Deep Learning talk at Artisphere for DC2. Again, thanks to the organizers (especially Harlan Harris and Sean Gonzalez) and the sponsors (especially Arlington Economic Development). We had a great turnout and a lot of good questions that night. Since the talk and at other Meetups since, I've been encouraged by the tidal wave of interest from teaching organizations and prospective students alike.

First some preemptive answers to the “FAQ” downstream of the talk:

  • Mary Galvin wrote a blog review of this event.
  • Yes, the slides are available.
  • Yes, corresponding audio is also available (thanks Geoff Moes).
  • A recently "reconstructed" talk combining the slides and audio is also now available!
  • Where else can I learn more about Deep Learning as a data scientist? (This may be a request to teach, a question about how to do something in Deep Learning, a question about theory, or a request to do an internship. They're all basically the same thing.)
  • It's this last question that's the focus of this blog post. Lots of people have asked and there are some answers out there already, but if people in the DC MSA are really interested, there could be more. At the end of this post is a survey—if you want more Deep Learning, let DC2 know what you want and together we'll figure out what we can make happen.

There actually was a class...

Aaron Schumacher and Tommy Shen invited me to come talk in April for General Assemb.ly's Data Science course. I did teach one Deep Learning module for them. That module was a slightly longer version of the talk I gave at Artisphere combined with one abbreviated “hands on” module on unsupervised feature learning based on Stanford's tutorial. It didn't help that the tutorial was written in Octave and the class had mostly been using Python up to that point. Though feedback was generally positive for the Deep Learning module, some students wondered if they could get a little more hands on and focus on specifics. And I empathize with them. I've spent real money on Deep Learning tutorials that I thought could have been much more useful if they were more hands on.

Though I've appreciated all the invitations to teach courses, workshops, or lectures, except for the General Assemb.ly course, I've turned down all the invitations to teach something more on Deep Learning. This is not because the data science community here in DC is already expert in Deep Learning or because it's not worth teaching. Quite the opposite. I've not committed to teach more Deep Learning mostly because of these three reasons:

  1. There are already significant Deep Learning Tutorial resources out there,
  2. There are significant front end investments that neophytes need to make for any workshop or tutorial to be valuable to both the class and instructor and,
  3. I haven't found a teaching model in the DC MSA that convinces me teaching a “traditional” class in the formal sense is a better investment of time than instruction through project-based learning on research work contracted through my company.

Resources to learn Deep Learning

There are already many freely available resources to learn the theory of Deep Learning, and it's made even more accessible by many of the very lucid authors who participate in this community. My talk was cherry-picked from a number of these materials and news stories. Here are some representative links that can connect you to much of the mainstream literature and discussion in Deep Learning:

  • The tutorials link on the DeepLearning.net page
  • NYU's Deep Learning course course material
  • Yann LeCun's overview of Deep Learning with Marc'Aurelio Ranzato
  • Geoff Hinton's Coursera course on Neural Networks
  • A book on Deep Learning from the Microsoft Speech Group
  • A reading list list from Carnegie Mellon with student notes on many of the papers
  • A Google+ page on Deep Learning

This is the first reason I don't think it's all that valuable for DC to have more of its own Deep Learning “academic” tutorials. And by “academic” I mean tutorials that don't end with students leaving the class successfully implementing systems that learn representations to do amazing things with those learned features. I'm happy to give tutorials in that “academic” direction or shape them based on my own biases, but I doubt I'd improve on what's already out there. I've been doing machine learning for 15 years, so I start with some background to deeply appreciate Deep Learning, but I've only been doing Deep Learning for two years now. And my expertise is self-taught. And I never did a post-doc with Geoff Hinton, Yann LeCun or Yoshua Bengio. I'm still learning, myself.

The investments to go from 0 to Deep Learning

It's a joy to teach motivated students who come equipped with all the prerequisites for really mastering a subject. That said, teaching a less equipped, uninvested and/or unmotivated studentry is often an exercise in joint suffering for both students and instructor.

I believe the requests to have a Deep Learning course, tutorial, workshop or another talk are all well intentioned... Except for Sean Gonzalez—it creeps me out how much he wants a workshop. But I think most of this eager interest in tutorials overlooks just how much preparation a student needs to get a good return on their time and tuition. And if they're not getting a good return, what's the point? The last thing I want to do is give the DC2 community a tutorial on “the Past” of neural nets. Here are what I consider some practical prerequisites for folks to really get something out of a hands-on tutorial:

  • An understanding of machine learning, including
    • optimization and stochastic gradient descent
    • hyperparameter tuning
    • bagging
    • at least a passing understanding of neural nets
  • A pretty good grasp of Python, including
    • a working knowledge of how to configure different packages
    • some appreciation for Theano (warts and all)
    • a good understanding of data preparation
  • Some recent CUDA-capable NVIDIA GPU hardware* configured for your machine
    • CUDA drivers
    • NVIDIA's CUDA examples compiled

*hardware isn't necessarily a prerequisite, but I don't know how you can get an understanding of any more than toy problems on a CPU

Resources like the ones above are great for getting a student up to speed on the “academic” issues of understanding deep learning, but that only scratches the surface. Once students know what can be done, if they’re anything like me, they want to be able to do it. And at that point, students need a pretty deep understanding of not just the theory, but of both hardware and software to really make some contributions in Deep Learning. Or even apply it to their problem.

Starting with the hardware, let's say, for sake of argument, that you work for the government or are for some other arbitrary reason forced to buy Dell hardware. You begin your journey justifying the $4000 purchase for a machine that might be semi-functional as a Deep Learning platform when there's a $2500 guideline in your department. Individual Dell workstations are like Deep Learning kryptonite, so even if someone in the n layers of approval bureaucracy somehow approved it, it's still the beginning of a frustrating story with an unhappy ending. Or let's say you build your own machine. Now add “building a machine” for a minimum of about $1500 to the prerequisites. But to really get a return in the sweet spot of those components, you probably want to spend at least $2500. Now the prerequisites include a dollar investment in addition to talent and tuition! Or let’s say you’re just going to build out your three-year-old machine you have for the new capability. Oh, you only have a 500W power supply? Lucky you! You’re going shopping! Oh, your machine has an ATI graphics card. I’m sure it’s just a little bit of glue code to repurpose CUDA calls to OpenCL calls for that hardware. Let's say you actually have an NVIDIA card (at least as recent as a GTX 580) and wanted to develop in virtual machines, so you need PCI pass-through to reach the CUDA cores. Lucky you! You have some more reading to do! Pray DenverCoder9's made a summary post in the past 11 years.

“But I run everything in the cloud on EC2,” you say! It's $0.65/hour for G2 instances. And those are the cheap GPU instances. Back of the envelope, it took a week of churning through 1.2 million training images with CUDA convnets (optimized for speed) to produce a breakthrough result. At $0.65/hour, you get maybe 20 or 30 tries doing that before it would have made more sense to have built your own machine. This isn't a crazy way to learn, but any psychological disincentive to experimentation, even $0.65/hour, seems like an unnecessary distraction. I also can't endorse the idea of “dabbling” in Deep Learning; it seems akin to “dabbling” in having children—you either make the commitment or you don't.

At this point, I’m not aware of an “import deeplearning” package in Python that can then fit a nine layer sparse autoencoder with invisible CUDA calls to your GPU on 10 million images at the ipython command line. Though people are trying. That's an extreme example, but in general, you need a flexible, stable codebase to even experiment at a useful scale—and that's really what we data scientists should be doing. Toys are fine and all, but if scale up means a qualitatively different solution, why learn the toy? And that means getting acquainted with the pros and cons of various codebases out there. Or writing your own, which... Good luck!

DC Metro-area teaching models

I start from the premise that no good teacher in the history of teaching has ever been rewarded appropriately with pay for their contributions and most teaching rewards are personal. I accept that premise. And this is all I really ever expect from teaching. I do, however, believe teaching is becoming even less attractive to good teachers every year at every stage of lifelong learning. Traditional post-secondary instructional models are clearly collapsing. Brick and mortar university degrees often trap graduates in debt at the same time the universities have already outsourced their actual teaching mission to low-cost adjunct staff and diverted funds to marketing curricula rather than teaching them. For-profit institutions are even worse. Compensation for a career in public education has never been particularly attractive, but still there have always been teachers who love to teach, are good at it, and do it anyway. However, new narrow metric-based approaches that hold teachers responsible for the students they're dealt rather than the quality of their teaching can be demoralizing for even the most self-possessed teachers. These developments threaten to reduce that pool of quality teachers to a sparse band of marginalized die-hards. But enough of my view of “teaching” the way most people typically blindly suggest I do it. The formal and informal teaching options in the DC MSA mirror these broader developments. I run a company with active contracts and however much I might love teaching and would like to see a well-trained crop of deep learning experts in the region, the investment doesn't add up. So I continue to mentor colleagues and partners through contracted research projects.

I don't know all the models for teaching and haven't spent a lot of time understanding them, but none seem to make sense to me in terms of time invested to teach students—partly because many of them really can't get at the hardware part of the list of prerequisites above. This is my vague understanding of compensation models generally available in the online space*:

  • Udemy – produce and own a "digital asset" of the course content and sell tuition and advertising as a MOOC. I have no experience with Udemy, but some people seemed happy to have made $20,000 in a month. Thanks to Valerie at Feastie for suggesting this option.
  • Statistics.com – Typically a few thousand for four sessions that Statistics.com then sells; I believe this must be a “work for hire” copyright model for the digital asset that Statistics.com buys from the instructor. I assume it's something akin to commissioned art, that once you create, you no longer own. [Editor’s note: Statistics.com is a sponsor of Data Science DC. The arrangement that John describes is similar to our understanding too.]
  • Myngle – Sell lots of online lessons for typically less than a 30% share.

And this is my understanding of compensation models locally available in the DC MSA*:

  • General Assemb.ly – Between 15-20% of tuition (where tuition may be $4000/student for a semester class).
  • District Data Labs Workshop – Splits total workshop tuition or profit 50% with the instructor—which may be the best deal I've heard, but 50% is a lot to pay for advertising and logistics. [Editor's note: These are the workshops that Data Community DC runs with our partner DDL.]
  • Give a lecture – typically a one time lecture with a modest honorarium ($100s) that may include travel. I've given these kinds of lectures at GMU and Marymount.
  • Adjunct at a local university – This is often a very labor- and commute-intensive investment and pays no better (with no benefits) than a few thousand dollars. Georgetown will pay about $200 per contact hour with students. Assuming there are three hours of out of classroom commitment for every hour in class, this probably ends up somewhere in the $50 per hour range. All this said, this was the suggestion of a respected entrepreneur in the DC region.
  • Tenure-track position at a local university – As an Assistant Professor, you will typically have to forego being anything but a glorified post-doc until your tenure review. And good luck convincing this crowd they need you enough to hire you with tenure.

*These are what I understand to be the approximate options and if you got a worse or better deal, please understand I might be wrong about these specific figures. I'm not wrong, though, that none of these are “market rate” for an experienced data scientist in the DC MSA.

Currently, all of my teaching happens through hands-on internships and project-based learning at my company, where I know the students (i.e. my colleagues, coworkers, subcontractors and partners) are motivated and I know they have sufficient resources to succeed (including hardware). When I “teach,” I typically do it for free, and I try hard to avoid organizations that create asymmetrical relationships with their instructors or sell instructor time as their primary “product” at a steep discount to the instructor compensation. Though polemic, Mike Selik summarized the same issue of cut rate data science in "The End of Kaggle." I'd love to hear of a good model where students could really get the three practical prerequisites for Deep Learning and how I could help make that happen here in DC2 short of making “teaching” my primary vocation. If there's a viable model for that out there, please let me know. If you still think you'd like to learn more about Deep Learning through DC2, please help us understand what you'd want out of it and whether you'd be able to bring your own hardware.

[wufoo username="datacommunitydc" formhash="m11ujb9d0m66byv" autoresize="true" height="1073" header="show" ssl="true"]

A Rush of Ideas: Kalev Leetaru at Data Science DC

gdeltThis review of the April Data Science DC Meetup was written by Ross Mohan. Ross is a solutions architect for Five 9 Group.

Perhaps you’ve heard the phrase lately “software is eating the world”. Well, to be successful at that, it’s going to have to do as least as good a job of eating the world’s data as do the systems of Kalev Leetaru, Georgetown/Yahoo! fellow.

Kalev Leetaru, lead investigator on GDELT and other tools, defines “world class” work — certainly in the sense of size and scope of data. The goal of GDELT and related systems is to stream global news and social media in as near realtime as possible through multiple steps. The overall goal is to arrive at reliable tone (sentiment) mining and differential conflict detection and to do so …. globally. It is a grand goal.

Kalev Leetaru’s talk covered several broad areas. History of data and communication, data quality and “gotcha” issues in data sourcing and curation, geography of Twitter, processing architecture, toolkits and considerations, and data formatting observations. In each he had a fresh perspective or a novel idea, born of the requirement to handle enormous quantities of ‘noisy’ or ‘dirty’ data.

Perspectives

Keetaru observed that “the map is not the territory” in the sense that actual voting, resource or policy boundaries as measured by various data sources may not match assigned national boundaries. He flagged this as a question of “spatial error bars” for maps.

Distinguishing Global data science from hard established HPC-like pursuits (such as computational chemistry) Kalev Leetaru observed that we make our own bespoke toolkits, and that there is no single ‘magic toolkit” for Big Data, so we should be prepared and willing to spend time putting our toolchain together.

After talking a bit about the historical evolution and growth of data, Kalev Leetaru asked a few perspective-changing questions (some clearly relevant to intelligence agency needs) How to find all protests? How to locate all law books? Some of the more interesting data curation tools and resources Kalev Leetaru mentioned — and a lot more — might be found by the interested reader in The Oxford Guide to Library Research by Thomas Mann.

GDELT (covered further below), labels parse trees with error rates, and reaches beyond the “WHAT” of simple news media to tell us WHY, and ‘how reliable’. One GDELT output product among many is the Daily Global Conflict Report, which covers world leader emotional state and differential change in conflict, not absolute markers.

One recurring theme was to find ways to define and support “truth.” Kalev Leetaru decried one current trend in Big Data, the so-called “Apple Effect”: making luscious pictures from data; with more focus on appearance than actual ground truth. One example he cited was a conclusion from a recent report on Syria, which -- blithely based on geotagged English-language tweets and Facebook postings -- cast a skewed light on Syria’s rebels (Bzzzzzt!)

Twitter

Leetaru provided one answer on “how to ‘ground truth’ data” by asking “how accurate are geotagged tweets?” Such tweets are after all only 3% of the total. But he reliably used those tweets. How?  By correlating location to electric power availability. (r = .89) He talked also about how to handle emoticons, irony, sarcasm, and other affective language, cautioning analysts to think beyond blindly plugging data into pictures.

Kalev Leetaru talked engagingly about Geography of Twitter, encouraging us to to more RTFD (D=data) than RTFM. Cut your own way through the forest. The valid maps have not been made yet, so be prepared to make your own. Some of the challenges he cited were how to break up typical #hashtagswithnowhitespace and put them back into sentences, how to build — and maintain — sentiment/tone dictionaries and to expect, therefore, to spend the vast majority of time in innovative projects in human tuning the algorithms and understanding the data, and then iterating the machine. Refreshingly “hands on.”

Scale and Tech Architecture

Kalev Leetaru turned to discuss the scale of data, which is now generating easily  in the petabytes per day range. There is no longer any question that automation must be used and that serious machinery will be involved. Our job is to get that automation machinery doing the right thing, and if we do so, we can measure the ‘heartbeat of society.’

For a book images project (60 Million images across hundreds of years) he mentioned a number of tools and file systems (but neither Gluster nor CEPH, disappointingly to this reviewer!) and delved deeply and masterfully into the question of how to clean and manage the very dirty data of “closed captioning” found in news reports. To full-text geocode and analyze half a million hours of news (from the Internet Archives), we need fast language detection and captioning error assessment. What makes this task horrifically difficult is that POS tagging “fails catastrophically on closed captioning” and that CC is worse, far worse in terms of quality than is Optical Character Recognition. The standard Stanford NL Understanding toolkit is very “fragile” in this domain: one reason being that news media has an extremely high density of location references, forcing the analyst into using context to disambiguate.

He covered his GDELT (Global Database of Event, Language and Tone), covering human/societal behavior and beliefs at scale around the world. A system of half a billion plus georeferenced rows, 58 columns wide, comprising 100,000 sources such as  broadcast, print, online media back to 1979, it relies on both human translation and Google translate, and will soon be extended across languages and back to the 1800s. Further, he’s incorporating 21 billion words of academic literature into this model (a first!) and expects availability in Summer 2014, (Sources include JSTOR, DTIC, CIA, CVORE CiteSeerX, IA.)

GDELT’s architecture, which relies heavily on the Google Cloud and BigQuery, can stream at 100,000 input observations/second. This reviewer wanted to ask him about update and delete needs and speeds, but the stream is designed to optimize ingest and query. GDELT tools were myriad, but Perl was frequently mentioned (for text processing).

Kalev Leetaru shared some post GDELT construction takeaways — “it’s not all English” and “watch out for full Unicode compliance” in your toolset, lest your lovely data processing stack SEGFAULT halfway through a load. Store data in whatever is easy to maintain and fast. Modularity is good but performance can be an issue; watch out for XML which bogs down processing on highly nested data. Use for interchange more than anything; sharing seems “nice” but “you can’t shared a graph” and “RAM disk is your friend” more so even than SSD, FusionIO, or fast SANs.

The talk, like this blog post, ran over allotted space and time, but the talk was well worth the effort spent understanding it.

Keep it Simple: How to Build a Successful Business Intelligence/Data Warehouse Architecture

Is your data warehouse architecture starting to look like a Rube Goldberg machine held together with duct tape instead of the elegant solution enabling data driven decision making that you envisioned? If your organization is anything like the ones I’ve worked in, then I suspect it might. Many businesses say they recognize that data is an asset, but when it comes to implementing solutions, the focus on providing business value is quickly lost as technical complexities pile up.

duct_tape

How can you recognize if your data warehouse is getting too complicated?

Does it have multiple layers that capture the same data in just a slightly different way? An organization I worked with determined that they needed 4 database layers (staging, long term staging, enterprise data warehouse, and data marts) with significant amounts of duplication. The duplication resulted from each layer not having a clear purpose, but even with more clarity on purpose, this architecture makes adding, changing and maintaining data harder at every turn.

Are you using technologies just because you have used them in the past? Or thought they would be cool to try out? An organization I worked with implemented a fantastically simple data warehouse star schema (http://en.wikipedia.org/wiki/Star_schema) with well under 500 GB of data. Unfortunately, they decided to complicate the data warehouse by adding a semantic layer to support a BI tool and an OLAP cube (which is some ways was a second semantic layer to support BI tools). There is nothing wrong with semantic layers or OLAP cubes. In fact, there are many valid reasons to use them. But, if you do not have said valid reason, they become just another piece of the data architecture that requires maintenance. Has someone asked for data that “should” be easy to get, but instead will take weeks of dedicated effort to pull together? I frequently encounter requests that sound simple, but the number of underlying systems involved and the lack of consistent data integration practices expands the scope exponentially.

Before I bring up too many bad memories of technical complexities taking over a BI/DW project, I want to get into what to do to avoid making things overcomplicated. The most important thing is to find a way that works for your organization to stay focused on business value.

If you find yourself thinking…

“The data is in staging, but we need to transform it into the operational data store, enterprise data warehouse and update 5 data marts before anyone can access the data.”

or

“I am going to try to because I want to learn more about it.”

or

“I keep having to pull together the customer data and it takes 2 weeks just to get an approved list of all customers.”

Stop, drop and roll, oh wait, you’re not technically on fire, so just stopping should do. Take some time to consider how to reset so that the focus is on providing business value. You might try using an approach such as the 5 Whys which was developed for root cause analysis by Sakichi Toyoda for Toyota . It forces reflection on a specific problem and helps you drill down into the specific cause. Why not try it out to see if you can find the root cause of complexity in a BI/DW project? It might just help you reduce or eliminate complexities when there is no good reason for the complexity in the first place.

Another suggestion is to identify areas of complexity from a technical perspective, but don’t stop there. The crucial next step is to determine how the complex technical environment impacts business users. For example, a technical team identifies two complex ETL processes for loading sales and HR data. Both include one off logic and processes that make it difficult to discern what is going on so it takes hours to troubleshoot issues that arise. In addition, the performance of both ETL processes has significantly degraded. The business users don’t really care about all that, but they have been complaining more and more about delays in getting the latest sales figures. When you connect the growing complexity to the delays in getting important data, the business users can productively contribute to a discussion on priority and business value. In this case, sales data would take clear precedence over HR data. Both can be added to a backlog, along with any other areas of complexity identified, and addressed in priority order.

Neither of these is a quick fix, but even slowly chipping away at overly complex areas will yield immediate benefits. Each simplification makes understanding, maintaining and extending the existing system easier.

Bio

Sara_Handel Sara Handel, Co-founder of DC Business Intelligentsia, Business Intelligence Lead at Excella Consulting (www.excella.com) - I love working on projects that transform data into clear, meaningful information that can help business leaders shape their strategies and make better decisions. My systems engineering education coupled with my expertise in Business Intelligence (BI) allows me to help clients find ways to maximize their systems' data capabilities from requirements through long term maintenance. I co-founded the DC Business Intelligentsia meetup to foster a community of people interested in contributing to the better use of data.