# Building Data Apps with Python on August 23rd

Data Community DC and District Data Labs are excited to be hosting another Building Data Apps with Python workshop on August 23rd.  For more info and to sign up, go to http://bit.ly/V4used.  There's even an early bird discount if you register before the end of this month!

Overview

Data products are usually software applications that derive their value from data by leveraging the data science pipeline and generate data through their operation. They aren’t apps with data, nor are they one time analyses that produce insights - they are operational and interactive. The rise of these types of applications has directly contributed to the rise of the data scientist and the idea that data scientists are professionals “who are better at statistics than any software engineer and better at software engineering than any statistician.”

These applications have been largely built with Python. Python is flexible enough to develop extremely quickly on many different types of servers and has a rich tradition in web applications. Python contributes to every stage of the data science pipeline including real time ingestion and the production of APIs, and it is powerful enough to perform machine learning computations. In this class we’ll produce a data product with Python, leveraging every stage of the data science pipeline to produce a book recommender.

What You Will Learn

Python is one of the most popular programming languages for data analysis.  Therefore, it is important to have a basic working knowledge of the language in order to access more complex topics in data science and natural language processing.  The purpose of this one-day course is to introduce the development process in Python using a project-based, hands-on approach. In particular you will learn how to structure a data product using every stage of the data science pipeline including ingesting data from the web, wrangling data into a structured database, computing a non-negative matrix factorization with Python, then producing a web based report.

Course Outline

The workshop will cover the following topics:

• Basic project structure of a Python application

• virtualenv & virtualenvwrapper

• Managing requirements outside the stdlib

• Creating a testing framework with nose

• Ingesting data with requests.py

• Wrangling data into SQLite Databases using SQLAlchemy

• Building a recommender system with Python

• Computing a matrix factorization with Numpy

• Storing computational models using pickles

• Reporting data with JSON

• Data visualization with Jinja2

After this course you should understand how to build a data product using Python and will have built a recommender system that implements the entire data science pipeline.

Instructor: Benjamin Bengfort

Benjamin is an experienced Data Scientist and Python developer who has worked in military, industry, and academia for the past eight years. He is currently pursuing his PhD in Computer Science at The University of Maryland, College Park, doing research in Metacognition and Active Logic. He is also a Data Scientist at Cobrain Company in Bethesda, MD where he builds data products including recommender systems and classifier models. He holds a Masters degree from North Dakota State University where he taught undergraduate Computer Science courses. He is also adjunct faculty at Georgetown University where he teaches Data Science and Analytics.

# High-Performance Computing in R Workshop

Data Community DC and District Data Labs are excited to be hosting a High-Performance Computing with R workshop on June 21st, 2014 taught by Yale professor and R package author Jay Emerson. If you're interested in learning about high-performance computing including concepts such as memory management, algorithmic efficiency, parallel programming, handling larger-than-RAM matrices, and using shared memory this is an awesome way to learn!

To reserve a spot, go to http://bit.ly/ddlhpcr.

Overview This intermediate-level masterclass will introduce you to topics in high-performance computing with R. We will begin by examining a range of related topics including memory management and algorithmic efficiency. Next, we will quickly explore the new parallel package (containing snow and multicore). We will then concentrate on the elegant framework for parallel programming offered by packages foreach and the associated parallel backends. The R package management system including the C/C++ interface and use of package Rcpp will be covered. We will conclude with basic examples of handling larger-than-RAM numeric matrices and use of shared memory. Hands-on exercises will be used throughout.

What will I learn? Different people approach statistical computing with R in different ways. It can be helpful to work on real data problems and learn something about R “on the fly” while trying to solve a problem. But it is also useful to have a more organized, formal presentation without the distraction of a complicated applied problem. This course offers four distinct modules which adopt both approaches and offer some overlap across the modules, helping to reinforce the key concepts. This is an active-learning class where attendees will benefit from working along with the instructor. Roughly, the modules include:

An intensive review of the core language syntax and data structures for working with and exploring data. Functions; conditionals arguments; loops; subsetting; manipulating and cleaning data; efficiency considerations and best practices, including loops and vector operations, memory overhead and optimizing performance.

Motivating parallel programming with an eye on programming efficiency: a case study. Processing, manipulating, and conducting a basic analysis of 100-200 MB of raw microarray data provides an excellent challenge on standard laptops. It is large enough to be mildly annoying, yet small enough that we can make progress and see the benefits of programming effiency and parallel programming.

Topics in high-performance computing with R, including packages parallel and foreach. Hands-on examples will help reinforce key concepts and techniques.

Authoring R packages, including an introduction to the C/C++ interface and the use of Rcpp for high-performance computing. Participants will build a toy package including calls to C/C++ functions.

Is this class right for me? This class will be a good fit for you if you are comfortable working in R and are familiar with R's core data structures (vectors, matrices, lists, and data frames). You are comfortable with for loops and preferably aware of R's apply-family of functions. Ideally you will have written a few functions on your own. You have some experience working with R, but are ready to take it to the next level. Or, you may have considerable experience with other programming languages but are interested in quickly getting up to speed in the areas covered by this masterclass.

After this workshop, what will I be able to do? You will be in a better position to code efficiently with R, perhaps avoiding the need, in some cases, to resort to C/C++ or parallel programming. But you will be able to implement so-called embarassingly parallel algorithms in R when the need arises, and you'll be ready to exploit R's C/C++ interface in several ways. You'll be in a position to author your own R package can include C/C++ code.

All participants will receive electronic copies of all slides, data sets, exercises, and R scripts used in the course.

What do I need to bring? You will need your laptop with the latest version of R. I recommend use of the R Studio IDE, but it is not necessary. A few add-on packages will be used in the workshop. Packages Rcpp and foreach will be used. As a complement to foreach you should also install doMC (Linux or MacOS only) and doSNOW(all platforms). If you want to work along with the C/C++ interface segment, some extra preparation will be required. Rcpp and use of the C/C++ interface requires compilers and extra tools; the folks at RStudio have a nice page that summarizes the requirements. Please note that these requirements may not be trivial (particularly in Windows) and need to be completed prior to the workshop if you intend to compile C/C++ code and use Rcpp during the workshop.

Instructor John W. Emerson (Jay) is Director of Graduate Studies in the Departmentof Statistics at Yale University. He teaches a range of graduate and undergraduate courses as well as workshops, tutorials, and short courses at all levels around the world. His interests are in computational statistics and graphics, and his applied work ranges from topics in sports statistics to bioinformatics, environmental statistics, and Big Data challenges.

He is the author of several R packages including bcp (for Bayesian change point analysis), bigmemory and sister packages (towards a scalable solution for statistical computing with massive data), and gpairs (for generalized pairs plots). His teaching style is engaging and his workshops are active, hands-on learning experiences.

You can reserve your spot by going to http://bit.ly/ddlhpcr.

# Elements of an Analytics "Education"

This a guest post by Wen Phan, who will be completing a Master of Science in Business at George Washington University (GWU) School of Business.  Wen is the recipient of the GWU Business Analytics Award for Excellence and Chair of the Business Analytics Symposium, a full-day symposium on business analytics on Friday, May 30th -- all are invited to attend. Follow Wen on Twitter @wenphan.

We have read the infamous McKinsey report. There is the estimated 140,000- to 190,000-person shortage of deep analytic talent by 2018, and an even bigger need - 1.5 million professionals - for those who can manage and consume analytical content. Justin Timberlake brought sexy back in 2006, but it’ll be the data scientist that will bring sexy to the 21st century. While data scientists are arguably the poster child of this most recent data hype, savvy data professionals are really required across many levels and functions of an organization. Consequently, a number of new and specialized advanced degree programs in data and analytics have emerged over the past several years – many of which are not housed in the traditional analytical departments, such as statistics, computer science or math. These programs are becoming increasingly competitive and graduates of these programs are skilled and in demand. For many just completing their undergraduate degrees or with just a few years of experience, these data degrees have become a viable option in developing skills and connections for a burgeoning industry. For others with several years of experience in adjacent ﬁelds, such as myself, such educational opportunities provide a way to help with career transitions and advancement.

I came back to school after having worked for a little over a decade. My undergraduate degree is in electrical engineering and at one point in my career, I worked on some of the most advanced microchips in the world. But I also have experience in operations, software engineering, product management, and marketing. Through it all, I have learned about the art and science of designing and delivering technology and products from ground zero - both from technical and business perspectives. My decision to leave a comfortable, well-paid job to return to school was made in order to leverage my technical and business experience in new ways and gain new skills and experiences to increase my ability to make an impact in organizations.

There are many opinions regarding what is important in an analytics education and just as many options to pursuing them, each with their own merits. Given that, I do believe there are a few competencies that should be developed no matter what educational path one takes, whether it is graduate school, MOOCs, or self-learning. What I oﬀer here are some personal thoughts on these considerations based on my own background, previous professional experiences, and recent educational endeavor with analytics and, more broadly, using technology and problem solving to advance organizational goals.

# Not just stats.

For many, analytics is about statistics and a data degree is just slightly diﬀerent from a statistics one. There is no doubt that statistics plays a major role in analytics, but it is still just one of the technical skills. If you are a serious direct handler of data of any kind, it will be obvious that programming chops are almost a must. For more customized and sophisticated processing, even substantial computer science knowledge – data structures, algorithms, and design patterns – will be required. Of course, even this idea has been pretty mainstream and is nicely captured by Drew Conway’s Data Science Venn Diagram. Other areas not as obvious to data competency are that of data storage theory and implementation (e.g. relational databases and data warehouses), operations research, and decision analysis. The computer science and statistics portions really focus on the sexy predictive modeling aspects of data. That said, knowing how to eﬀectively collect and store data upstream is tremendously valuable. After all, it is often the case that data extends beyond just one analysis or model. Data begets more data (e.g. data gravity). Many of the underlying statistical methods, such as maximum likelihood estimation (MLE), neural networks and support vector machines, all rely on principles and techniques of operations research. Further, operations research, also called optimization, oﬀers a prescriptive perspective on analytics. Last, it is obvious that analytics can help identify trends, understand customers, and forecast the future. However, in and of themselves those activities do not add any value; it is the decisions and resulting actions taken on those activities that deliver value. But, often, these decisions must be made in the face of substantial uncertainty and risk - hence the importance of critical decision analysis. The level of expertise required in various technical domains must align with your professional goals, but a basic knowledge of the above should allow you adequate ﬂuency across analytics activities.

# Applied.

I consider analytics an applied degree similar to how engineering is an applied degree. Engineering applies math and science to solve problems. Analytics is similar this way. One importance of applied ﬁelds is that they are where the rubber of theory needs to meet the road of reality. Data is not always normally distributed. In fact data is not always valid or even consistent. Formal education oﬀers rigor in developing strong foundational knowledge and skills. However, just as important are the skills to deal with reality. It is no myth that 80% of analytics is just about pre-processing the data; I call it dealing with reality. It is important to understand the theory behind the models, and frankly, it’s pretty fun to indulge in the intricacies of machine learning and convex optimization. In the end though, those things have been made relatively straightforward to implement with computers. What hasn’t (yet) been nicely encapsulated in computer software is the judgment and skill required to handle the ugliness of real-world data. You know what else is reality? Teammates, communication, and project management constraints. All this is to say that so much of an analytics education includes other areas that are not the theory, and I would argue that the success of many analytics endeavors are limited not by the theoretical knowledge, but rather by the practicalities of implementation whether with data, machines, or people. My personal recommendation to aspiring or budding data geeks is to cut your teeth as much as possible in dealing with reality. Do projects. As many of them as possible. With real data. And real stakeholders. And, for those of you manager types, give it a try; it’ll give you the empathy and perspective to eﬀectively work with the hardcore data scientists and manage the analytics process.

# Working with complexity and ambiguity.

The funny thing about data is that you have problems both when you have too little and too much of it. With too little data, you are often making inferences and assessing the conﬁdence of those inferences. With too much data, you are trying not to get confused. In the best case scenarios, your objectives in mining the data are straightforward and crystal clear. However, that is often not the case and exploration is required. Navigating this process of exploration and value discovery can be complex and ambiguous. There are the questions of “where do I start?” and “how far do I go?” This really speaks to the art of working with data. You pick up best practices along the way and develop some of your own. Initial exploration tactics may be as simple as proﬁling all attributes and computing correlations among a few of thing, seeing if anything looks promising or sticks. This process is further exacerbated with “big data”, where computational time is non-negligible and limits feedback delays during any kind of exploratory data analysis.

You can search the web for all kinds of advice on skills to develop for a data career. The few tidbits I include above are just my perspectives on some of the higher order bits in developing solid data skills. Advanced degree programs oﬀer compelling environments to build these skills and gain exposure in an eﬃcient way, including a professional network, resources, and opportunities. However, it is not the only way. As with all professional endeavors, one needs to assess his or her goals, background, and situation to ultimately determine the educational path that makes sense.

References:

# Where are the Deep Learning Courses?

This is a guest post by John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC.

## Why aren't there more Deep Learning talks, tutorials, or workshops in DC2?

It's been about two months since my Deep Learning talk at Artisphere for DC2. Again, thanks to the organizers (especially Harlan Harris and Sean Gonzalez) and the sponsors (especially Arlington Economic Development). We had a great turnout and a lot of good questions that night. Since the talk and at other Meetups since, I've been encouraged by the tidal wave of interest from teaching organizations and prospective students alike.

First some preemptive answers to the “FAQ” downstream of the talk:

• Mary Galvin wrote a blog review of this event.
• Yes, the slides are available.
• Yes, corresponding audio is also available (thanks Geoff Moes).
• A recently "reconstructed" talk combining the slides and audio is also now available!
• Where else can I learn more about Deep Learning as a data scientist? (This may be a request to teach, a question about how to do something in Deep Learning, a question about theory, or a request to do an internship. They're all basically the same thing.)
• It's this last question that's the focus of this blog post. Lots of people have asked and there are some answers out there already, but if people in the DC MSA are really interested, there could be more. At the end of this post is a survey—if you want more Deep Learning, let DC2 know what you want and together we'll figure out what we can make happen.

## There actually was a class...

Aaron Schumacher and Tommy Shen invited me to come talk in April for General Assemb.ly's Data Science course. I did teach one Deep Learning module for them. That module was a slightly longer version of the talk I gave at Artisphere combined with one abbreviated “hands on” module on unsupervised feature learning based on Stanford's tutorial. It didn't help that the tutorial was written in Octave and the class had mostly been using Python up to that point. Though feedback was generally positive for the Deep Learning module, some students wondered if they could get a little more hands on and focus on specifics. And I empathize with them. I've spent real money on Deep Learning tutorials that I thought could have been much more useful if they were more hands on.

Though I've appreciated all the invitations to teach courses, workshops, or lectures, except for the General Assemb.ly course, I've turned down all the invitations to teach something more on Deep Learning. This is not because the data science community here in DC is already expert in Deep Learning or because it's not worth teaching. Quite the opposite. I've not committed to teach more Deep Learning mostly because of these three reasons:

1. There are already significant Deep Learning Tutorial resources out there,
2. There are significant front end investments that neophytes need to make for any workshop or tutorial to be valuable to both the class and instructor and,
3. I haven't found a teaching model in the DC MSA that convinces me teaching a “traditional” class in the formal sense is a better investment of time than instruction through project-based learning on research work contracted through my company.

## Resources to learn Deep Learning

There are already many freely available resources to learn the theory of Deep Learning, and it's made even more accessible by many of the very lucid authors who participate in this community. My talk was cherry-picked from a number of these materials and news stories. Here are some representative links that can connect you to much of the mainstream literature and discussion in Deep Learning:

• The tutorials link on the DeepLearning.net page
• NYU's Deep Learning course course material
• Yann LeCun's overview of Deep Learning with Marc'Aurelio Ranzato
• Geoff Hinton's Coursera course on Neural Networks
• A book on Deep Learning from the Microsoft Speech Group
• A reading list list from Carnegie Mellon with student notes on many of the papers
• A Google+ page on Deep Learning

This is the first reason I don't think it's all that valuable for DC to have more of its own Deep Learning “academic” tutorials. And by “academic” I mean tutorials that don't end with students leaving the class successfully implementing systems that learn representations to do amazing things with those learned features. I'm happy to give tutorials in that “academic” direction or shape them based on my own biases, but I doubt I'd improve on what's already out there. I've been doing machine learning for 15 years, so I start with some background to deeply appreciate Deep Learning, but I've only been doing Deep Learning for two years now. And my expertise is self-taught. And I never did a post-doc with Geoff Hinton, Yann LeCun or Yoshua Bengio. I'm still learning, myself.

## The investments to go from 0 to Deep Learning

It's a joy to teach motivated students who come equipped with all the prerequisites for really mastering a subject. That said, teaching a less equipped, uninvested and/or unmotivated studentry is often an exercise in joint suffering for both students and instructor.

I believe the requests to have a Deep Learning course, tutorial, workshop or another talk are all well intentioned... Except for Sean Gonzalez—it creeps me out how much he wants a workshop. But I think most of this eager interest in tutorials overlooks just how much preparation a student needs to get a good return on their time and tuition. And if they're not getting a good return, what's the point? The last thing I want to do is give the DC2 community a tutorial on “the Past” of neural nets. Here are what I consider some practical prerequisites for folks to really get something out of a hands-on tutorial:

• An understanding of machine learning, including
• optimization and stochastic gradient descent
• hyperparameter tuning
• bagging
• at least a passing understanding of neural nets
• A pretty good grasp of Python, including
• a working knowledge of how to configure different packages
• some appreciation for Theano (warts and all)
• a good understanding of data preparation
• Some recent CUDA-capable NVIDIA GPU hardware* configured for your machine
• CUDA drivers
• NVIDIA's CUDA examples compiled

*hardware isn't necessarily a prerequisite, but I don't know how you can get an understanding of any more than toy problems on a CPU

Resources like the ones above are great for getting a student up to speed on the “academic” issues of understanding deep learning, but that only scratches the surface. Once students know what can be done, if they’re anything like me, they want to be able to do it. And at that point, students need a pretty deep understanding of not just the theory, but of both hardware and software to really make some contributions in Deep Learning. Or even apply it to their problem.

Starting with the hardware, let's say, for sake of argument, that you work for the government or are for some other arbitrary reason forced to buy Dell hardware. You begin your journey justifying the $4000 purchase for a machine that might be semi-functional as a Deep Learning platform when there's a$2500 guideline in your department. Individual Dell workstations are like Deep Learning kryptonite, so even if someone in the n layers of approval bureaucracy somehow approved it, it's still the beginning of a frustrating story with an unhappy ending. Or let's say you build your own machine. Now add “building a machine” for a minimum of about $1500 to the prerequisites. But to really get a return in the sweet spot of those components, you probably want to spend at least$2500. Now the prerequisites include a dollar investment in addition to talent and tuition! Or let’s say you’re just going to build out your three-year-old machine you have for the new capability. Oh, you only have a 500W power supply? Lucky you! You’re going shopping! Oh, your machine has an ATI graphics card. I’m sure it’s just a little bit of glue code to repurpose CUDA calls to OpenCL calls for that hardware. Let's say you actually have an NVIDIA card (at least as recent as a GTX 580) and wanted to develop in virtual machines, so you need PCI pass-through to reach the CUDA cores. Lucky you! You have some more reading to do! Pray DenverCoder9's made a summary post in the past 11 years.

“But I run everything in the cloud on EC2,” you say! It's $0.65/hour for G2 instances. And those are the cheap GPU instances. Back of the envelope, it took a week of churning through 1.2 million training images with CUDA convnets (optimized for speed) to produce a breakthrough result. At$0.65/hour, you get maybe 20 or 30 tries doing that before it would have made more sense to have built your own machine. This isn't a crazy way to learn, but any psychological disincentive to experimentation, even $0.65/hour, seems like an unnecessary distraction. I also can't endorse the idea of “dabbling” in Deep Learning; it seems akin to “dabbling” in having children—you either make the commitment or you don't. At this point, I’m not aware of an “import deeplearning” package in Python that can then fit a nine layer sparse autoencoder with invisible CUDA calls to your GPU on 10 million images at the ipython command line. Though people are trying. That's an extreme example, but in general, you need a flexible, stable codebase to even experiment at a useful scale—and that's really what we data scientists should be doing. Toys are fine and all, but if scale up means a qualitatively different solution, why learn the toy? And that means getting acquainted with the pros and cons of various codebases out there. Or writing your own, which... Good luck! ## DC Metro-area teaching models I start from the premise that no good teacher in the history of teaching has ever been rewarded appropriately with pay for their contributions and most teaching rewards are personal. I accept that premise. And this is all I really ever expect from teaching. I do, however, believe teaching is becoming even less attractive to good teachers every year at every stage of lifelong learning. Traditional post-secondary instructional models are clearly collapsing. Brick and mortar university degrees often trap graduates in debt at the same time the universities have already outsourced their actual teaching mission to low-cost adjunct staff and diverted funds to marketing curricula rather than teaching them. For-profit institutions are even worse. Compensation for a career in public education has never been particularly attractive, but still there have always been teachers who love to teach, are good at it, and do it anyway. However, new narrow metric-based approaches that hold teachers responsible for the students they're dealt rather than the quality of their teaching can be demoralizing for even the most self-possessed teachers. These developments threaten to reduce that pool of quality teachers to a sparse band of marginalized die-hards. But enough of my view of “teaching” the way most people typically blindly suggest I do it. The formal and informal teaching options in the DC MSA mirror these broader developments. I run a company with active contracts and however much I might love teaching and would like to see a well-trained crop of deep learning experts in the region, the investment doesn't add up. So I continue to mentor colleagues and partners through contracted research projects. I don't know all the models for teaching and haven't spent a lot of time understanding them, but none seem to make sense to me in terms of time invested to teach students—partly because many of them really can't get at the hardware part of the list of prerequisites above. This is my vague understanding of compensation models generally available in the online space*: • Udemy – produce and own a "digital asset" of the course content and sell tuition and advertising as a MOOC. I have no experience with Udemy, but some people seemed happy to have made$20,000 in a month. Thanks to Valerie at Feastie for suggesting this option.
• Statistics.com – Typically a few thousand for four sessions that Statistics.com then sells; I believe this must be a “work for hire” copyright model for the digital asset that Statistics.com buys from the instructor. I assume it's something akin to commissioned art, that once you create, you no longer own. [Editor’s note: Statistics.com is a sponsor of Data Science DC. The arrangement that John describes is similar to our understanding too.]
• Myngle – Sell lots of online lessons for typically less than a 30% share.

And this is my understanding of compensation models locally available in the DC MSA*:

• General Assemb.ly – Between 15-20% of tuition (where tuition may be $4000/student for a semester class). • District Data Labs Workshop – Splits total workshop tuition or profit 50% with the instructor—which may be the best deal I've heard, but 50% is a lot to pay for advertising and logistics. [Editor's note: These are the workshops that Data Community DC runs with our partner DDL.] • Give a lecture – typically a one time lecture with a modest honorarium ($100s) that may include travel. I've given these kinds of lectures at GMU and Marymount.
• Adjunct at a local university – This is often a very labor- and commute-intensive investment and pays no better (with no benefits) than a few thousand dollars. Georgetown will pay about $200 per contact hour with students. Assuming there are three hours of out of classroom commitment for every hour in class, this probably ends up somewhere in the$50 per hour range. All this said, this was the suggestion of a respected entrepreneur in the DC region.
• Tenure-track position at a local university – As an Assistant Professor, you will typically have to forego being anything but a glorified post-doc until your tenure review. And good luck convincing this crowd they need you enough to hire you with tenure.

*These are what I understand to be the approximate options and if you got a worse or better deal, please understand I might be wrong about these specific figures. I'm not wrong, though, that none of these are “market rate” for an experienced data scientist in the DC MSA.

Currently, all of my teaching happens through hands-on internships and project-based learning at my company, where I know the students (i.e. my colleagues, coworkers, subcontractors and partners) are motivated and I know they have sufficient resources to succeed (including hardware). When I “teach,” I typically do it for free, and I try hard to avoid organizations that create asymmetrical relationships with their instructors or sell instructor time as their primary “product” at a steep discount to the instructor compensation. Though polemic, Mike Selik summarized the same issue of cut rate data science in "The End of Kaggle." I'd love to hear of a good model where students could really get the three practical prerequisites for Deep Learning and how I could help make that happen here in DC2 short of making “teaching” my primary vocation. If there's a viable model for that out there, please let me know. If you still think you'd like to learn more about Deep Learning through DC2, please help us understand what you'd want out of it and whether you'd be able to bring your own hardware.

# District Data Labs Project Incubator Program

A crucial part of learning data science is applying the skills you learn to real world projects. Working on interesting projects keeps you motivated to continue learning and helps you sharpen your skills. Working in teams helps you learn from the different experiences of others and generate new ideas about learning avenues you can pursue in the future. That's why District Data Labs is starting a virtual incubator program for data science projects!

The incubator program is:

• Free (no cost to you)
• Part-time (you can work on projects in your spare time)
• Virtual (you don't need to be located in the DC area)

The first class of the incubator is scheduled to run from May through October 2014.  This is a great way to learn by working on a project with other people and even potentially sharing in the rewards of a project that ends up being commercially viable.

For more info and to apply, you can go to http://bit.ly/1dqp11k.

Applications end soon, so get yours in today!

# The Evolution of Big Data Platforms and People

This is a guest post by Paco Nathan. Paco is an O’Reilly authorApache Spark open source evangelist with Databricks, and an advisor for ZettacapAmplify Partners, and The Data Guild. Google lives in his family’s backyard. Paco spoke at Data Science DC in 2012.  A kind of “middleware” for Big Data has been evolving since the mid–2000s. Abstraction layers help make it simpler to write apps in frameworks such as Hadoop. Beyond the relatively simple issue of programming convenience, there are much more complex factors in play. Several open source frameworks have emerged that build on the notion of workflow, exemplifying highly sophisticated features. My recent talk Data Workflows for Machine Learning considers several OSS frameworks in that context, developing a kind of “scorecard” to help assess best-of-breed features. Hopefully it can help your decisions about which frameworks suit your use case needs.

By definition, a workflow encompasses both the automation that we’re leveraging (e.g., machine learning apps running on clusters) as well as people and process. In terms of automation, some larger players have departed from “conventional wisdom” for their clusters and ML apps. For example, while the rest of the industry embraced virtualization, Google avoided that path by using cgroups for isolation. Twitter sponsored a similar open source approach, Apache Mesos, which was attributed to helping resolve their “Fail Whale” issues prior to their IPO. As other large firms adopt this strategy, the implication is that VMs may have run out of steam. Certainly, single-digit utilization rates at data centers (current industry norm) will not scale to handle IoT data rates: energy companies could not handle that surge, let along the enormous cap-ex implied. I'll be presenting on Datacenter Computing with Apache Mesos next Tuesday at the Big Data DC Meetup, held at AddThis. We’ll discuss the Mesos approach of mixed workloads for better elasticity, higher utilization rates, and lower latency.

On the people side, a very different set of issues looms ahead. Industry is retooling on a massive scale. It’s not about buying a whole new set of expensive tools for Big Data. Rather it’s about retooling how people in general think about computable problems. One vital component may well not be having enough advanced math in the hands of business leaders. Seriously, we still frame requirements for college math in Cold War terms: years of calculus were intended to filter out the best Mechanical Engineering candidates, who could then help build the best ICBMs. However, in business today the leadership needs to understand how to contend with enormous data rates and meanwhile deploy high-ROI apps at scale: how and when to leverage graph queries, sparse matrices, convex optimization, Bayesian statistics – topics that are generally obscured beyond the “killing fields” of calculus.

A new book by Allen Day and me in development at O’Reilly called “Just Enough Math” introduces advanced math for business people, especially to learn how to leverage open source frameworks for Big Data – much of which comes from organizations that leverage sophisticated math, e.g., Twitter. Each morsel of math is considered in the context of concrete business use cases, lots of illustrations, and historical background – along with brief code examples in Python that one can cut & paste.

This next week in the DC area I will be teaching a full-day workshop that includes material from all of the above:

Machine Learning for Managers Tue, Apr 15, 8:30am–4:30pm (Eastern) MicroTek, 1101 Vermont Ave NW #700, Washington, DC 20005

That workshop provides an introduction to ML – something quite different than popular MOOCs or vendor training – with emphasis placed as much on the “soft skills” as on the math and coding. We’ll also have a drinkup in the area, to gather informally and discuss related topics in more detail:

Drinks and Data Science Wed, Apr 16, 6:30–9:00pm (Eastern) location TBD

Looking forward to meeting you there!

# Energy Education Data Jam

This is a guest post by Austin Brown, co-organizer of the Data Jam, and senior analyst with the National Renewable Energy Laboratory. DC2 urges you to check this out. (And if you participate, please let us know how it goes!) The Office of Energy Efficiency and Renewable Energy (EERE) at the U.S. Department of Energy (DOE) is hosting an “Energy Education Data Jam,” which will take place on Thursday, March 27, 2014, from 9am to 4pm, in Washington D.C. This is an event that could really benefit from the participation of some more talented developers, data folks, and designers.

Features presentations from a great set of experts and innovators: Aneesh Chopra, former White House CTO; Dr. Ed Dieterle, Gates Foundation; Dr. Jan DeWaters, Clarkson University; Dr. Cindy Moss, Discovery Education; Diane Tucker, Wilson Center

In the growing ecosystem of energy-related data jams and hackathons, this one will be distinct in that it is targeted toward improving the general understanding of the basics of energy in the U.S., which we have identified as a key obstacle to sensible long-term progress in energy.  We hope that what emerges from this data jam will be applicable to learners of any age – from preschool to adult learners.

EERE is working to amplify our approach to help improve energy understanding, knowledge, and decision-making. To address the measured gap in America's energy literacy, we plan to unite energy experts with the software, visualization, and development communities. This single-day event will bring developers and topic experts together with the goal of creating innovative products and partnerships to directly address energy literacy going forward.

The goal of the data jam is to catalyze development of tools, visualizations, and activities to improve energy literacy by bringing together:

•         Developers and designers who understand the problems presented by the energy literacy gap, and have a desire to bring about change

•         Educators with knowledge of how students learn, how energy is taught, and ideas about how we can bridge the energy literacy gap

•         Energy experts with a high-level understanding of the energy economy and who are capable of deconstructing complicated energy data

•         Energy foundations and nonprofits committed to clean energy and an understanding that education can be the first step towards a clean energy economy

No prior experience in energy education is required – just an innovative mindset and a readiness to try to change the thinking on spreading the word about energy.

If you have any questions or would like to RSVP, please send an email to energyliteracy@ee.doe.gov. You can also RSVP through Eventbrite. This event will strive for participation from a number of different backgrounds and expertise and, as such, space will be limited.  We ask that you kindly respond as soon as possible. Lunch will be provided.

# Building Data Apps with Python

Data Community DC and District Data Labs are excited to be offering a Building Data Apps with Python workshop on April 19th, 2014. Python is one of the most popular programming languages for data analysis.  Therefore, it is important to have a basic working knowledge of the language in order to access more complex topics in data science and natural language processing.  The purpose of this one-day course is to introduce the development process in python using a project-based, hands-on approach.

This course is focused on Python development in a data context for those who aren’t familiar with Python. Other courses like Python Data Analysis focus on data analytics with Python, not on Python development itself.

The main workshop will run from 11am - 6pm with an hour break for lunch around 1pm.  For those that are new to programming, there will be an optional introductory session from 9am - 11am aimed at getting you comfortable enough with Python development to follow along in the main session.

Introductory Session: Python for New Programmers (9am - 11am)

The morning session will teach the fundamentals of Python to those who are new to programming.  Learners would be grouped with a TA to ensure their success in the second session. The goal of this session is to ensure that students can demonstrate basic concepts in a classroom environment through successful completion of hands-on exercises. This beginning session will cover the following basic topics and exercises:

Topics:

• Variables
• Expressions
• Conditionality
• Loops
• Executing Programs
• Object Oriented Programming
• Functions
• Classes

Exercises:

• Write a function to determine if input is even or odd
• Read data from a file
• Count the words/lines in a file

At the end of this session, students should be familiar enough with programming concepts in Python to be able to follow along in the second session. They will have acquired a learning cohort in their classmates and instructors to help them learn Python more thoroughly in the future, and they will have observed Python development in action.

Main Session: Building a Python Application (11am - 6pm)

The afternoon session will focus on python application development for those who already know how to program and are familiar with Python. In particular, we’ll build a data application from beginning to end in a workshop fashion. This course would be a prerequisite for all other DDL courses offered that use python.

The following topics will be covered:

• Basic project structure
• virtualenv & virtualenvwrapper
• Building requirements outside the stdlib
• Testing with nose
• Ingesting data with request.py
• Munging data into SQLite Databases
• Some simple computations in Python
• Reporting data with JSON
• Data visualization with Jinja2 and Highcharts

We will build a Python application using the data science workflow: using Python to ingest, munge, compute, report, and even visualize. This is a basic, standard workflow that is repeatable and paves the way for more advanced courses using numerical and statistical packages in Python like Pandas and NumPy. In particular, we’ll use and fetch data from Data.gov, transform it and store it in a SQLite database, then do some simple computation. Then we will use Python to push our analyses out in JSON format and provide a simple reporting technique with Jinja2 and charting using Highcharts.

For more information and to reserve a spot, go to http://bit.ly/1m0y5ws.

Hope to see you there!

# Will big data bring a return of sampling statistics? And a review of Aaron Strauss's talk at DSDC

This guest post by Tommy Jones was originally published on Biased Estimates. Tommy is a statistician or data scientist -- depending on the context -- in Washington, DC. He is a graduate of Georgetown's MS program for mathematics and statistics. Follow him on Twitter @thos_jones.

### Some Background

#### What is sampling statistics?

Sampling statistics concerns the planning, collection, and analysis of survey data. When most people take a statistics course, they are learning "model-based" statistics. (Model-based statistics is not the same as statistical modeling, stick with me here.) Model-based statistics uses a mathematical function to model the distribution of an infinitely-sized population to quantify uncertainty. Sampling statistics, however, uses a priori knowledge of the size of the target population to inform quantifying uncertainty. The big lesson I learned after taking survey sampling is that if you assume the correct model, then the two statistical philosophies agree. But if your assumed model is wrong, the two approaches give different results. (And one approach has fewer assumptions, bee tee dubs.)
Sampling statistics also has a big bag of other tricks, too many to do justice here. But it provides frameworks for handling missing or biased data, combining data on subpopulations whose sample proportions differ from their proportions of the population, how to sample when subpopulations have very different statistical characteristics, etc.
As I write this, it is entirely possible to earn a PhD in statistics and not take a single course in sampling or survey statistics. Many federal agencies hire statisticians and then send them immediately back to school to places like UMD's Joint Program in Survey Methodology. (The federal government conducts a LOT of surveys.)
I can't claim to be certain, but I think that sampling statistics became esoteric for two reasons. First, surveys (and data collection in general) have traditionally been expensive. Until recently, there weren't many organizations except for the government that had the budget to conduct surveys properly and regularly. (Obviously, there are exceptions.) Second, model-based statistics tend to work well and have broad applicability. You can do a lot with a laptop, a .csv file, and the right education. My guess is that these two factors have meant that the vast majority of statisticians and statistician-like researchers have become consumers of data sets, rather than producers. In an age of "big data" this seems to be changing, however.

Response rates for surveys have been dropping for years, causing frustration among statisticians and skepticism from the public. Having a lower response rate doesn't just mean your confidence intervals get wider. Given the nature of many surveys, it's possible (if not likely) that the probability a person responds to the survey may be related to one or a combination of relevant variables. If unaddressed, such non-response can damage an analysis. Addressing the problem drives up the cost of a survey, however.
Consider measuring unemployment. A person is considered unemployed if they don't have a job and they are looking for one. Somebody who loses their job may be less likely to respond to the unemployment survey for a variety of reasons. They may be embarrassed, they may move back home, they may have lost their house! But if the government sends a survey or interviewer and doesn't hear back, how will it know if the respondent is employed, unemployed (and looking), or off the job market completely? So, they have to find out. Time spent tracking a respondent down is expensive!
So, if you are collecting data that requires a response, you must consider who isn't responding and why. Many people anecdotally chalk this effect up to survey fatigue. Aren't we all tired of being bombarded by websites and emails asking us for "just a couple minutes" of our time? (Businesses that send a satisfaction survey every time a customer contacts customer service take note; you may be your own worst data-collection enemy.)

### In Practice: Political Polling in 2012 and Beyond

In context of the above, Aaron Strauss's February 25th talk at DSDC was enlightening. Aaron's presentation was billed as covering "two things that people in [Washington D.C.] absolutely love. One of those things is political campaigns. The other thing is using data to estimate causal effects in subgroups of controlled experiments!" Woooooo! Controlled experiments! Causal effects! Subgroup analysis! Be still, my beating heart.
Aaron earned a PhD in political science from Princeton and has been involved in three of the last four presidential campaigns designing surveys, analyzing collected data, and providing actionable insights for the Democratic party. His blog is here. (For the record, I am strictly non-partisan and do not endorse anyone's politics though I will get in knife fights over statistical practices.)

In an hour-long presentation, Aaron laid a foundation for sampling and polling in the 21st century, revealing how political campaigns and businesses track our data, analyze it, and what the future of surveying may be. The most profound insight I got was to see how the traditional practices of sampling statistics were being blended with 21st century data collection methods, through apps and social media. Whether these changes will address the decline is response rates or only temporarily offset them remains to be seen.Some highlights:

• The number of households that have only wireless telephone service is reaching parity with the number having land line phone service. When considering only households with children (excluding older people with grown children and young adults without children) the number sits at 45 percent.
• Offering small savings on wireless bills may incentivize the taking of flash polls through smart phones.
• Reducing the marginal cost of surveys allows political pollsters to design randomized controlled trials, to evaluate the efficacy of different campaign messages on voting outcomes. (As with all things statistics, there are tradeoffs and confounding variables with such approaches.)

### Sampling Statistics and "Big Data"

I am not proposing that sampling statistics will become the new hottest thing. But I would not be surprised if sampling courses move from the esoteric fringes, to being a core course in many or most statistics graduate programs in the coming decades. (And we know it may take over a hundred years for something to become the new hotness anyway.)

The professor that taught the sampling statistics course that I took a few years ago is the chief of the Statistical Research Division at the U.S. Census Bureau. When I last saw him at an alumni/prospective student mixer for Georgetown's math/stat program in 2013, he was wearing a button that said "ask me about big data." In a time when some think that statistics is the old school discipline only relevant for small data, seeing this button on a man whose field even within statistics is considered so "old school" that even most statisticians have moved on  made me chuckle. But it also made me think; things may be coming full circle for sample statistics.

A statistician's role in big data (my source for the R.A. Fisher quote, above)

# Weekly Round-Up: Data Science Education, Statistics, Data Driven Organizations, and Data Stories

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from data education to data-driven organizations. In this week's round-up:

• Universities Offer Courses in a Hot New Field: Data Science
• Data Science: The End of Statistics?
• How Do You Create a Data-Driven Organization?
• Tell Better Data Stories with Motion and Interactivity

# Universities Offer Courses in a Hot New Field: Data Science

This is a NY Times article about how more and more schools are now offering degrees in data science. The article explains that the demand for these skills has been growing rapidly in the last few years and that schools are adapting their curriculum to the demands of the market. The author provides quotes from faculty at several of the universities mentioned in the article and also some details about the content of some of the programs at these schools.

# Data Science: The End of Statistics?

This was an interesting blog post posing questions about why statistics is sometimes left out of the data science hype. The author takes a shot at briefly proposing answers, but at the end solicits answers from the readers. The comments section of this post is excellent and well worth reading, with several folks with a wide range of experience chiming in to help answer the questions and shed some more light on this topic.

# How Do You Create a Data-Driven Organization?

This is an excellent blog post about how to create a data driven organization. The author just switched jobs to a company where he needs to overhaul how data is collected, stored, analyzed, and reported; and in this post he walks the reader through his thoughts on doing that and the steps he is taking to get all this done. The process includes information gathering and learning about the business, training, infrastructure, metrics, and reporting mediums. Each of these parts has sub-sections with comments and considerations.

# Tell Better Data Stories with Motion and Interactivity

This is a Harvard Business Review article about using motion and interactivity as tools when visualizing data over time. The article has several videos embedded in it that serve as examples and help further explain how these tools can be effective. At the end, the author provides three valuable takeaways when putting visualizations together yourself.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.