Data Community DC is pleased to announce our State of Data Science Education event.

Its goal is to bring together educators, vocational programs and companies to discuss the state and future of data science.

With the rise of data science both schools and companies are rapidly building up their programs and departments, but their beliefs about what a data scientist is and what skills they should have vary considerably.

AuthorRobert Vesco

Last week, the Urban Institute hosted a discussion on the evolving landscape of data and the potential impact on social science. “Machine Learning in a Data-Driven World” covered a wide range of important issues around data science in academic research and their real policy applications. Above all else, one critical narrative emerged:

Data are changing, and we can use these data for social good, but only if we are willing to adapt to new tools and emerging methods.

AuthorGuest Author

This guest blog post on ROCs was spurred by a conversation in the Q&A at Data Science DC’s June 16th Meetup on “Predicting Topics and Sharing in Social Media”. John Kaufhold, managing partner of Deep Learning Analytics, asked Bill Rand, Assistant Professor of Marketing at University of Maryland, about ROCs and convex hulls. In the post, Dr. Kaufhold satirizes data science moments lost in Q&A, talks ROC curves, and discusses the value of error bars in visualizing data science results.

AuthorGuest Author

This guest post is written by David Masad. David is a PhD candidate at George Mason University’s Department of Computational Social Science, where he studies international conflict and cooperation using agent-based modeling, event data, and network analysis. You can follow him on Twitter at @badnetworker.

'SciPy' has a few different meanings. It is a particular Python package, which brings together fast, efficient implementations of many key functions and algorithms for scientific computation. It's also the label for the broader scientific Python stack, the set of libraries and tools that make Python an increasingly popular language for science and research. Finally, it's what everyone calls what's nominally the Scientific Computing with Python conference, which for the past few years has been held every summer in Austin, TX.

This year, it involved two days of intensive tutorials; three days of presentations, talks, and discussion sections; and two more days of informal coding 'sprints'. Though I've been using the scientific Python tools for several years now, this was my first time attending SciPy. I even got a little "1st SciPy" sticker to add to my conference badge. For about five days, I got to be a part of a great community, and experience more Python, brisket and Tex-Mex than I realized was possible.

Part of the fun of this particular conference was the opportunity to talk to researchers working in areas far afield from my own. For example, over lunch burritos I got to talk shop with someone working at a climate research firm, and discovered interesting overlaps between weather and political forecasting. An astronomy talk contained some great insights into building community consensus around a common software tool. And a talk that was officially about oceanography had some very important advice on choosing a color palette for data visualization. (If you followed the links above, you saw that all the SciPy talks are available online, thanks to Enthought's generous sponsorship -- it's not too late to see any SciPy talks that seem interesting to you.)

The SciPy attendees from the DC area were a good cross-section of the diverse scientific Python community in general. There were an epidemiologist and a geneticist, a few astronomers, a government researcher, a library technologist, and a couple of social scientists (myself included). (If there were any DC-area geo-scientists there, I didn't get a chance to meet them).

There was also plenty of material directly applicable to data science. Chris Wiggins, the chief data scientist at the New York Times, gave the first day's keynote, with plenty of good insight into bringing data science into a large, established organization. Chris Fonnesbeck gave an engaging talk on the importance of statistical rigor in data science. Quite a few of the presentations introduced tools that data scientists can install and use right now. These include Dask, for out-of-core computation; xray, an exciting new library for multidimensional data; and two talks on using Docker for reproducible  research. There was a whole track devoted to visualization, including a talk on VisPy, a GPU-accelerated visualization library, that gave the conference one of its big 'Wow!' moments. And the future of Jupyter (still better known as the IPython Notebook) was announced in a 5-minute lightning talk, between demos of bad-idea ways to call Assembly directly from Python and Notebook flight sim widgets (which Erik Tollerud immediately dubbed 'planes on a snake').

Not only did I get to learn a lot from other people's research and tools, I got to present my own. Jackie Kazil and I unveiled Mesa, an agent-based modeling framework we've been building with other contributors. The sprint schedule after the conference proper gave us a chance to work with new collaborators who discovered the package at our talk the day before. A couple extra heads, and a couple days of extra work, mean that Mesa came out of SciPy noticeably better than when it came in. Quite a few other tools came out of the sprints with some improvements, including ones at the core of the scientific Python stack. Getting to work beside (and, later, drink beer with) such experienced developers was an educational opportunity in itself.

SciPy isn't just for physical scientists or hardcore programmers. If you use Python to analyze data or build models, or think you might want to, you should absolutely consider SciPy next year. The tutorials at the beginning can help novices to experts learn something new, and the sprints provide an opportunity to gain experience putting that knowledge to work. In between, the conference provides a great opportunity to gain exposure to a great variety of Python tools, and the community that builds, maintains and uses them. And even if you can't attend -- the videos of the talks are always online.

AuthorGuest Author

Dennis D. McDonald, Ph.D. Dennis is an independent management consultant based in Alexandria, Virginia. His experience includes consulting company ownership and management, database publishing and data transformation projects, managing the consolidation of large systems, open data, statistical research, corporate IT strategy, and IT cost analysis. Dennis recently attended one of our Meetups, "Get Moving with Data - The US Department of Transportation and its Data," and was kind enough to write a guest post for the Data Community DC blog. This article is originally published on

AuthorGuest Author

Data Science DCData Innovation DC, and District Data Labs are hosting a Data Brunch and Project Pitchfest event on Saturday April 4th from 11am - 1pm at GWU's Funger HallJoin us!

Saturday? Yup, we are changing things up and swapping out pizza and empanadas for bagels and other brunchy foods.

Eight teams of DC data scientists have come together for a three month incubator to turn theory into practice on projects spanning healthcare, economics, the environment, and more. Learn from their experience implementing a Deep Learning network on commercially available hardware, on deploying a D3.js visualization web app using Heroku, or on building a desktop GUI with Python... plus much more! Enjoy brunch and drinks on us as we are taken from concept to production on eight data products, and then join the judges by voting for the winner!

AuthorTony Ojeda

On December 11th, Prof. Regina Nuzzo from Galludet University talked at Data Science DC, about Problems with the p-valueThe event was well-received. If you missed it, the slides and audio are available. Here we provide Dr. Nuzzo's references and links from the talk, which are on their own a great resource for those considering communication about statistical reliability. (Note that the five topics she covered used examples from highly-publicized studies of sexual behavior.)

AuthorHarlan Harris

This is a guest post by Vadim Y. Bichutskiy, a Lead Data Scientist at Echelon Insights, a Republican analytics firm. His background spans analytical/engineering positions in Silicon Valley, academia, and the US Government. He holds MS/BS Computer Science from University of California, Irvine, MS Statistics from California State University, East Bay, and is a PhD Candidate in Data Sciences at George Mason University. Follow him on Twitter @vybstat.

Recently I got a hold of Jared Lander's book R for Everyone. It is one of the best books on R that I have seen. I first started learning R in 2007 when I was a CS graduate student at UC Irvine. Bored with my research, I decided to venture into statistics and machine learning. I enrolled in several PhD-level statistics courses--the Statistics Department at UC Irvine is in the same school as the CS Dept.--where I was introduced to R. Coming from a C/C++/Java background, R was different, exciting, and powerful.

Learning R is challenging because documentation is scattered all over the place. There is no comprehensive book that covers many important use cases. To get the fundamentals, one has to look at multiple books as well as many online resources and tutorials. Jared has written an excellent book that covers the fundamentals (and more!). It is easy-to-understand, concise and well-written. The title "R for everyone" is accurate because, while it is great for R novices, it is also quite useful for experienced R hackers. It truly lives up to its title.

Chapters 1-4 cover the basics: installation, RStudio, the R package system, and basic language constructs. Chapter 5 discusses fundamental data structures: data frames, lists, matrices, and arrays. Importing data into R is covered in Chapter 6: reading data from CSV files, Excel spreadsheets, relational databases, and from other statistical packages such as SAS and SPSS. This chapter also illustrates saving objects to disk and scraping data from the Web. Statistical graphics is the subject of Chapter 7 including Hadley Wickham's irreplaceable ggplot2 package. Chapters 8-10 are about writing R functions, control structures, and loops. Altogether Chapters 1-10 cover lots of ground. But we're not even halfway through the book!

Chapters 11-12 introduce tools for data munging: base R's apply family of functions and aggregation, Hadley Wickham's packages plyr and reshape2, and various ways to do joins. A section on speeding up data frames with the indispensable data.table package is also included. Chapter 13 is all about working with string (character) data including regular expressions and Hadley Wickham's stringr package. Important probability distributions are the subject of Chapter 14. Chapter 15 discusses basic descriptive and inferential statistics including the t-test and the analysis of variance. Statistical modeling with linear and generalized linear models is the topic of Chapters 16-18. Topics here also include survival analysis, cross-validation, and the bootstrap. The last part of the book covers hugely important topics. Chapter 19 discusses regularization and shrinkage including Lasso and Ridge regression, their generalization the Elastic Net, and Bayesian shrinkage. Nonlinear and nonparametric methods are the focus of Chapter 20: nonlinear least squares, splines, generalized additive models, decision trees, and random forests. Chapter 21 covers time series analysis with autoregressive moving average (ARIMA), vector autoregressive (VAR), and generalized autoregressive conditional heteroskedasticity (GARCH) models. Clustering is the the topic of Chapter 22: K-means, partitioning around medoids (PAM), and hierarchical.

The final two chapters cover topics that are often omitted from other books and resources, making the book especially useful to seasoned programmers. Chapter 23 is about creating reproducible reports and slide shows with the Yihui Xie’s knitr package, LaTeX and Markdown. Developing R packages is the subject of Chapter 24.

A useful appendix on the R ecosystem puts icing on the cake with valuable resources including Meetups, conferences, Web sites and online documentation, other books, and folks to follow on Twitter.

Whether you are a beginner or an experienced R hacker looking to pick up new tricks, Jared's book will be good to have in your library. It covers a multitude of important topics, is concise and easy-to-read, and is as good as advertised.

Since our founding in 2012, Data Community DC has had a WordPress-based web site. We've had a great experience on WordPress in many ways, and have published posts that have seen tens of thousands of readers. But it's time to move on, and so Data Community DC now has a new web site, with a new look, hosted on Squarespace.

Here's what you need to know:

  • All of the old content, including those blog posts, is still here.
  • If you were subscribed to the blog via an RSS reader, this is important. You'll need to re-subscribe with our new RSS feed URL:
  • We no longer need "/blog" in the URL. You can still type those extra five characters, but you'll get redirected to shorter, easier-to-type URLs.
  • We have new big red buttons for new visitors, pointing them at the content that'll be most valuable for them, whether they want to Learn, Share, enhance their Career, or are representing Organizations.
  • We've moved the Data Events DC calendar to a new site, which should be a bit faster. And this is important: If you subscribed to that calendar so that it showed up in your iCal or Google Calendar or Outlook (which is amazingly convenient), you'll need to do so again! Just go to the Calendar page, click on the Subscribe button at the bottom, and follow the instructions to add the new calendar. Then delete the old calendar, and you'll see all the new events as they're added.

Got any suggestions? See a problem? Want to help out with our web presence? Please get in touch!

AuthorHarlan Harris