Dennis D. McDonald, Ph.D. Dennis is an independent management consultant based in Alexandria, Virginia. His experience includes consulting company ownership and management, database publishing and data transformation projects, managing the consolidation of large systems, open data, statistical research, corporate IT strategy, and IT cost analysis. Dennis recently attended one of our Meetups, "Get Moving with Data - The US Department of Transportation and its Data," and was kind enough to write a guest post for the Data Community DC blog. This article is originally published on

AuthorGuest Author

Data Science DCData Innovation DC, and District Data Labs are hosting a Data Brunch and Project Pitchfest event on Saturday April 4th from 11am - 1pm at GWU's Funger HallJoin us!

Saturday? Yup, we are changing things up and swapping out pizza and empanadas for bagels and other brunchy foods.

Eight teams of DC data scientists have come together for a three month incubator to turn theory into practice on projects spanning healthcare, economics, the environment, and more. Learn from their experience implementing a Deep Learning network on commercially available hardware, on deploying a D3.js visualization web app using Heroku, or on building a desktop GUI with Python... plus much more! Enjoy brunch and drinks on us as we are taken from concept to production on eight data products, and then join the judges by voting for the winner!

AuthorTony Ojeda

On December 11th, Prof. Regina Nuzzo from Galludet University talked at Data Science DC, about Problems with the p-valueThe event was well-received. If you missed it, the slides and audio are available. Here we provide Dr. Nuzzo's references and links from the talk, which are on their own a great resource for those considering communication about statistical reliability. (Note that the five topics she covered used examples from highly-publicized studies of sexual behavior.)

AuthorHarlan Harris

This is a guest post by Vadim Y. Bichutskiy, a Lead Data Scientist at Echelon Insights, a Republican analytics firm. His background spans analytical/engineering positions in Silicon Valley, academia, and the US Government. He holds MS/BS Computer Science from University of California, Irvine, MS Statistics from California State University, East Bay, and is a PhD Candidate in Data Sciences at George Mason University. Follow him on Twitter @vybstat.

Recently I got a hold of Jared Lander's book R for Everyone. It is one of the best books on R that I have seen. I first started learning R in 2007 when I was a CS graduate student at UC Irvine. Bored with my research, I decided to venture into statistics and machine learning. I enrolled in several PhD-level statistics courses--the Statistics Department at UC Irvine is in the same school as the CS Dept.--where I was introduced to R. Coming from a C/C++/Java background, R was different, exciting, and powerful.

Learning R is challenging because documentation is scattered all over the place. There is no comprehensive book that covers many important use cases. To get the fundamentals, one has to look at multiple books as well as many online resources and tutorials. Jared has written an excellent book that covers the fundamentals (and more!). It is easy-to-understand, concise and well-written. The title "R for everyone" is accurate because, while it is great for R novices, it is also quite useful for experienced R hackers. It truly lives up to its title.

Chapters 1-4 cover the basics: installation, RStudio, the R package system, and basic language constructs. Chapter 5 discusses fundamental data structures: data frames, lists, matrices, and arrays. Importing data into R is covered in Chapter 6: reading data from CSV files, Excel spreadsheets, relational databases, and from other statistical packages such as SAS and SPSS. This chapter also illustrates saving objects to disk and scraping data from the Web. Statistical graphics is the subject of Chapter 7 including Hadley Wickham's irreplaceable ggplot2 package. Chapters 8-10 are about writing R functions, control structures, and loops. Altogether Chapters 1-10 cover lots of ground. But we're not even halfway through the book!

Chapters 11-12 introduce tools for data munging: base R's apply family of functions and aggregation, Hadley Wickham's packages plyr and reshape2, and various ways to do joins. A section on speeding up data frames with the indispensable data.table package is also included. Chapter 13 is all about working with string (character) data including regular expressions and Hadley Wickham's stringr package. Important probability distributions are the subject of Chapter 14. Chapter 15 discusses basic descriptive and inferential statistics including the t-test and the analysis of variance. Statistical modeling with linear and generalized linear models is the topic of Chapters 16-18. Topics here also include survival analysis, cross-validation, and the bootstrap. The last part of the book covers hugely important topics. Chapter 19 discusses regularization and shrinkage including Lasso and Ridge regression, their generalization the Elastic Net, and Bayesian shrinkage. Nonlinear and nonparametric methods are the focus of Chapter 20: nonlinear least squares, splines, generalized additive models, decision trees, and random forests. Chapter 21 covers time series analysis with autoregressive moving average (ARIMA), vector autoregressive (VAR), and generalized autoregressive conditional heteroskedasticity (GARCH) models. Clustering is the the topic of Chapter 22: K-means, partitioning around medoids (PAM), and hierarchical.

The final two chapters cover topics that are often omitted from other books and resources, making the book especially useful to seasoned programmers. Chapter 23 is about creating reproducible reports and slide shows with the Yihui Xie’s knitr package, LaTeX and Markdown. Developing R packages is the subject of Chapter 24.

A useful appendix on the R ecosystem puts icing on the cake with valuable resources including Meetups, conferences, Web sites and online documentation, other books, and folks to follow on Twitter.

Whether you are a beginner or an experienced R hacker looking to pick up new tricks, Jared's book will be good to have in your library. It covers a multitude of important topics, is concise and easy-to-read, and is as good as advertised.

Since our founding in 2012, Data Community DC has had a WordPress-based web site. We've had a great experience on WordPress in many ways, and have published posts that have seen tens of thousands of readers. But it's time to move on, and so Data Community DC now has a new web site, with a new look, hosted on Squarespace.

Here's what you need to know:

  • All of the old content, including those blog posts, is still here.
  • If you were subscribed to the blog via an RSS reader, this is important. You'll need to re-subscribe with our new RSS feed URL:
  • We no longer need "/blog" in the URL. You can still type those extra five characters, but you'll get redirected to shorter, easier-to-type URLs.
  • We have new big red buttons for new visitors, pointing them at the content that'll be most valuable for them, whether they want to Learn, Share, enhance their Career, or are representing Organizations.
  • We've moved the Data Events DC calendar to a new site, which should be a bit faster. And this is important: If you subscribed to that calendar so that it showed up in your iCal or Google Calendar or Outlook (which is amazingly convenient), you'll need to do so again! Just go to the Calendar page, click on the Subscribe button at the bottom, and follow the instructions to add the new calendar. Then delete the old calendar, and you'll see all the new events as they're added.

Got any suggestions? See a problem? Want to help out with our web presence? Please get in touch!

AuthorHarlan Harris

Applications are now open for the Spring 2015 session of the District Data Labs incubator program and research lab.

The Incubator is a structured 3-month project development program where teams of people work on data projects together.  Each team is assigned one project and team members build a data product together over the course of the 3 months.  Team sizes are small (3-4 people per team) and are carefully assembled to contain a mix of quantitative and technical skills.

AuthorTony Ojeda

This is a guest post by Shannon Turner, a software developer and founder of Hear Me Code, offering free, beginner-friendly coding classes for women in the DC area. In her spare time she creates projects like Shut That Down and serves as a mentor with Code for Progress. 

Over 200 women were in attendance for the DC Fem Tech Tour de Code Kickoff party held at Google Thursday night.  DC Fem Tech, a collective of over 25 women in tech organizations, collaborates to run events and support the women in DC's tech community.