Eighty percent or more of the time spent on data science projects is spent acquiring data, cleaning it, and preparing it for analysis. That data can come from a variety of sources, including APIs or individual web pages. However, not all data is created equal. Once we have automated its acquisition, much of it requires lengthy cleaning and formatting before it can be used. In this course, you will learn how to obtain data via web scraping and APIs, how to clean and consolidate your data, and how to wrangle it into a database so that it is ready for analysis.
Saturday? Yup, we are changing things up and swapping out pizza and empanadas for bagels and other brunchy foods.
Eight teams of DC data scientists have come together for a three month incubator to turn theory into practice on projects spanning healthcare, economics, the environment, and more. Learn from their experience implementing a Deep Learning network on commercially available hardware, on deploying a D3.js visualization web app using Heroku, or on building a desktop GUI with Python... plus much more! Enjoy brunch and drinks on us as we are taken from concept to production on eight data products, and then join the judges by voting for the winner!
Data Community DC and District Data Labs are hosting a full-day Intro to Machine Learning with Python workshop on Saturday February 28th. For more info and to sign up, go to [http://bit.ly/1xQ9f4n]. Register before February 13th for an early bird discount!
There have been three big changes at Data Community DC in recent days! Read more about changes to our Board and our line-up of Meetup groups.
Building off the success of the 2014 event, Health Datapalooza is once again opening its doors to app developers who would like to demo their app in front of an audience of more than 2,000 key healthcare industry executives, venture capitalists, providers and more!
On December 11th, Prof. Regina Nuzzo from Galludet University talked at Data Science DC, about Problems with the p-value. The event was well-received. If you missed it, the slides and audio are available. Here we provide Dr. Nuzzo's references and links from the talk, which are on their own a great resource for those considering communication about statistical reliability. (Note that the five topics she covered used examples from highly-publicized studies of sexual behavior.)
This is a guest post by Vadim Y. Bichutskiy, a Lead Data Scientist at Echelon Insights, a Republican analytics firm. His background spans analytical/engineering positions in Silicon Valley, academia, and the US Government. He holds MS/BS Computer Science from University of California, Irvine, MS Statistics from California State University, East Bay, and is a PhD Candidate in Data Sciences at George Mason University. Follow him on Twitter @vybstat.
Recently I got a hold of Jared Lander's book R for Everyone. It is one of the best books on R that I have seen. I first started learning R in 2007 when I was a CS graduate student at UC Irvine. Bored with my research, I decided to venture into statistics and machine learning. I enrolled in several PhD-level statistics courses--the Statistics Department at UC Irvine is in the same school as the CS Dept.--where I was introduced to R. Coming from a C/C++/Java background, R was different, exciting, and powerful.
Learning R is challenging because documentation is scattered all over the place. There is no comprehensive book that covers many important use cases. To get the fundamentals, one has to look at multiple books as well as many online resources and tutorials. Jared has written an excellent book that covers the fundamentals (and more!). It is easy-to-understand, concise and well-written. The title "R for everyone" is accurate because, while it is great for R novices, it is also quite useful for experienced R hackers. It truly lives up to its title.
Chapters 1-4 cover the basics: installation, RStudio, the R package system, and basic language constructs. Chapter 5 discusses fundamental data structures: data frames, lists, matrices, and arrays. Importing data into R is covered in Chapter 6: reading data from CSV files, Excel spreadsheets, relational databases, and from other statistical packages such as SAS and SPSS. This chapter also illustrates saving objects to disk and scraping data from the Web. Statistical graphics is the subject of Chapter 7 including Hadley Wickham's irreplaceable ggplot2 package. Chapters 8-10 are about writing R functions, control structures, and loops. Altogether Chapters 1-10 cover lots of ground. But we're not even halfway through the book!
Chapters 11-12 introduce tools for data munging: base R's apply family of functions and aggregation, Hadley Wickham's packages plyr and reshape2, and various ways to do joins. A section on speeding up data frames with the indispensable data.table package is also included. Chapter 13 is all about working with string (character) data including regular expressions and Hadley Wickham's stringr package. Important probability distributions are the subject of Chapter 14. Chapter 15 discusses basic descriptive and inferential statistics including the t-test and the analysis of variance. Statistical modeling with linear and generalized linear models is the topic of Chapters 16-18. Topics here also include survival analysis, cross-validation, and the bootstrap. The last part of the book covers hugely important topics. Chapter 19 discusses regularization and shrinkage including Lasso and Ridge regression, their generalization the Elastic Net, and Bayesian shrinkage. Nonlinear and nonparametric methods are the focus of Chapter 20: nonlinear least squares, splines, generalized additive models, decision trees, and random forests. Chapter 21 covers time series analysis with autoregressive moving average (ARIMA), vector autoregressive (VAR), and generalized autoregressive conditional heteroskedasticity (GARCH) models. Clustering is the the topic of Chapter 22: K-means, partitioning around medoids (PAM), and hierarchical.
The final two chapters cover topics that are often omitted from other books and resources, making the book especially useful to seasoned programmers. Chapter 23 is about creating reproducible reports and slide shows with the Yihui Xie’s knitr package, LaTeX and Markdown. Developing R packages is the subject of Chapter 24.
A useful appendix on the R ecosystem puts icing on the cake with valuable resources including Meetups, conferences, Web sites and online documentation, other books, and folks to follow on Twitter.
Whether you are a beginner or an experienced R hacker looking to pick up new tricks, Jared's book will be good to have in your library. It covers a multitude of important topics, is concise and easy-to-read, and is as good as advertised.
Since our founding in 2012, Data Community DC has had a WordPress-based web site. We've had a great experience on WordPress in many ways, and have published posts that have seen tens of thousands of readers. But it's time to move on, and so Data Community DC now has a new web site, with a new look, hosted on Squarespace.
Here's what you need to know:
- All of the old content, including those blog posts, is still here.
- If you were subscribed to the blog via an RSS reader, this is important. You'll need to re-subscribe with our new RSS feed URL: http://www.datacommunitydc.org/blog/?format=rss
- We no longer need "/blog" in the URL. You can still type those extra five characters, but you'll get redirected to shorter, easier-to-type URLs.
- We have new big red buttons for new visitors, pointing them at the content that'll be most valuable for them, whether they want to Learn, Share, enhance their Career, or are representing Organizations.
- We've moved the Data Events DC calendar to a new site, which should be a bit faster. And this is important: If you subscribed to that calendar so that it showed up in your iCal or Google Calendar or Outlook (which is amazingly convenient), you'll need to do so again! Just go to the Calendar page, click on the Subscribe button at the bottom, and follow the instructions to add the new calendar. Then delete the old calendar, and you'll see all the new events as they're added.
Got any suggestions? See a problem? Want to help out with our web presence? Please get in touch!
Applications are now open for the Spring 2015 session of the District Data Labs incubator program and research lab.
The Incubator is a structured 3-month project development program where teams of people work on data projects together. Each team is assigned one project and team members build a data product together over the course of the 3 months. Team sizes are small (3-4 people per team) and are carefully assembled to contain a mix of quantitative and technical skills.
Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP November Meetup!
This month's event features an overview of Latent Dirichlet Allocation and probabilistic topic modeling.
Topic models are a family of models to estimate the distribution of abstract concepts (topics) that make up a collection of documents. Over the last several years, the popularity of topic modeling has swelled. One model, Latent Dirichlet Allocation (LDA), is especially popular.
This is a guest post by Catherine Madden (@catmule), a lifelong doodler who realized a few years ago that doodling, sketching, and visual facilitation can be immensely useful in a professional environment. The post consists of her notes from the most recent Data Visualization DC Meetup. Catherine works as the lead designer for the Analytics Visualization Studio at Deloitte Consulting, designing user experiences and visual interfaces for visual analytics prototypes. She prefers Paper by Fifty Three and their Pencil stylus for digital note taking. (Click on the image to open full size.)
This is a guest post by Shannon Turner, a software developer and founder of Hear Me Code, offering free, beginner-friendly coding classes for women in the DC area. In her spare time she creates projects like Shut That Down and serves as a mentor with Code for Progress.
Over 200 women were in attendance for the DC Fem Tech Tour de Code Kickoff party held at Google Thursday night. DC Fem Tech, a collective of over 25 women in tech organizations, collaborates to run events and support the women in DC's tech community.
Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP October Meetup!
This month features an introduction to the art of automated query parsing, and a discussion of a WaPo app that automates some of the tedium of fact checking.
Tony Maull is a Senior Director of Enterprise at DataRPM. He will discuss the differences between computational search and content search. His primary focus is how the computation can be relied upon when a natural language question can be asked any number of ways but still needs to drive a consistently accurate answer.
Four of DC2's board members have published a new book! Tony Ojeda, Sean Murphy, Benjamin Bengfort, and Abhijit Dasgupta are proud to announce the arrival of Practical Data Science Cookbook.
Practical Data Science Cookbook is perfect for those who want to learn data science and numerical programming concepts through hands-on, real-world project examples. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of data science projects, the steps in the data science pipeline, and the programming examples presented in this book. Since the book is formatted to walk you through the projects with examples and explanations along the way, no prior programming experience is required.
Data Community DC and District Data Labs are hosting a full-day Social Network Analysis with Python workshop on Saturday November 22nd. For more info and to sign up, go to http://bit.ly/1lWFlLx. Register before October 31st for an early bird discount!
Social networks are not new, even though websites like Facebook and Twitter might make you want to believe they are; and trust me- I’m not talking about Myspace! Social networks are extremely interesting models for human behavior, whose study dates back to the early twentieth century. However, because of those websites, data scientists have access to much more data than the anthropologists who studied the networks of tribes!
Data Community DC and District Data Labs are excited to be hosting a Fast Data Applications with Spark & Python workshop on November 8th For more info and to sign up, go to http://bit.ly/Zhj0y1. There’s even an early bird discount if you register before October 17th!
Hadoop has made the world of Big Data possible by providing a framework for distributed computing on economical, commercial off-the-shelf hardware. Hadoop 2.0 implements a distributed file system, HDFS, and a computing framework, YARN, that allows distributed applications to easily harness the power of clustered computing on extremely large data sets. Over the past decade, the primary application framework has been MapReduce - a functional programming paradigm that lends itself extremely well to designing distributed applications, but carries with it a lot of computational overhead.
Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP September Meetup!
This month, we're joined by Kathy McCoy, Professor of Computer & Information Science and Linguistics at the University of Delaware. Kathy is also a consultant for the National Institute on Disability and Rehabilitation Research (NIDRR) at the U.S. Department of Education. Her research focuses on natural language generation and understanding, particularly for assistive technologies, and she'll be giving a presentation on Replicating Semantic Connections Made by Visual Readers for a Scanning System for Nonvisual Readers.
Data Community DC and District Data Labs are excited to be hosting a Natural Language Analysis with NLTK workshop on October 25th For more info and to sign up, go to http://bit.ly/1pK0pFN. There’s even an early bird discount if you register before October 3rd!
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK).