Building off the success of the 2014 event, Health Datapalooza is once again opening its doors to app developers who would like to demo their app in front of an audience of more than 2,000 key healthcare industry executives, venture capitalists, providers and more!
On December 11th, Prof. Regina Nuzzo from Galludet University talked at Data Science DC, about Problems with the p-value. The event was well-received. If you missed it, the slides and audio are available. Here we provide Dr. Nuzzo's references and links from the talk, which are on their own a great resource for those considering communication about statistical reliability. (Note that the five topics she covered used examples from highly-publicized studies of sexual behavior.)
This is a guest post by Vadim Y. Bichutskiy, a Lead Data Scientist at Echelon Insights, a Republican analytics firm. His background spans analytical/engineering positions in Silicon Valley, academia, and the US Government. He holds MS/BS Computer Science from University of California, Irvine, MS Statistics from California State University, East Bay, and is a PhD Candidate in Data Sciences at George Mason University. Follow him on Twitter @vybstat.
Recently I got a hold of Jared Lander's book R for Everyone. It is one of the best books on R that I have seen. I first started learning R in 2007 when I was a CS graduate student at UC Irvine. Bored with my research, I decided to venture into statistics and machine learning. I enrolled in several PhD-level statistics courses--the Statistics Department at UC Irvine is in the same school as the CS Dept.--where I was introduced to R. Coming from a C/C++/Java background, R was different, exciting, and powerful.
Learning R is challenging because documentation is scattered all over the place. There is no comprehensive book that covers many important use cases. To get the fundamentals, one has to look at multiple books as well as many online resources and tutorials. Jared has written an excellent book that covers the fundamentals (and more!). It is easy-to-understand, concise and well-written. The title "R for everyone" is accurate because, while it is great for R novices, it is also quite useful for experienced R hackers. It truly lives up to its title.
Chapters 1-4 cover the basics: installation, RStudio, the R package system, and basic language constructs. Chapter 5 discusses fundamental data structures: data frames, lists, matrices, and arrays. Importing data into R is covered in Chapter 6: reading data from CSV files, Excel spreadsheets, relational databases, and from other statistical packages such as SAS and SPSS. This chapter also illustrates saving objects to disk and scraping data from the Web. Statistical graphics is the subject of Chapter 7 including Hadley Wickham's irreplaceable ggplot2 package. Chapters 8-10 are about writing R functions, control structures, and loops. Altogether Chapters 1-10 cover lots of ground. But we're not even halfway through the book!
Chapters 11-12 introduce tools for data munging: base R's apply family of functions and aggregation, Hadley Wickham's packages plyr and reshape2, and various ways to do joins. A section on speeding up data frames with the indispensable data.table package is also included. Chapter 13 is all about working with string (character) data including regular expressions and Hadley Wickham's stringr package. Important probability distributions are the subject of Chapter 14. Chapter 15 discusses basic descriptive and inferential statistics including the t-test and the analysis of variance. Statistical modeling with linear and generalized linear models is the topic of Chapters 16-18. Topics here also include survival analysis, cross-validation, and the bootstrap. The last part of the book covers hugely important topics. Chapter 19 discusses regularization and shrinkage including Lasso and Ridge regression, their generalization the Elastic Net, and Bayesian shrinkage. Nonlinear and nonparametric methods are the focus of Chapter 20: nonlinear least squares, splines, generalized additive models, decision trees, and random forests. Chapter 21 covers time series analysis with autoregressive moving average (ARIMA), vector autoregressive (VAR), and generalized autoregressive conditional heteroskedasticity (GARCH) models. Clustering is the the topic of Chapter 22: K-means, partitioning around medoids (PAM), and hierarchical.
The final two chapters cover topics that are often omitted from other books and resources, making the book especially useful to seasoned programmers. Chapter 23 is about creating reproducible reports and slide shows with the Yihui Xie’s knitr package, LaTeX and Markdown. Developing R packages is the subject of Chapter 24.
A useful appendix on the R ecosystem puts icing on the cake with valuable resources including Meetups, conferences, Web sites and online documentation, other books, and folks to follow on Twitter.
Whether you are a beginner or an experienced R hacker looking to pick up new tricks, Jared's book will be good to have in your library. It covers a multitude of important topics, is concise and easy-to-read, and is as good as advertised.
Since our founding in 2012, Data Community DC has had a WordPress-based web site. We've had a great experience on WordPress in many ways, and have published posts that have seen tens of thousands of readers. But it's time to move on, and so Data Community DC now has a new web site, with a new look, hosted on Squarespace.
Here's what you need to know:
- All of the old content, including those blog posts, is still here.
- If you were subscribed to the blog via an RSS reader, this is important. You'll need to re-subscribe with our new RSS feed URL: http://www.datacommunitydc.org/blog/?format=rss
- We no longer need "/blog" in the URL. You can still type those extra five characters, but you'll get redirected to shorter, easier-to-type URLs.
- We have new big red buttons for new visitors, pointing them at the content that'll be most valuable for them, whether they want to Learn, Share, enhance their Career, or are representing Organizations.
- We've moved the Data Events DC calendar to a new site, which should be a bit faster. And this is important: If you subscribed to that calendar so that it showed up in your iCal or Google Calendar or Outlook (which is amazingly convenient), you'll need to do so again! Just go to the Calendar page, click on the Subscribe button at the bottom, and follow the instructions to add the new calendar. Then delete the old calendar, and you'll see all the new events as they're added.
Got any suggestions? See a problem? Want to help out with our web presence? Please get in touch!
Applications are now open for the Spring 2015 session of the District Data Labs incubator program and research lab.
The Incubator is a structured 3-month project development program where teams of people work on data projects together. Each team is assigned one project and team members build a data product together over the course of the 3 months. Team sizes are small (3-4 people per team) and are carefully assembled to contain a mix of quantitative and technical skills.
Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP November Meetup!
This month's event features an overview of Latent Dirichlet Allocation and probabilistic topic modeling.
Topic models are a family of models to estimate the distribution of abstract concepts (topics) that make up a collection of documents. Over the last several years, the popularity of topic modeling has swelled. One model, Latent Dirichlet Allocation (LDA), is especially popular.
This is a guest post by Catherine Madden (@catmule), a lifelong doodler who realized a few years ago that doodling, sketching, and visual facilitation can be immensely useful in a professional environment. The post consists of her notes from the most recent Data Visualization DC Meetup. Catherine works as the lead designer for the Analytics Visualization Studio at Deloitte Consulting, designing user experiences and visual interfaces for visual analytics prototypes. She prefers Paper by Fifty Three and their Pencil stylus for digital note taking. (Click on the image to open full size.)
This is a guest post by Shannon Turner, a software developer and founder of Hear Me Code, offering free, beginner-friendly coding classes for women in the DC area. In her spare time she creates projects like Shut That Down and serves as a mentor with Code for Progress.
Over 200 women were in attendance for the DC Fem Tech Tour de Code Kickoff party held at Google Thursday night. DC Fem Tech, a collective of over 25 women in tech organizations, collaborates to run events and support the women in DC's tech community.
Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP October Meetup!
This month features an introduction to the art of automated query parsing, and a discussion of a WaPo app that automates some of the tedium of fact checking.
Tony Maull is a Senior Director of Enterprise at DataRPM. He will discuss the differences between computational search and content search. His primary focus is how the computation can be relied upon when a natural language question can be asked any number of ways but still needs to drive a consistently accurate answer.
Four of DC2's board members have published a new book! Tony Ojeda, Sean Murphy, Benjamin Bengfort, and Abhijit Dasgupta are proud to announce the arrival of Practical Data Science Cookbook.
Practical Data Science Cookbook is perfect for those who want to learn data science and numerical programming concepts through hands-on, real-world project examples. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of data science projects, the steps in the data science pipeline, and the programming examples presented in this book. Since the book is formatted to walk you through the projects with examples and explanations along the way, no prior programming experience is required.
Data Community DC and District Data Labs are hosting a full-day Social Network Analysis with Python workshop on Saturday November 22nd. For more info and to sign up, go to http://bit.ly/1lWFlLx. Register before October 31st for an early bird discount!
Social networks are not new, even though websites like Facebook and Twitter might make you want to believe they are; and trust me- I’m not talking about Myspace! Social networks are extremely interesting models for human behavior, whose study dates back to the early twentieth century. However, because of those websites, data scientists have access to much more data than the anthropologists who studied the networks of tribes!
Data Community DC and District Data Labs are excited to be hosting a Fast Data Applications with Spark & Python workshop on November 8th For more info and to sign up, go to http://bit.ly/Zhj0y1. There’s even an early bird discount if you register before October 17th!
Hadoop has made the world of Big Data possible by providing a framework for distributed computing on economical, commercial off-the-shelf hardware. Hadoop 2.0 implements a distributed file system, HDFS, and a computing framework, YARN, that allows distributed applications to easily harness the power of clustered computing on extremely large data sets. Over the past decade, the primary application framework has been MapReduce - a functional programming paradigm that lends itself extremely well to designing distributed applications, but carries with it a lot of computational overhead.
Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP September Meetup!
This month, we're joined by Kathy McCoy, Professor of Computer & Information Science and Linguistics at the University of Delaware. Kathy is also a consultant for the National Institute on Disability and Rehabilitation Research (NIDRR) at the U.S. Department of Education. Her research focuses on natural language generation and understanding, particularly for assistive technologies, and she'll be giving a presentation on Replicating Semantic Connections Made by Visual Readers for a Scanning System for Nonvisual Readers.
Data Community DC and District Data Labs are excited to be hosting a Natural Language Analysis with NLTK workshop on October 25th For more info and to sign up, go to http://bit.ly/1pK0pFN. There’s even an early bird discount if you register before October 3rd!
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK).
Harlan Harris is the President and a co-founder of Data Community DC, and is a long-time fan of DataKind. Last week, DataKind, the nonprofit that connects pro-bono data and tech folks with nonprofits in need of data help, announced the first regional chapters, in the UK, Bangalore, Dublin, Singapore, San Francisco, and best of all (we think!), Washington, DC!
Our (lucky number) 13th DIDC meetup took place at the spacious offices of Endgame in Clarendon, VA. Endgame very graciously provided incredible gourmet pizza (and beer) for all those who attended.
Beyond such excellent beverages and food, attendees were treated to four separate and compelling talks. For those of you who could not attend, a little information about the talks and speakers is below (as well as contact information) and the slides!
This is a guest post by Lawrence Leemis, a professor in the Department of Mathematics at The College of William & Mary.
A front-page article over the weekend in the Wall Street Journal indicated that the number one profession of interest to tech firms is a data scientist, someone whose analytic skills, computing skills, and domain skills are able to detect signals from data and use them to advantage. Although the terms are squishy, the push today is for "big data" skills and "predictive analytics" skills which allow firms to leverage the deluge of data that is now accessible.
I attended the Joint Statistical Meetings last week in Boston and I was impressed by the number of talks that referred to big data sets and also the number that used the R language. Over half of the technical talks that I attended included a simulation study of one type or another.
The two traditional aspects of the scientific method, namely theory and experimentation, have been enhanced with computation being added as a third leg. Sitting at the center of computation is simulation, which is the topic of this post. Simulation is a useful tool when analytic methods fail because of mathematical intractability.
The questions that I will address here are how Monte Carlo simulation and discrete-event simulation differ and how they fit into the general framework of predictive analytics.
First, how do how Monte Carlo and discrete-event simulation differ? Monte Carlo simulation is appropriate when the passage of time does not play a significant role. Probability calculations involving problems associated with playing cards, dice, and coins, for example, can be solved by Monte Carlo.
Discrete-event simulation, on the other hand, has the passage of time as an integral part of the model. The classic application areas in which discrete-event simulation has been applied are queuing, inventory, and reliability. As an illustration, a mathematical model for a queue with a single server might consist of (a) a probability distribution for the time between arrivals to the queue, (b) a probability distribution for the service time at the queue, and (c) an algorithm for placing entities in the queue (first-come-first served is the usual default). Discrete-event simulation can be coded into any algorithmic language, although the coding is tedious. Because of the complexities of coding a discrete-event simulation, dozens of languages have been developed to ease implementation of a model.
The field of predictive analytics leans heavily on the tools from data mining in order to identify patterns and trends in a data set. Once an appropriate question has been posed, these patterns and trends in explanatory variables (often called covariates) are used to predict future behavior of variables of interest. There is both an art and a science in predictive analytics. The science side includes the standard tools of associated with mathematics computation, probability, and statistics. The art side consists mainly of making appropriate assumptions about the mathematical model constructed for predicting future outcomes. Simulation is used primarily for verification and validation of the mathematical models associated with a predictive analytics model. It can be used to determine whether the probabilistic models are reasonable and appropriate for a particular problem.
Two sources for further training in simulation are a workshop in Catonsville, Maryland on September 12-13 by Barry Lawson (University of Richmond) and me or the Winter Simulation Conference (December 7-10, 2014) in Savannah.
Data Community DC is pleased to announce a new service to the area data community: topic-specific discussion lists! In this way we hope to extend the successes of our Meetups and workshops by providing a way for groups of local people with similar interests to maintain contact and have ongoing discussions. In a previous post, we announced the formation of the Deep Learning Discussion List for those interested in Deep Learning topics. The second topic-specific discussion group has just been created, a collaboration between Charlie Greenbacker (@greenbacker) and the DC-NLP Meetup Group and Ben Bengfort (@bbengfort) and DIDC - both specialists in Natural Language Processing and Computational Linguistics.
If you're interested in Natural Language Processing and want to be part of the discussion, sign up here:
This discussion group is intended for computational linguists, data scientists, software engineers, students, faculty, and anyone interested in the automatic processing of natural language by a computer! NLP has received a big boost in recent years thanks to modern machine learning techniques - and has made tasks like automatic classification of language as well information extraction techniques part of our every day lives. The next phase of NLP involves machine understanding and translation, text summarization and generation, as well semantic reasoning across texts. These topics are the forefront of science and should be discussed in a community of brilliant people, which is why we have created this group! From current events to interesting topics to questions and answers, please use this group as a platform to engage with your fellow data scientists on the topic of language processing!
We hope to see you on the group soon!