On December 11th, Prof. Regina Nuzzo from Galludet University talked at Data Science DC, about Problems with the p-value. The event was well-received. If you missed it, the slides and audio are available. Here we provide Dr. Nuzzo's references and links from the talk, which are on their own a great resource for those considering communication about statistical reliability. (Note that the five topics she covered used examples from highly-publicized studies of sexual behavior.)
Data Community DC and District Data Labs are hosting a full-day Analyzing Social Media Data with R workshop on Saturday January 24th. For more info and to sign up, go to http://bit.ly/1FJjFIz. Register before January 9th for an early bird discount!
This beginner-to-intermediate level course will introduce you to various topics in social media analysis. Attendees will begin by learning social-scientific groundwork for understanding measurement and inference, and then demonstrate, through concrete examples using R, how to explore, measure, visualize, and communicate various aspects of social media. The course will be taught using parts of the instructors’ book, Social Media Mining with R, but will include important additions such as topic modeling and social network analysis.
What You Will Learn
Conceptually, students will learn about the potentials and pitfalls surrounding the measurement of socially generated data.
Concretely, students will learn how to scrape, parse, and visualize textual and other classes of social data via both static and hands-on examples. The course will then move into the analysis of social data, including the areas of topic modeling, sentiment analysis, and social network analysis. Within Social Network Analysis, student will learn the basics of social networks, how to calculate individual and network statistics using R, how to visualize networks in R, and advanced uses of social network analysis. Students will come away with hands-on experience, and R code that can be built out to serve their future needs.
The workshop will cover the following topics:
- Introduction to Social Science Data
- Principles of measurement and inference for social media data
- Mining Text Data: Twitter
- Basic text mining
- Topic modeling
- Sentiment analysis (dictionary and supervised learning)
- Sentiment analysis (unsupervised learning using IRT)
- Introduction to Social Network Analysis
- Individual and network measures
- Network Visualization Techniques
- Advanced uses for Network Analysis
After taking this course, students will be aware of the possibilities and limitations inherent in social media. They will be able to obtain social data at medium scales, various add-on packages.
Nathan Danneman is Fellow and Data Scientist for L-3 Data Tactics, where he has worked on sentiment analysis, geospatial anomaly detection, and behavioral cyber analytics. He holds a PhD in political science, and has an extensive background in causal modeling and game theory. During his graduate work, Nathan taught introductory statistics courses, and served as an on-campus consultant for quantitative theses and dissertations. Nathan has created novel sentiment analysis applications, and is a co-author of Social Media Mining with R.
Richard Heimann is Fellow and Chief Data Scientist at L-3 NSS. He also is an EMC Certified Data Scientist with concentrations in spatial statistics, data mining, and pattern discovery. He is an Adjunct Faculty at the University of Maryland Baltimore County, where he teaches Spatial Analysis and Statistical Reasoning, and an Instructor at George Mason University, where he teaches Human Terrain Analysis. Mr. Heimann recently provided data science support to customers, including DARPA, Department of Homeland Security, US Army, Counter-IED Operational Integration Center, and the Pentagon. Richard is a co-author of Social Media Mining with R.
This is a guest post by Vadim Y. Bichutskiy, a Lead Data Scientist at Echelon Insights, a Republican analytics firm. His background spans analytical/engineering positions in Silicon Valley, academia, and the US Government. He holds MS/BS Computer Science from University of California, Irvine, MS Statistics from California State University, East Bay, and is a PhD Candidate in Data Sciences at George Mason University. Follow him on Twitter @vybstat.
Recently I got a hold of Jared Lander's book R for Everyone. It is one of the best books on R that I have seen. I first started learning R in 2007 when I was a CS graduate student at UC Irvine. Bored with my research, I decided to venture into statistics and machine learning. I enrolled in several PhD-level statistics courses--the Statistics Department at UC Irvine is in the same school as the CS Dept.--where I was introduced to R. Coming from a C/C++/Java background, R was different, exciting, and powerful.
Learning R is challenging because documentation is scattered all over the place. There is no comprehensive book that covers many important use cases. To get the fundamentals, one has to look at multiple books as well as many online resources and tutorials. Jared has written an excellent book that covers the fundamentals (and more!). It is easy-to-understand, concise and well-written. The title "R for everyone" is accurate because, while it is great for R novices, it is also quite useful for experienced R hackers. It truly lives up to its title.
Chapters 1-4 cover the basics: installation, RStudio, the R package system, and basic language constructs. Chapter 5 discusses fundamental data structures: data frames, lists, matrices, and arrays. Importing data into R is covered in Chapter 6: reading data from CSV files, Excel spreadsheets, relational databases, and from other statistical packages such as SAS and SPSS. This chapter also illustrates saving objects to disk and scraping data from the Web. Statistical graphics is the subject of Chapter 7 including Hadley Wickham's irreplaceable ggplot2 package. Chapters 8-10 are about writing R functions, control structures, and loops. Altogether Chapters 1-10 cover lots of ground. But we're not even halfway through the book!
Chapters 11-12 introduce tools for data munging: base R's apply family of functions and aggregation, Hadley Wickham's packages plyr and reshape2, and various ways to do joins. A section on speeding up data frames with the indispensable data.table package is also included. Chapter 13 is all about working with string (character) data including regular expressions and Hadley Wickham's stringr package. Important probability distributions are the subject of Chapter 14. Chapter 15 discusses basic descriptive and inferential statistics including the t-test and the analysis of variance. Statistical modeling with linear and generalized linear models is the topic of Chapters 16-18. Topics here also include survival analysis, cross-validation, and the bootstrap. The last part of the book covers hugely important topics. Chapter 19 discusses regularization and shrinkage including Lasso and Ridge regression, their generalization the Elastic Net, and Bayesian shrinkage. Nonlinear and nonparametric methods are the focus of Chapter 20: nonlinear least squares, splines, generalized additive models, decision trees, and random forests. Chapter 21 covers time series analysis with autoregressive moving average (ARIMA), vector autoregressive (VAR), and generalized autoregressive conditional heteroskedasticity (GARCH) models. Clustering is the the topic of Chapter 22: K-means, partitioning around medoids (PAM), and hierarchical.
The final two chapters cover topics that are often omitted from other books and resources, making the book especially useful to seasoned programmers. Chapter 23 is about creating reproducible reports and slide shows with the Yihui Xie’s knitr package, LaTeX and Markdown. Developing R packages is the subject of Chapter 24.
A useful appendix on the R ecosystem puts icing on the cake with valuable resources including Meetups, conferences, Web sites and online documentation, other books, and folks to follow on Twitter.
Whether you are a beginner or an experienced R hacker looking to pick up new tricks, Jared's book will be good to have in your library. It covers a multitude of important topics, is concise and easy-to-read, and is as good as advertised.
Since our founding in 2012, Data Community DC has had a WordPress-based web site. We've had a great experience on WordPress in many ways, and have published posts that have seen tens of thousands of readers. But it's time to move on, and so Data Community DC now has a new web site, with a new look, hosted on Squarespace.
Here's what you need to know:
- All of the old content, including those blog posts, is still here.
- If you were subscribed to the blog via an RSS reader, this is important. You'll need to re-subscribe with our new RSS feed URL: http://www.datacommunitydc.org/blog/?format=rss
- We no longer need "/blog" in the URL. You can still type those extra five characters, but you'll get redirected to shorter, easier-to-type URLs.
- We have new big red buttons for new visitors, pointing them at the content that'll be most valuable for them, whether they want to Learn, Share, enhance their Career, or are representing Organizations.
- We've moved the Data Events DC calendar to a new site, which should be a bit faster. And this is important: If you subscribed to that calendar so that it showed up in your iCal or Google Calendar or Outlook (which is amazingly convenient), you'll need to do so again! Just go to the Calendar page, click on the Subscribe button at the bottom, and follow the instructions to add the new calendar. Then delete the old calendar, and you'll see all the new events as they're added.
Got any suggestions? See a problem? Want to help out with our web presence? Please get in touch!
Applications are now open for the Spring 2015 session of the District Data Labs incubator program and research lab.
The Incubator is a structured 3-month project development program where teams of people work on data projects together. Each team is assigned one project and team members build a data product together over the course of the 3 months. Team sizes are small (3-4 people per team) and are carefully assembled to contain a mix of quantitative and technical skills.
Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP November Meetup!
This month's event features an overview of Latent Dirichlet Allocation and probabilistic topic modeling.
Topic models are a family of models to estimate the distribution of abstract concepts (topics) that make up a collection of documents. Over the last several years, the popularity of topic modeling has swelled. One model, Latent Dirichlet Allocation (LDA), is especially popular.
This is a guest post by Catherine Madden (@catmule), a lifelong doodler who realized a few years ago that doodling, sketching, and visual facilitation can be immensely useful in a professional environment. The post consists of her notes from the most recent Data Visualization DC Meetup. Catherine works as the lead designer for the Analytics Visualization Studio at Deloitte Consulting, designing user experiences and visual interfaces for visual analytics prototypes. She prefers Paper by Fifty Three and their Pencil stylus for digital note taking. (Click on the image to open full size.)
This is a guest post by Shannon Turner, a software developer and founder of Hear Me Code, offering free, beginner-friendly coding classes for women in the DC area. In her spare time she creates projects like Shut That Down and serves as a mentor with Code for Progress.
Over 200 women were in attendance for the DC Fem Tech Tour de Code Kickoff party held at Google Thursday night. DC Fem Tech, a collective of over 25 women in tech organizations, collaborates to run events and support the women in DC's tech community.
Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP October Meetup!
This month features an introduction to the art of automated query parsing, and a discussion of a WaPo app that automates some of the tedium of fact checking.
Tony Maull is a Senior Director of Enterprise at DataRPM. He will discuss the differences between computational search and content search. His primary focus is how the computation can be relied upon when a natural language question can be asked any number of ways but still needs to drive a consistently accurate answer.
Four of DC2's board members have published a new book! Tony Ojeda, Sean Murphy, Benjamin Bengfort, and Abhijit Dasgupta are proud to announce the arrival of Practical Data Science Cookbook.
Practical Data Science Cookbook is perfect for those who want to learn data science and numerical programming concepts through hands-on, real-world project examples. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of data science projects, the steps in the data science pipeline, and the programming examples presented in this book. Since the book is formatted to walk you through the projects with examples and explanations along the way, no prior programming experience is required.
Data Community DC and District Data Labs are hosting a full-day Social Network Analysis with Python workshop on Saturday November 22nd. For more info and to sign up, go to http://bit.ly/1lWFlLx. Register before October 31st for an early bird discount!
Social networks are not new, even though websites like Facebook and Twitter might make you want to believe they are; and trust me- I’m not talking about Myspace! Social networks are extremely interesting models for human behavior, whose study dates back to the early twentieth century. However, because of those websites, data scientists have access to much more data than the anthropologists who studied the networks of tribes!
Data Community DC and District Data Labs are excited to be hosting a Fast Data Applications with Spark & Python workshop on November 8th For more info and to sign up, go to http://bit.ly/Zhj0y1. There’s even an early bird discount if you register before October 17th!
Hadoop has made the world of Big Data possible by providing a framework for distributed computing on economical, commercial off-the-shelf hardware. Hadoop 2.0 implements a distributed file system, HDFS, and a computing framework, YARN, that allows distributed applications to easily harness the power of clustered computing on extremely large data sets. Over the past decade, the primary application framework has been MapReduce - a functional programming paradigm that lends itself extremely well to designing distributed applications, but carries with it a lot of computational overhead.
Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP September Meetup!
This month, we're joined by Kathy McCoy, Professor of Computer & Information Science and Linguistics at the University of Delaware. Kathy is also a consultant for the National Institute on Disability and Rehabilitation Research (NIDRR) at the U.S. Department of Education. Her research focuses on natural language generation and understanding, particularly for assistive technologies, and she'll be giving a presentation on Replicating Semantic Connections Made by Visual Readers for a Scanning System for Nonvisual Readers.
Data Community DC and District Data Labs are excited to be hosting a Natural Language Analysis with NLTK workshop on October 25th For more info and to sign up, go to http://bit.ly/1pK0pFN. There’s even an early bird discount if you register before October 3rd!
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK).
Harlan Harris is the President and a co-founder of Data Community DC, and is a long-time fan of DataKind. Last week, DataKind, the nonprofit that connects pro-bono data and tech folks with nonprofits in need of data help, announced the first regional chapters, in the UK, Bangalore, Dublin, Singapore, San Francisco, and best of all (we think!), Washington, DC!
Our (lucky number) 13th DIDC meetup took place at the spacious offices of Endgame in Clarendon, VA. Endgame very graciously provided incredible gourmet pizza (and beer) for all those who attended.
Beyond such excellent beverages and food, attendees were treated to four separate and compelling talks. For those of you who could not attend, a little information about the talks and speakers is below (as well as contact information) and the slides!
This is a guest post by Lawrence Leemis, a professor in the Department of Mathematics at The College of William & Mary.
A front-page article over the weekend in the Wall Street Journal indicated that the number one profession of interest to tech firms is a data scientist, someone whose analytic skills, computing skills, and domain skills are able to detect signals from data and use them to advantage. Although the terms are squishy, the push today is for "big data" skills and "predictive analytics" skills which allow firms to leverage the deluge of data that is now accessible.
I attended the Joint Statistical Meetings last week in Boston and I was impressed by the number of talks that referred to big data sets and also the number that used the R language. Over half of the technical talks that I attended included a simulation study of one type or another.
The two traditional aspects of the scientific method, namely theory and experimentation, have been enhanced with computation being added as a third leg. Sitting at the center of computation is simulation, which is the topic of this post. Simulation is a useful tool when analytic methods fail because of mathematical intractability.
The questions that I will address here are how Monte Carlo simulation and discrete-event simulation differ and how they fit into the general framework of predictive analytics.
First, how do how Monte Carlo and discrete-event simulation differ? Monte Carlo simulation is appropriate when the passage of time does not play a significant role. Probability calculations involving problems associated with playing cards, dice, and coins, for example, can be solved by Monte Carlo.
Discrete-event simulation, on the other hand, has the passage of time as an integral part of the model. The classic application areas in which discrete-event simulation has been applied are queuing, inventory, and reliability. As an illustration, a mathematical model for a queue with a single server might consist of (a) a probability distribution for the time between arrivals to the queue, (b) a probability distribution for the service time at the queue, and (c) an algorithm for placing entities in the queue (first-come-first served is the usual default). Discrete-event simulation can be coded into any algorithmic language, although the coding is tedious. Because of the complexities of coding a discrete-event simulation, dozens of languages have been developed to ease implementation of a model.
The field of predictive analytics leans heavily on the tools from data mining in order to identify patterns and trends in a data set. Once an appropriate question has been posed, these patterns and trends in explanatory variables (often called covariates) are used to predict future behavior of variables of interest. There is both an art and a science in predictive analytics. The science side includes the standard tools of associated with mathematics computation, probability, and statistics. The art side consists mainly of making appropriate assumptions about the mathematical model constructed for predicting future outcomes. Simulation is used primarily for verification and validation of the mathematical models associated with a predictive analytics model. It can be used to determine whether the probabilistic models are reasonable and appropriate for a particular problem.
Two sources for further training in simulation are a workshop in Catonsville, Maryland on September 12-13 by Barry Lawson (University of Richmond) and me or the Winter Simulation Conference (December 7-10, 2014) in Savannah.
Data Community DC is pleased to announce a new service to the area data community: topic-specific discussion lists! In this way we hope to extend the successes of our Meetups and workshops by providing a way for groups of local people with similar interests to maintain contact and have ongoing discussions. In a previous post, we announced the formation of the Deep Learning Discussion List for those interested in Deep Learning topics. The second topic-specific discussion group has just been created, a collaboration between Charlie Greenbacker (@greenbacker) and the DC-NLP Meetup Group and Ben Bengfort (@bbengfort) and DIDC - both specialists in Natural Language Processing and Computational Linguistics.
If you're interested in Natural Language Processing and want to be part of the discussion, sign up here:
This discussion group is intended for computational linguists, data scientists, software engineers, students, faculty, and anyone interested in the automatic processing of natural language by a computer! NLP has received a big boost in recent years thanks to modern machine learning techniques - and has made tasks like automatic classification of language as well information extraction techniques part of our every day lives. The next phase of NLP involves machine understanding and translation, text summarization and generation, as well semantic reasoning across texts. These topics are the forefront of science and should be discussed in a community of brilliant people, which is why we have created this group! From current events to interesting topics to questions and answers, please use this group as a platform to engage with your fellow data scientists on the topic of language processing!
We hope to see you on the group soon!
This post, from DC2 President Harlan Harris, was originally published on his blog. Harlan was on the board of WINFORMS, the local chapter of the Operations Research professional society, from 2012 until this summer. Earlier this year, I attended the INFORMS Conference on Business Analytics & Operations Research, in Boston. I was asked beforehand if I wanted to be a conference blogger, and for some reason I said I would. This meant I was able to publish posts on the conference's WordPress web site, and was also obliged to do so!
Here are the five posts that I wrote, along with an excerpt from each. Please click through to read the full pieces:
- more insight, less action — deliverables tend towards predictions and storytelling, versus formal optimization
- more openness, less big iron — open source software leads to a low-cost, highly flexible approach
- more scruffy, less neat — data science technologies often come from black-box statistical models, vs. domain-based theory
- more velocity, smaller projects — a hundred $10K projects beats one $1M project
- more science, less engineering — both practitioners and methods have different backgrounds
- more hipsters, less suits — stronger connections to the tech industry than to the boardroom
- more rockstars, less teams — one person can now (roughly) do everything, in simple cases, for better or worse
DJ Patil says “a data product is a product that facilitates an end goal through the use of data.” So, it’s not just an analysis, or a recommendation to executives, or an insight that leads to an improvement to a business process. It’s a visible component of a system. LinkedIn’s People You May Know is viewed by many millions of customers, and it’s based on the complex interactions of the customers themselves.
[A]s a DC resident, we often hear of “Healthcare and Education” as a linked pair of industries. Both are systems focused on social good, with intertwined government, nonprofit, and for-profit entities, highly distributed management, and (reportedly) huge opportunities for improvement. Aside from MIT Leaders for Global Operations winning the Smith Prize (and a number of shoutouts to academic partners and mentors), there was not a peep from the education sector at tonight’s awards ceremony. Is education, and particularly K-12 and postsecondary education, not amenable to OR techniques or solutions?
In 2011, almost every talk seemed to me to be from a Fortune 500 company, or a large nonprofit, or a consulting firm advising a Fortune 500 company or a large nonprofit. Entrepeneurship around analytics was barely to be seen. This year, there are at least a few talks about Hadoop and iPhone apps and more. Has the cost of deploying advanced analytics substantially dropped?
It’s worthwhile learning a bit about databases, even if you have no decision-making authority in your organization, and don’t feel like becoming a database administrator (good call). But by getting involved early in the data-collection process, when IT folks are sitting around a table arguing about platform questions, you can get a word in occasionally about the things that matter for analytics — collecting all the data, storing it in a way friendly to later analytics, and so forth.
All in all, I enjoyed blogging the conference, and recommend the practice to others! It's a great way to organize your thoughts and to summarize and synthesize your experiences.
This August, we're joined by Tony Davis, technical manager in the NLP and machine learning group at 3M Health Information Systems and adjunct professor in the Georgetown University Linguistics Department, where he's taught courses including information retrieval and extraction, and lexical semantics.
Tony will be introducing us to automatic segmentation. Automatic segmentation deals with breaking up unstructured documents into units - words, sentences, topics, etc. Search and retrieval, document categorization, and analysis of dialog and discourse all benefit from segmentation.
Tony's talk will cover some of the techniques, linguistic and otherwise, that have been applied to segmentation, with particular reference to two use cases: multimedia information retrieval and medical coding.
Following July's joint meetup with Statistical Programming DC and Data Wranglers, where Tommy Jones and Charlie Greenbacker performed a showdown between tools in R and Python, we're back at our usual U-Street location.
Wednesday, August 13, 2014 6:30 PM to 8:30 PM Stetsons Famous Bar & Grill 1610 U Street Northwest, Washington, DC
The DC NLP meetup group is for anyone in the Washington, D.C. area working in (or interested in) Natural Language Processing. Our meetings will be an opportunity for folks to network, give presentations about their work or research projects, learn about the latest advancements in our field, and exchange ideas or brainstorm. Topics may include computational linguistics, machine learning, text analytics, data mining, information extraction, speech processing, sentiment analysis, and much more.
For more information and to RSVP, please visit: http://www.meetup.com/DC-NLP/events/192504832/