Earlier this year, several of us from the DC2 community (Harlan Harris --that's me, Marck Vaisman, and Sean Murphy) conducted a web-based survey of Data Scientists, with the goal of better understanding the varieties of people, skills, and experiences that fall under this rather broad buzzword. We have analyzed the results from over 250 respondents, and are excited to share some initial findings here! The first task in the survey was to rank a set of 21 skill categories. We used the technique of non-negative matrix factorization to find five underlying dimensions of variation among the rankings. We found that Data Scientists have skills that tend to be associated together, and by grouping those skills, we can provide people with a useful shorthand. Here are the skills groups, with category names that we think clarify what we as Data Scientists bring to the table:
- Programming: Back-end Programming, Front-end Programming, Systems Administration
- Stats: Classical Statistics, Data Manipulation, Science, Spatial Statistics, Surveys and Marketing, Temporal Statistics, Visualization
- Math: Algorithms, Bayesian/Monte Carlo Statistics, Graphical Models, Math, Optimization, Simulation
- Business: Business, Product Development
- Machine Learning/Big Data: Big and Distributed Data, Machine Learning, Structured Data, Unstructured Data
Clearly not everyone who is strong in some aspects of these categories will be expert in every area. But, as a general rule, these skill groups co-occur. Equally important, a Data Scientist who may have skills in Machine Learning and Big Data may have little expertise in Surveys or Front-End Programming.
We performed a similar NMF analysis on a series of self-evaluation questions near the end of the survey. Respondents gave "Completely Agree" to "Completely Disagree" responses to statements that started with "I think of myself as a(n)..." We view the Self-Identification groups that fell out of the NMF analysis as being critical to clarifying the diverse backgrounds and interests of Data Scientists. Here are how the responses to these questions grouped, along with category names that we feel are useful:
- Data Businessperson: Business person, Leader, Entrepreneur
- Data Creative: Artist, Jack-of-All-Trades, Hacker
- Data Researcher: Scientist, Researcher, Statistician
- Data Engineer: Engineer, Developer
Many people responded to many of these self-ID questions positively, but the analysis shows underlying dimensions of variation that can inform peoples' career paths and interests. Even more fascinating, the two groupings we identified, skills and self-ID, correlate in ways that we think are highly valuable to Data Scientists and organizations that need our skills. The below graph shows how survey participants, labeled by their primary (by strongest factor loading) skill group and their primary self-ID group, arrange themselves in a cross-tabulation table (click to see larger).
We'd love to share more results with you! If you are in the Washington, DC area on August 27th, please come see us talk about the survey results at the Data Science DC Meetup! And if you'll be attending DataGotham in New York City on September 14th, we'll be presenting highlights there too! Otherwise, stay tuned for future presentations and publications. If you have any specific questions that we might be able to answer as we further explore the data, please email us!
Harlan (harlan at datacommunitydc.org) Sean (seanm at datacommunitydc.org) Marck (marck at datacommunitydc.org)