This weekend, I had the chance to interview Marck Vaisman and talk to him about his background, his preferred data science tools, how he got connected in the data community, the future of data science, contributing to an O'Reilly book, and some of data science's growing pains.
Marck is a lot of things - he is a co-founder of Data Community DC, he runs the R Users DC meetup, he's a Data Scientist at DeepMile, Owner & Principal Data Scientist at DataXtract LLC, and a contributing author to the O'Reilly book The Bad Data Handbook.
Here's what we talked about... Tony: Let's start off by talking a little about Deep Mile. What is Deep Mile and what do you guys do there?
Marck: Deep Mile is a small local consulting firm that historically had been doing a lot of work for government clients - a lot of network analysis with private data. Basically, they took those methodologies that they developed into the commercial space. Now, we're doing work for both government and commercial clients. You can call it a local analytics and data science consultancy.
We're also developing a product that we hope to launch soon, which is basically a new way to measure internet audience and traffic and provide customers with detailed data about different levels of web-related activity.
Tony: Are you working on that project specifically?
Marck: I am, yes. So since we're a small company I get to do all sorts of stuff. A lot of the work I do is data science tasks related to product development or client work. That can be anything from collecting data, mining data, finding insights, building models, etc.
Tony: What tools do you use to do your job on a day to day basis?
Marck: R, Linux shell tools, I use Hadoop when I need to process a lot of data together. I'm not using Hadoop that much right now, but I do use it when I need to. So basically R, command line, SQL, Python, Hadoop and it's associated projects (mostly Pig, I'm not using Hive very much these days) - all of this either on my local laptop or on Amazon's cloud.
Tony: So I know that you've worked at a bunch of different companies throughout your career - everything from startups to larger companies to doing consulting. Can you tell me a little bit about that - the different challenges and what you liked about each?
Marck: Sure. The things that I liked about working for a small company were the fact that you can wear many hats and you have a lot more freedom. I could use the tools that I thought were necessary, and I didn't really have to ask for approval. Just as long as I got the job done, I could use the tools that I needed to without any roadblocks. If I needed to buy a software license for Matlab and spend $2,000, I'd just do it. It was mostly open source stuff though, so cost wasn't much of an issue. Or if I needed to use the cloud, I would just spin up a hundred node Hadoop cluster, spend whatever I needed to spend on it, and get the job done.
Tony: So no structure in place where you would need to run things by IT like there is in bigger organizations?
Marck: No. When I was doing consulting for clients I would get approval. Especially when I would need to run a large job and spend a couple hundred dollars, but they were usually OK with that. But in that sense it's sort of the decentralized aspect of it - being able to just really focus on my job and not have to deal with internal BS.
At the bigger companies, it's a different story. But it's sort of the nature of the beast, at least from a tools perspective. The tool sets were much more limited. For example, I could have R installed on my computer - that wasn't much of a problem - but getting access to the data, being able to work with the systems, and adding additional packages would have to go through an internal approval cycle, which would take some time. So even though you have all these great tools available, you can't get the most out of them. So I had to get creative a lot of times, do what I could do with the systems, and then pull the data out to a local machine and just do the rest there. It was painful, and it added a lot of time to the process.
Tony: Do you think having processes like that in place hinders larger organizations?
Marck: Yeah, and I talk about this in my chapter of The Bad Data Handbook. One of the commandments has to do with giving your data scientist one tool for all jobs. You can't just depend on the one tool you have access to because that's all they're going to give you access to. I understand security and that there are policies and procedures in place in large organizations. And that's fine. But I advocate that in cases like that you have to have a partner on the IT side - the folks that are managing the tool sets and the infrastructure. Find a way to make it work for the organization and not let it be an impediment.
Tony: Out of all the tools and processes that you work with, which do you enjoy working with the most and why?
Marck: Well, it shouldn't be any secret that R is probably my favorite tool, for many reasons. I started using R about 3 or 4 years ago and coming from an environment where the tool set that I was using most was SQL and Excel, learning about R and working with it just created all these possibilities to do things that I wasn't able to do before.
So in terms of the workflow and the process of going through and working with data, working with R is just very natural because of the data structures it has and the way you work with the language. The fact that you have 4,000+ packages gives you the ability to do anything and everything. Granted I probably use R for some things that it's not the best tool for, but it works.
Tony: How did you get involved in the data scene here (DC)?
Marck: That's a funny story actually. This process has been evolving since the Summer of 2009. One day, I stumbled upon Mike Driscoll's post The Three Sexy Skills of Data Geeks and the blog post really resonated with me, because I was doing a lot of these things and it was great to find out that there were other people out there that were doing the same type of work. I started following Mike on Twitter and saw him tweet that he was going to be in DC, so I sent him an email saying "Hey, I saw your blog post. It really resonated with me and I'd just love to meet up with you and chat." So we did.
We met, and I was telling him about the work I was doing and how I wanted to get more involved in doing more applied data analysis. I knew about the basic statistical techniques back then, but I wanted to get more into learning about machine learning and all the computational aspects of working with data. And his advice to me was two things: you have to learn R and get more familiar with the statistical elements of machine learning.
Mike was based out of the west coast, but he spent a lot of time here in DC. He was one of the founders of a company called CustomInk, so he has ties to the area. And he introduced me to Peter Skomoroch, who was also doing a lot of work with Hadoop back then and was still in this area before he went to work for LinkedIn.
So I basically started using Twitter to follow all these folks who were doing interesting work, communicating with them, etc. And then that same year, I went to Hadoop World for the first time and started learning about Hadoop and the tool sets and the technologies. Then a couple months after that, probably in early 2010, I volunteered to run the R Users group here in DC. Someone had created the meetup in 2009, but it had sort of died and the Revolution Analytics folks that were driving that initiative put out a call for someone to step up and volunteer. I just raised my hand and said alright I'll do it.
I did all this without knowing much R. But it was great because I really started small, built the community, started getting the word out there, and started using the tool more. I was learning R through the group as I was using it. And then I knew a lot of the tech people here because I had been involved in the tech sector for a while.
So running the meetup gave me a way to meet other people and start building a community. Harlan moved here from New York in 2011. I had been thinking of doing something like Data Science DC for a while and once Harlan moved - I had known him for some time by then - we kicked around the idea, he talked to Matt, and we decided to start up Data Science DC. And that's when it really took off. By then I was running the R group for about a year, and I knew and had communicated with a lot of the most influential people that were working in this space around the country. So I think that's what really got me in the middle of things and to where I am today in terms of the community.
Tony: What do you find most exciting about data science?
Marck: To me, data science is a way of getting to understand how things work. I've always been a very curious person by nature. I've joked about this, but I think there's some truth to the idea that data science is part of a modern approach to the scientific method - using modern tools and techniques to process data and learn about the world. Granted, a lot of the work we're doing is focused on online metrics - the way we live a digital lifestyle and collect data, and using it to build products, to understand how things work. But to me it's more than that, it's just being able to answer questions about different things, finding patterns in the data about whatever you're curious about. You can find a data set online about pretty much anything these days. And that's a great way to learn how things work and understand the relationships between things. So that's what data science is to me. It's using the statistical and mathematical tools and techniques that we have available today to answer the hard questions that couldn't be answered before.
Tony: Where do you see the future of data science going - say 5 or 10 years out?
Marck: Well, first of all I think there's a lot of development happening in the tool sets, especially making very technical tools more accessible to people. On one side there is the business side of data science and on the other side there is the practitioner side. On the business side, there are a lot of companies that are building tools and software that allow you to, for example, run a Hadoop job with the push of a button - abstracting the layer for less technical people.
So I think there's going to be some consolidation of tools - better tools to do this kind of work and the tools are going to be easier to use. But because every data set has its own idiosyncrasies, I don't think there's ever going to be a one-size-fits-all solution.
As a set of techniques, I think it's going to make its way into pretty much everything. There's hype about the whole data science thing right now, but that hype is probably going to die in the next 2 years. I think there's always going to be work to do in this field, at least for the foreseeable future. And if you're good - if you know how to work with data, if you know how to get insights out of the data (not just taking data and building a chart, but being the intermediary between what are the questions you're trying to answer, how are you going to get there, here are the tools and techniques that you're going to use, etc.) I think there's definitely going to be a need for talent.
Tony: So for people that want to expand their skills and learn more about data science, what kind of advice would you give them? What kind of things would they need to learn and what kinds of things would they need to get into?
Marck: If you want to get into this field, you want to start small. Find a data set online, download it, and then start exploring it. Start learning R because R is just a great tool to work with data. Look at the data, plot it, build a model, ask some questions and see if you can find answers to them. It's very iterative, and if you haven't done data analysis before you're not going to know the tricks and the nuances until you actually start doing it. There's no magic formula. There's no checklist. You take a data set - it doesn't have to be big - download it into R, do some summaries, do some plots, look at the relationships between variables, etc. That's mostly where everything starts.
And even when you're working with big data, you're not working with big data all together. You aggregate and then you compare it. Let's say you're looking at transactions over time. You aggregate per time slice. So a lot of times you take very big data, you make it smaller, and then the actual data you're analyzing ends up being a lot smaller than what you started with.
You want to understand the basic statistical techniques. You want to know some computing and programming. Learn R, learn some Python. Learn to use the shell because a lot of times the amount of work required to clean and organize a data set is far greater than for the analysis itself. A lot of it is learning how to munge - be a great munger. I can't stress this enough.
As for the rest? Read publications, follow people on Twitter, read the blogs, there are some great books out there, know your basic stats, know your basic math. It's really not hard to get into this. But if you have a passion for it, you're going to find that you're always wanting to learn new techniques and doing more and more things with the data.
Tony: You mentioned following people. Who are some of the people that you follow and who are some of the people you look up to in the data community?
Marck: Well, I don't spend that much time on Twitter these days because it takes up too much time. But I have a list that is an amalgamation of folks, and these are some of the folks I draw insights from for various different things. There's John Myles White who wrote the book Machine Learning for Hackers - big R guy, really smart guy. Some neuroscientists. Hans Rosling, who is a big name in the TED circles and he's always talking about using data to solve the worlds problems. Hilary Mason is on the list. JD Long, who is someone I met early in the data circles a long time ago. He was in Chicago - he lives in Bermuda now. He's a really smart guy, really funny. He doesn't always tweet about data but he always has these amazing insights. There's this fellow called @gappy3000 that is very cynical. He's in the New York finance community - really smart guy. He talks about all sorts of things, but he always asks the hard questions that nobody else is asking. Then there's @siah. He was at Berkeley, he comes from the operations research side of things, and he tweets a lot about tools like C and R and Python.
In terms of understanding trends and what's happening in the data science field from within the technology industry I follow what the LinkedIn team is doing. DJ Patil, who's been involved in building the data science teams there. I think DJ has a lot of good insights from a business and organizational standpoint.
There's different people that talk about different levels of the field. Even within the DC community, there's always someone who has something to say that's interesting and you learn something new.
Tony: My final question is about The O'Reilly book The Bad Data Handbook. Can you tell me a little about the book in general and then the chapter wrote?
Marck: The Bad Data Handbook is a collection of articles or short stories about the different challenges that come up when you're dealing with data. It's a combined work among 17 authors. There are definitely some chapters that are very technical that give you tools and tips on doing specific tasks. It covers a lot of the pain points of doing data work. My chapter got started as a collection of horror stories from the organizational side of the equation. You asked me earlier about the differences between working with big and small organizations. A lot of these stemmed from frustrations about working mostly in large organizations. I had been building up this collection of anecdotes of things that had really frustrated me, mostly in trying to do my work and not being able to.
About a year ago at the Strata conference, I had a conversation with Q. Ethan McCallum, who is the main author of the book. He told me he was putting together this project about data horror stories and at the time I told him I didn't know if I had enough material but if I did, I would think about it. So this idea was in the back of my mind and then in the middle of the Summer, I had amassed enough anecdotes so I emailed and said if you're still looking for stories, I want to write about the organizational aspects. And when I told him about the idea, he said yes, absolutely.
I started writing the anecdotes and then I realized that there were common threads among them. So I tried to take a little bit of a tongue-in-cheek or a little humorous approach to this. The chapter is called The Dark Side of Data Science and it focuses on the mostly organizational and technical pitfalls that you want to avoid if you want to be successful in your data science efforts from an organizational perspective. What I did was turn these anecdotes into 5 commandments.
The first commandment is: Know nothing about thy data. And the idea here is that it's sometimes mind-boggling how you come into an organization, they have all this data, and they really know nothing about it. Maybe you don't know the insights and that's why you're working with a data scientist, but you should understand how the data gets generated, understand where it lives, know its nuances, give things descriptive names so that you know what it means, etc.
The second commandment is: Thou shalt provide your data scientist with a single tool for all tasks. There is no one-tool-fits-all approach in this field. A lot of times they give you a hammer and they say go fish. It doesn't work.
The third one is: Thou shalt analyze data for analysis' sake only. Analysis for the sake of analysis is going to get you nowhere. You have to have at least a vague idea of where you want to go or what you want to explore. I've heard stories about people being handed data sets and being told find me a golden nugget. That's not how it works.
Number four is: Thou shalt compartmentalize learnings. In large organizations, there are usually silos. This is not new. Folks are doing similar work and they're not sharing. So it's really frustrating when you've done a lot of work and spent a lot of hours and then you find out someone did this like 6 months ago and it's in a file somewhere.
And the last one (one of my favorites): Thou shalt expect omnipotence from data scientists. Sean, Harlan, and I put together a survey about the skills of data scientists. It's really amazing to see how some folks expect one person to do it all and to do it all overnight. That's not going to happen. This is a field that's still being defined. It's a field that's growing so there's a lot of nomenclature and communication problems, but the idea is that a data science initiative is probably going to be a team effort as opposed to an individual effort. One person can do lots of things, don't get me wrong. But bigger picture, it's going to be a team effort where you have a variety of skills that are complementary.
Tony: And that's different from the job ads you currently see out there?
Marck: The job ads, they list everything. We want the rock star programmer, we want the machine learning expert, we want NLP, we want the person whose written all these algorithms, that's created a distributed back end system, that's built a company, etc. Some of these are just really funny requirements where you look at them and say yeah, that's a stretch.
Tony: Have you talked to any employers a couple months down the road to see if they've ever found someone with all the skills they were looking for?
Marck: Yes and no. A lot of times, I just see that there's a job posting that's been circulating for a really really long time. So that tells me that either they haven't filled it, they haven't found the right person, or they're being too picky. I've actually talked to some people that have been looking for folks that have told me a few months down the line that it's been really tough to find someone. And I know that the people I'm talking to aren't taking an unrealistic approach, they're actually looking for talent. They do say that a lot of people are calling themselves data scientists and they really just don't have a clue.
It's hard to label yourself as a data scientist because then people expect you to be able to do all these sorts of things. Even in my case, I do consider myself to be a data scientist and I do have enough knowledge in the field across all the different parts of it. But I'm not a machine learning guy. I know how to do applied machine learning, I know all the basic algorithms and how they work, I understand how to apply and use them. But if someone says I'm looking for a data scientist and I want someone to build me the next best algorithm, that's beyond my area of expertise. On the other hand, they may want you to build and manage a cluster. I can use Hadoop, I know how to run jobs, but I'm not a sys admin. So that's why I'm saying don't expect omnipotence from your data scientist. Data scientists are smart people, they are curious people, they can do all sorts of stuff, but be reasonable in your expectations.