The whirl wind of Strata2013 wrapped up last Thursday evening with an unofficial happy hour from Facebook. This conference, hosted by O'Reilly, is one of the data conferences of the year. If you missed this one, there is one in NYC coming up in October.
Below is a list in chronological order of the presentations that stood out most to me and my particular set of interests. Also note, if this list seems somewhat sparse it is only because I kept getting tied up in fascinating conversations. Truth be told, I probably spent at least half my time in the speakers' lounge. Also, each title links over to the Strata talk page where slides should be available for download as PDF.
An Introduction to the Berkeley Data Analytics Stack (BDAS) Featuring Spark, Spark Streaming, and Shark - Part 1 The UC Berkeley team offered an excellent overview of a cutting edge and open data stack that currently includes three components: Shark, Spark, and Mesos. Shark is an Apache Hive compatible data warehouse system built on top of Spark that can answer Hive QL queries up to 100x faster than Hive. Spark is "a data-parallel execution engine that is fast and fault-tolerant" that has Scala, Java, and Python APIs and can be used interactively via Scala or Python shells. Finally, Mesos provides cluster management and can run Hadoop, MPI, Hypertable, and Spark. This is one to watch.
Yes, I am a bit biased, but did thoroughly enjoy the "Big Data for Enterprise IT Day" that Marck and I presented in. Some other standout talks included:
The Laws of Data Mining Duncan Ross culled his experiences data mining for numerous companies across industries to provide general abstractions about the field that certainly ring true with this practitioner. If you are a company thinking about diving into your data, this is a must download.
Just the basics: you've probably heard about data mining and think you need a PhD to do it. Clever stuff with numbers. Predictions. Clusters. Algorithms. The 9 Laws explains the why of the basic steps you can take to be successful as a data miner, and show that this is primarily a business discipline, not a branch of computer science.
How to Interview a Data Scientist Daniel Tunkelang provided a refreshing look at all of the things people do wrong when interviewing data scientists and a few much better practices. His insights rung true with my own personal experiences and no, coding abstract problems or basic algorithms on a white board is not the best way to flesh out someone's true potential. If you are trying to hire this rare breed of technical rock star, I highly recommend looking over his slide deck or watching the talk.
It's great being a data scientist -- some are even calling it the sexiest job of the 21st century. What's not so great is trying to hire data scientists when the demand for them far outstrips the supply. Which makes it imperative that you nail the hiring process. In this session, I'll share hiring tips, and specifically what we've learned at LinkedIn about how to interview data scientists.
The Science of Managing Data Scientists Kate Matsudaira's talk was the first time I have ever heard someone speak intelligently and insightfully on the management of data scientists. She simply made too many insightful points to mention all of them but here are just a few that I wholeheartedly agree with. Science is different from engineering and data science is different from software engineering. Science explores the unknown whereas engineering often repeats that which has been previously accomplished. This has massive ramifications on management best practices. For one, data science efforts should not be shoehorned into traditional software engineering processes just because they both involve code. Trying to do so demonstrates a fundamental lack of understanding of data science. Additionally, for all of the data scientists interviewing for those high paying jobs, run screaming from those companies that don't get this. Simply put, Kate's talk was easily one of the best of Strata.
Data science can power incredible innovation, but the most important insights typically aren't known ahead of time. This makes it challenging to manage schedules, expectations, and goals. At Decide, data science is core to our product. This talk will share lessons learned from both sides, and provide the audience with strategies to improve process and communication in their own teams.
Committing to Recommendation Algorithms Eric Colson of Stitch Fix discussed the next logical step in recommendation engines; engines so good that they select the item and ship it to you automatically and without your involvement.
While this may be a dream for some or a horrifying nightmare to others, this idea seems to be the natural evolution of ideas that have come and gone. A few decades ago there was the book or CD of the month club that automatically sent you books until you left the program (which could be notoriously hard to do). The "algorithm" determining which books to ship probably depended heavily on the cheapest books available. Now, there are online gift box clubs such as local-to-DC startup UmbaBox that offers curated surprise boxes of artisan crafts. Here, an expert and human curator handles the decision-making.
With Stitch fix, machine learning is used to make highly personalized recommendations. Per one anecdote, the first shipment is usually quite hit or miss but the system learns fast and users can fall quickly in love. This business model is definitely one to watch.
Recommendation algorithms have long been a valuable component of ecommerce. They drive incremental revenue by helping customers find what they are looking for. But a new shopping service is taking recommendations to the next level. Stitch Fix, with its disruptive business model, is betting big on algorithms, technology, and domain expertise to transform the way we shop. To Stitch Fix, recommendations represent more than incremental revenue; the company has oriented its entire business model around its ability to get relevant merchandise to its customers. They go beyond helping customers find what they are looking for; Stitch Fix helps customers find what is right for them. And, sometimes what’s right isn’t obvious until the customer has it in her hands.
Big Data on Small Devices: Data Science goes Mobile Yael Garten of LinkedIn is clearly a rising star in the data science world and has a fantastic personality to match. It is great to hear a data scientist give back quality data and some of the stories she told were truly fascinating. Most resonant with me was a plot showing the iPad- and browser-based LinkedIn usage peaking at very different times in the US; iPad LinkedIn use is "Coffee and Couch" both before and after work while browser access peaks during the workday. And of course, this pattern isn't necessarily seen in other cultures and countries. Check out her talk for many more such insights.
Mobile is already changing everything about the way we access information, and we’ve only begun to experience the impact of the smart devices revolution. Mobile analytics is also very different from web product analytics. With a variety of screen sizes, operating systems, and many other features, the “mobile platform” is in fact a multitude of distinct platforms. How do we recognize and even benefit from this richness and heterogeneity to create amazing products our users will love? Data science for consumer internet products relies on our ability to effectively analyze and understand ubiquitous computing in terms of a holistic product experience, as individuals consume and create data on mobile and desktop devices in their day-to-day lives. I’ll talk about mobile data science challenges — from product development to data-driven decision making.
Next-Gen Data Scientists Rachel Schutt of Johnson Research Labs spoke on a subject near and dear to my heart, what is a data scientist. Check her talk out and see how much she agrees with some of DC2's own research.
Data Science is an emerging field in industry, yet not well-defined as an academic discipline (or even in industry for that matter). I proposed the “Introduction to Data Science” course at Columbia in March, 2012. This was the first course at Columbia that had the term “Data Science” in the title. I had three primary motivations:
1) Bringing industry to students: I wanted to give students an education in what it’s like to be a data scientist in industry and give them some of the skills data scientists have. This is based on my experience as a lead analyst on the Google+ Data Science team. But I didn’t want to limit them to only my way of seeing the world, so each week, guest speakers from the NYC tech community came to teach the class.
2) I wanted to think more deeply about the science of data science: Data Science has the potential to be a deep and profound research discipline impacting all aspects of our lives. Columbia University and Mayor Bloomberg announced the Institute for Data Sciences and Engineering in July, 2012. This course created an opportunity to develop the theory of Data Science and to formalize it as a legitimate science.
3) Personal Challenge: I kept hearing from data scientists in industry that you can’t teach data science in a classroom or university setting and I took that on as a challenge. I wanted to test the hypothesis that it was possible to train awesome data scientists in the classroom.
In February 2013, 2 months will have passed since the class ended. I’ll be able to reflect on how the class went, how I thought about the curriculum, how I engaged the NYC tech community to be involved in the class, who the students were, whether I had impact on them, etc.
Algorithmic Illusions: Hidden Biases of Big Data Kate Crawford of Microsoft Research spoke on the biases hidden in much of big data and of particular note was the following; any large data set collected from smart phone users is inherently and strongly biased in terms of socio-economics and more. Yes, Twitter is not a representative sample of the US population (although many probably forget this). As a result, collecting mobile data to aid policy decisions could be problematic and awareness is the first step to solving the problem. While I agree, I would love to see a comparison of the outcomes from traditional ways of making decisions (egos and assumptions) against biased big data decisions. With great power comes great responsibility; as people apparently attribute "more" truth to data-derived insights, there comes an equal and important responsibility to use unbiased data.
Big data gives us a powerful new way to see patterns in information – but what can’t we see? When does big data not tell us the whole story? This talk opens up the question of the biases we bring to big data, and how we might work beyond them.
Human Fault-tolerance Nathan Marz's (Twitter) talk was fantastic. Succinctly put, all data stores should permanently record time indexed data that is read-only and immutable. All applications that sit atop this treasure trove of truth are views into the data and cannot alter the data store. Thus, it becomes much harder for a developer to delete, destroy, or corrupt data accidentally. Does this level of abstraction sound familiar to the web app developers out there?
There’s been a huge amount of progress in recent years in developing distributed systems that are resilient to all sorts of faults. However, there’s one critical category of errors that has largely been ignored: human error. The scope and potential impact of human error is massive: deployed bugs, accidentally deleting data, accidentally DDOS’ng important internal services, and so on. Designing for human fault-tolerance leads to important conclusions on the fundamental ways data systems should be architected.
Introducing Julia - a New Open Source Mathematical Programming Language Yup, Julia is still awesome and Michael Bean of Forio Simulations did a nice job restating this point. If you aren't playing with it, what are you waiting for? And yes, there is even an IDE, Julia Studio, so you have absolutely no excuses not to jump on the bandwagon.
Data analytics and the mathematical programming languages that support them are changing the world but they are also suffering from growing pains. Technical computing languages that have been around for decades have been slow to adopt new compiler technologies such as JIT, optional type indications, and others.
Julia was introduced to solve these limitations. Julia is built on a solid foundation of JIT compiling, parallelism, and a mathematical syntax that will look familiar to users of other mathematical languages. Julia supports a rich type system. Scalars, vectors, arrays, tuples, composite types and several others can be defined in Julia. Julia is designed for technical computing and supports a fully remote cloud computing mode. Julia is free, open source, and library-friendly. The core Julia language is licensed under the MIT free software license.
LinkedIn Endorsements: Reputation, Virality, and Social Tagging Another great talk from the team at LinkedIn. While I remain skeptical as to the actual validity of endorsements, they have proven to be wildly popular and are the fastest growing data product ever created by LinkedIn. While we are on the topic, I would trust endorsements more if only individuals who had one skill could endorse another in that skill (or at least the expert endorsement was weighted more).
Forget the great presentation by Sam Shah and Peter Skomoroch. I loved this talk because it highlighted the large level of effort, technology, data, and math required to build what might appear to be a simple concept to those who have never worked with data. Can I get a Halleluja?
Endorsements are a one-click system to recognize someone for their skills and expertise on LinkedIn, the largest professional online social network. This is one of the latest “data features” in LinkedIn’s portfolio, and the endorsement ecosystem generates a large graph of reputation signals and viral user activity.
Underneath this feature, there are several interesting and difficult data questions:
1. How do you automatically create a taxonomy of skills in the professional context?
2. How do you disambiguate between different contexts of skills? For instance, “search” could mean information retrieval, search & seizure, search & rescue, among others.
3. How can you leverage data to determine someone’s authoritativeness in a skill?
4. How do you use that authoritativeness to recommend people to endorse?
5. How do you optimize a complex large scale machine learning system for viral growth & engagement?