Management

The Evolution of Big Data Platforms and People

This is a guest post by Paco Nathan. Paco is an O’Reilly authorApache Spark open source evangelist with Databricks, and an advisor for ZettacapAmplify Partners, and The Data Guild. Google lives in his family’s backyard. Paco spoke at Data Science DC in 2012.  Data Workflows for Machine LearningA kind of “middleware” for Big Data has been evolving since the mid–2000s. Abstraction layers help make it simpler to write apps in frameworks such as Hadoop. Beyond the relatively simple issue of programming convenience, there are much more complex factors in play. Several open source frameworks have emerged that build on the notion of workflow, exemplifying highly sophisticated features. My recent talk Data Workflows for Machine Learning considers several OSS frameworks in that context, developing a kind of “scorecard” to help assess best-of-breed features. Hopefully it can help your decisions about which frameworks suit your use case needs.

By definition, a workflow encompasses both the automation that we’re leveraging (e.g., machine learning apps running on clusters) as well as people and process. In terms of automation, some larger players have departed from “conventional wisdom” for their clusters and ML apps. For example, while the rest of the industry embraced virtualization, Google avoided that path by using cgroups for isolation. Twitter sponsored a similar open source approach, Apache Mesos, which was attributed to helping resolve their “Fail Whale” issues prior to their IPO. As other large firms adopt this strategy, the implication is that VMs may have run out of steam. Certainly, single-digit utilization rates at data centers (current industry norm) will not scale to handle IoT data rates: energy companies could not handle that surge, let along the enormous cap-ex implied. I'll be presenting on Datacenter Computing with Apache Mesos next Tuesday at the Big Data DC Meetup, held at AddThis. We’ll discuss the Mesos approach of mixed workloads for better elasticity, higher utilization rates, and lower latency.

On the people side, a very different set of issues looms ahead. Industry is retooling on a massive scale. It’s not about buying a whole new set of expensive tools for Big Data. Rather it’s about retooling how people in general think about computable problems. One vital component may well not be having enough advanced math in the hands of business leaders. Seriously, we still frame requirements for college math in Cold War terms: years of calculus were intended to filter out the best Mechanical Engineering candidates, who could then help build the best ICBMs. However, in business today the leadership needs to understand how to contend with enormous data rates and meanwhile deploy high-ROI apps at scale: how and when to leverage graph queries, sparse matrices, convex optimization, Bayesian statistics – topics that are generally obscured beyond the “killing fields” of calculus.

A new book by Allen Day and me in development at O’Reilly called “Just Enough Math” introduces advanced math for business people, especially to learn how to leverage open source frameworks for Big Data – much of which comes from organizations that leverage sophisticated math, e.g., Twitter. Each morsel of math is considered in the context of concrete business use cases, lots of illustrations, and historical background – along with brief code examples in Python that one can cut & paste.

This next week in the DC area I will be teaching a full-day workshop that includes material from all of the above:

Machine Learning for Managers Tue, Apr 15, 8:30am–4:30pm (Eastern) MicroTek, 1101 Vermont Ave NW #700, Washington, DC 20005

That workshop provides an introduction to ML – something quite different than popular MOOCs or vendor training – with emphasis placed as much on the “soft skills” as on the math and coding. We’ll also have a drinkup in the area, to gather informally and discuss related topics in more detail:

Drinks and Data Science Wed, Apr 16, 6:30–9:00pm (Eastern) location TBD

Looking forward to meeting you there!

Selling Data Science: Validation

FixMyPineapple2 We are all familiar with the phrase "We can not see the forest for the trees", and this certainly applies to us as data scientists.  We can become so involved with what we're doing, what we're building, the details of our work, that we don't know what our work looks like to other people.  Often we want others to understand just how hard it was to do what we've done, just how much work went into it, and sometimes we're vain enough to want people to know just how smart we are.

So what do we do?  How do we validate one action over another?  Do we build the trees so others can see the forrest?  Must others know the details to validate what we've built, or is it enough that they can make use of our work?

We are all made equal by our limitation to 24 hours in a day, and we must choose what we listen to and what we don't, what we focus on and what we don't.  The people who make use of our work must do the same.  John Locke proposed the philosophical thought experiment, "If a tree falls in the woods and no one is around to hear it, does it make a sound?"  If we explain all the details of our work, and no one gives the time to listen, will anyone understand?  To what will people give their time?

Let's suppose that we can successfully communicate all the challenges we faced and overcame in building our magnificent ideas (as if anyone would sit still that long), what then?  Thomas Edison is famous for saying, “I have not failed. I've just found 10,000 ways that won't work.”, but today we buy lightbulbs that work, who remembers all the details about the different ways he failed?  "It may be important for people who are studying the thermodynamic effects of electrical currents through materials." Ok, it's important to that person to know the difference, but for the rest of us it's still not important.  We experiment, we fail, we overcome, thereby validating our work because others don't have to.

Better to teach a man to fish than to provide for him forever, but there are an infinite number of ways to successfully fish.  Some approaches may be nuanced in their differences, but others may be so wildly different they're unrecognizable, unbelievable, and beg for incredulity.  The catch is (no pun intended) methods are valid because they yield measurable results.

It's important to catch fish, but success is not consistent nor guaranteed, and groups of people may fish together so after sharing their bounty everyone is fed.  What if someone starts using this unrecognizable and unbelieveable method of fishing?  Will the others accept this "risk" and share their fish with those who won't use the "right" fishing technique, their technique?  Even if it works the first time that may simply be a fluke they say, and we certainly can't waste any more resources "risking" hungry bellies now can we.

So does validation lie in the method or the results?  If you're going hungry you might try a new technique, or you might have faith in what's worked until the bitter end.  If a few people can catch plenty of fish for the rest, let the others experiment.  Maybe you're better at making boats, so both you and the fishermen prosper.  Perhaps there's someone else willing to share the risk because they see your vision, your combined efforts giving you both a better chance at validation.

If we go along with what others are comfortable with, they'll provide fish.  If we have enough fish for a while, we can experiment and potentially catch more fish in the long run.  Others may see the value in our experiments and provide us fish for a while until we start catching fish.  In the end you need fish, and if others aren't willing to give you fish you have to get your own fish, whatever method yields results.

Selling Data Science

Data Science is said to include statisticians, mathematicians, machine learning experts, algorithm experts, visualization ninjas, etc., and while these objective theories may be useful in recognizing necessary skills, selling our ideas is about execution.  Ironically there are plenty of sales theories and guidelines, such as SPIN selling, the iconic ABC scene from boiler room, or my personal favorite from Glengarry Glenross, that tell us what we should be doing, what questions we should be asking, how a sale should progress, and of course how to close, but none of these address the thoughts we may be wrestling with as we navigate conversations.  We don't necessarily mean to complicate things, we just become accustomed to working with other data science types, but we still must reconcile how we communicate with our peers versus people in other walks of life who are often geniuses in their own right.

We love to "Geek Out", we love exploring the root of ideas and discovering what's possible, but now we want to show people what we've discovered, what we've built, and just how useful it is.  Who should we start with?  How do we choose our network?  What events should we attend?  How do we balance business and professional relationships?  Should I continue to wear a suit?  Are flip-flops too casual?  Are startup t-shirts a uniform?  When is it appropriate to talk business?  How can I summarize my latest project?  Is that joke Ok in this context?  What is "lip service"?  What is a "slow no"?  Does being "partnered" on a project eventually lead to paying contracts?  What should I blog about?  How detailed should my proposal be?  What can I offer that has meaning to those around me?  Can we begin with something simple, or do we have to map out a complete long term solution?  Can I get along professionally with this person/team on a long term project?  Can I do everything that's being asked of me or should I pull a team together?  Do I have the proper legal infrastructure in place to form a team?  What is appropriate "in kind" support?  Is it clear what I'm offering?

The one consistent element is people: who would we like to work with and how.  This post kicks off a new series that explores these issues and helps us balance between geeking out and selling the results, between creating and sharing.

Maker's Schedules Versus Manager's Schedules and Why it Matters for Data Scientists

Paul Graham of Y-Combinator wrote a fascinating article in July 2009 about the "Maker's Schedule" versus the "Manager's Schedule." In it, he describes the differences in how managers and makers (software programmers) define and schedule their time and the implications this has for meetings, productivity, and possible tensions between these two groups.

While I will be the first person to point out the large differences between software engineering and data science, the scheduling mentality of the maker is pretty similar to that of the data scientist; large blocks of uninterrupted time (think half or full day) are required to do work and work is defined as the creation of something, be it an analysis, a new methodology, a visualization, etc.  In contrast, managers think in hourly blocks with the meeting being the actual product created or unit of work. Thus, for the data scientist, a single meeting at 10am can completely destroy a half-day block of potential productivity.

Paul's insights strongly ring true from my personal experiences and he advocates several different strategies to mitigate this potential conflict. Simply put, I highly recommend reading this article for both data makers and data managers alike.