Rant

Where are the Deep Learning Courses?

This is a guest post by John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC.

Why aren't there more Deep Learning talks, tutorials, or workshops in DC2?

It's been about two months since my Deep Learning talk at Artisphere for DC2. Again, thanks to the organizers (especially Harlan Harris and Sean Gonzalez) and the sponsors (especially Arlington Economic Development). We had a great turnout and a lot of good questions that night. Since the talk and at other Meetups since, I've been encouraged by the tidal wave of interest from teaching organizations and prospective students alike.

First some preemptive answers to the “FAQ” downstream of the talk:

  • Mary Galvin wrote a blog review of this event.
  • Yes, the slides are available.
  • Yes, corresponding audio is also available (thanks Geoff Moes).
  • A recently "reconstructed" talk combining the slides and audio is also now available!
  • Where else can I learn more about Deep Learning as a data scientist? (This may be a request to teach, a question about how to do something in Deep Learning, a question about theory, or a request to do an internship. They're all basically the same thing.)
  • It's this last question that's the focus of this blog post. Lots of people have asked and there are some answers out there already, but if people in the DC MSA are really interested, there could be more. At the end of this post is a survey—if you want more Deep Learning, let DC2 know what you want and together we'll figure out what we can make happen.

There actually was a class...

Aaron Schumacher and Tommy Shen invited me to come talk in April for General Assemb.ly's Data Science course. I did teach one Deep Learning module for them. That module was a slightly longer version of the talk I gave at Artisphere combined with one abbreviated “hands on” module on unsupervised feature learning based on Stanford's tutorial. It didn't help that the tutorial was written in Octave and the class had mostly been using Python up to that point. Though feedback was generally positive for the Deep Learning module, some students wondered if they could get a little more hands on and focus on specifics. And I empathize with them. I've spent real money on Deep Learning tutorials that I thought could have been much more useful if they were more hands on.

Though I've appreciated all the invitations to teach courses, workshops, or lectures, except for the General Assemb.ly course, I've turned down all the invitations to teach something more on Deep Learning. This is not because the data science community here in DC is already expert in Deep Learning or because it's not worth teaching. Quite the opposite. I've not committed to teach more Deep Learning mostly because of these three reasons:

  1. There are already significant Deep Learning Tutorial resources out there,
  2. There are significant front end investments that neophytes need to make for any workshop or tutorial to be valuable to both the class and instructor and,
  3. I haven't found a teaching model in the DC MSA that convinces me teaching a “traditional” class in the formal sense is a better investment of time than instruction through project-based learning on research work contracted through my company.

Resources to learn Deep Learning

There are already many freely available resources to learn the theory of Deep Learning, and it's made even more accessible by many of the very lucid authors who participate in this community. My talk was cherry-picked from a number of these materials and news stories. Here are some representative links that can connect you to much of the mainstream literature and discussion in Deep Learning:

  • The tutorials link on the DeepLearning.net page
  • NYU's Deep Learning course course material
  • Yann LeCun's overview of Deep Learning with Marc'Aurelio Ranzato
  • Geoff Hinton's Coursera course on Neural Networks
  • A book on Deep Learning from the Microsoft Speech Group
  • A reading list list from Carnegie Mellon with student notes on many of the papers
  • A Google+ page on Deep Learning

This is the first reason I don't think it's all that valuable for DC to have more of its own Deep Learning “academic” tutorials. And by “academic” I mean tutorials that don't end with students leaving the class successfully implementing systems that learn representations to do amazing things with those learned features. I'm happy to give tutorials in that “academic” direction or shape them based on my own biases, but I doubt I'd improve on what's already out there. I've been doing machine learning for 15 years, so I start with some background to deeply appreciate Deep Learning, but I've only been doing Deep Learning for two years now. And my expertise is self-taught. And I never did a post-doc with Geoff Hinton, Yann LeCun or Yoshua Bengio. I'm still learning, myself.

The investments to go from 0 to Deep Learning

It's a joy to teach motivated students who come equipped with all the prerequisites for really mastering a subject. That said, teaching a less equipped, uninvested and/or unmotivated studentry is often an exercise in joint suffering for both students and instructor.

I believe the requests to have a Deep Learning course, tutorial, workshop or another talk are all well intentioned... Except for Sean Gonzalez—it creeps me out how much he wants a workshop. But I think most of this eager interest in tutorials overlooks just how much preparation a student needs to get a good return on their time and tuition. And if they're not getting a good return, what's the point? The last thing I want to do is give the DC2 community a tutorial on “the Past” of neural nets. Here are what I consider some practical prerequisites for folks to really get something out of a hands-on tutorial:

  • An understanding of machine learning, including
    • optimization and stochastic gradient descent
    • hyperparameter tuning
    • bagging
    • at least a passing understanding of neural nets
  • A pretty good grasp of Python, including
    • a working knowledge of how to configure different packages
    • some appreciation for Theano (warts and all)
    • a good understanding of data preparation
  • Some recent CUDA-capable NVIDIA GPU hardware* configured for your machine
    • CUDA drivers
    • NVIDIA's CUDA examples compiled

*hardware isn't necessarily a prerequisite, but I don't know how you can get an understanding of any more than toy problems on a CPU

Resources like the ones above are great for getting a student up to speed on the “academic” issues of understanding deep learning, but that only scratches the surface. Once students know what can be done, if they’re anything like me, they want to be able to do it. And at that point, students need a pretty deep understanding of not just the theory, but of both hardware and software to really make some contributions in Deep Learning. Or even apply it to their problem.

Starting with the hardware, let's say, for sake of argument, that you work for the government or are for some other arbitrary reason forced to buy Dell hardware. You begin your journey justifying the $4000 purchase for a machine that might be semi-functional as a Deep Learning platform when there's a $2500 guideline in your department. Individual Dell workstations are like Deep Learning kryptonite, so even if someone in the n layers of approval bureaucracy somehow approved it, it's still the beginning of a frustrating story with an unhappy ending. Or let's say you build your own machine. Now add “building a machine” for a minimum of about $1500 to the prerequisites. But to really get a return in the sweet spot of those components, you probably want to spend at least $2500. Now the prerequisites include a dollar investment in addition to talent and tuition! Or let’s say you’re just going to build out your three-year-old machine you have for the new capability. Oh, you only have a 500W power supply? Lucky you! You’re going shopping! Oh, your machine has an ATI graphics card. I’m sure it’s just a little bit of glue code to repurpose CUDA calls to OpenCL calls for that hardware. Let's say you actually have an NVIDIA card (at least as recent as a GTX 580) and wanted to develop in virtual machines, so you need PCI pass-through to reach the CUDA cores. Lucky you! You have some more reading to do! Pray DenverCoder9's made a summary post in the past 11 years.

“But I run everything in the cloud on EC2,” you say! It's $0.65/hour for G2 instances. And those are the cheap GPU instances. Back of the envelope, it took a week of churning through 1.2 million training images with CUDA convnets (optimized for speed) to produce a breakthrough result. At $0.65/hour, you get maybe 20 or 30 tries doing that before it would have made more sense to have built your own machine. This isn't a crazy way to learn, but any psychological disincentive to experimentation, even $0.65/hour, seems like an unnecessary distraction. I also can't endorse the idea of “dabbling” in Deep Learning; it seems akin to “dabbling” in having children—you either make the commitment or you don't.

At this point, I’m not aware of an “import deeplearning” package in Python that can then fit a nine layer sparse autoencoder with invisible CUDA calls to your GPU on 10 million images at the ipython command line. Though people are trying. That's an extreme example, but in general, you need a flexible, stable codebase to even experiment at a useful scale—and that's really what we data scientists should be doing. Toys are fine and all, but if scale up means a qualitatively different solution, why learn the toy? And that means getting acquainted with the pros and cons of various codebases out there. Or writing your own, which... Good luck!

DC Metro-area teaching models

I start from the premise that no good teacher in the history of teaching has ever been rewarded appropriately with pay for their contributions and most teaching rewards are personal. I accept that premise. And this is all I really ever expect from teaching. I do, however, believe teaching is becoming even less attractive to good teachers every year at every stage of lifelong learning. Traditional post-secondary instructional models are clearly collapsing. Brick and mortar university degrees often trap graduates in debt at the same time the universities have already outsourced their actual teaching mission to low-cost adjunct staff and diverted funds to marketing curricula rather than teaching them. For-profit institutions are even worse. Compensation for a career in public education has never been particularly attractive, but still there have always been teachers who love to teach, are good at it, and do it anyway. However, new narrow metric-based approaches that hold teachers responsible for the students they're dealt rather than the quality of their teaching can be demoralizing for even the most self-possessed teachers. These developments threaten to reduce that pool of quality teachers to a sparse band of marginalized die-hards. But enough of my view of “teaching” the way most people typically blindly suggest I do it. The formal and informal teaching options in the DC MSA mirror these broader developments. I run a company with active contracts and however much I might love teaching and would like to see a well-trained crop of deep learning experts in the region, the investment doesn't add up. So I continue to mentor colleagues and partners through contracted research projects.

I don't know all the models for teaching and haven't spent a lot of time understanding them, but none seem to make sense to me in terms of time invested to teach students—partly because many of them really can't get at the hardware part of the list of prerequisites above. This is my vague understanding of compensation models generally available in the online space*:

  • Udemy – produce and own a "digital asset" of the course content and sell tuition and advertising as a MOOC. I have no experience with Udemy, but some people seemed happy to have made $20,000 in a month. Thanks to Valerie at Feastie for suggesting this option.
  • Statistics.com – Typically a few thousand for four sessions that Statistics.com then sells; I believe this must be a “work for hire” copyright model for the digital asset that Statistics.com buys from the instructor. I assume it's something akin to commissioned art, that once you create, you no longer own. [Editor’s note: Statistics.com is a sponsor of Data Science DC. The arrangement that John describes is similar to our understanding too.]
  • Myngle – Sell lots of online lessons for typically less than a 30% share.

And this is my understanding of compensation models locally available in the DC MSA*:

  • General Assemb.ly – Between 15-20% of tuition (where tuition may be $4000/student for a semester class).
  • District Data Labs Workshop – Splits total workshop tuition or profit 50% with the instructor—which may be the best deal I've heard, but 50% is a lot to pay for advertising and logistics. [Editor's note: These are the workshops that Data Community DC runs with our partner DDL.]
  • Give a lecture – typically a one time lecture with a modest honorarium ($100s) that may include travel. I've given these kinds of lectures at GMU and Marymount.
  • Adjunct at a local university – This is often a very labor- and commute-intensive investment and pays no better (with no benefits) than a few thousand dollars. Georgetown will pay about $200 per contact hour with students. Assuming there are three hours of out of classroom commitment for every hour in class, this probably ends up somewhere in the $50 per hour range. All this said, this was the suggestion of a respected entrepreneur in the DC region.
  • Tenure-track position at a local university – As an Assistant Professor, you will typically have to forego being anything but a glorified post-doc until your tenure review. And good luck convincing this crowd they need you enough to hire you with tenure.

*These are what I understand to be the approximate options and if you got a worse or better deal, please understand I might be wrong about these specific figures. I'm not wrong, though, that none of these are “market rate” for an experienced data scientist in the DC MSA.

Currently, all of my teaching happens through hands-on internships and project-based learning at my company, where I know the students (i.e. my colleagues, coworkers, subcontractors and partners) are motivated and I know they have sufficient resources to succeed (including hardware). When I “teach,” I typically do it for free, and I try hard to avoid organizations that create asymmetrical relationships with their instructors or sell instructor time as their primary “product” at a steep discount to the instructor compensation. Though polemic, Mike Selik summarized the same issue of cut rate data science in "The End of Kaggle." I'd love to hear of a good model where students could really get the three practical prerequisites for Deep Learning and how I could help make that happen here in DC2 short of making “teaching” my primary vocation. If there's a viable model for that out there, please let me know. If you still think you'd like to learn more about Deep Learning through DC2, please help us understand what you'd want out of it and whether you'd be able to bring your own hardware.

[wufoo username="datacommunitydc" formhash="m11ujb9d0m66byv" autoresize="true" height="1073" header="show" ssl="true"]

Moderating The World IA Data Viz Panel

This weekend was my introduction to moderating an expert panel since switching careers and becoming a data science consultant. The panel was organized by Lisa Seaman of Sapient and consisted of Andrew Turner of Esri, Amy Cesal of Sunlight Foundation, Maureen Linke of USA Today, and Brian Price of USA Today. We had roughly an hour to talk, present information, and engage the audience. You can watch the full panel discussion thanks to the excellent work of Lisa Seaman and the World IA Day organizers, but there's a bit of back-story that I think is interesting.

DataViz-BW-AgencyFB Bold DataViz-Rivers-AgencyFB BoldIn the spring of 2013 Amy Cesal helped create the DVDC logo (seen on the right), so it was nice to have someone I'd already worked with. Similarly, Lisa had attended a few DVDC and asked me to moderate because she'd enjoyed them so much. By itself it's not exactly surprising that Lisa attended some DVDC events and went with who she'd met, but common sense isn't always so common. If you google "Data Viz" or "Data Visualization" and focus on local DC companies, experts, speakers, etc. you'll find some VERY accomplished people, but there's more to why people reach out. You have to know how people work together, and you can only know by meeting them and discussing common interests, which is a tenant of all the DC2 Programs.

Now that the sappy stuff is out of the way, I wanted to share some thoughts on running the panel. I don't know about you, but I fall asleep whenever the moderator simply asks a question and each panelist answers in turn. The first response can be interesting, but each subsequent response builds little on the one before, there's no conversation. This can go on for one, maybe two go-rounds, but any more than that and the moderator is just being lazy, doesn't know the panelists, doesn't know the material, or all of the above. A good conversation builds on each response, and if that drifts away from the original question the moderator can jump in, but resetting too much by effectively re-asking the question is robotic and defeats the purpose of having everyone together in one place.

Heading this potential disaster off at the pass, Lisa scheduled a happy hour, hopefully to give us a little liquid courage and create a natural discourse. I did my homework, read about everyone on the panel, and starting imagining how everyone's expertise and experience overlapped. Accuracy vs communicating information; Managing investigative teams vs design iteration; building industry tools vs focused and elegant interfaces; D3js vs Raphael. The result: a conversation, which is what we want from a panel, isn't it?

Validation: The Market vs Peer Review

FixMyPineapple2How do we know our methods are valid? In academia your customers are primarily other academics, and in business they're whoever has money to pay.  However, it's a fallacy to think academics don't have to answer outside of their ivory tower or that businesses need not be concerned with expert opinion.  Academics need to pay their mortgages and buy milk, and businesses will find their products or services aren't selling if they're clearly ignoring well established principles.  Conversely, businesses don't need to push the boundaries of science for a product to be useful and academics don't need to increase market share to push the boundaries of science. So how do we strike a balance?  When do we need to seek these different forms of validation?

Academics Still Need Food

Some of us wish things were so simple, that academics need not worry about money nor businesses about peer review, but everything is not so well delineated.  Eventually the most eccentric scientist has to buy food, and every business eventually faces questions about its products/services.  Someone has to buy the result of your efforts, the only question is how many are buying and at what price.  In academia, without a grant you may just be an adjunct professor.  Professors are effectively running mission driven organizations that are non-profit in nature, and their investors are the grant review panels who greatly consider the peer review process in the awarding process.

Bridge the Consumer Gap

Nothing goes exactly as planned.  Consumers may buy initially, but there will inevitably be questions and business representatives can not possibly address all those questions.  A small "army" may be necessary to handle the questions, and armies need clear direction, so businesses are inevitably reviewed and accepted by its peers.  The peer review helps bridge the gap between business and consumer.

Unlike academic peer review, businesses often have privacy requirements for competitive advantage that preclude the open exposure of a particular solution. In these cases, credibility is demonstrated when your solution can provide answers to clients' particular use cases. This is a practical peer review in the jungle of business.

Incestual Peer Review

You can get lost in this peer review process, each person has their thoughts which they write about, affecting others' thoughts which they write about, and so on and so forth.  A small group of people can produce mountains of "peer reviewed" papers and convince themselves of whatever they like, much like any crazy person can shut the outside world and become lost in their own thoughts.  Godel's incompleteness theorem can be loosely interpreted as, "For any system there will always be statements that are true, but that are unprovable within the system."  Godel was dealing specifically with natural numbers, but we inherently understand that you can not always look inward for the answer, you have to engage the outside world.

Snake Oil Salesman

Conversely, without peer review or accountability, cursory acceptance (i.e. consumer purchases) can give a false sense of legitimacy.  Some people will give anything a try at least once, and the snake oil salesman is the perfect example.  Traveling from town to town, the salesman brings an oil people have never seen before and claims it can remedy all of their worst ailments; However, once people use the snake oil and realize it is no more effective than a sugar pill, the salesman has already moved on to another town. Experience with a business goes a long way in legitimizing the business.

Avoid Mob Rule

These two forms of legitimacy, looking internal versus external, peer review versus purchase, can be extremely powerful, rewarding, and a positive force in society.  Have you ever had an idea people didn't immediately accept?  Did you look to your friends and colleagues for support before sharing your idea more widely?  This is a type of peer review (though not a formal one), and something we use to develop and introduce new ideas.

The Matrix

Conversely, have you ever known something to be true but can't find the words to convince others it is true?  In The Matrix, Morpheus tells Neo, "Unfortunately, no one can be told what the Matrix is. You have to see it for yourself."  If people can be made to see what you know to be true, to experience the matrix rather than be told about it, they have more grounds to believe and accept your claims.  Sometimes in business you have to ignore the nay-sayers, build your vision, and let its adoption speak for itself.  Ironically there are those who would presume to teach birds to fly, and businesses may watch the peer review process explain how their vision works only to then be lectured on why they were successful.

Conclusion

In legitimizing our work, business or academic, when do we look to peer review and when do we look to engaging the world?  This is a self similar process, where we may gather our own thoughts before speaking, or we may consult our friends and colleagues before publishing, but above all we must be aware of who is consuming our product and review our product accordingly before sharing it.

Selling Data Science: Validation

FixMyPineapple2 We are all familiar with the phrase "We can not see the forest for the trees", and this certainly applies to us as data scientists.  We can become so involved with what we're doing, what we're building, the details of our work, that we don't know what our work looks like to other people.  Often we want others to understand just how hard it was to do what we've done, just how much work went into it, and sometimes we're vain enough to want people to know just how smart we are.

So what do we do?  How do we validate one action over another?  Do we build the trees so others can see the forrest?  Must others know the details to validate what we've built, or is it enough that they can make use of our work?

We are all made equal by our limitation to 24 hours in a day, and we must choose what we listen to and what we don't, what we focus on and what we don't.  The people who make use of our work must do the same.  John Locke proposed the philosophical thought experiment, "If a tree falls in the woods and no one is around to hear it, does it make a sound?"  If we explain all the details of our work, and no one gives the time to listen, will anyone understand?  To what will people give their time?

Let's suppose that we can successfully communicate all the challenges we faced and overcame in building our magnificent ideas (as if anyone would sit still that long), what then?  Thomas Edison is famous for saying, “I have not failed. I've just found 10,000 ways that won't work.”, but today we buy lightbulbs that work, who remembers all the details about the different ways he failed?  "It may be important for people who are studying the thermodynamic effects of electrical currents through materials." Ok, it's important to that person to know the difference, but for the rest of us it's still not important.  We experiment, we fail, we overcome, thereby validating our work because others don't have to.

Better to teach a man to fish than to provide for him forever, but there are an infinite number of ways to successfully fish.  Some approaches may be nuanced in their differences, but others may be so wildly different they're unrecognizable, unbelievable, and beg for incredulity.  The catch is (no pun intended) methods are valid because they yield measurable results.

It's important to catch fish, but success is not consistent nor guaranteed, and groups of people may fish together so after sharing their bounty everyone is fed.  What if someone starts using this unrecognizable and unbelieveable method of fishing?  Will the others accept this "risk" and share their fish with those who won't use the "right" fishing technique, their technique?  Even if it works the first time that may simply be a fluke they say, and we certainly can't waste any more resources "risking" hungry bellies now can we.

So does validation lie in the method or the results?  If you're going hungry you might try a new technique, or you might have faith in what's worked until the bitter end.  If a few people can catch plenty of fish for the rest, let the others experiment.  Maybe you're better at making boats, so both you and the fishermen prosper.  Perhaps there's someone else willing to share the risk because they see your vision, your combined efforts giving you both a better chance at validation.

If we go along with what others are comfortable with, they'll provide fish.  If we have enough fish for a while, we can experiment and potentially catch more fish in the long run.  Others may see the value in our experiments and provide us fish for a while until we start catching fish.  In the end you need fish, and if others aren't willing to give you fish you have to get your own fish, whatever method yields results.

Selling Data Science

Data Science is said to include statisticians, mathematicians, machine learning experts, algorithm experts, visualization ninjas, etc., and while these objective theories may be useful in recognizing necessary skills, selling our ideas is about execution.  Ironically there are plenty of sales theories and guidelines, such as SPIN selling, the iconic ABC scene from boiler room, or my personal favorite from Glengarry Glenross, that tell us what we should be doing, what questions we should be asking, how a sale should progress, and of course how to close, but none of these address the thoughts we may be wrestling with as we navigate conversations.  We don't necessarily mean to complicate things, we just become accustomed to working with other data science types, but we still must reconcile how we communicate with our peers versus people in other walks of life who are often geniuses in their own right.

We love to "Geek Out", we love exploring the root of ideas and discovering what's possible, but now we want to show people what we've discovered, what we've built, and just how useful it is.  Who should we start with?  How do we choose our network?  What events should we attend?  How do we balance business and professional relationships?  Should I continue to wear a suit?  Are flip-flops too casual?  Are startup t-shirts a uniform?  When is it appropriate to talk business?  How can I summarize my latest project?  Is that joke Ok in this context?  What is "lip service"?  What is a "slow no"?  Does being "partnered" on a project eventually lead to paying contracts?  What should I blog about?  How detailed should my proposal be?  What can I offer that has meaning to those around me?  Can we begin with something simple, or do we have to map out a complete long term solution?  Can I get along professionally with this person/team on a long term project?  Can I do everything that's being asked of me or should I pull a team together?  Do I have the proper legal infrastructure in place to form a team?  What is appropriate "in kind" support?  Is it clear what I'm offering?

The one consistent element is people: who would we like to work with and how.  This post kicks off a new series that explores these issues and helps us balance between geeking out and selling the results, between creating and sharing.

The Top 5 Questions A Data Scientist Should Ask During a Job Interview

The data science job market is hot and an incredible number of companies, large and small, are advertising a desperate need for talent.
office-space-bobs
Before jumping on the first 6-figure offer you get, it would be wise to ask the penetrating questions below to make sure that the seemingly golden opportunity in front of you isn't actually pyrite.

1) Do they have data?

You might get a good laugh at this one and probably assume that this company interviewing you must have data as they are interviewing you (a data scientist). However, you know what they say about ass-u-ming, right?

If the company tells you that the data is coming (similar to the "check is in the mail"), start asking a lot more questions. Ask if the needed data sharing agreements have been signed and even ask to see them. If not, ask what the backup plan is for if (or when) the data does not arrive. Trust me, it always takes longer than everyone thinks.

To be an entrepreneur means to be an optimist at some level because otherwise no one would do something with such a low probability of success. Thus, it is pretty easy for an entrepreneur to assume that getting data will not be that hard. It will only be after months of stalled negotiations and several failures that they will give up on getting the data or, in startup parlance, pivot. In the meantime, you best figure out some other ways of being useful and creating value for your new organization.

2) Who will you report to and what is her or his background?

So, really what you are asking is: does the person who will claim me as a minion actually have experience with data and do they understand the amount of time that wrangling data can take?

If you are reporting to an Management/Executive type, this question is all important and your very survival likely depends on your answer.

First, go read the Gervais Principle at ribbonfarm. From my experience, the ideas aren't too far off of the mark.

Second, many data-related tasks are conceptually trivial. However, these tasks can take an amount of time seemingly inversely proportional to their simplicity. Or, even worse, something that is conceptually very simple may be mathematically or statistically very challenging or require many difficult and time-consuming steps. Something like count the number of tweets for or against a particular topic is trivial for people but less so for algorithms.

Further, as everyone knows, data wrangling on any project can consume 80% or more of the total project time and, unless that manager has worked with data, she or he may not understand this reality. The rule of thumb to never forget is that if someone does not understand something, that person will almost always under appreciate it. I swear there must be a class in American MBA programs that teaches if you don't understand something it must be simple and only take five minutes.

If you are reporting to a CTO-type, the situation may seem better but it actually might be worse. Software engineering and development do not equal data science. Technical experience, most of the time, does not equal data experience. Having gone through a few semesters of calculus does not a statistics background make. Hopefully, I have made my point. There is a reason we call the fields software **engineering** (nice and predictable) and data **science** (conducting experiments to test hypotheses). However, many technically-oriented people may believe they know more than they actually do.

Short version for #2 is that time expectations are important to flesh out up front and are highly dependent on your boss' background.

Third, your communications strategy will change radically depending on your boss' background. Do they want the sordid details of how you worked through the data or do they just want the bottom line impact?

3) How will my progress and/or performance be measured?

Knowing how to succeed in your new workplace is pretty important and the expectations surrounding data science are stratospheric at the moment. Keep your eyes peeled if there is a good quick win available for you to demonstrate your value (and this is a question that I would directly ask).

The giant red flag here is if you will be included in an "agile" software process with data-work shoehorned into short-term sprints along with the engineering or development team. Data Science is science and many tasks will often have you dealing with the dreaded unknown unknown. In other words, you are exploring terra incognita, a process that is unpredictable at best. Managing data scientists is very different than managing software engineers.

4) How many other data scientists/practitioners will you be working with and are in the company overall?

What you are trying to understand here is how data-driven (versus ego-driven) the company that you are thinking of joining is.

If the company has existed for more than a few years and has few data science or analyst types, it is probably ego driven. Put another way, decisions are made by the HiPPOs (the HIghest Paid Person's Opinions). If your data analyses are going to be used for internal decision making, this possibly puts you, the new hire, directly against the HiPPOs. Guess who will win that fight?  If you are going into this position, make sure you will be arming the HiPPO with knowledge as opposed to fighting directly against other HiPPOs.

5) Has anyone ever run analyses on the company's data?

This one is critical if you will be doing any type of retrospective analyses based on previously collected data. If you simply ask the company if they have ever looked at their data, the answer is often yes regardless of whether or not they have as most companies don't want to admit that they haven't. Instead, ask what types of analyses the company has done on its data, did the examination cover all of the companies data, and ask who (being careful to inquire about this person's background and credentials) did the work.

The reason this line of questioning is so important is that the first time you plumb the depths of a company's database, you are likely to dig up some skeletons. And by likely I really mean certainly. In fact, going through historically collected data is much like an archeological excavation. As you go further back into the database, you go through deeper layers of the history of the organization and will learn much. You might find out when they changed contractors or when they decided to stop collecting a particular field that you just happen to need. You might see when the servers went down for a day or when a particularly well hidden bug prevented database writes for a few weeks.  The important point here is that you might uncover issues that some people still present in the company would prefer not to be unearthed. My simple advice, tread lightly.

Data Visualization: The Double Edged Data Sword

Can we use data visualization, and perhaps data avatars, to build a better community?  If you've ever been part of making the rules for an organization, you may be familiar with the desire to write a rule for every scenario that may arise, to codify how the organization expects things to happen given a specific context.  For a small group this may work as you can resolve most issues with a simple conversation and only broad rules need be codified (roles, responsibilities, etc.).  However, we're also familiar with the draconian rules that arise as a result of some crazy thing that one person did.  One could argue, "What choice do we have?", because once the size of a group grows and communication becomes a combinatorial challenge (you can't talk to everyone about everything all the time) we need the rule of law.  Laws provide everyone a common reference people can relate to their individual context and use to govern their daily conduct, but so can data.  The difference our modern world, our sensor laden interconnected world, has compared to the opaque world of previous generations is data and information.  We have a much greater potential to be aware of the context beyond our immediate senses, and thereby better understand the consequences of our actions, but we can't reach that potential unless we can visualize that data.

There will always be human issues, people interpret data and information differently, which is why we must "trust but verify" and when necessary use data to revisit peoples' reasoning.  Data visualization is what allows us to be aware of "the context beyond our immediate senses", and this premise holds whether you're in the moment or you're looking for a deeper understanding.

When we're in the moment we, presumably, want to make a better decisions and need "decision support".  To make decisions in the moment, the information must be "readily available" or we're forced to make decisions without it.  Consider how new data might change basic decisions throughout your day, in fact businesses are taking advantage of this and coffee shops provide public transit information on tablets behind the counter so people know they have time for that extra latte.

Conversely, if we want to understand how events unfolded we rely on our observations, possibly the observations of those we trust, and fill the space between with our experience and assumptions about the world.  Different people make different observations, and sometimes we can piece together a more precise picture of our shared experience, but the more precise the observations the more unique the situation, and we need laws that provide "a common reference" so we write laws for the most general cases.  The consequence: an officer could write us a ticket for jaywalking even though there are no cars for miles, or we are afraid to help those around us because it may implicate us.  Bottom line,  the more observations we have the less we are forced to assume.

"Decision Support" and "Trust but Verify" are the two sides of the double edged data sword, and it's this give and take that forces gradual adoption by people, organizations, governments, etc.  Almost universally people want transparency into why events unfolded, but do not necessarily want information about them made widely available.  The most notorious of these examples involves Julian Assange of WikiLeaks and Edward Snowden of the recent NSA leaks, in these cases the US Government wants information without having their information disclosed.  Conversely many of us believe governments would run better with more transparency, but there is a proper balance.

On a less controversial and more personal level, I use financial tracking software to help me plan budgets and generally live within my means, I use GPS to help me understand my workout routines, I use event tools to plan my Data Visualization DC events, I generally allow third party applications access to my data in exchange for new and better services.  Each of these services has some sort of data visualization and analytics that come with it, and these visualizations and analytics are essential to my personal decision support.  I enjoy the services, but it is interesting to also suddenly see advertisements about the thing I shard, tweeted, emailed, etc., earlier that day or week.  On the one hand I'm glad advertisements have become more meaningful, but what are the consequences of the double edged data sword?

I would like to be able to revisit my personal information for innocuous reasons, to remember where I've been, what actions I took, who I talked to, who I shared what with, etc.  There are more compelling reasons too, trust but verify reasons, as the data could prove I was at work, was conducting work, met with that client, didn't waste the money, couldn't be liable for an accident, etc.; I'd like to generally have the power to confirm statements and claims I make using my personal data.

Unfortunately I don't own what's recorded about me, my personal data, typically the third party owns that data.  Theoretically I can get that information piecemeal, I can go to the service provided and manually record it, a former justice department lawyer even suggested the wide use of FOIA, but we data scientists and visualizers know that if you can't automate the data collection and visualization then it's really not practical.  In other words, without data visualization we can't hear the proverbial tree in the woods.  So until there is a FOIA app on my smartphone, basically an API for my personal data guaranteed by the government for the people, we can't visualize "the context beyond our immediate senses" for ourselves and others, and the other edge of the data sword will always be difficult to defend against.

Data Visualization: Data Viz vs Data Avatar

What does it mean to be a Data Visualizer?  It is mutually exclusive with a graphics/UX designer, data scientist, or a coder?  This last week I attended Action Design DC, which focused on motivating people to take action by presenting information as something familiar we could feel empathy for.  In that something, an avatar/figurine/robot, a fish tank, a smiley or frowny face, etc., we couldn't help but recognize a reflection of ourselves because the state of that something was determined by data gathered from ourselves.  In other words, anything from our pulse to exercise time to body temperature to that last time we got up from our desk is used to determine the state of say a wooden figuring, where little activity may result in a slouching figure, while reaching a goal activity results in an 'ative' figurine.  I co-organize Data Visualization DC, and so for me and the people around me this presentation begged the question, "Is this data visualization?"

The term "Data Visualizer" recognizes someone who creates data visualizations, so we are really exploring what is a data visualization versus graphics design, classical graphs, or in this case shall we say "Data Avatar"?  If the previous posts can be used as evidence, to create a data visualization requires an understanding of the science and the programming language of choice, along with a certain artistic creativity.  The science is necessary to understand the data and discover the insight, a toolbox of visualization techniques helps when there is overwhelming data, and the story may have many nuances requiring sophisticated interactive capability for the user/reader to fully explore.  For example, the recent news of the NSA PRISM program's existence has created interest in the sheer number of government data requests to Google, or a few months ago the gun debate following the tragedy in Newtown Connecticut resulted in some very sophisticated interactive data visualizations to help us understand the cause and effect relationships of states' relationships with the law and guns.

ActionVenDiagramA UX designer may have knowledge of the origin of the underlying data but doesn't necessarily have to, they can take what they have then focus on guiding the user/reader through the data, and they may only architect the solution.  A pure coder cares only for the elegance of the code to manipulate data and information with optimal efficiency.

Some may argue that a data scientist focuses solely on the data analytics, understanding the source of the underlying information and bringing it together to find new insights, but without good communication a good insight is like a tree falling in the woods with no-one around.  Data scientists primarily communicate with data visualizations, and so you could argue that all data scientists are also data visualizers, but vice versa?  I argue that there is a significant overlap, but they are not necessarily the same; You do not need to know how to create a spark plug in order to use it inside an engine.

The line in the sand between a data avatar and a data visualization is in compelling action versus understanding, respectively.  A data visualization is designed to communicate insights through our visual acuity, whereas a data avatar is designed to compel action by invoking an emotional connection.  In other words, from the data's point of view, one is introspective and the other extrospective.  This presumes that the data itself is its own object to understand and to interact with.

Evidence from Google IO: Recommendation Engines are not MVPs

My co-editor's earlier post today about recommendation engines is simply spot on and I wanted to add not only my strong agreement but also some more anecdotal support for her conclusions. Google IO 2013, which concluded last week, was filled with developer-oriented announcements. As a result, some of the more consumer-focused announcements and their ramifications were glossed over. Google just announced that they are now recommending books and apps and musics via Google Play.  In other words, Google just launched their own recommendation engine. Bringing this point up to a few Googlers I was told that there have been quite a few teams at Google that have attempted to build such a recommendation engine before and met with less than stellar success.  And, remember, this is Google. There have been 48 billion app installs. They are indexing the world's data. They have knowledge graph. They probably have your email. Yet, with more data than anyone else on the planet, ridiculous computing super-infrastructure, and immense pools of elite talent, Google Play is just now getting its own recommendation engine in 2013. Recommendation engines are NOT minimum viable products, full stop.

Why You Should Not Build a Recommendation Engine

One does not simply build an MVP with a recommendation engine Recommendation engines are arguably one of the trendiest uses of data science in startups today. How many new apps have you heard of that claim to "learn your tastes"? However, recommendations engines are widely misunderstood both in terms of what is involved in building a one as well as what problems they actually solve. A true recommender system involves some fairly hefty data science -- it's not something you can build by simply installing a plugin without writing code. With the exception of very rare cases, it is not the killer feature of your minimum viable product (MVP) that will make users flock to you -- especially since there are so many fake and poorly performing recommender systems out there.

A recommendation engine is a feature (not a product) that filters items by predicting how a user might rate them. It solves the problem of connecting your existing users with the right items in your massive inventory (i.e. tens of thousands to millions) of products or content. Which means that if you don't have existing users and a massive inventory, a recommendation engine does not truly solve a problem for you. If I can view the entire inventory of your e-commerce store in just a few pages, I really don't need a recommendation system to help me discover products! And if your e-commerce store has no customers, who are you building a recommendation system for? It works for Netflix and Amazon because they have untold millions of titles and products and a large existing user base who are already there to stream movies or buy products. Presenting users with  recommended movies and products increases usage and sales, but doesn't create either to begin with.

There are two basic approaches to building a recommendation system: the collaborative filtering method and the content-based approach. Collaborative filtering algorithms take user ratings or other user behavior and make recommendations based on what users with similar behavior liked or purchased. For example, a widely used technique in the Netflix prize was to use machine learning to build a model that predicts how a user would rate a film based solely on the giant sparse matrix of how 480,000 users rated 18,000 films (100 million data points in all). This approach has the advantage of not requiring an understanding of the content itself, but does require a significant amount of data, ideally millions of data points or more, on user behavior. The more data the better. With little or no data, you won't be able to make recommendations at all -- a pitfall of this approach known as the cold-start problem. This is why you cannot use this approach in a brand new MVP. 

The content-based approach requires deep knowledge of your massive inventory of products. Each item must be profiled based on its characteristics. For a very large inventory (the only type of inventory you need a recommender system for), this process must be automatic, which can prove difficult depending on the nature of the items. A user's tastes are then deduced based on either their ratings, behavior, or directly entering information about their preferences. The pitfalls of this approach are that an automated classification system could require significant algorithmic development and is likely not available as a commodity technical solution. Second, as with the collaborative filtering approach, the user needs to input information on their personal tastes, though not on the same scale. One advantage of the content-based approach is that it doesn't suffer from the cold-start problem -- even the first user can gain useful recommendations if the content is classified well. But the benefit that recommendations offer to the user must justify the effort required to offer input on personal tastes. That is, the recommendations must be excellent and the effort required to enter personal preferences must be minimal and ideally baked into the general usage. (Note that if your offering is an e-commerce store, this data entry amounts to adding a step to your funnel and could hurt sales more than it helps.) One product that has been successful with this approach is Pandora. Based on naming a single song or artist, Pandora can recommend songs that you will likely enjoy. This is because a single song title offers hundreds of points of data via the Music Genome Project. The effort required to classify every song in the Music Genome Project cannot be understated -- it took 5 years to develop the algorithm and classify the inventory of music offered in the first launch of Pandora. Once again, this is not something you can do with a brand new MVP.

Pandora may be the only example of a successful business where the recommendation engine itself is the core product, not a feature layered onto a different core product. Unless you have the domain expertise, algorithm development skill, massive inventory, and frictionless user data entry design to build your vision of the Pandora for socks / cat toys / nail polish / etc, your recommendation system will not be the milkshake that brings all the boys to the yard. Instead, you should focus on building your core product, optimizing your e-commerce funnel, growing your user base, developing user loyalty, and growing your inventory. Then, maybe one day, when you are the next Netflix or Amazon, it will be worth it to add on a recommendation system to increase your existing usage and sales. In the mean time, you can drive serendipitous discovery simply by offering users a selection of most popular content or editor's picks.