Selling Data Science: Common Language

What do you think of when you say the word "data"?  For data scientists this means SO MANY different things from unstructured data like natural language and web crawling to perfectly square excel spreadsheets.  What do non-data scientists think of?  Many times we might come up with a slick line for describing what we do with data, such as, "I help find meaning in data" but that doesn't help sell data science.  Language is everything, and if people don't use a word on a regular basis it will not have any meaning for them.  Many people aren't sure whether they even have data let alone if there's some deeper meaning, some insight, they would like to find.  As with any language barrier the goal is to find common ground and build from there.

You can't blame people, the word "data" is about as abstract as you can get, perhaps because it can refer to so many different things.  When discussing data casually, rather than mansplain what you believe data is or what it could be, it's much easier to find examples of data that they are familiar with and preferably are integral to their work.

The most common data that everyone runs into is natural language, unfortunately this unstructured data is also some of the most difficult to work with; In other words, they may know what it is but showing how it's data may still be difficult.  One solution: discuss a metric with a qualitative name, such metrics include: "similarity", "diversity", or "uniqueness".  We may use the Jaro algorithm to measure similarity, where we count common letters between two strings and their transpositions, and there are other algorithms.  When discuss 'similarity' with someone new, or any other word that measures relationships in natural language, we are exploring something we both accept and we are building common ground.


Some data is obvious, like this neatly curated spreadsheet from the Committee to Protect Journalists.  Part of my larger presentation at Freedom Hack (thus the lack of labels), the visualization shown on the right was only possible to build in short order because the data was already well organized.  If we're lucky enough to have such an easy start to a conversation, we get to bring the conversation to the next level and maybe build something interesting that all parties can appreciate; In other words we get to "geek out" professionally.