DataBlog

Is Statistics the Least Important Part of Data Science?

There is a fascinating discussion occurring on Andrew Gelman's blog that some of our Data Community DC member's might want to chime in on ... or discuss right here on our blog.

There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. . . .

The tech industry has always had to deal with databases and coding; that stuff is a necessity. The statistical part of data science is more of an option.

To put it another way: you can do tech without statistics but you can’t do it without coding and databases.

So, what do you think?

Some of the Data Community DC board members weigh in below:

From Sean Gonzalez

We are often quoted as saying something akin to "80% of data science is data munging.", and Dr.Gelman is following in this basic idea.  Part of me wishes I could be given a little sandbox where I get to play with algorithms all day; the amount of work it takes to set up an environment where you actually get to play with math with any bearing on your colleagues or business can be staggering.  On the other hand, this is exactly what I've run away from in 2013; I do not like having the algorithms' applications essentially decided for me ahead of time by the business plan.  There are enough tools, languages, packages, code bases, and enthusiasm that data science can become the centerpiece of the product; and some of the best examples are in marketing and product pricing.  If you have the vision, affinity for multiple programming languages, algorithmic insights, and the pulse of your customers, you can build applications that shock people with their capabilities.  In other words, applications using algorithms yield much more interesting results and are far more useful.  In short, without statistics, algorithms, mathematics, etc. you're a developer not a data scientist.

From Tony Ojeda

In my opinion, data science without math is not really data science. Data science is not just moving data around, transforming it, and storing it in different places. It's not just creating an application that uses data in some way. Data science involves the entire tool box of skills needed to get through the process of acquiring data, cleaning it, storing it, exploring it, analyzing it, modeling it, and then taking the results of all that and creating something valuable.

This means that DBAs and software developers need to improve their math skills if they want to refer to themselves as data scientists, and statisticians and mathematicians need to improve their data architecture and coding skills if they want to refer to themselves as data scientists.

From Abhijit Dasgupta

Hilary Mason says that data science is about telling stories with data. Just data manipulation and organization can't tell stories, though it can help make the story easier to write. The story can't just come from abstract models and algorithms without setting up the data, and understanding its context. The story can't be written by just knowing the context, without being able to get your hands dirty and understand what the data is saying. It takes a combination of data handling, manipulation, and organization, algorithms and models, visualization and context to tell the story. In other words, data science is a partnership between IT and database developers (data handling, manipulation, organization), statisticians/mathematicians/computer scientists (algorithms and modeling, visualization), and the owners/caretakers/collectors of the data (context). Each part is necessary to tell the story. A data scientist (more precisely, a "good" data scientist) must have a combination of data architecture, manipulation and coding skills, knowledge of algorithms and models (including an understanding of what algorithm is good for what types of data and contexts, i.e., the mathematics of the algorithm), and the ability to contextualize and translate the results of the coding and analyses to create the story; that story can then be further developed into something more useful and valuable. So the math/stat types have to learn to get their hands dirty using data infrastructure and programming tools, the IT types must learn something about models, visualization and the appropriate contexts in which they have a chance at being successful, and both have to learn about the context and background of the data. After all, data and all the analysis in the world without an end in mind is just data, not a story.

From Ben Bengfort

Data science as a product does focus primarily on computation, true. Perhaps because of this most Data Science products are constructed in an Agile fashion- building through iteration. Agile data science requires fast hypothesis testing across large data sets, failure early and often, and sprints towards finite goals. Although the tools are computational (particularly if you consider machine learning computation rather than statistics) - hypothesis validation and analyses are essential to this methodology. Tech without statistics gives you no sense of where you're at and where you're going- and you're simply pounding away at a domain specific dataset without innovating. Although the statistics gets minimized (possibly because statisticians don't necessarily have an application oriented focus)- it plays a major role in the agile data science life cycle, and should be treated as essential as databases and coding.

 

 

If you would like to read more of the discussion that has already occurred, please go over to Dr Gelman's blog.

Selling Data Science: Common Language

What do you think of when you say the word "data"?  For data scientists this means SO MANY different things from unstructured data like natural language and web crawling to perfectly square excel spreadsheets.  What do non-data scientists think of?  Many times we might come up with a slick line for describing what we do with data, such as, "I help find meaning in data" but that doesn't help sell data science.  Language is everything, and if people don't use a word on a regular basis it will not have any meaning for them.  Many people aren't sure whether they even have data let alone if there's some deeper meaning, some insight, they would like to find.  As with any language barrier the goal is to find common ground and build from there.

You can't blame people, the word "data" is about as abstract as you can get, perhaps because it can refer to so many different things.  When discussing data casually, rather than mansplain what you believe data is or what it could be, it's much easier to find examples of data that they are familiar with and preferably are integral to their work.

The most common data that everyone runs into is natural language, unfortunately this unstructured data is also some of the most difficult to work with; In other words, they may know what it is but showing how it's data may still be difficult.  One solution: discuss a metric with a qualitative name, such metrics include: "similarity", "diversity", or "uniqueness".  We may use the Jaro algorithm to measure similarity, where we count common letters between two strings and their transpositions, and there are other algorithms.  When discuss 'similarity' with someone new, or any other word that measures relationships in natural language, we are exploring something we both accept and we are building common ground.

first50JournalistDeathAssoc

Some data is obvious, like this neatly curated spreadsheet from the Committee to Protect Journalists.  Part of my larger presentation at Freedom Hack (thus the lack of labels), the visualization shown on the right was only possible to build in short order because the data was already well organized.  If we're lucky enough to have such an easy start to a conversation, we get to bring the conversation to the next level and maybe build something interesting that all parties can appreciate; In other words we get to "geek out" professionally.

Data Visualization: Sweave

So you're a data scientist (statistician, physicist, data miner, machine learning expert, AI guy, etc.) and you have the envious challenge of communicating your ideas and your work to people who have not followed you down your rabbit hole.  Typically this involved first getting the data, writing your code, honing the analysis, distilling the pertinent information and graphs/charts, then organizing it into a presentable format (document, presentation, etc.).  Interactive visualizations are really cool and if done right they allow the user to explore the data and the implications of your analysis on their own time.  Unfortunately interactive visualizations require an extra effort, so once you're done with your analysis you have to repurpose the functions so they work within a framework such as Shiny.  For those of us who simply want a nice presentable document to compile once we've finished our work, I introduce you to Sweave.

Sweave is not necessarily built for RStudio, it is built specifically for R to create LaTeX documents, but naturally RStudio has built it right in and created a great interface.  This is a positive and a negative in that it's so easy you don't need to know precisely how the whole mechanism results in a pdf file, but that becomes an issue when your document doesn't compile and you need to debug it.  Sweave is its own language of sorts, with blocks for evaluating an R session, blocks for plain english, and an html tag style of its own that gives the document format instructions (title, body, size, figure dimensions, etc.).  In principle it's easy to understand, but with any new language it has its own syntax, its own unwritten rules, plenty of google searches despite well written tutorials, videos, and books, compatibility issues with different versions of R, and of course throwing your hands in the air in confusion.

Once you get the hang of it you can fit your normal data analysis into the framework sweave provides, and you end up telling a story with your work as you work.  Having good comments has always been a staple of writing code, whatever the language, but there is always a push back because it's mixed in with the code, requiring the reader to understand the flow of the code and how the functions and scripts work together.  Mathematica is a great example of code and presentation working together, but unfortunately it is not free.

Sweave will however change your style and will make you break up your analysis into digestible chunks for the target reader.  For example, when I am analyzing details of some dataset and/or debugging my functions, I will produce many more graphs than are necessary for the end reader.  Perhaps the makers of sweave worked similarly, and purposefully required a R code block in sweave to only print the first plot from that block, forcing you to choose your plots carefully.  You can get around this by exporting ggplot2 figure objects from your code as a list variable and plotting them using the "grid.arrange()" function from the "gridExtra" package, but this is not something you might normally do.  This is how sweave draws you into its style (don't forget to resize your figure: "<<fig=TRUE, echo=FALSE, height=10>>=" and "\setkeys{Gin}{width=0.9\textwidth}", the kittens will be fine), but the bottom line is if you can make sweave part of your routine, you can produce beautiful reports from your R comments and code; maybe it will even help me better remember years from now what that set of functions does that's buried in my computer, but I can only speculate.

Data Visualization: Teaching Data Viz

In the past few months Data Community DC (DC2) has brought together a series of great speakers for its visualization program DVDC, and now people are asking for more depth.  We have begun including interactive elements in our traditional lecture style events, breaking up the format to allow people to freely ask questions of the organizers, speakers, and enthusiasts.  We have received positive feedback, but there has also been a steady request for more depth and detail.  As a result, DC2 has recently begun organizing workshops around our personal expertise, our event speakers, and data practitioners in our network.  This naturally begs the question, "How do you teach data visualization to newcomers with little to no coding experience?"

The DVDC events have focused on "data psychology", "visualization languages", and "visualization techniques".  Data psychology is about understanding people and how you can use visualizations, in the right context, in the right sequence, starting with the right data, etc., to best communicate your data insights.  Visualization languages is somewhat self-explanatory, it focuses on the increasingly efficient and easy to use programming languages.  Visualization techniques is all about how can you visually represent the data so it is both pleasing to the eye, suppresses irrelevant nuances, and highlights key features about the data.

This approach has worked well and we've received very positive feedback, but a workshop is not passive learning, you have 3-4 hours to introduce a topic, explore the topic, be creative, regroup, discuss, and review lessons learned, and as I was emphatically told, "We don't want you lecturing for 3-4 hours!"  Data psychology is nice but without something outside the metaphysical world there isn't much to physically play with, and people are pretty creative to begin with.  Focusing on the visualizations by themselves becomes more of an art class as people go wild with their imaginations.  The code behind data visualization is the only way to focus the discussion, the only way to create a "sandbox" in which people can explore the rules, learn what's possible, and find something that brings their imagination to life.

This can not be an introduction to R or Python class as it would take up the entire class; If there is little coding experience we need to make the code obvious and thereby secondary to creating their visualization of interest.  This is easily done by curating a workspace that contains an interesting dataset, which everyone downloads at the beginning of class.  From there we introduce visualization "widgets" the class can easily call from the command line, and can be easily mixed and matched with each other, or input data varied, in order to create something unique for each person.  We can mix and match classic charts, maps, stacked charts, proportional symbol graphs, heat maps, gantt charts, steam charts, arc charts, polar charts, etc., enough to show that there are more ways to combine data than you can shake a stick at, and therefore depends more on what you want to say rather than what's possible.

Of course the holy grail is interactive visualizations, but that requires new languages and more sophisticated passing of variables between front and back end controllers.  Of course a more advanced class is the natural next step.

Data Visualization: Data Viz vs Data Avatar

What does it mean to be a Data Visualizer?  It is mutually exclusive with a graphics/UX designer, data scientist, or a coder?  This last week I attended Action Design DC, which focused on motivating people to take action by presenting information as something familiar we could feel empathy for.  In that something, an avatar/figurine/robot, a fish tank, a smiley or frowny face, etc., we couldn't help but recognize a reflection of ourselves because the state of that something was determined by data gathered from ourselves.  In other words, anything from our pulse to exercise time to body temperature to that last time we got up from our desk is used to determine the state of say a wooden figuring, where little activity may result in a slouching figure, while reaching a goal activity results in an 'ative' figurine.  I co-organize Data Visualization DC, and so for me and the people around me this presentation begged the question, "Is this data visualization?"

The term "Data Visualizer" recognizes someone who creates data visualizations, so we are really exploring what is a data visualization versus graphics design, classical graphs, or in this case shall we say "Data Avatar"?  If the previous posts can be used as evidence, to create a data visualization requires an understanding of the science and the programming language of choice, along with a certain artistic creativity.  The science is necessary to understand the data and discover the insight, a toolbox of visualization techniques helps when there is overwhelming data, and the story may have many nuances requiring sophisticated interactive capability for the user/reader to fully explore.  For example, the recent news of the NSA PRISM program's existence has created interest in the sheer number of government data requests to Google, or a few months ago the gun debate following the tragedy in Newtown Connecticut resulted in some very sophisticated interactive data visualizations to help us understand the cause and effect relationships of states' relationships with the law and guns.

ActionVenDiagramA UX designer may have knowledge of the origin of the underlying data but doesn't necessarily have to, they can take what they have then focus on guiding the user/reader through the data, and they may only architect the solution.  A pure coder cares only for the elegance of the code to manipulate data and information with optimal efficiency.

Some may argue that a data scientist focuses solely on the data analytics, understanding the source of the underlying information and bringing it together to find new insights, but without good communication a good insight is like a tree falling in the woods with no-one around.  Data scientists primarily communicate with data visualizations, and so you could argue that all data scientists are also data visualizers, but vice versa?  I argue that there is a significant overlap, but they are not necessarily the same; You do not need to know how to create a spark plug in order to use it inside an engine.

The line in the sand between a data avatar and a data visualization is in compelling action versus understanding, respectively.  A data visualization is designed to communicate insights through our visual acuity, whereas a data avatar is designed to compel action by invoking an emotional connection.  In other words, from the data's point of view, one is introspective and the other extrospective.  This presumes that the data itself is its own object to understand and to interact with.

Data Visualization: New Shiny Packages & Products

Siny_Gridster_JustGauge_HighCharts_AppOver the past few weeks and months we've been exploring the new R web application framework Shiny, how we can develop in it, what its potential is, and what's new.  As expected, web apps with Shiny are getting very sophisticated, and thankfully are making development of professional products by data scientists a reality.  Where once insights gained by analysis required long and intense meeting, we now have beautiful interactive shiny web apps, none of which would be possible without the developers who are building interfaces through R for their favorite javascript packages.  In many ways working through R allows us to do very unique things because we're in an environment where we can easily manipulate, massage, and re-structure our data.  @Winston_Cheng, one of the elite developers building these R javascript packages, has put together a demo app that brings together Shiny with Gridster, JustGage, and Highcharts. Shiny comes with basic functions to create side-panels and main-panels, but those functions are not very configurable and they look amateurish, which can have some collateral issues.  Shiny is good however at modularizing its input vs output displays, and wouldn't it be nice is we could rearrange them as we saw fit, or anyone we build our apps for; Thank you Winston.

Gridster - JustGage - Highcharts

Let's overview with the short short version: Gridster is the boxes that JustGage, Highcharts, or your R charts live within.  A little more detail, Gridster is "the mythical drag-and-drop multi-column jQuery grid plugin that allows building intuitive draggable layouts from elements spanning multiple columns... made by Ducksboard.", JustGage generates and animates nice & clean gauges, and Highcharts is an interactive HTML5/Javascript charting library.  What's nice is these are all independent, which always makes coding easier, the example provided just happens to be what Winston felt like putting together.  One could just as easily use another chart type, such as ggplot2, iGraph, geoPlot, or your own javascript package.

New Potential

The real value here is that representations of your data can be manipulated on the fly by your users.  We do our best to get into our users heads and present the data in a clean simple way, but there can be too much to simplify yet still have meaning; there is a constant balance between depth and breadth.  If you can rearrange the position of different charts you address peoples' ability to absorb the information quickly, and if people can replace plots with new ones, based on your estimation of what's potentially interesting, they can customize their view.  Now your Shiny app is a plug and play system designed to give users flexibility based on their interests.

Shiny Products

My question recently has been what kind of products are ultimately possible with this new Shiny framework.  Initially the question for me was scalability, but I quickly found it's easy to create a Shiny server using Amazon EC2 which will scale for users automatically.  The next question was integration, gathering external data and publishing to external databases, which has been addressed by Shiny's reactive functions - although is can still be a little convoluted.  Interactivity was very quickly addressed through the built-in use of javascript and development of Shiny D3 packages.  The question really comes down to who uses R that would want to develop products?  We are an insight apart, in that we are constantly looking for a better way to bridge the gap between what we see and what is useful, shareable, consumable, etc.  For instance, now we can show the downstream impacts of a component's reliability in manufacturing, correlations between consumer purchasing behaviors, or real-time resource needs for a dispatch center.  These Shiny apps are like the copper wire between a voltage potential, they let the power of insight flow.

Data Visualization: Exploring Biodiversity

BHL_FlikrWhen you have a few hundred years worth of data on biological records, as the Smithsonian does, from journals to preserved specimens to field notes to sensor data, even the most diligently kept records don't perfectly align over the years, and in some cases there is outright conflicting information.  This data is important, it is our civilization's best minds giving their all to capture and record the biological diversity of our planet.  Unfortunately, as it stands today, if you or I were to decide we wanted to learn more, or if we wanted to research a specific species or subject, accessing and making sense of that data effectively becomes a career.  Earlier this year an executive order was given which generally stated that federally funded research had to comply with certain data management rules, and the Smithsonian took that order to heart, event though it didn't necessarily directly apply to them, and has embarked to make their treasure of information more easily accessible.  This is a laudable goal, but how do we actually go about accomplishing this?  Starting with digitized information, which is a challenge in and of itself, we have a real Big Data challenge, setting the stage for data visualization. The Smithsonian has already gone a long way in curating their biodiversity data on the Biodiversity Heritage Library (BHL) website, where you can find ever increasing sources.  However, we know this curation challenge can not be met by simply wrapping the data with a single structure or taxonomy.   When we search and explore the BHL data we may not know precisely what we're looking for, and we don't want a scavenger hunt to ensue where we're forced to find clues and hidden secrets in hopes of reaching our national treasure; maybe the Gates family can help us out...

People see relationships in the data differently, so when we go exploring one person may do better with a tree structure, others prefer a classic title/subject style search, or we may be interested in reference types and frequencies.  Why we don't think about it as one monolithic system is akin to discussing the number of Angels that fit on the head of a pin, we'll never be able to test our theories.  Our best course is to accept that we all dive into data from different perspectives, and we must therefore make available different methods of exploration.

Data Visualization DC (DVDC) is partnering with the Smithsonian's Biodiversity Heritage Library to introduce new methods of exploring their vast national data treasure.  Working with cutting edge visualizers such as Andy Trice of Adobe, DVDC is pairing new tools with the Smithsonian's, our public, biodiversity data.  Andy's development of web standards for Visualizing data with HTML5 is a key step forward to making the BHL data more easily accessible not only by creating rich immersive experiences, but also by providing the means through which we all can take a bite out of this huge challenge.  We have begun with simple data sets, such as these sets organized by title and subject, but there are many fronts to tackle.  Thankfully the Smithsonian is providing as much of their BHL data as possible through their API, but for all that is available there is more that is yet to even be digitized.  Hopefully over time we can utilize data visualization to unlock this great public treasure, and hopefully knowledge of biodiversity will help us imagine greater things.

 

 

Data Visualization: rCharts

NYT_rCharts_AppWe've discussed a few times the advantages of presenting your work in R as an interactive visualization using Shiny, and the next obvious step has been interactive charts.  Let me introduce rCharts and Slidify created by Ramnath Vaidyanathan (ramnathv on GitHub).  As is increasingly the case, these tools are all about how quickly you can begin creating your own work and ideas, it's about putting the power in your hands. First things first, you'll want to install Slidify (demo) from Ramnath's GitHub account using install_github('slidify','ramnathv'), there are a number of other packages you need to install, including rChartsNYT, some of which are on his github account only, and you'll need R 3.0.0.  I generally don't like going through these details as a few Google searches will get you what you need, and it only took me about 30 minutes to update and install everything after running into a few errors and making some coffee.

That being said, as always the goal here is to Democratize Data, and what better way than to begin with America's pastime; although I must object because his demo begins with the Boston Red Sox chosen in the side-panel.  To fix this grievous error is simple, go to the 'ui.R' file and change the 'selectInput' function attribute 'selected' from 'Boston Red Sox' to 'New York Yankees' ... Aaaahh, much better!

Some details are provided in this great tutorial, of which the HTML5 code you can recreate and augment using Slidify using the R markdown file "index.Rmd" with these commands: slidify('index.Rmd'); system('open index.html').  You know everything is working when you can recreate this New York Times app using the command "runApp('app')" and both the tutorial and interactive chart show up in your browser.

The Shiny code is very simple, with 17 and 19 SLOCs for the ui.R and server.R functions respectively, but this is primarily due to the new rCharts functions 'showOutput' and 'renderChart' written for Shiny, and 'rPlot' function which uses the PolyChartsJS library to create interactive visualizations.  From here you need to know how to use the tooltip arguments in javascript.  This small amount of code is possible because the input 'team_data', defined in global.R and pulled from the Lahman baseball database, is a data-frame, and the rCharts function enables the tooltip arguments to operate on the data-frame variables.  In other words, you can create a lot of work for yourself if you don't set your data up right in the first place.

Again, the goal here is to easily create interactive presentations of your data, and rCharts with Shiny provides that given you can begin with organized data.  This seems completely reasonable to me as I myself have trouble speaking intelligently on a subject if I don't have the information organized in my own mind, why would I expect R to do better?

When is data "Big"?

This article from the Harvard Business Review gives this basic prescription for telling a story with Big Data:

  1. Find the compelling narrative
  2. Think about your audience
  3. Be objective and offer balance
  4. Don't Censor
  5. Finally, Edit, Edit, Edit.

This is good advice for the telling of any story, whatever the source, big data or otherwise.  I would like to offer a refinement that may help you in moving forward and making progress, because while it's nice to have an ideal in mind, Plato's pen if you will, getting there is not an abstract exercise and our final destination is often the result of how we characterize the challenge.

"Big Data" is a nice buzz word sufficiently specific as to give structure to a conversation over cocktails yet sufficiently abstract as to allow for anyone's interpretation.  Sean Murphy, of Data Community DC, sees the 'Big' in Big Data as meaning the data has real undeniable impact on people and organization, and that impact is accelerating.  When brought up in conversation 'Big' is often taken literally, as in 'a lot' or 'large amount' of data.  I can't help but think of types of skiing and snowboarding when defining Big Data; If you're a good alpine skiier a harder slope might involve steep, bumpy, cliffs, trees, etc., if you're a snow boarder hard might involve a half-pipe, rails, and big jumps, and if you're a nordic skiier hard might involve tight spaces and quick turns.  In any case the slopes chosen are 'hard' from a certain perspective, even though each person might be an expert in their respective fields.  Big Data can be exactly that, 'Big' from your perspective and what you are trying to accomplish.  Someone who knows how to deal with Petabytes of information might not actually be that good at dealing with Gigabytes of information, the whole Confucius fly with a cannon theme, and if they can't help your business internal operations or external products, then how are they helping you?

Your data is Big because it has real impact, or you know it has the potential to do so.

Data Visualization: From Excel to ???

So you're an excel wizard, you make the best graphs and charts Microsoft's classic product has to offer, and you expertly integrate them into your business operations.  Lately you've studied up on all the latest uses for data visualization and dashboards in taking your business to the next level, which you tried to emulate with excel and maybe some help from the Microsoft cloud, but it just doesn't work the way you'd like it to.  How do you transition your business from the stalwart of the late 20th century?

If you believe you can transition your business operations to incorporate data visualization, you're likely gathering raw data, maintaining basic information, making projections, all eventually used in an analysis-of-alternatives and final decision for internal and external clients.  In addition, it's not just about using the latest tools and techniques, your operational upgrades must actually make it easier for you and your colleagues to execute daily, otherwise it's just an academic exercise.

Google Docs

There are some advantages to using Google Docs over desktop excel, the fact that it's in the cloud, has built in sharing capabilities, wider selection of visualization options, but my favorite is that you can reference and integrate multiple sheets from multiple users to create a multi-user network of spreadsheets.  If you have a good javascript programmer on hand you can even define custom functions, which can be nice when you have particularly lengthy calculations as spreadsheet formulas tend to be cumbersome.  A step further, you could use Google Docs as a database for input to R, which can then be used to set up dashboards for the team using a Shiny Server.  Bottom line, Google makes it flexible, allowing you to pivot when necessary, but it can take time to master.

Tableau Server

Tableau Server is a great option to share information across all users in your organization, have access to a plethora of visualization tools, utilize your mobile device, set up dashboards, keep your information secure.  The question is, how big is your organization?  Tableau Server will cost you $1000/user, with a minimum of 10 users, and 20% yearly maintenance.  If you're a small shop it's likely that your internal operations are straightforward and can be outlined to someone new in a good presentation, meaning that Tableau is like grabbing the whole toolbox to hang a picture, it may be more than necessary.  If you're a larger organization, Tableau may accelerate your business in ways you never thought of before.

Central Database

There are a number of database options, including Amazon Relational Data Services and Google Apps Engine.  There are a lot of open source solutions using either, and it will take more time to set up, but with these approaches you're committing to a future.  As you gain more clients, and gather more data, you may want to access to discover insights you know are there from your experience in gathering that data.  This is a simple function call from R, and results you like can be set up as a dashboard using a number of different languages.  You may expand your services, hire new employees, but want to easily access your historical data to set up new dashboards for daily operations.  Even old dashboards may need an overhaul, and being able to access the data from a standard system, as opposed to coordinating a myriad of spreadsheets, makes pivoting much easier.

Centralize vs Distributed

Google docs is very much a distributed system where different users have different permissions, whereas setting up a centralized database will restrict most people into using your operational system according to your prescription.  So when do you consolidate into a single system and when do you give people the flexibility to use their data as they see fit?  It depends of course.  It depends on the time history of that data, if the data is no good next week then be flexible, if this is your company's gold then make sure the data is in a safe, organized, centralized place.  You may want to allow employees to access your company's gold for their daily purposes, and classic spreadsheets may be all they need for that, but when you've made considerable effort to get the unique data you have, make sure it's in a safe place and use a database system you know you can easily come back to when necessary.