We've mostly focused on creating visualizations assuming the data is readily available, but it's often said that ~90% of data science is reforming the data so you can make use of it. Many companies begin with spreadsheets, and it's understandable, people are well trained with spreadsheets, spreadsheets have nice formulas, you can link equations and files, you can use macros to create client dashboards and interfaces, and they have decent charts and reporting capabilities. The issue with spreadsheets is they don't scale well with your operations. Once your files and data sheets are all linked with formulas and references, the number of interdependencies becomes such that changes have a "butterfly effect" throughout your calculations, requiring an exponentially increasing amount of time and money to verify and validate. If your data and company operations are spread out in a series of spreadsheets, how can you create a new foundation on which you can build a scalable solution?
Host the Raw Data
At the recent Data Science DC meetup we were introduced to the Open Knowledge Foundation and CKAN, an open source data management system that makes data accessible by providing tools to streamline publishing, sharing, finding and using data. Some of these tools include providing metadata on your datasets (description, revision, tags, groups, API key, etc), revision control, keyword and faceted search, and tagging. You can host your data with CKAN's service for a monthly fee, or you can download the code from GitHub and host it on your own or perhaps with Amazon EC2.
The idea here is to maintain the privacy of your data while making it accessible from any computer, making all your data navigable and searchable from a single portal, and keeping track of your changes over time.
If you have a few simple spreadsheets then your in luck but in many cases the spreadsheets are interdependent, and so while the ability to preserve is certainly there you will have to manually reconnect your documents. The document and formula references are in a way code, and code always needs a human hand in transitioning between languages, so transitioning from spreadsheet operations to hosting your data with an API will always be a hurdle.
When you're ready to reconnect all your documents' data references and equations there are a lot of options, the issue is enabling your company to grow efficiently. There are some formula that never change, like unit conversions or calendar operations, and it may be a good idea to preserve the spreadsheets' references in database calculations, but once calculations may change in time, when you need the system to identify patterns and help recognize insights, when visualizations and external reports are needed for clients or internal operations, consider using a more flexible language. If the calculations are more mathematically sophisticated use R, Python, Julia; if they're short and simple all you may need is Tableau, but if they need to be part of an external client portal consider PHP or Ruby. You can always combine all these languages into a more sophisticated system, but that's an architecture for another discussion.
Enabling a Team
I had the question come up the other day as I was exploring this topic, "If all your doing is translating the code from one language to another, what's the advantage of transitioning your data to a hosted API?" If you're just one person it might not make sense, but it's more likely that you have a team that is relying on valid data, reliable calculations, and ready reports. As people go about their work, each person may need to update the database or equations for everyone's benefit, and visualizations, dashboards, or reports need to be updated with lessons learned. In short, hosting your data with an API can ultimately give your team confidence in the infrastructure, allowing them to focus on their work and creating a platform on which systems can be automated, including data visualizations.