Projects

District Data Labs Project Incubator Program

A crucial part of learning data science is applying the skills you learn to real world projects. Working on interesting projects keeps you motivated to continue learning and helps you sharpen your skills. Working in teams helps you learn from the different experiences of others and generate new ideas about learning avenues you can pursue in the future. That's why District Data Labs is starting a virtual incubator program for data science projects!

Data Science Project Incubator

 

The incubator program is:

  • Free (no cost to you)
  • Part-time (you can work on projects in your spare time)
  • Virtual (you don't need to be located in the DC area)

 

The first class of the incubator is scheduled to run from May through October 2014.  This is a great way to learn by working on a project with other people and even potentially sharing in the rewards of a project that ends up being commercially viable.

For more info and to apply, you can go to http://bit.ly/1dqp11k.

Applications end soon, so get yours in today!

Smithsonian American Art Museum Hackathon Day 2 !

Hackathon Day 2- The sequel!! hackathon_Day2

 

Welcome back!

Let me just start off by saying thank you SMITHSONIAN AMERICAN ART MUSEUM for putting on the best hackathon ever! Granted, it was my first hackathon, but they have set the bar very high. In case you missed it, check out part 1 of this two part post.

Day 2 commenced with breakfast at 10 am and a continuation of hacking and developing our 2 minute videos and 5 minute presentations to meet the 4:00 pm deadline. You could clearly see many of us had little sleep despite the fact that we got to go home. We all basically went home and continued to ruminate over our beautifully designed ideas.

The buzz in the room began to grow as the deadline neared and everyone started escaping into little nooks and crannies to record their 2 minute video pitches.

At 4:30pm presentations began, and they were incredible. The link to all the videos will be posted shortly. Here is a quick summary of what in less than 48 hours, hackers accomplished:

1. Team Once Upon a Time: A dad and their kids got together and created an incredible video of an interactive museum experience where you would basically tap on items in the museum and get information on them via online sources.

2. Team Muneeb and Sohaib: Two brothers with a mission! Sohaib developed a game simulation of the museum where you could pick a character and immerse yourself in an fantastical experience at the Luce Center. Muneeb developed a phone application where you could call a number and get information on that art piece. It was like 311 for art!

3. Team Kiosk to the Future: A beautifully designed mobile application where you could take a picture of the art, get information about it, share with your friends, and all other sorts of neat tricks. It was fully functional on their phone, by the way.

4. Team Geosafe: A mobile application where you could learn more about the artists of the art pieces, like where they lived and even see where they lived on a map in real time. It had a really nice “touch-screen” feel and view. This a one-man team!

5. Team Patrick: A great prototype of having a number of very interactive kiosks in the art museum so you could get information of the art you were looking at during that moment and find related art works nearby. You could walk around and feel welcomed by the space and always re-orient yourself at the nearest kiosk. Another one man team.

6. Team Megatherium: A great interactive game idea, where you would select tiles on a screen about art, play a game with questions about those art pieces, compete with other friends, record your score, etc. The best part about this is that the game didn’t end at the museum, you could go online at home, continue to play, earn badges. It kept you connected beyond your visit. They also planned to have a Phase 2 to expand to other museums.

7. Team Diego: Another great game where you would learn about art, and also tag it based on what term you think it best described. These tags would be fed back into an algorithm to better help label the art pieces in the database as well. *Diego was not a part of the teams to be judged since he is a fellow and helped develop the API. He was also one of the judges.

8. Team Art Pass: They developed a card with a qr code that you would walking around with to interact with the space. You would scan it to a mobile device, login and also scan it across art pieces. It would record all your activity and save it to your profile. They also mentioned expanding it to other museums, so you could have all your experiences recorded in one place.  They also printed a mock-up of the card with the QR code!

9.Team Back Left: Our team developed a website where you would basically learn about art, play games, see exactly where the art piece via google maps (Regardless of where you were), develop your own art with a word tag cloud that you could print at home or at the museum and more! We wanted to make sure you always felt you were in the museum regardless of where you physically were.

The judges deliberated for half an hour and each came back for reasons why all of the ideas were great. Prizes were awarded in the following categories:

Best use of the API: Team Geosafe

Most Franchise-able: Team Art Pass

Most unexpected: Team Muneeb, the visual arts museum game

People's Choice: Team Once Upon a Time

Runner Up: Team Kiosk to the Future

Absolute Favorite: Team Back Left!

At the end we all took a great communal photo and yes we all managed to crowd source the solution of 30 people fitting in one photo ;). Group Photo!

We then had the opportunity (which we all took), to video tape small 2 minute pitches on why we think this API should stay open and available to the developers for use. I think all of our faces said WHY NOT!!?

There was some interacting, mingling and more ideation until the museum started pushing us out. The day ended at around 7pm.

So there you have it, a magnificent time to make friends, help one of our Smithsonian institutions and refine our hacking abilities.

PLEASE COMMENT HERE IF YOU THINK THE API SHOULD BE OPEN AND WHAT IDEAS YOU MAY HAVE IF YOU COULD USE IT OR OTHER SMITHSONIAN INSTITUTIONS WHERE YOU WOULD LIKE TO SEE THIS HAPPEN.

Until another data adventure!

Signing off,

‘Drea

 

 

 

Selling Data Science: Validation

FixMyPineapple2 We are all familiar with the phrase "We can not see the forest for the trees", and this certainly applies to us as data scientists.  We can become so involved with what we're doing, what we're building, the details of our work, that we don't know what our work looks like to other people.  Often we want others to understand just how hard it was to do what we've done, just how much work went into it, and sometimes we're vain enough to want people to know just how smart we are.

So what do we do?  How do we validate one action over another?  Do we build the trees so others can see the forrest?  Must others know the details to validate what we've built, or is it enough that they can make use of our work?

We are all made equal by our limitation to 24 hours in a day, and we must choose what we listen to and what we don't, what we focus on and what we don't.  The people who make use of our work must do the same.  John Locke proposed the philosophical thought experiment, "If a tree falls in the woods and no one is around to hear it, does it make a sound?"  If we explain all the details of our work, and no one gives the time to listen, will anyone understand?  To what will people give their time?

Let's suppose that we can successfully communicate all the challenges we faced and overcame in building our magnificent ideas (as if anyone would sit still that long), what then?  Thomas Edison is famous for saying, “I have not failed. I've just found 10,000 ways that won't work.”, but today we buy lightbulbs that work, who remembers all the details about the different ways he failed?  "It may be important for people who are studying the thermodynamic effects of electrical currents through materials." Ok, it's important to that person to know the difference, but for the rest of us it's still not important.  We experiment, we fail, we overcome, thereby validating our work because others don't have to.

Better to teach a man to fish than to provide for him forever, but there are an infinite number of ways to successfully fish.  Some approaches may be nuanced in their differences, but others may be so wildly different they're unrecognizable, unbelievable, and beg for incredulity.  The catch is (no pun intended) methods are valid because they yield measurable results.

It's important to catch fish, but success is not consistent nor guaranteed, and groups of people may fish together so after sharing their bounty everyone is fed.  What if someone starts using this unrecognizable and unbelieveable method of fishing?  Will the others accept this "risk" and share their fish with those who won't use the "right" fishing technique, their technique?  Even if it works the first time that may simply be a fluke they say, and we certainly can't waste any more resources "risking" hungry bellies now can we.

So does validation lie in the method or the results?  If you're going hungry you might try a new technique, or you might have faith in what's worked until the bitter end.  If a few people can catch plenty of fish for the rest, let the others experiment.  Maybe you're better at making boats, so both you and the fishermen prosper.  Perhaps there's someone else willing to share the risk because they see your vision, your combined efforts giving you both a better chance at validation.

If we go along with what others are comfortable with, they'll provide fish.  If we have enough fish for a while, we can experiment and potentially catch more fish in the long run.  Others may see the value in our experiments and provide us fish for a while until we start catching fish.  In the end you need fish, and if others aren't willing to give you fish you have to get your own fish, whatever method yields results.

Data Visualization: Hosting Your Data

OKFWe've mostly focused on creating visualizations assuming the data is readily available, but it's often said that ~90% of data science is reforming the data so you can make use of it.  Many companies begin with spreadsheets, and it's understandable, people are well trained with spreadsheets, spreadsheets  have nice formulas, you can link equations and files, you can use macros to create client dashboards and interfaces, and they have decent charts and reporting capabilities.  The issue with spreadsheets is they don't scale well with your operations.  Once your files and data sheets are all linked with formulas and references, the number of interdependencies becomes such that changes have a "butterfly effect" throughout your calculations, requiring an exponentially increasing amount of time and money to verify and validate. If your data and company operations are spread out in a series of spreadsheets, how can you create a new foundation on which you can build a scalable solution?

Host the Raw Data

At the recent Data Science DC meetup we were introduced to the Open Knowledge Foundation and CKAN, an open source data management system that makes data accessible by providing tools to streamline publishing, sharing, finding and using data.  Some of these tools include providing metadata on your datasets (description, revision, tags, groups, API key, etc), revision control, keyword and faceted search, and tagging.  You can host your data with CKAN's service for a monthly fee, or you can download the code from GitHub and host it on your own or perhaps with Amazon EC2.

The idea here is to maintain the privacy of your data while making it accessible from any computer, making all your data navigable and searchable from a single portal, and keeping track of your changes over time.

Transitioning

If you have a few simple spreadsheets then your in luck but in many cases the spreadsheets are interdependent, and so while the ability to preserve is certainly there you will have to manually reconnect your documents.  The document and formula references are in a way code, and code always needs a human hand in transitioning between languages, so transitioning from spreadsheet operations to hosting your data with an API will always be a hurdle.

When you're ready to reconnect all your documents' data references and equations there are a lot of options, the issue is enabling your company to grow efficiently.   There are some formula that never change, like unit conversions or calendar operations, and it may be a good idea to preserve the spreadsheets' references in database calculations, but once calculations may change in time, when you need the system to identify patterns and help recognize insights, when visualizations and external reports are needed for clients or internal operations, consider using a more flexible language.  If the calculations are more mathematically sophisticated use R, Python, Julia; if they're short and simple all you may need is Tableau, but if they need to be part of an external client portal consider PHP or Ruby.  You can always combine all these languages into a more sophisticated system, but that's an architecture for another discussion.

Enabling a Team

I had the question come up the other day as I was exploring this topic, "If all your doing is translating the code from one language to another, what's the advantage of transitioning your data to a hosted API?"  If you're just one person it might not make sense, but it's more likely that you have a team that is relying on valid data, reliable calculations, and ready reports.  As people go about their work, each person may need to update the database or equations for everyone's benefit, and visualizations, dashboards, or reports need to be updated with lessons learned.  In short, hosting your data with an API can ultimately give your team confidence in the infrastructure, allowing them to focus on their work and creating a platform on which systems can be automated, including data visualizations.

Data Visualization: Exploring Biodiversity

BHL_FlikrWhen you have a few hundred years worth of data on biological records, as the Smithsonian does, from journals to preserved specimens to field notes to sensor data, even the most diligently kept records don't perfectly align over the years, and in some cases there is outright conflicting information.  This data is important, it is our civilization's best minds giving their all to capture and record the biological diversity of our planet.  Unfortunately, as it stands today, if you or I were to decide we wanted to learn more, or if we wanted to research a specific species or subject, accessing and making sense of that data effectively becomes a career.  Earlier this year an executive order was given which generally stated that federally funded research had to comply with certain data management rules, and the Smithsonian took that order to heart, event though it didn't necessarily directly apply to them, and has embarked to make their treasure of information more easily accessible.  This is a laudable goal, but how do we actually go about accomplishing this?  Starting with digitized information, which is a challenge in and of itself, we have a real Big Data challenge, setting the stage for data visualization. The Smithsonian has already gone a long way in curating their biodiversity data on the Biodiversity Heritage Library (BHL) website, where you can find ever increasing sources.  However, we know this curation challenge can not be met by simply wrapping the data with a single structure or taxonomy.   When we search and explore the BHL data we may not know precisely what we're looking for, and we don't want a scavenger hunt to ensue where we're forced to find clues and hidden secrets in hopes of reaching our national treasure; maybe the Gates family can help us out...

People see relationships in the data differently, so when we go exploring one person may do better with a tree structure, others prefer a classic title/subject style search, or we may be interested in reference types and frequencies.  Why we don't think about it as one monolithic system is akin to discussing the number of Angels that fit on the head of a pin, we'll never be able to test our theories.  Our best course is to accept that we all dive into data from different perspectives, and we must therefore make available different methods of exploration.

Data Visualization DC (DVDC) is partnering with the Smithsonian's Biodiversity Heritage Library to introduce new methods of exploring their vast national data treasure.  Working with cutting edge visualizers such as Andy Trice of Adobe, DVDC is pairing new tools with the Smithsonian's, our public, biodiversity data.  Andy's development of web standards for Visualizing data with HTML5 is a key step forward to making the BHL data more easily accessible not only by creating rich immersive experiences, but also by providing the means through which we all can take a bite out of this huge challenge.  We have begun with simple data sets, such as these sets organized by title and subject, but there are many fronts to tackle.  Thankfully the Smithsonian is providing as much of their BHL data as possible through their API, but for all that is available there is more that is yet to even be digitized.  Hopefully over time we can utilize data visualization to unlock this great public treasure, and hopefully knowledge of biodiversity will help us imagine greater things.

 

 

"Designing for Behavior Change" - New O'Reilly Title Coming from Local Data Community Member

This guest post is by Steve Wendel, "a behavioral social scientist and simulation modeling expert" at HelloWallet who has been a strong supporter of Data Community DC since the beginning. Cover Design

I'm happy to announce that O'Reilly Media will be publishing a new book I'm writing called Designing for Behavior Change. The book gives step-by-step guidance on how to design, build, and test products that help people change their daily behavior and routines. The goal is to help people take actions that they want to take, but have struggled with in the past: from exercising more (FitBit, Fuelband), to spending less on utilities (Opower, Nest), to taking control of their finances (HelloWallet).

It's a practical how-to book, aimed at designers, product folks, data scientists, entrepreneurs and others who are thinking about and building these products. It includes:

  1. Insight into how the mind makes decisions, and what that means for changing behavior.
  2. Step-by-step instructions on how to select a behavioral change strategy, and convert that strategy into a real product.
  3. Techniques and software tools for evaluating the concrete impact of a product on user behavior, and for discovering the factors that block users from changing their behavior.
  4. Lots of practical, concrete examples.

For those of you who have been reading my blog for a while, it covers some of the same topics (how to sustain engagement with a product, how to overcome existing habits, etc), but in much greater depth, and with an overall framework to make sense of the product development process.

The book draws from our experiences here at HelloWallet over the last four years, and well as from the dozens of companies I've had the joy of talking with through the Action Design DC Meetup, and mentoring companies at The Fort and 1776 DC.

If you'd like a free copy of the draft eBook, just sign up for my email newsletter here (http://eepurl.com/tJfgz). I'll send out draft chapters as they become ready, and blog posts on related topics.

I look forward to your feedback! I'd also love to hear about interesting case studies and examples of companies doing this work.

-Steve

Amazon EC2 versus Google Compute Engine, Part 4

It has been a while since we have talked about cloud computing benchmarks and wanted to bring a recent and relevant post to your attention. But, before we do, let's summarize the story of EC2 versus Google Compute Engine. Our first article compared and contrasted GCE and EC2 instance positioning. The second article benchmarked various instance types across the two providers using a series of synthetic benchmarks including the Phoronix Test Suite and the SciMark.

cloudImpact

And, less than a week after Data Community DC reported a 20% performance disparity, Amazon dropped the price of only the instances directly GCE competitors by, you guessed it, 20%. Coincidence?

Our third article continued the performance evaluation and used the Java version of the NAS Performance Benchmarks to explore single and multi-threaded computational performance. Since then, GCE prices have dropped 4% across the board to which Amazon quickly responded with their own price drop.

Whereas we looked at number crunching capability between the two services, this post and the reported set of benchmarks from Sebastian Stadil of Scalr compares and contrasts many other aspects of the two services including:

  • instance boot times
  • ephemeral storage read and write performance
  • inter-region network bandwidth and latency

My anectdotal experiences with GCE can confirm that GCE instance boot times are radically faster than EC2's and that the provided API for GCE was far easier to use (very similar to MIT's StarCluster cluster management package for EC2).

In general, this article complements our findings nicely, making a strong case to at least test out Google Compute Engine if you are considering EC2 or a long time user of Amazon's Web Services.

You can find the article in question here: By the numbers: How Google Compute Engine stacks up to Amazon EC2 — Tech News and Analysis

Applying Aggregated Government Data - Insights into a Data-focused Company

The following blog is a review of Enigma Technology, Inc., their VC funded open source government data aggregation platform, and how that platform may be utilized in different business applications.  Enjoy!

Approach

Aggregation of previously unconnected data sources is a new industry in our hyper-connected world.  Enigma has focused on aggregating open source US Government data, and the question is, “What is possible with this new technology?”  Given only the information from the website, this paper explores Enigma’s decision support approach in their three ‘Feature’ sections: Data Sources, Discover, and Analyze.

Data Sources

The technology to gather data from the open web, either directly or by scraping websites, has reached maturity#, and as a result it is simply a bureaucratic process to focus aggregation on the specific government industries they highlight (aircraft, lobbying, financial, spending, and patent).  Enigma focuses on data augmentation, API access, and custom data, which is another way of saying, “We can provide standard insights or give you easy access, and we can apply these principles to whatever data sets you have in mind.”  This business model is another 21st century standard: Standard charge for self-service data-applications, consult on private data-applications.

Discover (Data)

A primary feature of Government is its siloed approach to data; a classic example being sharing of intelligence# information following 9/11#.  Juxtaposed data always produces new correlations between the data sets and thereby new insights, allowing for exploration previously impossible or impractical.  Combined with powerful UI and self-service tools, Enigma is seeking to empower its users to recognize what’s most valuable to them, as opposed to their providing any one ‘Right’ answer, an approach broadly adopted in the software as a service (SaaS) industry and widely applied to Government data#.

Analyze (Data)

Enigma’s goal is to immerse its users in the data in a meaningful way, allowing them to drill down to any detail or rise above the fray with metadata created by its standard functions and operators ala classic statistics, classification#, and data ontology#.  Again utilizing a powerful UI and self-service tools#, Enigma plans to empower its users to focus on their data of interest (filtering) and choose which mathematical operations to perform on data under comparison, all with a goal of integrating previously independent Government data sets.

Business Applications

If the aggregated open source data is directly applicable to your existing business, by all means weigh the ROI of an Enigma subscription, however in most cases the application of this data will require more significant discussion and negotiations with potential clients, or Enigma’s self-service model will only serve as a demonstration for private or more restricted-access data.  Government organizations are being charged with integrating data across ‘silos’, and services like those of Enigma will provide comprehensive tools for the first step in this process, allowing for consultation on its application and for services specializing in the chartered goals of that Government data integration.

Using Data to Create Viral Content. [INFOGRAPHIC]

Pinterest_Infographic_teaser Netflix recently used their own data to drive the creation of the hit series 'House of Cards'. A similar approach can be applied to other forms of media to create content that is highly likely to become popular or even go viral through social media channels.

I examined the data set collected by Feastie Analytics to determine the features of recipes that make them the most likely to go viral on Pinterest. Some of the results are in the infographic below (originally published here). The data set includes 109,000 recipes published after Jan 1, 2011 on over 1200 different food blogs. Each recipe is tagged by its ingredients, meal course, dish title, and publication date. For each recipe, I have a recent total pin count. I also have the dimensions of a representative photo from the original blog post.

The first thing that I examined is the distribution of pins by recipe. What I found is that the distribution  of pins by recipe is much like the distribution of wealth in the United States -- the top 1% have orders of magnitude more than the bottom 90%. The top 0.1% has another order of magnitude more than the top 1%! Many of the most pinned recipes are from popular blogs that regularly have highly pinned recipes, but a surprising number are from smaller or newer blogs. A single viral photo can drive hundreds of thousands of new visitors to a site that has never seen that level of traffic before.

For the purposes of this analysis, I defined "going viral" as reaching the top 5% of recipes --  having a pin count over 2964 pins. Then, I calculated how much more (or less) likely a recipe is to go viral depending on its meal course, keywords in the dish title, ingredients, day of the week, and the aspect ratio of the photo.

Some of the results are surprising and some are expected. Many people would expect that desserts are most likely to go viral on Pinterest. But in reality, desserts are published the most but not most likely to go viral. Appetizers have the best probability of going viral, perhaps because they are published less frequently, yet are in relatively high demand. The popularity of cheese, chocolate, and other sweets in the dishes and ingredients is not surprising. What is somewhat surprising are some of the healthier ingredients such as quinoa, spinach, and black beans. The fact that Sunday is the second best day to publish is surprising, as most publishers avoid weekends. However traffic to recipe sites spikes on Sundays, so it makes sense that recipes published then have an advantage. Finally, it's no surprise that images with tall orientations are more likely to go viral on Pinterest considering how they are given more space by the Pinterest design. But now, we can put a number on just how much of an advantage portrait oriented photos have -- they are approximately twice as likely to go viral as the average photo.

Hungry yet? What other forms of content would you like to see this approach applied to?

Check back tomorrow for a tutorial on how to create an infographic with Keynote.

How to Make Your Recipe Go Viral on Pinterest

What does $100K Buy in Terms of Compute Time? GCE and EC2 square off against big iron.

Introduction

Let’s say you have a $100,000 to spend this year to crunch numbers, a lot of numbers. How much compute time does that buy?

In this article, I am going to try to answer that question, comparing the outright purchase of big hardware with cloud-based alternatives including Amazon’s EC2 and Google Compute Engine. Note, for my particular computational task, each core or compute node requires 3.5GB of RAM, helping to constrain the options for this problem.

Buying Hardware

Everyone knows that buying computational muscle much larger than your typical laptop or desktop gets expensive fast. Let’s ballpark that this $100,000 will buy you a 512-core machine with 4GB of RAM per core and some storage (say 10 TB). These cores are from the AMD Opteron 6200 Series, the “Interlagos” family and these processors claim up to 16 cores per chip (there is some dispute as to this number as each pair of cores shares a floating point unit).

I am intentionally ignoring the siting, maintenance, setup, and operational costs, plus the staff time spent getting such a system ordered and installed. For the sake of argument, let’s say we can run 512 cores per hour every day of the year for $100K. Put another way, this hardware offers 4,485,120 core-hours of compute time over the year.

Google Compute Engine (GCE)

GCE is the new cloud computing kid on the block and Google has come out swinging. The company’s n1-standard-8 offers 8 virtualized cores with 30GB of RAM with or without ephemeral storage ($0.96/hour or $1.104/hour, respectively).

Assuming the n1-standard-8, each of these instances costs $23.04 per day or $8409.6 per year. Bulk pricing may be available but no information is available on the current website. Thus, that $100,000 offers up 11.89 n1-standard-8 instances running 24 hours a day, 365 days a year or just over 95 cores running continuously. Another way, $100K buys 833,333 core-hours of compute time.

Amazon Web Services

Amazon EC2 is the default cloud services provider. As this service has been around for some time, Amazon offers the most options when it comes to buying CPU time. In this case, we will look at two such pricing options: On Demand and Reserved Instances. For an almost apples to apples comparison with GCE, we will use the second-generation m3.2xlarge instance that is roughly comparable to the n1-standard-8 (although my personal benchmarks have shown that the n1-standard-8 offers a 10-20% performance advantage).

On Demand

EC2 allows users to rent virtual instances by the hour and this is the most directly comparable pricing option to GCE. As the m3.2xlarge is $1.00 per hour or $24 per day, $100,000 offers the user 11.41 m3.2xlarge instances running all year or just over 91 cores. Or, $100K buys 800,000 core-hours.

Reserved Instances

EC2 also allows consumers to buy down the hourly instance charge with a Reserved Instance or, in Amazon’s own words:

Reserved Instances give you the option to make a low, one-time payment for each instance you want to reserve and in turn receive a significant discount on the hourly charge for that instance. There are three Reserved Instance types (Light, Medium, and Heavy Utilization Reserved Instances) that enable you to balance the amount you pay upfront with your effective hourly price.

As we would be running instances night and day, we will look at the “Heavy Utilization Reserved Instance pricing. Each m3.2xlarge instance reserved requires an initial payment of $3432 $2978 for a 1-year term but then the instance is available at a much lower rate: $0.282 $0.246 per hour.

Thus, running one such instance all year costs: $3432 + $0.282 x 24 x 365 = $5902

$2978 + $0.246 x 24 hours/day x 365 days/year = $5132.96

Our $100,000 thus buys 16.94 19.48 m3.2xlarge instances running 24 x 7 for the year. In terms of cores, this option offers 155.9 cores running continuously for a year and this is a considerable jump from On Demand pricing. Put another way, $100K buys 1,186,980 1,365,294 core-hours.

Please note that the strike out reflects the recent Amazon Reserved Instance price drop that occurred 3/5/2013.

Conclusion

In short, for continuous, consistent use over a year, purchasing hardware offers almost 4x the raw processing power (512 cores vs 155.5 cores) of the nearest cloud option. Even if we assume that our hardware pricing is wildly optimistic and cut the machine in half to 256 cores, there is still a 2x advantage. Again, I realize that I am asking for some broad approximations here such as the equivalence between a virtualized Intel Sandy Bridge core and an actual hardware processor such as the Opteron 6200 series.

Screen Shot 2013-03-05 at 8.43.26 AM

However, what the cloud offers that the hardware (and to some extent, the Amazon Reserved Heavy Utilization instances) cannot is radical flexibility. For example, the cloud could accelerate tasks by briefly scaling the number of cores. If we assume an embarrassingly parallel task that scales perfectly and takes our 512 core machine 4 weeks, there is no reason we couldn’t task 2048 cores to finish the work in a week. Of course we would spend down our budget much faster but having those results 3 weeks earlier could create massive value for the project in other ways.  Or, put another way, GCE, for $100,000, offers the best flexible deal: a bucket of 833,333 core-hours that can be poured out at whatever rate the user wants. If a steady flow of computation is desired, either the hardware purchase or the Amazon Reserved Instances offer the better deal.

(Note, DataCommunityDC is an Amazon Affiliate. Thus, if you click the image in the post and buy the book, we will make approximately $0.43 and retire to a small island).