cloud computing

A Rush of Ideas: Kalev Leetaru at Data Science DC

gdeltThis review of the April Data Science DC Meetup was written by Ross Mohan. Ross is a solutions architect for Five 9 Group.

Perhaps you’ve heard the phrase lately “software is eating the world”. Well, to be successful at that, it’s going to have to do as least as good a job of eating the world’s data as do the systems of Kalev Leetaru, Georgetown/Yahoo! fellow.

Kalev Leetaru, lead investigator on GDELT and other tools, defines “world class” work — certainly in the sense of size and scope of data. The goal of GDELT and related systems is to stream global news and social media in as near realtime as possible through multiple steps. The overall goal is to arrive at reliable tone (sentiment) mining and differential conflict detection and to do so …. globally. It is a grand goal.

Kalev Leetaru’s talk covered several broad areas. History of data and communication, data quality and “gotcha” issues in data sourcing and curation, geography of Twitter, processing architecture, toolkits and considerations, and data formatting observations. In each he had a fresh perspective or a novel idea, born of the requirement to handle enormous quantities of ‘noisy’ or ‘dirty’ data.


Keetaru observed that “the map is not the territory” in the sense that actual voting, resource or policy boundaries as measured by various data sources may not match assigned national boundaries. He flagged this as a question of “spatial error bars” for maps.

Distinguishing Global data science from hard established HPC-like pursuits (such as computational chemistry) Kalev Leetaru observed that we make our own bespoke toolkits, and that there is no single ‘magic toolkit” for Big Data, so we should be prepared and willing to spend time putting our toolchain together.

After talking a bit about the historical evolution and growth of data, Kalev Leetaru asked a few perspective-changing questions (some clearly relevant to intelligence agency needs) How to find all protests? How to locate all law books? Some of the more interesting data curation tools and resources Kalev Leetaru mentioned — and a lot more — might be found by the interested reader in The Oxford Guide to Library Research by Thomas Mann.

GDELT (covered further below), labels parse trees with error rates, and reaches beyond the “WHAT” of simple news media to tell us WHY, and ‘how reliable’. One GDELT output product among many is the Daily Global Conflict Report, which covers world leader emotional state and differential change in conflict, not absolute markers.

One recurring theme was to find ways to define and support “truth.” Kalev Leetaru decried one current trend in Big Data, the so-called “Apple Effect”: making luscious pictures from data; with more focus on appearance than actual ground truth. One example he cited was a conclusion from a recent report on Syria, which -- blithely based on geotagged English-language tweets and Facebook postings -- cast a skewed light on Syria’s rebels (Bzzzzzt!)


Leetaru provided one answer on “how to ‘ground truth’ data” by asking “how accurate are geotagged tweets?” Such tweets are after all only 3% of the total. But he reliably used those tweets. How?  By correlating location to electric power availability. (r = .89) He talked also about how to handle emoticons, irony, sarcasm, and other affective language, cautioning analysts to think beyond blindly plugging data into pictures.

Kalev Leetaru talked engagingly about Geography of Twitter, encouraging us to to more RTFD (D=data) than RTFM. Cut your own way through the forest. The valid maps have not been made yet, so be prepared to make your own. Some of the challenges he cited were how to break up typical #hashtagswithnowhitespace and put them back into sentences, how to build — and maintain — sentiment/tone dictionaries and to expect, therefore, to spend the vast majority of time in innovative projects in human tuning the algorithms and understanding the data, and then iterating the machine. Refreshingly “hands on.”

Scale and Tech Architecture

Kalev Leetaru turned to discuss the scale of data, which is now generating easily  in the petabytes per day range. There is no longer any question that automation must be used and that serious machinery will be involved. Our job is to get that automation machinery doing the right thing, and if we do so, we can measure the ‘heartbeat of society.’

For a book images project (60 Million images across hundreds of years) he mentioned a number of tools and file systems (but neither Gluster nor CEPH, disappointingly to this reviewer!) and delved deeply and masterfully into the question of how to clean and manage the very dirty data of “closed captioning” found in news reports. To full-text geocode and analyze half a million hours of news (from the Internet Archives), we need fast language detection and captioning error assessment. What makes this task horrifically difficult is that POS tagging “fails catastrophically on closed captioning” and that CC is worse, far worse in terms of quality than is Optical Character Recognition. The standard Stanford NL Understanding toolkit is very “fragile” in this domain: one reason being that news media has an extremely high density of location references, forcing the analyst into using context to disambiguate.

He covered his GDELT (Global Database of Event, Language and Tone), covering human/societal behavior and beliefs at scale around the world. A system of half a billion plus georeferenced rows, 58 columns wide, comprising 100,000 sources such as  broadcast, print, online media back to 1979, it relies on both human translation and Google translate, and will soon be extended across languages and back to the 1800s. Further, he’s incorporating 21 billion words of academic literature into this model (a first!) and expects availability in Summer 2014, (Sources include JSTOR, DTIC, CIA, CVORE CiteSeerX, IA.)

GDELT’s architecture, which relies heavily on the Google Cloud and BigQuery, can stream at 100,000 input observations/second. This reviewer wanted to ask him about update and delete needs and speeds, but the stream is designed to optimize ingest and query. GDELT tools were myriad, but Perl was frequently mentioned (for text processing).

Kalev Leetaru shared some post GDELT construction takeaways — “it’s not all English” and “watch out for full Unicode compliance” in your toolset, lest your lovely data processing stack SEGFAULT halfway through a load. Store data in whatever is easy to maintain and fast. Modularity is good but performance can be an issue; watch out for XML which bogs down processing on highly nested data. Use for interchange more than anything; sharing seems “nice” but “you can’t shared a graph” and “RAM disk is your friend” more so even than SSD, FusionIO, or fast SANs.

The talk, like this blog post, ran over allotted space and time, but the talk was well worth the effort spent understanding it.

General Assembly & DC2 Scholarship

GA DC2 Scholarship The DC2 mission statement emphasises that "Data Community DC is an organization committed to connecting and promoting the work of data professionals...", ultimately we see DC2 becoming a hub for data scientists interested in exploring new material, advancing their skills, collaborating, starting a business with data, mentoring others, teaching classes, changing careers, etc. Education is clearly a large part of any of these interests, and while DC2 has held a few workshops and is sponsored by organizations like, we knew we could do more and so we partnered with General Assembly and created a GA & DC2 scholarship specifically for members of Data Community DC.

For our first scholarship we landed on Front End Web Development and User Experience, which we naturally announced first at Data Viz DC.  How does this relate to data science?  As I was happy to rebut Mr. Gelman in our DC2 blogpost reply, sometimes I would love to have a little sandbox where I get to play with algorithms all day, but then again this is exactly what I've run away from in 2013 in becoming an independent data science consultant, I don't want a business plan I'm not a part of dictating what I can play with.  Enter Web Dev and UX.  As Harlan Harris, organizer of DSDC, mentions in his venn diagram on what makes a data scientist, which Tony Ojeda later emphasizes, programming is a natural and necessary part of being a data scientist.  In other words, there's this thing called the interwebs that has more data than you can shake a stick at, and if you can't operate in that environment then as a data scientist you're asking someone else to do that heavy lifting for you.

Over the next month we'll be choosing the winners of the GA DC2 Scholarship, and if you'd like to see any other scholarships in the future please leave your thoughts in the comments below or tweet us.

Happy Thanksgiving!

Cloud SOA Semantics and Data Science Conference

This is a guest post prepared by the SOA, Semantics, & Data Science conference for the Data Community DC Blog to provide introduction/context to the types of technology that the conference focuses on, and its applications.

CSDS_Logo_v2Featuring 15th SOA for E-Gov Conference:

Conference Title: Cloud: SOA, Semantics, & Data Science Theme: The Changing Landscape of Federal Information Technology Dates: 9/10/2013  to 9/11/2013 Location: The Waterford, Springfield, VA Contact: Tammy Kicker ( Web Site:

Federal organizations are racing to capitalize on social, mobile and cloud computing trends to provide solutions for their agency mission needs.  At the same time, there is great pressure to spend less while improving capability, service, cost and flexibility.

This event is an open knowledge exchange forum for communities of practice in Cloud, SOA, Semantics, and Data Science.  It brings together thought leaders and experts from the federal and business communities to continue the conversation around best practices in advancing SOA, semantic technology and data science within the Cloud construct.

The event builds on the successes of two previous events: the Service-Oriented Architecture (SOA) e-Government Conferences and the Department of Defense SOA and Semantic Technology Conferences.

This event is focused on SOA, Cloud Computing, Semantics Technology and Data Analytics.

SOA uses data as a service, which in turn requires dealing effectively with semantics.  Data science is used to process and analyze the data for those semantics to extract information.  Given the recent pronouncement by Dominic Sale, OMB (invited Keynote) that "all content is data", this conference is especially timely and focused.

Presenters and panelists will examine the benefits of governance frameworks and approaches Federal agencies are pursuing to increase the maturity and efficiency of their SOA, Cloud, Semantic Technology and Data Science.

The types of technology focused on and its applications are summarized in the table below.


Speaker Technology Applications Comments
Brand Niemann Data Science Data Visualization Tools (12 Leaders and Chalengers) OMB Analytic Data Sets and Public Data Sets for DC Data Science Community Director and Senior Data Scientist, Semantic Community and Founder of Federal SOA Community of Practice
Dominic Sale (invited) TBD TBD OMB Chief of Data Analytics & Reporting
Steve Woodward Cloud Computing in Canada New Agency CEO, Cloud Perspectives
David S. Linthicum Cloud and SOA Convergence Your Enterprise Cloud Computing Thought Leader, Executive, Consultant, Author, and Speaker
Denzil Wasson Semantics Cloud and SOA for Government Chief Technology Officer, Everware-CBDI
Vendor Showcase Multiple Multiple Always a Favorite at These Conferences
Use Cases and Pilots Cray Grph Computer Semantic Medline - National Library of Medicine & White House OSTP’s NITRD Federal Big Data Senior Steering WG Discovery of Disease Cause and Effect
Geoffrey Charles Fox Cyberinfrastructure Enabling e-Government, e-Business and e-More Or Less Anything Associate Dean for Graduate Studies & Research Distinguished Professor of Computer Science and Informatics, Indiana University
Dennis Wisnosky Semantics Mainstream Former DoD Business Mission Area CTO, member of the Enterprise Data Management Council
Michaela Iorga Cloud Computing and Security Government and Industry Senior Security Technical Lead for Cloud Computing and Chair, NIST Cloud Computing Security WG
Use Cases and Pilots Semantics & Other Mission/Business Transformation Needs Four Applications for Department of Veterans Affairs and Other Organizations

Your participation and suggested contributions are welcomed to continue to build and sustain this unique community of communities of practice to improve the delivery of government services in support of the US Federal Digital Government Strategy and Open Government Data Initiatives.


Brand Niemann, former Senior Enterprise Architect and Data Scientist with the US EPA, completed 30 years of federal service in 2010. Since then he has worked as a data scientist for a number of organizations, produced data science products for a large number of data sets, and published data stories for Federal Computer Week, Semantic Community and AOL/Breaking Government.

Amazon EC2 versus Google Compute Engine, Part 4

It has been a while since we have talked about cloud computing benchmarks and wanted to bring a recent and relevant post to your attention. But, before we do, let's summarize the story of EC2 versus Google Compute Engine. Our first article compared and contrasted GCE and EC2 instance positioning. The second article benchmarked various instance types across the two providers using a series of synthetic benchmarks including the Phoronix Test Suite and the SciMark.


And, less than a week after Data Community DC reported a 20% performance disparity, Amazon dropped the price of only the instances directly GCE competitors by, you guessed it, 20%. Coincidence?

Our third article continued the performance evaluation and used the Java version of the NAS Performance Benchmarks to explore single and multi-threaded computational performance. Since then, GCE prices have dropped 4% across the board to which Amazon quickly responded with their own price drop.

Whereas we looked at number crunching capability between the two services, this post and the reported set of benchmarks from Sebastian Stadil of Scalr compares and contrasts many other aspects of the two services including:

  • instance boot times
  • ephemeral storage read and write performance
  • inter-region network bandwidth and latency

My anectdotal experiences with GCE can confirm that GCE instance boot times are radically faster than EC2's and that the provided API for GCE was far easier to use (very similar to MIT's StarCluster cluster management package for EC2).

In general, this article complements our findings nicely, making a strong case to at least test out Google Compute Engine if you are considering EC2 or a long time user of Amazon's Web Services.

You can find the article in question here: By the numbers: How Google Compute Engine stacks up to Amazon EC2 — Tech News and Analysis

What does $100K Buy in Terms of Compute Time? GCE and EC2 square off against big iron.


Let’s say you have a $100,000 to spend this year to crunch numbers, a lot of numbers. How much compute time does that buy?

In this article, I am going to try to answer that question, comparing the outright purchase of big hardware with cloud-based alternatives including Amazon’s EC2 and Google Compute Engine. Note, for my particular computational task, each core or compute node requires 3.5GB of RAM, helping to constrain the options for this problem.

Buying Hardware

Everyone knows that buying computational muscle much larger than your typical laptop or desktop gets expensive fast. Let’s ballpark that this $100,000 will buy you a 512-core machine with 4GB of RAM per core and some storage (say 10 TB). These cores are from the AMD Opteron 6200 Series, the “Interlagos” family and these processors claim up to 16 cores per chip (there is some dispute as to this number as each pair of cores shares a floating point unit).

I am intentionally ignoring the siting, maintenance, setup, and operational costs, plus the staff time spent getting such a system ordered and installed. For the sake of argument, let’s say we can run 512 cores per hour every day of the year for $100K. Put another way, this hardware offers 4,485,120 core-hours of compute time over the year.

Google Compute Engine (GCE)

GCE is the new cloud computing kid on the block and Google has come out swinging. The company’s n1-standard-8 offers 8 virtualized cores with 30GB of RAM with or without ephemeral storage ($0.96/hour or $1.104/hour, respectively).

Assuming the n1-standard-8, each of these instances costs $23.04 per day or $8409.6 per year. Bulk pricing may be available but no information is available on the current website. Thus, that $100,000 offers up 11.89 n1-standard-8 instances running 24 hours a day, 365 days a year or just over 95 cores running continuously. Another way, $100K buys 833,333 core-hours of compute time.

Amazon Web Services

Amazon EC2 is the default cloud services provider. As this service has been around for some time, Amazon offers the most options when it comes to buying CPU time. In this case, we will look at two such pricing options: On Demand and Reserved Instances. For an almost apples to apples comparison with GCE, we will use the second-generation m3.2xlarge instance that is roughly comparable to the n1-standard-8 (although my personal benchmarks have shown that the n1-standard-8 offers a 10-20% performance advantage).

On Demand

EC2 allows users to rent virtual instances by the hour and this is the most directly comparable pricing option to GCE. As the m3.2xlarge is $1.00 per hour or $24 per day, $100,000 offers the user 11.41 m3.2xlarge instances running all year or just over 91 cores. Or, $100K buys 800,000 core-hours.

Reserved Instances

EC2 also allows consumers to buy down the hourly instance charge with a Reserved Instance or, in Amazon’s own words:

Reserved Instances give you the option to make a low, one-time payment for each instance you want to reserve and in turn receive a significant discount on the hourly charge for that instance. There are three Reserved Instance types (Light, Medium, and Heavy Utilization Reserved Instances) that enable you to balance the amount you pay upfront with your effective hourly price.

As we would be running instances night and day, we will look at the “Heavy Utilization Reserved Instance pricing. Each m3.2xlarge instance reserved requires an initial payment of $3432 $2978 for a 1-year term but then the instance is available at a much lower rate: $0.282 $0.246 per hour.

Thus, running one such instance all year costs: $3432 + $0.282 x 24 x 365 = $5902

$2978 + $0.246 x 24 hours/day x 365 days/year = $5132.96

Our $100,000 thus buys 16.94 19.48 m3.2xlarge instances running 24 x 7 for the year. In terms of cores, this option offers 155.9 cores running continuously for a year and this is a considerable jump from On Demand pricing. Put another way, $100K buys 1,186,980 1,365,294 core-hours.

Please note that the strike out reflects the recent Amazon Reserved Instance price drop that occurred 3/5/2013.


In short, for continuous, consistent use over a year, purchasing hardware offers almost 4x the raw processing power (512 cores vs 155.5 cores) of the nearest cloud option. Even if we assume that our hardware pricing is wildly optimistic and cut the machine in half to 256 cores, there is still a 2x advantage. Again, I realize that I am asking for some broad approximations here such as the equivalence between a virtualized Intel Sandy Bridge core and an actual hardware processor such as the Opteron 6200 series.

Screen Shot 2013-03-05 at 8.43.26 AM

However, what the cloud offers that the hardware (and to some extent, the Amazon Reserved Heavy Utilization instances) cannot is radical flexibility. For example, the cloud could accelerate tasks by briefly scaling the number of cores. If we assume an embarrassingly parallel task that scales perfectly and takes our 512 core machine 4 weeks, there is no reason we couldn’t task 2048 cores to finish the work in a week. Of course we would spend down our budget much faster but having those results 3 weeks earlier could create massive value for the project in other ways.  Or, put another way, GCE, for $100,000, offers the best flexible deal: a bucket of 833,333 core-hours that can be poured out at whatever rate the user wants. If a steady flow of computation is desired, either the hardware purchase or the Amazon Reserved Instances offer the better deal.

(Note, DataCommunityDC is an Amazon Affiliate. Thus, if you click the image in the post and buy the book, we will make approximately $0.43 and retire to a small island).

GCE vs EC2 Part 3: Benchmarks from Serial and Multithreaded Java Applications

Welcome to part 3 of an examination of Google Compute Engine and Amazon Elastic Compute Cloud for cluster computing.


In part 1, I looked at how similarly Google and Amazon position their instance types and the characteristics that distinguish each including cost.

In part 2, I looked at the first set of benchmarks testing the compute and memory capabilities of individual instances, learning that Amazon and Google compute units are not the same.

Since the second post, Amazon ratcheted up the level of competition by offering a 20% price drop on some instances that have exact equivalents within Google Compute Engine.

The Benchmark

In this entry, I offer up some additional individual instance benchmarks using one of if not the defacto benchmark for examining the performance of MPI clusters: the NAS Parallel Benchmarks (NAS = NASA Advanced Supercomputing). In NASA's words:

The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks are derived from computational fluid dynamics (CFD) applications and consist of five kernels and three pseudo-applications in the original "pencil-and-paper" specification (NPB 1). The benchmark suite has been extended to include new benchmarks for unstructured adaptive mesh, parallel I/O, multi-zone applications, and computational grids.  Problem sizes in NPB are predefined and indicated as different classes. Reference implementations of NPB are available in commonly-used programming models like MPI and OpenMP (NPB 2 and NPB 3).

Note that you do have to jump through some hoops to sign up and download the benchmark source code but the process only takes a few minutes.

The NAS benchmark is up to version 3.3.1. However, a slew of problems compiling this latest version prompted me to turn to version 3.0 that contains the most recent Java port of the benchmark. The Java port easily compiled and, as I am interested in Java application performance due to a current research project, this was fine by me but I still hope to go back and run the full NAS benchmark suite on a cluster.

Note that the Java version is only sufficient to run serial and multithreaded benchmarks, exercising the serial and multithreaded capabilities of single instances and not MPI clusters.  If you would like more information about the Java port of the NAS Benchmarks, an detailed paper is available here.

The eight benchmarks both used in this test and that were originally specified in NPB 1 mimic the computation and data movement in computational fluid dynamics applications.

Fiver Computational Kernels:

  • IS - Integer Sort, random memory access
  • EP - Embarrassingly Parallel
  • CG - Conjugate Gradient, irregular memory access and communication
  • MG - Multi-Grid on a sequence of meshes, long- and short-distance communication, memory intensive
  • FT - discrete 3D fast Fourier Transform, all-to-all communication

Three Pseudo Applications

  • BT - Block Tri-diagonal solver
  • SP - Scalar Penta-diagonal solver
  • LU - Lower-Upper Gauss-Seidel solver

These individual benchmarks can be run for different size data sets including:

 Class S: small for quick test purposes

Class W: workstation size (a 90's workstation; now likely too small)

Classes A, B, C: standard test problems; ~4X size increase going from one class to the next

Classes D, E, F: large test problems; ~16X size increase from each of the previous classes

For these benchmarks, all tests were run on both the new Amazon m3.xlarge 2nd generation instance ($0.50 per hour as of 2/14/2013) and the Google n1-standard-4 ($0.48 per hour as of 2/14/2013).


In the results below, data classes S, W, A, and B were used.

The first plot, courtesy of ggplot2, shows serial performance across all tests and data classes S, W, and A.  Serial results were not computed for class B due to time and cost considerations.

Serial Performance Tests


 Across the tested data sizes and all serial tests, the GCE n1-standard-4 took the performance crown. If we take the Amazon instance as our baseline, the GCE instance bests it by an average of

  • 9.0% for IS;
  • 10.6% for CG;
  • 18.5% for MG;
  • 26.6% for FT;
  • 18.1% for BT;
  • 19.3% for SP;
  • and 19.1% for LU.


Multithreaded Performance Tests 


Results are somewhat more muddled for the multithreaded tests where the Amazon m3.xlarge pulls out wins, notably in the Integer Sort (IS) and the memory intensive Multi-Grid (MG) tests. Quantifying this with the Amazon instance as a baseline, we see average performance percent differences of:

  • 2.0% IS (AWS)
  • 3.5% CG (GCE)
  • 7.0 MG (AWS)
  • 18.4% FT (GCE)
  • 11.9% BT (GCE)
  • 11.1% SP (GCE)
  • 12.6% LU (GCE)

Stay tuned as more benchmarks are on their way ...

GCE vs EC2 Part 2b: The Price War Heats Up ...

seansmall by Turns out that one of our earlier posts was quite prophetic. From Amazon this morning, 2/1/2013:

Price reduction for Amazon EC2 We are reducing Linux On Demand prices for First Generation Standard (M1) instances, Second Generation Standard (M3) instances, High Memory (M2) instances and High CPU (C1) instances in all regions. All prices are effective from February 1, 2013. These reductions vary by instance type and region, but typically average 10-20% price drops. For complete pricing details, please visit the Amazon EC2 pricing page.

Notice that this price drop is only happening on the instance types that compete directly with GCE instances, even up to the fact that this is only on Linux instances and not Windows instances which Google do not offer (yet).

It looks like we have a war brewing and the customer is going to win.



Google Compute Engine vs Amazon EC2 Part 2: Synthetic CPU and Memory Benchmarks

seansmall by 

Testing Assumptions

In the last article, I examined pricing and feature differentiation between Google Compute Engine and Amazon EC2 instance types. Now, it is time to see if the last article's key assumptions, that Google Compute Engine Units are equivalent to Amazon EC2 Compute Units, is correct; and the results may surprise you.

The Competitors

In the Google Compute engine corner is the n1-standard-4, both with and without ephemeral storage. In the other, relatively crowded corner, are three contenders from Amazon, the second generation m3.xlarge, the classic m1.xlarge, and the hi1.4xlarge. Per the benchmarking software:

[table id=2 /]

Note that GCE instances use a Google-compiled and modified Linux Kernel but otherwise the distribution looks like Ubuntu 12.04.  Also, all instances used identical Java versions,

java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.5) (6b24-1.11.5-0ubuntu1~12.04.1) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Benchmark Software

Two different benchmark suites were used focusing on CPU and memory performance.  Components of the Phoronix Test Suite were used and the SciMark v2.1.1.1 and Java SciMark v2.0, which consists of five computational kernels: FFT, Gauss-Seidel relaxation, Sparse matrix-multiply, Monte Carlo integration, and dense LU factorization.

CPU Benchmark Results

Both the Java and non-Java SciMark benchmarks tell a similar story (higher scores indicate better performance).  In all tests, the n1-standard-4 and the m3.xlarge top the charts, trading the performance crown back and forth by small margins.  The m1.xlarge trails the pack by a very significant margin.



Curiously, the GCE instance wins all Java SciMark 2.0 regular tests but the m3.xlarge wins the same suite of tests when larger data sizes, designed to exceed the CPU cache size, were used.

From the Phoronix Suite, three separate computationally intensive tests where chosen: the LAMMPS molecular dynamics simulation v1.0, the parallel BZIP2 compression 1.1.6, and a ray tracer (POV-Ray 3.6.1).  Each test measured performance in seconds. To simplify comparison and visualization, all values were normalized by the longest run time for that test (universally the m1.xlarge). Thus, values are not shown as seconds but percentages, the lowest value is best.

The n1-standard-4 edged out the m3.xlarge in both the POV-Ray and LAMMPS but was defeated by the m3.xlarge in the BZIP2 compression test. Not surprising, the h1.4xlarge with 16 cores destroyed all comers in the parallel BZIP test.


Memory Benchmark Results

Memory benchmark results show that more robust metrics will be necessary to truly compare cloud computing capabilities. The line plot below shows memory speed benchmark results for the n1-standard-4 GCE instance and the m3.xlarge, m1.xlarge, and the h14.xlarge Amazon instances.


Each benchmark was run 4 times each for both the the n1-standard-4 and the m3.xlarge and this is where the real story lies. Notice that the GCE instance shows little performance variability across tests. In contrast, the m3.xlarge comes close to competing evenly with the GCE instance in most (but not all) tests but demonstrates performance drops up to 40%. It would seem that there is some validity to Google's claims that GCE offers more consistent performance than competitors. Interestingly, this benchmark took the longest wall clock time to run of the synthetic tests.


In terms of short-term number crunching, the m3.xlarge and the n1-standard-4 seem similarly capable, trading small wins across the numerous benchmarks. In terms of memory speed, a very different story emerges; the GCE instance holds a small but consistent lead in memory speed but a large margin of victory in consistency of performance.  For lengthy processor-intensive tasks, this differential could be significant.

As neither of these services is free, let's return to pricing as it would seem that not all compute units are the same. The GCE n1-standard-4 is either $0.48 per hour without ephemeral storage or $0.552 per hour with storage. In comparison, the m1.xlarge costs $0.520 per hour while the m3.xlarge costs $0.58 per hour and is only available without storage. Note that all prices were current as of 1/20/2013.

At these price points, the original m1.xlarge looks significantly overpriced. One must wonder when Amazon will either phase this option out or drastically alter its pricing. Even though the m3 second generation Amazon instances were just launched 11/1/2012, the story is similar. The comparable GCE instance offers approximately the same number crunching performance, better and, more importantly, more consistent memory performance, at a 20% discount.

The question that needs to be asked is what happens if computational performance is measured not just for a few seconds or minutes, but for hours or days at a time, a common situation in high performance computing and big data. Here is where I believe that Google may have a very significant advantage and I look forward to investigating this in my next article.


Anectdotally speaking, GCE instances are ready for use **much** faster than EC2 instances in my humble experiences. The time difference was quite noticable but I did not bother to quantify this characteristic.

GCE versus EC2: A Comparison of Instance Pricing and Characteristics, Part 1

I started using EC2 back in 2008 and just recently joined the Private Preview of Google Compute Engine (GCE). Currently, I use the cloud to build ad-hoc MPI clusters for simulating protein-protein interactions using a software package called OSPREY (Open Source ProteinREdesign for You). I have been using EC2 and MIT's brilliant StarCluster to spin up clusters with up to 800 virtual cores and am now looking at GCE as a possible alternative. This series of blog posts will detail my experiences and observations and possibly useful insights.GCE Round 1

Before I begin, a few things. First, I am highly biased; I love both Amazon (long time Prime member) and Google (just look at my gmail address). Second, a number of blogs have offered detailed looks at GCE  (herehere, and, here among others) and I would highly recommend reading the overview here.  This post looks at price (as of 1/1/2013) and instance positioning.

Pricing and Instance Characteristics

Instance pricing drops not infrequently and current pricing across all instance types can be found here for EC2 and here for GCE. Please note that this comparison is for what Amazon calls On-Demand Instances. Amazon offers many ways to access instances including Reserved Instances, where you can prepay for light, medium and heavy utilization of instances over longer periods of time, and Spot Instances, where you can bid lower prices for underutilized infrastructure.

Instance types compete on numerous characteristics including:

  1. processing capability
  2. memory
  3. ephemeral storage (size and speed)
  4. IO performance

and a number of others (boot speed, networking configuration, performance consistency, etc) that I won't get into here.

Apples to Apples or Apples to Oranges?

Let's start with processing capability. Unfortunately, both Amazon and Google don't make it web-simple to compare the processing power of their respective virtual instances. So let's dig deeper. EC2 measures processing power with "EC2 Compute Units" which are, per Amazon:

"One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. This is also the equivalent to an early-2006 1.7 GHz Xeon processor referenced in our original documentation."

which is really less than helpful. However, Amazon released the "Cluster Compute Eight Extra Large Instance" on 11/14/2011. This monster offers 88 EC2 Compute Units but adds in that this is really equivalent to "2 x Intel Xeon E5-2670, eight-core "Sandy Bridge" architecture." Intel says that its E5-2670 is a 2.6GHz, 32nm core with 8 actual cores capable of running 16 hyper-threads (2 threads per core). Since the CCEELI has 2x8 cores = 16 cores = 32 hyper-threads we have the following relationship:

88 EC2 comp = 32 Hyper-threads (@2.6 GHz, Sandy Bridge) or 2.75 EC2 Compute Units = 1 "Virtual" Core (@2.6GHz, Sandy Bridge)

Now, let's look at Google's "cuter" measurement for processing power, the "GQ."

"GCEU (Google Compute Engine Unit), or GQ for short, is a unit of CPU capacity that we use to describe the compute power of our instance types. We chose 2.75 GQ’s to represent the minimum power of one logical core (a hardware hyper-thread) on our Sandy Bridge platform."

Reading further, for the current generation of 'n1' machine types, "a virtual CPU is implemented as a single hyperthread on a 2.6GHz Intel Sandy Bridge Xeon processor." Does this sound familar? Thus,

2.75 GCEU = 1 Hyperthread (@2.6GHz, Sandy Bridge)

Piecing it all together we get:

1 GCEU = 1 EC2 Compute Units

which, assuming the two platforms are running identical processors, is quite the coincidence. Let's run with this for the moment.

Assuming Some Things Equal

What everyone seems to care about is the distribution of CPU vs RAM, ephemeral storage is just that; your instance goes down and everything in ephemeral storage is gone. It should be faster but it can always be bolstered with cloud storage either Amazon Simple Storage Service (S3) or Google Cloud Storage.


Thus, we have three key variables: price, CPU, and RAM. Google and Amazon instance categorizations reaffirm this as both offer "standard", "high memory", and "high CPU" instance types.

In the scatter plot of compute power and instance memory, one thing becomes obvious immediately.  Most instance types cluster in the lower left of the plot, having 25 GQ's and 60 GB of RAM or less.  However, Amazon (circles) sells access to a few outlier instances for more niche types of computing needs.

The outliers in the plot above (listed below) are all from Amazon, which has seen and attempted to fill market opportunities.

  • Tan, circle, top middle = Cluster Compute Eight Extra Large Instance (60.5 GiB memory/88 EC2 Compute Units/3370 GB of instance storage)
  • Purple circle, far right = High Storage Instances (117 GiB memory/35 EC2 Compute Units/24 hard disk drives each with 2 TB of instance storage)
  • Light green circle, middle = High I/O Quadruple Extra Large Instance (60.5 GiB memory/35 EC2 Compute Units/2 SSD-based volumes each with 1024 GB of instance storage)

Excluded from this plot is Amazon's Cluster GPU Quadruple Extra Large Instance (22 GiB of memory/33.5 EC2 Compute Units/2 x NVIDIA Tesla “Fermi” M2050 GPUs/1690 GB of instance storage) as I have no good way of factoring in the general purpose GPU computing power (and neither does Amazon, or so it seems.)

When we eliminate Amazon's outlier instances listed above, we see a very interesting story. Google instances (triangles) are positioned to offer greater compute performance in most cases.

To counter attack, Amazon released its second generation "M3" instances on 10/31/12, offering the same amount of memory but more computational muscle (red/orange line). The M3 instances stand out due to offering 3.25 EC2 Compute Units per virtual core, suggesting that the underlying hardware is faster than the Xeon E5-2670 Sandy Bridge @ 2.6GHz.

Interestingly, neither provider offers a single instance that gives you access to an entire virtualized processor. Such an instance serving up the entire 8 core Xeon E5-2670 would clock in at 44 compute units.

Looking at the ratio of CPU to RAM across instance classes from both Google and Amazon, we see some strong consistency:

  • Standard Instances range from 0.58 to 0.87
  • High Memory Instances range from .38 to .42
  • High CPU Instances range from 2.9 to 3.1

What About Price?

Standard Instances

If we compare the M1 Medium in EC2 to the n1-standard-1-d, you get the same 3.75 GB of RAM but 37.5% more compute units for only 6% increased cost. This identical trend continues comparing the M1 Large to the n1-standard-2-d, and the M1 Extra Large to the n1-standard-4-d. I would not be surprised to see another price drop on first generation Amazon instance types.

Second Generation Instances

Things get interesting when we compare the second generation Amazon M3 images to GCE's. Comparing the M3 Extra Large to n1-standard-4, we see the EC2 instance offers 18% more compute at 20% more cost. This exact pattern continues examining the M3 Double Extra Large to the n1-standard-8


Comparing the High-CPU Medium to the n1-highcpu-2-d, the GCE instance offers 5.8% more memory, 10% more compute for 3% more cost. The gap closes slightly for High-CPU Extra Large versus the n1-highcpu-8-d, which has 2.8% more memory, 10% more compute units, for 3% more money.

Competition and What's Next

An arm's race and/or price war in the cloud computing space will almost certainly benefit the end consumer.  Per some reports, Google is targeting GCE not at the little guy or websites but at very large scale enterprise operations requiring  many instances.  It will be interesting to see how Google competes with the very rich ecosystem built atop Amazon's service, one that will take some time to replicate. In my next article, I plan to benchmark comparable instances to see if the assumptions made about the hardware were correct.

Some early benchmarks were posted mid-2012 but I suspect that they may no longer be valid.

A Note on Naming

Amazon's instance naming scheme is getting out of hand, just look at the "Cluster Compute Eight Extra Large Instance." If this doesn't stop, selecting an instance type from Amazon is going to sound a lot like ordering at Starbucks with your free drink card .



 works at Johns Hopkins University and co-organizes the Data Business DC and Mid Maryland Data Science meetups.