EC2

Instructions for deploying an Elasticsearch Cluster with Titan

ElasticSearch Elasticsearch is an open source distributed real-time search engine for the cloud. It allows you to deploy a scalable, auto-discovered cluster of nodes, and as search capacity grows, you simple need to add more nodes and the cluster will reorganize itself. Titan, a distributed graph engine by Aurelius supports elasticsearch as an option to index your vertices for fast lookup and retrieval. By default, Titan supports elasticsearch running in the same JVM and storing data locally on the client, which is fine for embedded mode. However, once your Titan cluster starts growing, you have to respond by growing an elasticsearch cluster side by side with the graph engine.

This tutorial is how to quickly get a elasticsearch cluster up and running on EC2, then configuring Titan to use it for indexing. It assumes you already have an EC2/Titan cluster deployed. Note, that these instructions were for a particular deployment, so please forward any questions about specifics in the comments!

Step 1: Installation

NOTE: These instructions assume you've installed Java6 or later.

By far, the best installation mechanism to install eleasticsearch on an Ubuntu EC2 instance is the Debian package that is provided as a download. This package installs an init.d script and places the configuration files in /etc/elasticsearch and generally creates goodness that we don't have to deal with. You can find the .deb on the elastic search download page.

$ cd /tmp
$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.deb
$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.deb.sha1.txt
$ sha1sum elasticsearch-0.90.7.deb && cat elasticsearch-0.90.7.deb.sha1.txt 

Note that you may have to use the flag --no-check-certificate or you could use curl, just ensure that you use the correct filenames. Also ensure that the checksums match (and be even more paranoid and check the elasticsearch website). Installation is simple:

$ sudo dpkg -i elasticsearch-0.90.7.deb

Elasticsearch will now be running on your machine with the default configuration. To check this you can do the following:

$ sudo status elasticsearch
$ ps -ef | grep elasticsearch

But while we configure it, it doesn't really need to be running:

$ sudo service elasticsearch stop

In particular this does the following things you should be aware of:

  1. Creates the elasticsearch:elasticsearch user and group
  2. Installs the library into /usr/share/elasticsearch
  3. Creates the logging directory at /var/log/elasticsearch
  4. Creates the configuration directory at /etc/elasticsearch
  5. Creates a data directory at /var/lib/elasticsearch
  6. Creates a temp work directory at /tmp/elasticsearch
  7. Creates an upstart script at /etc/init.d/elasticsearch
  8. Creates an upstart configuration at /etc/default/elasticsearch

Because of our particular Titan deployment, this is not good enough for what we're trying to accomplish, so the next step is configuration.

Step 2: Configuration

The configuration we're looking for is an auto-discovered EC2 elastic cluster that is bound to the default ports, and works with data on the attached volume rather than on the much small root disk. In order to autodiscover on EC2 we have to install an AWS plugin, which can be found on the cloud aws plugin Github page:

$ cd /usr/share/elasticsearch
$ bin/plugin -install elasticsearch/elasticsearch-cloud-aws/1.15.0

Elasticsearch is configured via a YAML file in /etc/elasticsearch/elasticsearch.yml so open up your editor, and use the configurations as we added them below:

path:
    conf: /etc/elasticsearch
    data: /raid0/elasticsearch
    work: /raid0/tmp/elasticsearch
    logs: /var/log/elasticsearch
cluster:
    name: DC2
cloud:
    aws:
        access_key: ${AWS_ACCESS_KEY_ID}
        secret_key: ${AWS_SECRET_ACCESS_KEY}
discovery:
    type: ec2

For us, the other defaults worked just fine. So let's go through this a bit. First off, for all the paths, make sure that they exist, you've created them, and that they have the correct permissions. The raid0 folder is where we have mounted an EBS volume that contains enough non-ephemeral storage for our data services. Although this does add some network overhead, it prevents data loss when the instance is terminated. However, if you're not working with EBS or you've mounted in a different location, using the root directory defaults is probably fine.

$ sudo mkdir /raid0/elasticsearch
$ sudo chown elasticsearch:elasticsearch /raid0/elasticsearch
$ sudo chmod 775 elasticsearch
$ sudo mkdir -p /raid0/tmp/elasticsearch
$ sudo chmod 777 /raid0/tmp
$ sudo chown elasticsearch:elasticsearch /raid0/tmp/elasticsearch
$ sudo chmod 775 /raid0/tmp/elasticsearch

Editor's Note: I just discovered that you can actually set these options with the dpkg command so that you don't have to do it manually. See the elasticsearch as a service on linux guide for more.

The cluster name, in our case DC2, needs to be the same for every node on the cluster, this is also vital for EC2. The default, elasticsearch, could make the discovery more difficult. Also note that each node can be named separately, but by default the name is selected randomly on a list of 3000 or so Marvel characters. The cloud and discovery options allow discovery through EC2.

You should now be able to run the cluster:

$ sudo service elasticsearch start

Check the logs to make sure there are no errors, and that the cluster is running. If so, you should be able to navigate to the following URL:

http://localhost:9200/_cluster/health?pretty=true

By replacing localhost with the hostname, you can see the status of the cluster, as well as the number of nodes. But wait, why are there no more nodes being added? Don't keep waiting! The reason is because Titan has probably already been configured to use local Elasticsearch, and is blocking port 9300, the communication and control port for the ES cluster.

Configuring Titan

Titan is blocking the cluster elasticsearch with its own local elasticsearch, and anyway, we want Titan to use the elasticsearch cluster! Let's reconfigure Titan. First, open up your favorite editor and change the configuration of /opt/titan/config/yourgraph.properties to the following:

storage.backend=cassandra
storage.hostname=${LOCAL_IPADDR}

storage.index.search.backend=elasticsearch
storage.index.search.client-only=true
storage.index.search.hostname=${ES_ADDR},${ES_ADDR},${ES_ADDR}

Hopefully you don't have to replace the storage.backend and storage.hostname configurations. Remove the storage.index.search.local-mode configuration as well as the storage.index.search.directory configuration, and add the configurations above as follows.

For storage.index.search.hostname, add a comma separated list of every node in the ES cluster (for now).

That's it! Reload Titan, and you should soon see the cluster grow to include all the nodes you configured, as well as a speed up in queries to the Titan graph!

A Revolution in Cloud Pricing: Minute By Minute Cloud Billing for Everyone

Google IO wrapped up last week with a tremendous number of data-related announcements. Today's post is going to focus on Google Compute Engine (GCE), Google's answer to Amazon's Elastic Compute Cloud (EC2) that allows you to create and run virtual compute instances within Google's cloud. We have spent a good amount of time talking about GCE in the past, in particular, benchmarking it against EC2 here, here, here, and here.clock The main GCE announcement at IO was, of course, the fact that now **anyone** and **everyone** can try out and use GCE. Yes, GCE instances now support up to 10 terabytes per disk volume, which is a BIG deal. However, the fact that GCE will use minute-by-minute pricing, which might not seem incredibly significant on the surface, is an absolute game changer.

Let's say that I have a job that will take just a thousand instances each a little bit over an hour to finish (a total of just over a thousand "instance hours"). I launch my thousand instances, run the needed job, and then shut down my cloud 61 minutes later. Let's also assume that Amazon and Google both charge about the same amount, say $0.50 per instance per hour (a relatively safe assumption) and that Amazon's and Google's instances have the same computational horsepower (this is not true, see my benchmark results). As Amazon charges by the hour, Amazon would charge me for two hours per instance or $1000.00 total (1000 instances x $0.50 per instance per hour x 2 hours per instance) whereas Google would only charge me $508.34 (1000 instances x $0.50 per instance per hour x 61/60 hours per instance). In this circumstance, Amazon's hourly billing has almost doubled my costs but the impact is far worse.

If I want to try to mitigate the over charge, I can run the job with fewer instances but for a longer time. One option would be to run 100 instances for just over 10 hours each. This setup would then cost me $550 (100 instances x 11 hours per instance x $0.50 per instance per hour). If I am exceedingly price sensitive, I could run a single instance for a 1001 hours and get the same job complete at a total cost of $500.50. At this point, I am only getting overcharged $0.50 cents but, if you are willing to wait 1000 hours for your results, why use the cloud at all?

Ok, now let's say completing the task is incredibly important to you and time is of the essence. In this case, let's throw 5,000 instances at the problem which now takes just over 12 minutes to solve (let's call this 13 minutes). Running these 5,000 instances in GCE would cost $541.66 (5000 instances x 13/60 hours per instance x $0.50 per hour per instance) whereas the same run in Amazon would cost $2500 (5000 instances x 1 hour per instance x $0.50 per hour per instance)!!!!

With GCE, I don't have to worry about this overcharge until I hit the 10-minute minimum charge window. Thus, whenever I use GCE, I should simply throw as many instances as possible at the problem without thinking as the price is going to wind up about the same in either case. Or, put another way, look at the best case that GCE provides (I get my job done in 13 minutes for about $540) whereas for the same amount of money ($550), Amazon completes this task in 10 hours. Which one would you choose?

This is the true beauty of the cloud. GCE's pricing scheme incentivizes users to take full advantage of the cloud (massive parallelization for bursty computation) whereas Amazon's does not.  When using GCE, I will spin up as many instances to get the job done as fast as possible. With Amazon, I will not due to the billing overcharges. Even with all other things equal, GCE wins every time. Once users get used to getting immediate results, they won't go back.

My guess is that Amazon changes their hourly billing practices much sooner than later.

Amazon EC2 versus Google Compute Engine, Part 4

It has been a while since we have talked about cloud computing benchmarks and wanted to bring a recent and relevant post to your attention. But, before we do, let's summarize the story of EC2 versus Google Compute Engine. Our first article compared and contrasted GCE and EC2 instance positioning. The second article benchmarked various instance types across the two providers using a series of synthetic benchmarks including the Phoronix Test Suite and the SciMark.

cloudImpact

And, less than a week after Data Community DC reported a 20% performance disparity, Amazon dropped the price of only the instances directly GCE competitors by, you guessed it, 20%. Coincidence?

Our third article continued the performance evaluation and used the Java version of the NAS Performance Benchmarks to explore single and multi-threaded computational performance. Since then, GCE prices have dropped 4% across the board to which Amazon quickly responded with their own price drop.

Whereas we looked at number crunching capability between the two services, this post and the reported set of benchmarks from Sebastian Stadil of Scalr compares and contrasts many other aspects of the two services including:

  • instance boot times
  • ephemeral storage read and write performance
  • inter-region network bandwidth and latency

My anectdotal experiences with GCE can confirm that GCE instance boot times are radically faster than EC2's and that the provided API for GCE was far easier to use (very similar to MIT's StarCluster cluster management package for EC2).

In general, this article complements our findings nicely, making a strong case to at least test out Google Compute Engine if you are considering EC2 or a long time user of Amazon's Web Services.

You can find the article in question here: By the numbers: How Google Compute Engine stacks up to Amazon EC2 — Tech News and Analysis