Tutorials

High-Performance Computing in R Workshop

Data Community DC and District Data Labs are excited to be hosting a High-Performance Computing with R workshop on June 21st, 2014 taught by Yale professor and R package author Jay Emerson. If you're interested in learning about high-performance computing including concepts such as memory management, algorithmic efficiency, parallel programming, handling larger-than-RAM matrices, and using shared memory this is an awesome way to learn!

To reserve a spot, go to http://bit.ly/ddlhpcr.

Overview This intermediate-level masterclass will introduce you to topics in high-performance computing with R. We will begin by examining a range of related topics including memory management and algorithmic efficiency. Next, we will quickly explore the new parallel package (containing snow and multicore). We will then concentrate on the elegant framework for parallel programming offered by packages foreach and the associated parallel backends. The R package management system including the C/C++ interface and use of package Rcpp will be covered. We will conclude with basic examples of handling larger-than-RAM numeric matrices and use of shared memory. Hands-on exercises will be used throughout.

What will I learn? Different people approach statistical computing with R in different ways. It can be helpful to work on real data problems and learn something about R “on the fly” while trying to solve a problem. But it is also useful to have a more organized, formal presentation without the distraction of a complicated applied problem. This course offers four distinct modules which adopt both approaches and offer some overlap across the modules, helping to reinforce the key concepts. This is an active-learning class where attendees will benefit from working along with the instructor. Roughly, the modules include:

An intensive review of the core language syntax and data structures for working with and exploring data. Functions; conditionals arguments; loops; subsetting; manipulating and cleaning data; efficiency considerations and best practices, including loops and vector operations, memory overhead and optimizing performance.

Motivating parallel programming with an eye on programming efficiency: a case study. Processing, manipulating, and conducting a basic analysis of 100-200 MB of raw microarray data provides an excellent challenge on standard laptops. It is large enough to be mildly annoying, yet small enough that we can make progress and see the benefits of programming effiency and parallel programming.

Topics in high-performance computing with R, including packages parallel and foreach. Hands-on examples will help reinforce key concepts and techniques.

Authoring R packages, including an introduction to the C/C++ interface and the use of Rcpp for high-performance computing. Participants will build a toy package including calls to C/C++ functions.

Is this class right for me? This class will be a good fit for you if you are comfortable working in R and are familiar with R's core data structures (vectors, matrices, lists, and data frames). You are comfortable with for loops and preferably aware of R's apply-family of functions. Ideally you will have written a few functions on your own. You have some experience working with R, but are ready to take it to the next level. Or, you may have considerable experience with other programming languages but are interested in quickly getting up to speed in the areas covered by this masterclass.

After this workshop, what will I be able to do? You will be in a better position to code efficiently with R, perhaps avoiding the need, in some cases, to resort to C/C++ or parallel programming. But you will be able to implement so-called embarassingly parallel algorithms in R when the need arises, and you'll be ready to exploit R's C/C++ interface in several ways. You'll be in a position to author your own R package can include C/C++ code.

All participants will receive electronic copies of all slides, data sets, exercises, and R scripts used in the course.

What do I need to bring? You will need your laptop with the latest version of R. I recommend use of the R Studio IDE, but it is not necessary. A few add-on packages will be used in the workshop. Packages Rcpp and foreach will be used. As a complement to foreach you should also install doMC (Linux or MacOS only) and doSNOW(all platforms). If you want to work along with the C/C++ interface segment, some extra preparation will be required. Rcpp and use of the C/C++ interface requires compilers and extra tools; the folks at RStudio have a nice page that summarizes the requirements. Please note that these requirements may not be trivial (particularly in Windows) and need to be completed prior to the workshop if you intend to compile C/C++ code and use Rcpp during the workshop.

Instructor John W. Emerson (Jay) is Director of Graduate Studies in the Departmentof Statistics at Yale University. He teaches a range of graduate and undergraduate courses as well as workshops, tutorials, and short courses at all levels around the world. His interests are in computational statistics and graphics, and his applied work ranges from topics in sports statistics to bioinformatics, environmental statistics, and Big Data challenges.

He is the author of several R packages including bcp (for Bayesian change point analysis), bigmemory and sister packages (towards a scalable solution for statistical computing with massive data), and gpairs (for generalized pairs plots). His teaching style is engaging and his workshops are active, hands-on learning experiences.

You can reserve your spot by going to http://bit.ly/ddlhpcr.

Saving Money Using Data Science for Dummies

I'd like to tell you a story about how I made "data science" work for me without writing a single line of code, launching a single data analysis or visualization app, or even looking at a single digit of data.

"No, way!", you say? Way, my friends. Way.

Our company is developing VoteRaise.com, a new way for voters to fundraise for the candidates they'd like to have run for political office. We realized that, even here in DC, there is no active community of people exploring how innovations in technology & techniques impacts political campaigning, so we decided to create it.

I started the process of creating RealPolitech, a new Meetup, last night. It's clear that Meetup up makes use of data and, more importantly, thinks about how to use that data to help its users. For instance, when it came time to pick the topics that RealPolitech would cover, Meetup did a great job of making suggestions. All I had to do was type a few letters, and I was presented with a list of options to select.

I got the whole meetup configured quickly. I was impressed by the process. But, when it came time to make a payment, I went looking for a discount code. Couldn't find one in FoundersCard. Couldn't find it in Fosterly. Or anywhere else online. Seventy two dollars for six months is a good deal, already. But, still, we're a bootstrapped startup, and every dollar saved is a dollar earned. So, I decided to try something.

I just... stopped. I figured, if Meetup is gathering and providing data during the configuration process, they must be doing the same during checkout. So, I figured I'll give them a day or two to see if I receive an unexpected "special offer". Sure enough, by the time I woke up, I had a FIFTY PERCENT discount coupon, along with a short list of people who may be interested in RealPolitech, as incentive to pay the dues and launch the Meetup.

RealPolitech is up and running. You can find it here, come join and share your ideas and expertise! We are putting together our kickoff event, and lining up speakers and sponsors.

Oh, and I saved 50% off the dues by leveraging my knowledge of data science to get the outcome I wanted by doing... absolutely nothing.

Backbone, The Primer: A Simple App

backbone

This is the fourth of four posts that guest author Greg MacWilliam has put together for Data Community DC: Backbone, The Primer. For more details about the API and an overview, please go back to the Introduction. For more on Models and Collections please review part two of this primer. A comprehensive look at Views and event bindings was in part three.

A simple REST application

Now its time to put it all together. Let's breakdown a complete RESTful application that performs all CRUD methods with our API.

1. The DOM

The first step in setting up any small application is to establish a simple interface for managing the tasks you intend to perform. Here, we've establish a "muppets-app" container element with a list (<ul>) for displaying all Muppet items, and a simple input form for defining new muppets.

Down below, a template is defined for rendering individual list items. Note that our list item template includes a "remove" button for clearing the item from the list.

Finally, we'll include our application's JavaScript as an external script. We can assume that all further example code will be included in muppet-app.js.

<div id="muppets-app"> <ul class="muppets-list"></ul> <div class="muppet-create"> <b>Add a Muppet</b> <fieldset> <label for="muppet-name">Name:</label> <input id="muppet-name" type="text"> </fieldset> <fieldset> <label for="muppet-job">Job:</label> <input id="muppet-job" type="text"> </fieldset> <button class="create">Create Muppet!</button> </div> </div> <script type="text/template" id="muppet-item-tmpl"> <p><a href="/muppets/<%= id %>"><%= name %></a></p> <p>Job: <i><%= occupation %></i></p> <button class="remove">x</button> </script> <script src="muppet-app.js"></script>

2. The Model and Collection

Now in "muppet-app.js", the first structures we'll define is the Model class for individual list items, and the Collection class for managing a list of models. The Collection class is configured with the URL of our API endpoint.

// Model class for each Muppet item var MuppetModel = Backbone.Model.extend({ defaults: { id: null, name: null, occupation: null } }); // Collection class for the Muppets list endpoint var MuppetCollection = Backbone.Collection.extend({ model: MuppetModel, url: '/muppets' });

3. A List Item View

The first View class that we'll want to define is for individual list items. This class will generate its own <li> container element, and will render itself with our list item template. That template function is being generated once, and then stored as a member of the class. All instances of this class will utilize that one parsed template function.

This view also configures an event for mapping clicks on the "remove" button to its model's destroy method (which will remove the model from its parent collection, and then dispatch a DELETE request from the model to the API).

// View class for displaying each muppet list item var MuppetsListItemView = Backbone.View.extend({ tagName: 'li', className: 'muppet', template: _.template($('#muppet-item-tmpl').html()), render: function() { var html = this.template(this.model.toJSON()); this.$el.html(html); return this; }, events: { 'click .remove': 'onRemove' }, onRemove: function() { this.model.destroy(); } });

4. A List View

Now we need a view class for rendering out lists of items, and capturing input from the "create" form.

This view binds a listener to its collection that will trigger the view to render whenever the collection finishes syncing with the API. That will force our view to re-render when initial data is loaded, or when items are created or destroyed.

This view renders a list item for each model in its collection. It first finds and empties its list container ("ul.muppets-list"), and then loops through its collection, building a new list item view for each model in the collection.

Lastly, this view configures an event that maps clicks on the "create" button to collecting form input, and creating a new collection item based on the input data.

// View class for rendering the list of all muppets var MuppetsListView = Backbone.View.extend({ el: '#muppets-app', initialize: function() { this.listenTo(this.collection, 'sync', this.render); }, render: function() { var $list = this.$('ul.muppets-list').empty(); this.collection.each(function(model) { var item = new MuppetsListItemView({model: model}); $list.append(item.render().$el); }, this); return this; }, events: { 'click .create': 'onCreate' }, onCreate: function() { var $name = this.$('#muppet-name'); var $job = this.$('#muppet-job'); if ($name.val()) { this.collection.create({ name: $name.val(), occupation: $job.val() }); $name.val(''); $job.val(''); } } });

5. Instantiation

Finally, we need to build instances of our components. We'll construct a collection instance to load data, and then construct a list view instance to display it. When our application components are all configured, all that's left to do is tell the collection to fetch for data!

// Create a new list collection, a list view, and then fetch list data: var muppetsList = new MuppetsCollection(); var muppetsView = new MuppetsListView({collection: muppetsList}); muppetsList.fetch();

Getting View Support

View management is by far the least regulated component of Backbone, and yet is –ironically– among the most uniquely disciplined roles in front-end engineering. While Backbone.View provides some very useful low-level utility features, it provides few high-level workflow features. As a result, major Backbone extenstions including Marionette and LayoutManager have become popular. Also see ContainerView for a minimalist extension of core Backbone.View features.

Thanks for reading. That's Backbone in a nutshell.

About the Author

Greg MacWilliam is an RIT alumni, currently working as a freelance web software engineer. He specializes in single-page application architecture and API design. He's worked for organizations including the National Park Service and NPR, and has taught JavaScript at the Art Institute of Washington. Greg has self-published several online games, and actively contributes to BackboneJS projects.

Greg's current side projects include the Backbone extensions EpoxyJS and ContainerView, as well as ConstellationJS for grid geometry operations.

You can follow him on Twitter and Github:

Building Data Apps with Python

Data Community DC and District Data Labs are excited to be offering a Building Data Apps with Python workshop on April 19th, 2014. Python is one of the most popular programming languages for data analysis.  Therefore, it is important to have a basic working knowledge of the language in order to access more complex topics in data science and natural language processing.  The purpose of this one-day course is to introduce the development process in python using a project-based, hands-on approach.

Python_Building_Data_Apps

This course is focused on Python development in a data context for those who aren’t familiar with Python. Other courses like Python Data Analysis focus on data analytics with Python, not on Python development itself.

The main workshop will run from 11am - 6pm with an hour break for lunch around 1pm.  For those that are new to programming, there will be an optional introductory session from 9am - 11am aimed at getting you comfortable enough with Python development to follow along in the main session.

Introductory Session: Python for New Programmers (9am - 11am)

The morning session will teach the fundamentals of Python to those who are new to programming.  Learners would be grouped with a TA to ensure their success in the second session. The goal of this session is to ensure that students can demonstrate basic concepts in a classroom environment through successful completion of hands-on exercises. This beginning session will cover the following basic topics and exercises:

Topics:

  • Variables
  • Expressions
  • Conditionality
  • Loops
  • Executing Programs
  • Object Oriented Programming
  • Functions
  • Classes

Exercises:

  • Write a function to determine if input is even or odd
  • Read data from a file
  • Count the words/lines in a file

At the end of this session, students should be familiar enough with programming concepts in Python to be able to follow along in the second session. They will have acquired a learning cohort in their classmates and instructors to help them learn Python more thoroughly in the future, and they will have observed Python development in action.

Main Session: Building a Python Application (11am - 6pm)

The afternoon session will focus on python application development for those who already know how to program and are familiar with Python. In particular, we’ll build a data application from beginning to end in a workshop fashion. This course would be a prerequisite for all other DDL courses offered that use python.

The following topics will be covered:

  • Basic project structure
  • virtualenv & virtualenvwrapper
  • Building requirements outside the stdlib
  • Testing with nose
  • Ingesting data with request.py
  • Munging data into SQLite Databases
  • Some simple computations in Python
  • Reporting data with JSON
  • Data visualization with Jinja2 and Highcharts

We will build a Python application using the data science workflow: using Python to ingest, munge, compute, report, and even visualize. This is a basic, standard workflow that is repeatable and paves the way for more advanced courses using numerical and statistical packages in Python like Pandas and NumPy. In particular, we’ll use and fetch data from Data.gov, transform it and store it in a SQLite database, then do some simple computation. Then we will use Python to push our analyses out in JSON format and provide a simple reporting technique with Jinja2 and charting using Highcharts.

For more information and to reserve a spot, go to http://bit.ly/1m0y5ws.

Hope to see you there!

Backbone, The Primer: Views

backbone

This is the third of four posts that guest author Greg MacWilliam has put together for Data Community DC: Backbone, The Primer. For more details about the API and an overview, please go back to the Introduction. For more on Models and Collections please review part two of this primer.

Using Views

Views create linkage between data sources (Models and Collections) and display elements. As a general rule, Views should map one-to-one with each data source present – meaning that one view controller is created for each collection and each model represented within the display.

Creating a view's container element

All Backbone views are attached to a container element, or an HTML document element into which all nested display and behavior is allocated. A common approach is to bind major views onto predefined elements within the HTML Document Object Model (hereforth referred to as the "DOM"). For example:

<ul id="muppets-list"></ul>

<script>
var MuppetsListView = Backbone.View.extend({
    el: '#muppets-list'
});
</script>

In the above example, a Backbone view class is configured to reference "#muppets-list" as its target el, or element. This element reference is a selector string that gets resolved into a DOM element reference.

Another common workflow is to have Backbone views create their own container elements. To do this, simply provide a tagName and an optional className for the created element:

var MuppetsListItemView = Backbone.View.extend({
    tagName: 'li',
    className: 'muppet'
});

These two container element patterns (selecting and creating) are commonly used together. For example, a collection may attach itself to a selected DOM element, and then create elements for each item in the collection.

Once a view class is defined, we'll next need to instance it:

var MuppetsListView = Backbone.View.extend({
    el: '#muppets-list'
});

// Create a new view instance:
var muppetsList = new MuppetsListView();

// Append content into the view's container element:
muppetsList.$el.append('<li>Hello World</li>');

When a view is instanced, Backbone will configure an $el property for us––this is a jQuery object wrapping the view's attached container element. This reference provides a convenient way to work with the container element using the jQuery API.

Backbone also encourages efficient DOM practices using jQuery. Rather than performing large and expensive operations across the entire HTML document, Backbone views provide a $ method that performs jQuery operations locally within the view's container element:

// Find all "li" tags locally within the view's container:
muppetsList.$('li');

Under the hood, using view.$('…') is synonymous with calling view.$el.find('…'). These localized queries greatly cut down on superflous DOM operations.

Attaching a view's data source

A view is responsible for binding its document element to a model or a collection instance, provided to the view as a constructor option. For example:

var myModel = new MyModel();
var myView = new MyView({model: myModel});

// The provided model is attached directly onto the view:
console.log(myView.model === myModel); // << true

Attach a model to a view by providing a {model: …} constructor option:

var KermitModel = Backbone.Model.extend({
    url: '/muppets/1'
    defaults: { . . . }
});

var MuppetsListItemView = Backbone.View.extend({
    tagName: 'li',
    className: 'muppet',

    initialize: function() {
        console.log(this.model); // << KermitModel!!
    }
});

// Create Model and View instances:
var kermitModel = new KermitModel();
var kermitView = new MuppetsListItemView({model: kermitModel});

Attach a collection to a view by providing a {collection: …} constructor option:

var MuppetsModel = Backbone.Model.extend({ . . . });

var MuppetsCollection = Backbone.Collection.extend({
    model: MuppetsModel,
    url: '/muppets'
});

var MuppetsListView = Backbone.View.extend({
    el: '#muppets-list',

    initialize: function() {
        console.log(this.collection); // << MuppetsCollection!!
    }
});

// Create Collection and View instances:
var muppetsList = new MuppetsCollection();
var muppetsView = new MuppetsListView({collection: muppetsList});

In the above examples, the provided data sources are attached directly to their view instances, thus allowing the views to reference those sources as this.model or this.collection. It will be the view's job to render its data source into its DOM element, and pass user input data from the DOM back into its data source.

Also note, the above examples leverage Backbone's initialize method. initialize is called once per object instance at the time the object is created, and is therefore useful for configuring new objects. Any Backbone component may define an initialize method.

Rendering a View

Among the primary responsibility of a view is to render data from its data source into its bound DOM element. Backbone is notoriously unopinionated about this task (for better or worse), and provides no fixtures for translating a data source into display-ready HTML. That's for us to define.

However, Backbone does prescribe a workflow for where and when rendering occurs:

  1. A views defines a render method. This method generates HTML from its data source, and installs that markup into the view's container element.
  2. A view binds event listeners to its model. Any changes to the model should trigger the view to re-render.

A simple implementation:

<div id="kermit-view"></div>

<script>
var KermitModel = Backbone.Model.extend({
    url: '/muppets/1',
    defaults: {
        name: '',
        occupation: ''
    }
});

ver KermitView = Backbone.View.extend({
    el: '#kermit-view',

    initialize: function() {
        this.listenTo(this.model, 'sync change', this.render);
        this.model.fetch();
        this.render();
    },

    render: function() {
        var html = '<b>Name:</b> ' + this.model.get('name');
        html += ', occupation: ' + this.model.get('occupation');
        this.$el.html(html);
        return this;
    }
}); 

var kermit = new KermitModel();
var kermitView = new KermitView({model: kermit});
</script>

In the above example, a simple render cycle is formed:

  1. The view's render method translates its bound model into display-ready HTML. The rendered HTML is inserted into the view's container element. A render method normally returns a reference to the view for method-chaining purposes.
  2. The view's initialize method binds event listeners to the model for sync and change events. Either of those model events will trigger the view to re-render. The view then fetches (loads) its model, and renders its initial appearance.

At the core of this workflow is event-driven behavior. View rendering should NOT be a direct result of user interactions or application behaviors. Manually timing render calls is prone to errors and inconsistancies. Instead, rendering should be a simple union of data and display: when the data changes, the display updates.

Rendering with templates

To simplify the process of rendering model data into display-ready markup, parsed HTML templates are commonly used. An HTML template looks generally like this:

<p><a href="/muppets/<%= id %>"><%= name %></a></p>
<p>Job: <i><%= occupation %></i></p>

Look familiar? Template rendering on the front-end is very similar to server-side HTML rendering. We just need a JavaScript template utility to parse these template strings.

There are numerous JavaScript template libraries available. For some reason, Handlebars is incredibly popular among Backbone developers… personally, I find this odd considering that Underscore has a perfectly capable template renderer built in, and is thus omnipresent all Backbone apps. For this primer, we'll be using the Underscore template renderer.

To implemented a front-end template, we first need to define the raw-text markup. Here's a quick and easy trick for hiding raw template text within HTML documents: include the raw text in a <script> tag with a bogus script type. For example:

<script type="text/template" id="muppet-item-tmpl">
    <p><a href="/muppets/<%= id %>"><%= name %></a></p>
    <p>Job: <i><%= occupation %></i></p>
</script>

The above <script> tag defines a bogus type="text/template" attribute. This isn't a valid script type, so the script tag's contents are ignored by HTML parsers. However––we can still access that ignored script tag within the DOM, extract its raw text content, and parse that text into a template. To create a JavaScript template, we do this:

var tmplText = $('#muppet-item-tmpl').html();
var muppetTmpl = _.template(tmplText);

The Underscore template method parses our raw text into a reusable template function. This template function may be called repeatedly with different data sources, and will generate a parsed HTML string for each source. For example, let's quickly load and render Kermit:

var muppetTmpl = _.template( $('#muppet-item-tmpl').html() );
var kermit = new KermitModel();

kermit.fetch().then(function() {
    var html = muppetTmpl(kermit.toJSON());
});

// Resulting HTML string:
<p><a href="/muppets/1">Kermit</a></p>
<p>Job: <i>being green</i></p>

In the above example, a KermitModel is created and fetched, and then rendered to HTML after its data is loaded. To generate HTML markup, we simply invoke the template function and pass in a data source. The process is pretty straight-forward until we get to that mysterious toJSON call. What's that?

In order to render a Backbone Model using a generic template, we must first serialize the model into primitive data. Backbone provides a toJSON method on Models and Collections for precicely this reason; toJSON will serialize a plain object representation of these proprietary data structures.

Let's revise the earlier rendering example to include a parsed template:

<div id="kermit-view"></div>

<script type="text/template" id="muppet-tmpl">
    <p><a href="/muppets/<%= id %>"><%= name %></a></p>
    <p>Job: <i><%= occupation %></i></p>
</script>

<script>
var KermitModel = Backbone.Model.extend({
    url: '/muppets/1',
    defaults: {
        name: '',
        occupation: ''
    }
});

ver KermitView = Backbone.View.extend({
    el: '#kermit-view',
    template: _.template($('#muppet-tmpl').html()),

    initialize: function() {
        this.listenTo(this.model, 'sync change', this.render);
        this.model.fetch();
        this.render();
    },

    render: function() {
        var html = this.template(this.model.toJSON());
        this.$el.html(html);
        return this;
    }
}); 

var kermit = new KermitModel();
var kermitView = new KermitView({model: kermit});
</script>

Using a parsed template greatly simplifies the render method, especially as the size and complexity of the rendering increases. Also note that our template function is generated once and cached as a member of the view class. Generating template functions is slow, therefore it's best to retain a template function that will be used repeatedly.

Binding DOM events

Next up, a view must capture user input events--whether that's an element click, text input, or changes in keyboard focus. Backbone Views provide a convenient way of declaring user interface events using an events object. The events object defines a mapping of DOM event triggers to handler methods on the view.

<div id="kermit-view">
    <label>Name:</label> <input type="text" name="name" class="name">
    <button class="save">Save</button>
</div>

<script>
ver KermitView = Backbone.View.extend({
    el: '#kermit-view',

    events: {
        'change .name': 'onChangeName',
        'click .save': 'onSave'
    },

    onChangeName: function(evt) {
        this.model.set('name', evt.currentTarget.value);
    },

    onSave: function(evt) {
        this.model.save();
    }
});

var kermitView = new KermitView({model: new KermitModel()});
</script>

To summarize the structure of the above events object:

  • Events triggers are declared as keys on the events object, formatted as "[event_type] [selector]".
  • Event handlers are declared as string values on the events object; each handler name cites a method available in the view.

Be mindful that event handler methods should be kept fairly simple, and remain focused on how each DOM event trigger relates to a behavior of the underlying model.

About the Author

Greg MacWilliam is an RIT alumni, currently working as a freelance web software engineer. He specializes in single-page application architecture and API design. He's worked for organizations including the National Park Service and NPR, and has taught JavaScript at the Art Institute of Washington. Greg has self-published several online games, and actively contributes to BackboneJS projects.

Greg's current side projects include the Backbone extensions EpoxyJS and ContainerView, as well as ConstellationJS for grid geometry operations.

You can follow him on Twitter and Github:

Backbone, The Primer: Models and Collections

backbone

This is the second of four posts that guest author Greg MacWilliam has put together for Data Community DC: Backbone, The Primer. For more details about the API and an overview, please go back to the Introduction.

Usings Models

First, let's build a single model that manages data for Kermit. We know that Kermit's REST endpoint is "/muppets/1" (ie: /muppets/:id, and Kermit's id is "1"). Configured as a Backbone model, Kermit looks like this:

var KermitModel = Backbone.Model.extend({ url: '/muppets/1', defaults: { id: null, name: null, occupation: null } });

Kermit's model does two things:

  • It defines a RESTful URL for his model to sync with, and…
  • It defines default attributes for his model. Default attributes are useful for representing API data composition within your front-end code. Also, these defaults guarentee that your model is always fully formed, even before loading its data from ther server.

However, what IS that KermitModel object? When you extend a Backbone component, you always get back a constructor function. That means we need to create an instance of the model before using it:

var kermit = new KermitModel(); kermit.fetch().then(function() { kermit.get('name'); // >> "Kermit" kermit.get('occupation'); // >> "being green" kermit.set('occupation', 'muppet leader'); kermit.save(); });

After creating a new instance of our Kermit model, we call fetch to have it load data from its REST endpoint. Calling fetch returns a promise object, onto which we can chain success and error callbacks. In the above example, we perform some basic actions on our model after loading it. Commonly used Model methods include:

  • fetch: fetches the model's data from its REST service using a GET request.
  • get: gets a named attribute from the model.
  • set: sets values for named model attributes (without saving to the server).
  • save: sets attributes, and then saves the model data to the server using a PUT request.
  • destroy: decomissions the model, and removes it from the server using a DELETE request.

Using Collections

Collections handle the loading and management of a list of models. We must first define a Model class for the list's items, and then attach that model class to a managing collection:

var MuppetModel = Backbone.Model.extend({ defaults: { id: null, name: null, occupation: null } }); var MuppetCollection = Backbone.Collection.extend({ url: '/muppets' model: MuppetModel });

In the above example, MuppetsCollection will load data from the "/muppets" list endpoint. It will then construct the loaded data into a list of MuppetModel instances.

To load our collection of Muppet models, we build a collection instance and then call fetch:

var muppets = new MuppetCollection(); muppets.fetch().then(function() { console.log(muppets.length); // >> length: 1 });

Easy, right? However, there's a problem here: our collection only created a single model. We were supposed to get back a list of two items. Let's review again what the GET /muppets/ service returns…

{ "total": 2, "page": 1, "perPage": 10, "muppets": [ { "id": 1, "name": "Kermit", "occupation": "being green" }, { "id": 2, "name": "Gonzo", "occupation": "plumber" } ] }

We can see that this list data does indeed contain two records, however our collection only created one model instance. Why? The reason is beacuse Collections are derived from Arrays, while Models are derived from Objects. In this case, our root data structure is an Object (not an Array), so our collection tried to parse the returned data directly into a model.

What we really want is for our collection to populate its list from the "muppets" array property of the returned data object. To address this, we simply add a parse method onto our collection:

var MuppetCollection = Backbone.Collection.extend({ url: '/muppets' model: MuppetModel, parse: function(data) { return data.muppets; } });

A Collection's parse method recieves raw data loaded from REST services, and may return a specific portion of that data to be loaded into the collection. In Backbone, both Models and Collections support the definition of a parse method. Using parse is very useful for reconciling minor differences between your API design and your front-end application architecture (which often times won't map one-to-one, and that's okay!).

With the parse method in place, the following now happens upon fetching the collection:

var muppets = new MuppetCollection(); muppets.fetch().then(function() { console.log(muppets.length); // >> length: 2 }); muppets.get(1); // >> Returns the "Kermit" model, by id reference muppets.get(2); // >> Returns the "Gonzo" model, by id reference muppets.at(0); // >> Returns the "Kermit" model, by index muppets.findWhere({name: 'Gonzo'}); // >> returns the "Gonzo" model

Success! The returned list of Muppets were parsed as expected into a collection of MuppetModel instances, and the Collection provided some basic methods for querying them. Commonly used Collection methods include:

  • fetch: fetches the collection's data from its REST service using a GET request.
  • create: adds a new model into the collection, and creates it at the API via POST.
  • add: adds a new model into the collection without telling the API.
  • remove: removes a model from the collection without telling the API.
  • get: gets a model from the collection by id reference.
  • at: gets a model from the collection by index.
  • find: finds all records matching a specific search criteria.

Backbone CRUD

Create, Read, Update, and Destroy are the four major data interactions that an application must manage. Backbone Models and Collections work closely together to delegate these roles. Infact, the relationship of Models and Collections (not so coincidentally) mirrors the design of a RESTful API. To review:

var MuppetModel = Backbone.Model.extend({ defaults: { id: null, name: null, occupation: null } }); var MuppetCollection = Backbone.Collection.extend({ url: '/muppets' model: MuppetModel });

Notice above that the model class does NOT define a url endpoint to sync with. This is because models within a collection will automatically construct their url reference as "[collection.url]/[model.id]". This means that after fetching the collection, our Kermit model (with an id of "1") will automatically be configured to sync with a url of "/muppets/1".

Create

Use Collection.create to POST new data to a list endpoint. The API should return complete data for the new database record, including its assigned id. The new model is created immediately within the front-end collection.

muppetsList.create({name: 'Piggy', occupation: 'fashionista'});

Read

Use Collection.fetch or Model.fetch to load data via GET. For models, you'll generally only need to call fetch for models without a parent collection.

kermit.fetch(); muppetsList.fetch();

Update

Use Model.save to PUT updated data for a model. The model's complete data is sent to the API.

kermit.save('occupation', 'being awesome');

Destroy

Use Model.destroy to DELETE a model instance. The model will remove itself from any parent collection, and issue a DELETE request to the API.

kermit.destroy();

Patch

Some API designs may also support using the PATCH method to perform partial model updates (where only modified data attributes are sent to the API). This design is less common, however can be achieved in Backbone by calling Model.save and passing a {patch: true} option.

kermit.save('occupation', 'being awesome', {patch: true});

Binding Events

Backbone also provides a best-of-class Events framework. The major differentiator of Backbone Events is the support for context passing, or, specifing what this refers to when an event handler is triggered:

target.on(event, handler, context) target.off(event, handler, context) target.trigger(event)

The other key feature of the Backbone Events framework are the inversion-of-control event binders:

this.listenTo(target, event, handler) this.stopListening()

These reverse event binders make life easier by quickly releasing all of an object's bound events, thus aiding memory management. As a general rule, objects with a shorter lifespan should listen to objects with a longer lifespan, and clean up their own event references when deprecated.

Model & Collection Events

You may bind event handlers onto any model or collection (optionally passing in a handler context):

kermit.on('change', function() { // do stuff... }, this);

Commonly tracked Model events include:

  • "change": triggered when the value of any model attribute changes.
  • "change:[attribute]": triggered when the value of the named attribute changes.
  • sync: called when the model completes a data exchange with the API.

Commonly tracked Collection events include:

  • "add": triggered when a model is added to the collection.
  • "remove": triggered when a model is removed from the collection.
  • "reset": triggered when the collection is purged with a hard reset.
  • "sync": triggered when the collection completes a data exchange with the API.
  • [model event]: all child model events are proxied by their parent collection.

Review Backbone's catalog of built-in events for all available event triggers.

About the Author

Greg MacWilliam is an RIT alumni, currently working as a freelance web software engineer. He specializes in single-page application architecture and API design. He's worked for organizations including the National Park Service and NPR, and has taught JavaScript at the Art Institute of Washington. Greg has self-published several online games, and actively contributes to BackboneJS projects.

Greg's current side projects include the Backbone extensions EpoxyJS and ContainerView, as well as ConstellationJS for grid geometry operations.

You can follow him on Twitter and Github:

Backbone, The Primer

backbone

This is the first of four posts that guest author Greg MacWilliam has put together for Data Community DC: Backbone, The Primer.

Editors Note: Backbone.js is an MVC framework for front-end developers. It allows rapid development and prototyping of rich AJAX web applications in the same way that Django or Rails does for the backend. Part of the Data Science pipeline is reporting and visualization; and rich HTML reports that use Javascript libraries like D3 or Highcharts are already part of the Data Scientist's toolkit. As you'll see from this primer, leveraging Backbone will equip Data Scientists with the means to quickly deploy to the front-end without having to wade through many of the impediments that typically plague browser development.

A common sentiment I hear from developers coming to Backbone is that they don't know where to start with it. Unlike full-featured frameworks with prescribed workflows (ie: Angular or Ember), Backbone is a lightweight library with few opinions. At its worst, some would say that Backbone has TOO few opinions. At its best thought, Backbone is a flexible component library designed to provide a baseline solution for common application design patterns.

The core of Backbone provides a comprehensive RESTful service package. This primer assumes that you have a basic understanding of REST services, and will focus on the interactions of Backbone components with RESTful data services.

What's In The Box?

The first thing to familiarize with are the basic components provided by Backbone. There are three foundation components that make up Backbone applications:

  • Backbone.ModelModels store application data, and sync with REST services. A model may predefine its default attributes, and will emit events when any of its managed data attributes change.
  • Backbone.CollectionCollections manage a list of models, and sync with REST services. A collection provides basic search methods for querying its managed models, and emits events when the composition of its models change.
  • Backbone.ViewA view connects a model to its visual representation in the HTML Document Object Model (or, the "DOM"). Views render their associated model's data into the DOM, and capture user input from the DOM to send back into the model.
  • Bonus… Underscore!Backbone has two JavaScript library dependencies: jQuery and UnderscoreJS. Chances are good that you're familiar with jQuery. If you don't know Underscore, review its documentation. Underscore provides common functional programming for working with data structures. When setting up Backbone, you get the full capabilities of Underscore as part of the package!

While Backbone does include additional useful features, we'll be focusing on these core components in this primer.

Got REST Services?

For this primer, let's assume the following RESTful Muppets data service exists:

GET /muppets/

Gets a list of all Muppets within the application. Returns an array of all Muppet models (with some additional meta data):

{ "total": 2, "page": 1, "perPage": 10, "muppets": [ { "id": 1, "name": "Kermit", "occupation": "being green" }, { "id": 2, "name": "Gonzo", "occupation": "plumber" } ] }

POST /muppets/

Creates a new Muppet model based on the posted data. Returns the newly created model:

{ "id": 3, "name": "Animal", "occupation": "drummer" }

GET /muppets/:id

PUT /muppets/:id

DEL /muppets/:id

Gets, modifies, and/or deletes a specific user model. All actions return the requested/modified model:

{ "id": 1, "name": "Kermit", "occupation": "being green" }

What's Next?

In the next three posts, we'll discuss in more detail Models and Collections, Views, Event Binding, and walk through the creation of a simple app based on the REST API described above.

  • Part One: Introduction
  • Part Two: Models and Collections
  • Part Three: Views
  • Part Four: A Simple App

About the Author

Greg MacWilliam is an RIT alumni, currently working as a freelance web software engineer. He specializes in single-page application architecture and API design. He's worked for organizations including the National Park Service and NPR, and has taught JavaScript at the Art Institute of Washington. Greg has self-published several online games, and actively contributes to BackboneJS projects.

Greg's current side projects include the Backbone extensions EpoxyJS and ContainerView, as well as ConstellationJS for grid geometry operations.

You can follow him on Twitter and Github:

Setting up a Titan Cluster on Cassandra and ElasticSearch on AWS EC2

Guest Post by Jenny Kim This purpose of this post is to provide a walkthrough of a Titan cluster setup and highlight some key gotchas I've learned along the way. This walkthrough will utilize the following versions of each software package:

Versions

The cluster in this walkthrough will utilize 2 M1.Large instances, which mirrors our current Staging cluster setup. A typical production graph cluster utilizes 4 M1.XLarge instances.

NOTE: While the Datastax Community AMI requires at minimum, M1.Large instances, the exact instance-type and cluster size should depend on your expected graph size, concurrent requests, and replication and consistency needs.

Part 1: Setup a Cassandra cluster

I followed Titan's EC2 instructions for standing up Titan on a Cassandra cluster using the Datastax Auto-Clustering AMI:

Step 1: Setting up Security Group

Navigate to the EC2 Console Dashboard, then click on Security Groups under Network & Security.

Create a new security group. Click Inbound. Set the “Create a new rule” dropdown menu to “Custom TCP rule”.

Add a rule for port 22 from source 0.0.0.0/0.

Add a rule for ports 1024-65535 from the security group members. If you don’t want to open all unprivileged ports among security group members, then at least open 7000, 7199, and 9160 among security group members.

Tip: the “Source” dropdown will autocomplete security group identifiers once “sg” is typed in the box, so you needn’t have the exact value ready beforehand.

Step 2: Launch DataStax Cassandra AMI

Launch the DataStax AMI in your desired zone

On the Instance Details page of the Request Instances Wizard, set “Number of Instances” to your desired number of Cassandra nodes (i.e. - 2)

Set “Instance Type” to at least m1.large.

On the Advanced Instance Options page of the Request Instances Wizard, set the “as text” radio button under “User Data”, then fill this into the text box.

--clustername [cassandra-cluster-name]
--totalnodes [number-of-instances]
--version community 
--opscenter no

[number-of-instances] in this configuration must match the number of EC2 instances configured on the previous wizard page (i.e. - 2). [cassandra-cluster-name] can be any string used for identification. For example:

--clustername titan-staging
--totalnodes 2
--version community 
--opscenter no

On the Tags page of the Request Instances Wizard you can apply any desired configurations. These tags exist only at the EC2 administrative level and have no effect on the Cassandra daemons’ configuration or operation.

  • It is useful here to set a tag for ElasticSearch to discover this node when identifying its cluster nodes. We will revisit this tag in the ElasticSearch section.

On the Create Key Pair page of the Request Instances Wizard, either select an existing key pair or create a new one. The PEM file containing the private half of the selected key pair will be required to connect to these instances.

On the Configure Firewall page of the Request Instances Wizard, select the security group created earlier.

Review and launch instances on the final wizard page. The AMI will take a few minutes to load.

Step 3: Verify Successful Instance Launch

SSH into any Cassandra instance node:

ssh -i [your-private-key].pem ubuntu@[public-dns-name-of-any-cassandra-instance]

Run the Cassandra nodetool nodetool -h 127.0.0.1 ring to inspect the state of the Cassandra token ring.

  • You should see as many nodes in this command’s output as instances launched in the previous steps. Status should say UP for all rows.
  • Note, that the AMI takes a few minutes to configure each instance. A shell prompt will appear upon successful configuration when you SSHinto the instance.

If upon shelling in, Cassandra still appears to be loading, Ctrl-C to quit and restart Cassandra with:

sudo service cassandra restart

Part 2: Install Titan

Titan can be embedded within each Cassandra node-instance, or installed remotely from the cluster. I installed Titan on each Cassandra instance, but do not run in embedded mode.

Step 1: Download Titan

SSH into a Cassandra instance node and within the ubuntu home directory, download the Titan 0.4.1 server distribution ZIP:

wget http://s3.thinkaurelius.com/downloads/titan/titan-server-0.4.1.zip

Unzip the Titan directory and move to /opt/:

unzip titan-server-0.4.1.zip
sudo mv titan-server-0.4.1 /opt/

cd to /opt/ and create a symlink from /opt/titan to /opt/titan-server-0.4.1

cd /opt/titan
sudo ln -s titan-server-0.4.1 titan

Step 2: Configure Titan

We need to create a specific Titan configuration file that can be used when we run the Gremlin shell. This configuration will include our storage settings, cache settings, and search index settings.

Create a new properties file (i.e. - mygraph.properties) within the /opt/titan/conf folder:

vim /opt/titan/conf/mygraph.properties

The storage settings should specify the backend as Cassandra and include the Private IP to one of the Cassandra nodes. However, additional Cassandra configurations are listed here: https://github.com/thinkaurelius/titan/wiki/Using-Cassandra#cassandra-specific-configuration

storage.backend=cassandra
storage.hostname=172.12.191.2

The database cache settings should be enabled in a Production environment. Full documentation is found here: https://github.com/thinkaurelius/titan/wiki/Database-Cache. For our purposes, we will just enable the db-cache, set the clean time (milliseconds to wait to clean cache), cache-time (max milliseconds to hold items in cache), and cache-size (percentage of total heap space available to the JVM that Titan runs in).

cache.db-cache = true
cache.db-cache-clean-wait = 50
cache.db-cache-time = 10000
cache.db-cache-size = 0.25

The search index settings should specify "elasticsearch" as the external search backend, and configure it for the remote ElasticSearch cluster.

  • First you must set up an ElasticSearch cluster, which we have done using the same nodes as the Cassandra cluster.
  • Refer to the Deploying an ElasticSearch Cluster post for instructions on how to do this.
  • Make note of the cluster name and host IPs for all nodes in the ES cluster.

Based on the above ES cluster settings, add the following to you properties file, replacing hostnames and cluster-name with your specific settings:

storage.index.search.backend=elasticsearch
storage.index.search.hostname=;
storage.index.search.cluster-name=;
storage.index.search.index-name=titan
storage.index.search.client-only=true
storage.index.search.sniff=false
storage.index.search.local-mode=false

Save the file, and test in Gremlin:

bin/gremlin.sh
gremlin> g = TitanFactory.open('conf/mygraph.properties')

You should see Gremlin connect to the Cassandra cluster and return a blank Gremlin prompt. Success! Keep Gremlin open for the next Step.

Step 3: Run Indices

Now before we add any data to our graph, we need to do a one-time setup of any Titan and ElasticSearch property and label indices. This must be done with caution because in Titan, indexes cannot be modified, dropped, or added on existing properties (Titan Limitations).

Note that we created a script for our indices and tracked them in Github to quickly adapt our indices when updating and reloading a new Graph. Also keep in mind that Titan 0.4.1 has a new index syntax that is different and not backwards compatible with the old Titan 0.3.2 syntax.

In the Gremlin shell, copy and paste the indices script. If all indices run successfully, commit, shutdown and exit:

gremlin> g.commit()
gremlin> g.shutdown()
gremlin> exit

(Optional) Part 3: Load GraphSON

If you are doing a bulk load of GraphSON into Titan, you can do so via Faunus or Gremlin. The GraphSON format for each method is unique, so you will need to ensure that your GraphSON format adheres to the expected rules. This walkthrough will focus on the Gremlin GraphSON load.

Save the graphSON file (i.e. - gremlin.graphson.json) to the root of the Titan directory.

Edit the bin/gremlin.sh script file to increase the JVM heap size max to 4GB (ensuring enough memory on the machine), to avoid "GC overhead exceeded" errors.

(bin/gremlin.sh)
(Line 25)  JAVA_OPTIONS="-Xms32m -Xmx4096m"

Create a new groovy script to load the GraphSON and auto-commit:

(loader.groovy)
g = TitanFactory.open('conf/mygraph.properties')
g.loadGraphSON('gremlin.graphson.json')
g.commit()

Run the Groovy script through Gremlin

bin/gremlin.sh -e loader.groovy

This will take a while...a 500 MB graphSON file generally takes about 1.5 hours to finish loading (assuming no errors).

Part 4: Configure Rexster

Titan 0.4.1 Server now ships with Rexster Server, so you can run a Titan-configured Rexster server by running the rexster.sh script from Titan's bin directory.

Step 1: Rexster Configuration

To run Rexster, you will need to create a Rexster configuration XML which Rexster by default expects to be under $TITAN_HOME/rexhome/config/rexster.xml (i.e. - /opt/titan/rexhome/config/rexster.xml).

Under /opt/titan/rexhome create a config directory

 mkdir /opt/titan/rexhome/config

Create a rexster.xml document under the config directory. Alternately, copy the /opt/titan/conf/rexster-cassandra-es.xml file into /opt/titan/rexhome/config/rexster.xml

The Rexster configuration needs a few changes to properly connect to the Cassandra cluster and ElasticSearch cluster. Here is an example configuration: https://gist.github.com/spaztic1215/7e4303b75184098e64fc

Update the base-uri property on Line 5 to the current instance's DNS:

http://ec2-54-193-46-179.us-west-1.compute.amazonaws.com

Update the graph properties to connect to Cassandra and ES:

    
        graph
        com.thinkaurelius.titan.tinkerpop.rexster.TitanGraphConfiguration
        false

            cassandra
            secondaryOne
            100
            titan
            true
            0.3
            elasticsearch
            false
            secondaryOne,secondaryTwo
            fife
            titan
            false
            false

                tp:gremlin

Step 2: Test Rexster

Now we should test Rexster Server to make sure it is connecting properly to Titan and ES:

Change directory to $TITAN_HOME (i..e - /opt/titan) and start Rexster manually

bin/rexster.sh -s

You should see a bunch of Rexster console messages that indicate that it has connected to the Cassandra cluster and ES cluster. You can verify that Rexster is running in a browser by going to:

http://:8182

Verify that the Titan graph is found by Rexster:

http://:8182/graphs/graph/

Step 3: Create a Rexster Upstart Script

Once you've confirmed that Rexster can successfully start, we will create an upstart script to manage the Rexster start/stop process.

Under /etc/init, create a configuration called rexster.conf

A sample rexster upstart configuration can be found here: https://gist.github.com/spaztic1215/5bfc2ee2d370b933c8ca

Note that the above configuration assumes that you are using the Datastax AMI which includes a raid0 directory, and thus logs to the /raid0/log/rexster directory. You must create this directory before starting the script:

mkdir /raid0/log/rexster

Save file, and start rexster with upstart:

sudo start rexster

Check the log to ensure successful startup:

cd /raid0/log/rexster
tail rexster.log

Success!

You now have a fully configured Cassandra, ElasticSearch, Titan, Rexster setup on a single node. Once you've applied this configuration to all your nodes, you can start Rexster Server on the entire cluster, set up an ELB that contains your Rexster instances binding port 80 to 8182, and start accepting Rexster requests from the ELB's domain.


Jenny Kim

Jenny Kim is a senior software engineer at Cobrain, where she works with the data science team. Jenny graduated from the Uuniversity of Maryland with a B.S. in Computer Science and a B.A. in American Studies. She acquired her Masters in Information Systems Technology from The George Washington University in December 2013. In her free time, Jenny enjoys volunteering at local film festivals, obsessive vacuuming, and relaxing with the family Shih Tzu.

Instructions for deploying an Elasticsearch Cluster with Titan

ElasticSearch Elasticsearch is an open source distributed real-time search engine for the cloud. It allows you to deploy a scalable, auto-discovered cluster of nodes, and as search capacity grows, you simple need to add more nodes and the cluster will reorganize itself. Titan, a distributed graph engine by Aurelius supports elasticsearch as an option to index your vertices for fast lookup and retrieval. By default, Titan supports elasticsearch running in the same JVM and storing data locally on the client, which is fine for embedded mode. However, once your Titan cluster starts growing, you have to respond by growing an elasticsearch cluster side by side with the graph engine.

This tutorial is how to quickly get a elasticsearch cluster up and running on EC2, then configuring Titan to use it for indexing. It assumes you already have an EC2/Titan cluster deployed. Note, that these instructions were for a particular deployment, so please forward any questions about specifics in the comments!

Step 1: Installation

NOTE: These instructions assume you've installed Java6 or later.

By far, the best installation mechanism to install eleasticsearch on an Ubuntu EC2 instance is the Debian package that is provided as a download. This package installs an init.d script and places the configuration files in /etc/elasticsearch and generally creates goodness that we don't have to deal with. You can find the .deb on the elastic search download page.

$ cd /tmp
$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.deb
$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.deb.sha1.txt
$ sha1sum elasticsearch-0.90.7.deb && cat elasticsearch-0.90.7.deb.sha1.txt 

Note that you may have to use the flag --no-check-certificate or you could use curl, just ensure that you use the correct filenames. Also ensure that the checksums match (and be even more paranoid and check the elasticsearch website). Installation is simple:

$ sudo dpkg -i elasticsearch-0.90.7.deb

Elasticsearch will now be running on your machine with the default configuration. To check this you can do the following:

$ sudo status elasticsearch
$ ps -ef | grep elasticsearch

But while we configure it, it doesn't really need to be running:

$ sudo service elasticsearch stop

In particular this does the following things you should be aware of:

  1. Creates the elasticsearch:elasticsearch user and group
  2. Installs the library into /usr/share/elasticsearch
  3. Creates the logging directory at /var/log/elasticsearch
  4. Creates the configuration directory at /etc/elasticsearch
  5. Creates a data directory at /var/lib/elasticsearch
  6. Creates a temp work directory at /tmp/elasticsearch
  7. Creates an upstart script at /etc/init.d/elasticsearch
  8. Creates an upstart configuration at /etc/default/elasticsearch

Because of our particular Titan deployment, this is not good enough for what we're trying to accomplish, so the next step is configuration.

Step 2: Configuration

The configuration we're looking for is an auto-discovered EC2 elastic cluster that is bound to the default ports, and works with data on the attached volume rather than on the much small root disk. In order to autodiscover on EC2 we have to install an AWS plugin, which can be found on the cloud aws plugin Github page:

$ cd /usr/share/elasticsearch
$ bin/plugin -install elasticsearch/elasticsearch-cloud-aws/1.15.0

Elasticsearch is configured via a YAML file in /etc/elasticsearch/elasticsearch.yml so open up your editor, and use the configurations as we added them below:

path:
    conf: /etc/elasticsearch
    data: /raid0/elasticsearch
    work: /raid0/tmp/elasticsearch
    logs: /var/log/elasticsearch
cluster:
    name: DC2
cloud:
    aws:
        access_key: ${AWS_ACCESS_KEY_ID}
        secret_key: ${AWS_SECRET_ACCESS_KEY}
discovery:
    type: ec2

For us, the other defaults worked just fine. So let's go through this a bit. First off, for all the paths, make sure that they exist, you've created them, and that they have the correct permissions. The raid0 folder is where we have mounted an EBS volume that contains enough non-ephemeral storage for our data services. Although this does add some network overhead, it prevents data loss when the instance is terminated. However, if you're not working with EBS or you've mounted in a different location, using the root directory defaults is probably fine.

$ sudo mkdir /raid0/elasticsearch
$ sudo chown elasticsearch:elasticsearch /raid0/elasticsearch
$ sudo chmod 775 elasticsearch
$ sudo mkdir -p /raid0/tmp/elasticsearch
$ sudo chmod 777 /raid0/tmp
$ sudo chown elasticsearch:elasticsearch /raid0/tmp/elasticsearch
$ sudo chmod 775 /raid0/tmp/elasticsearch

Editor's Note: I just discovered that you can actually set these options with the dpkg command so that you don't have to do it manually. See the elasticsearch as a service on linux guide for more.

The cluster name, in our case DC2, needs to be the same for every node on the cluster, this is also vital for EC2. The default, elasticsearch, could make the discovery more difficult. Also note that each node can be named separately, but by default the name is selected randomly on a list of 3000 or so Marvel characters. The cloud and discovery options allow discovery through EC2.

You should now be able to run the cluster:

$ sudo service elasticsearch start

Check the logs to make sure there are no errors, and that the cluster is running. If so, you should be able to navigate to the following URL:

http://localhost:9200/_cluster/health?pretty=true

By replacing localhost with the hostname, you can see the status of the cluster, as well as the number of nodes. But wait, why are there no more nodes being added? Don't keep waiting! The reason is because Titan has probably already been configured to use local Elasticsearch, and is blocking port 9300, the communication and control port for the ES cluster.

Configuring Titan

Titan is blocking the cluster elasticsearch with its own local elasticsearch, and anyway, we want Titan to use the elasticsearch cluster! Let's reconfigure Titan. First, open up your favorite editor and change the configuration of /opt/titan/config/yourgraph.properties to the following:

storage.backend=cassandra
storage.hostname=${LOCAL_IPADDR}

storage.index.search.backend=elasticsearch
storage.index.search.client-only=true
storage.index.search.hostname=${ES_ADDR},${ES_ADDR},${ES_ADDR}

Hopefully you don't have to replace the storage.backend and storage.hostname configurations. Remove the storage.index.search.local-mode configuration as well as the storage.index.search.directory configuration, and add the configurations above as follows.

For storage.index.search.hostname, add a comma separated list of every node in the ES cluster (for now).

That's it! Reload Titan, and you should soon see the cluster grow to include all the nodes you configured, as well as a speed up in queries to the Titan graph!

Analyzing Social Media Networks using NodeXL

This is a guest post from Marc Smith, Chief Social Scientist at Connected Action Consulting Group, and a developer of NodeXL, an Excel-based system for (social) network analysis. Marc will be leading a workshop on NodeXL, offered through Data Community DC, on Wednesday, November 13th. If the below peaks your fancy, please register. Parts of this post appeared first on connectedaction.net. NodeXL Logo

I am excited to have the opportunity to present a NodeXL workshop with Data Community DC on November 13th at 6pm in Washington, D.C.

In this session I will describe the ways NodeXL can simplify the process of collecting, storing, analyzing, visualizing and publishing reports about connected structures. NodeXL supports the exploration of social media with import features that pull data from personal email indexes on the desktop, Twitter, Flickr, Youtube, Facebook and WWW hyperlinks. NodeXL allows non-programmers to quickly generate useful network statistics and metrics and create visualizations of network graphs.  Filtering and display attributes can be used to highlight important structures in the network.  Innovative automated layouts make creating quality network visualizations simple and quick.

2013-SMRF-NodeXL-SNA-5 steps for Social Media Network Analysis

For example, this a map of the connections among the people who recently tweeted about the DataCommunityDC Twitter account was created with just a few clicks and no coding:

DataCommunityDC Twitter NodeXL SNA Map and Report for Tuesday, 05 November 2013 at 15:15 UTC

This graph represents a network of 67 Twitter users whose recent tweets contained “DataCommunityDC", taken from a data set limited to a maximum of 10,000 tweets. The network was obtained from Twitter on Tuesday, 05 November 2013 at 15:15 UTC. The tweets in the network were tweeted over the 7-day, 16-hour, 4-minute period from Monday, 28 October 2013 at 22:38 UTC to Tuesday, 05 November 2013 at 14:42 UTC. There is an edge for each “replies-to” relationship in a tweet. There is an edge for each “mentions” relationship in a tweet. There is a self-loop edge for each tweet that is not a “replies-to” or “mentions”.

The network has been segmented into groups (“G1, G2, G3…”) and each group is labeled with the words most frequently used in the tweets from the people in that group. The size of each Twitter user’s profile picture represents the log scaled value of their follower count.

Analysis of the network location of each participant reveals the people in key locations in the network, people at the “center” of the graph:

For more examples, please see the NodeXL Graph Gallery at: http://nodexlgraphgallery.org/Pages/Default.aspx