Resources

Problems with the p-value -- References

Problems with the p-value -- References

On December 11th, Prof. Regina Nuzzo from Galludet University talked at Data Science DC, about Problems with the p-valueThe event was well-received. If you missed it, the slides and audio are available. Here we provide Dr. Nuzzo's references and links from the talk, which are on their own a great resource for those considering communication about statistical reliability. (Note that the five topics she covered used examples from highly-publicized studies of sexual behavior.)

Announcing the Publication of Practical Data Science Cookbook

Practical Data Science Cookbook is perfect for those who want to learn data science and numerical programming concepts through hands-on, real-world project examples. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of data science projects, the steps in the data science pipeline, and the programming examples presented in this book. 

Announcing Discussion Lists! First up: Deep Learning

Data Community DC is pleased to announce a new service to the area data community: topic-specific discussion lists! In this way we hope to extend the successes of our Meetups and workshops by providing a way for groups of local people with similar interests to maintain contact and have ongoing discussions. Our first discussion list will be on the topic of Deep Learning. The below is a guest post from John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC Meetup. A while back, there was this blog post about Deep Learning. At the end, we asked readers about their interest in hands-on Deep Learning tutorials.

ELEVEN

The results are in, and the survey went to 11. And as in all data science, context matters--and this eleven is decidedly less inspiring than Nigel Tufnel’s eleven. That said, ten out of eleven respondents wanted a hands-on Deep Learning tutorial, and eight respondents said they would register for a tutorial even if it required hardware approval or enrollment in a hardware tutorial. But interest in practical hands-on Deep Learning workshops appears to be highly nonuniform. One respondent said they’d drive from hundreds of miles away for these workshops, but of the 3000+ data scientists in DC’s data and analytics community, presumably more local, only eleven total responded with interest.

In short, the survey was a bust.

So it’s still not clear what the area data community wants out of Deep Learning, if anything, but since April I’ve gotten plenty of questions from plenty of people about Deep Learning on everything from hardware to parameter tuning, so I know there’s more interest than what we got back on the survey. Since a lot of these questions are probably shared, a discussion list might help us figure out how we can best help the most members get started in Deep Learning.

So how about a Deep Learning discussion list? If you’re a local and want to talk about Deep Learning, sign up here:

https://groups.google.com/a/datacommunitydc.org/d/forum/deeplearning

For the record, this discussion list was Harlan’s original suggestion. If you’re looking to take away any rules of thumb here, a simple one is “just agree with whatever Harlan says.” Tommy Jones and I will run this discussion list for now. To be clear, this list caters to the specific Deep Learning interests of data enthusiasts in the DC area. For a bigger community, there’s always deeplearning.net, the Deep Learning google+ page , and individual mailing lists and git repos for specific Deep Learning codebases, like Caffe, pylearn2, and Torch7.

In the meantime, I was happy to see some Deep Learning interest at DC NLP’s Open Mic night by Christo Kirov. And NLP data scientists need not watch Deep Learning developments from the sidelines anymore; some recent motivating results in the NLP space have been summarized in a tutorial by Richard Socher. I’m not qualified to say whether these are the kind of historic breakthroughs we’ve recently seen in speech recognition and object recognition, but it’s worth taking a look at what's happening out there.

High-Performance Computing in R Workshop

Data Community DC and District Data Labs are excited to be hosting a High-Performance Computing with R workshop on June 21st, 2014 taught by Yale professor and R package author Jay Emerson. If you're interested in learning about high-performance computing including concepts such as memory management, algorithmic efficiency, parallel programming, handling larger-than-RAM matrices, and using shared memory this is an awesome way to learn!

To reserve a spot, go to http://bit.ly/ddlhpcr.

Overview This intermediate-level masterclass will introduce you to topics in high-performance computing with R. We will begin by examining a range of related topics including memory management and algorithmic efficiency. Next, we will quickly explore the new parallel package (containing snow and multicore). We will then concentrate on the elegant framework for parallel programming offered by packages foreach and the associated parallel backends. The R package management system including the C/C++ interface and use of package Rcpp will be covered. We will conclude with basic examples of handling larger-than-RAM numeric matrices and use of shared memory. Hands-on exercises will be used throughout.

What will I learn? Different people approach statistical computing with R in different ways. It can be helpful to work on real data problems and learn something about R “on the fly” while trying to solve a problem. But it is also useful to have a more organized, formal presentation without the distraction of a complicated applied problem. This course offers four distinct modules which adopt both approaches and offer some overlap across the modules, helping to reinforce the key concepts. This is an active-learning class where attendees will benefit from working along with the instructor. Roughly, the modules include:

An intensive review of the core language syntax and data structures for working with and exploring data. Functions; conditionals arguments; loops; subsetting; manipulating and cleaning data; efficiency considerations and best practices, including loops and vector operations, memory overhead and optimizing performance.

Motivating parallel programming with an eye on programming efficiency: a case study. Processing, manipulating, and conducting a basic analysis of 100-200 MB of raw microarray data provides an excellent challenge on standard laptops. It is large enough to be mildly annoying, yet small enough that we can make progress and see the benefits of programming effiency and parallel programming.

Topics in high-performance computing with R, including packages parallel and foreach. Hands-on examples will help reinforce key concepts and techniques.

Authoring R packages, including an introduction to the C/C++ interface and the use of Rcpp for high-performance computing. Participants will build a toy package including calls to C/C++ functions.

Is this class right for me? This class will be a good fit for you if you are comfortable working in R and are familiar with R's core data structures (vectors, matrices, lists, and data frames). You are comfortable with for loops and preferably aware of R's apply-family of functions. Ideally you will have written a few functions on your own. You have some experience working with R, but are ready to take it to the next level. Or, you may have considerable experience with other programming languages but are interested in quickly getting up to speed in the areas covered by this masterclass.

After this workshop, what will I be able to do? You will be in a better position to code efficiently with R, perhaps avoiding the need, in some cases, to resort to C/C++ or parallel programming. But you will be able to implement so-called embarassingly parallel algorithms in R when the need arises, and you'll be ready to exploit R's C/C++ interface in several ways. You'll be in a position to author your own R package can include C/C++ code.

All participants will receive electronic copies of all slides, data sets, exercises, and R scripts used in the course.

What do I need to bring? You will need your laptop with the latest version of R. I recommend use of the R Studio IDE, but it is not necessary. A few add-on packages will be used in the workshop. Packages Rcpp and foreach will be used. As a complement to foreach you should also install doMC (Linux or MacOS only) and doSNOW(all platforms). If you want to work along with the C/C++ interface segment, some extra preparation will be required. Rcpp and use of the C/C++ interface requires compilers and extra tools; the folks at RStudio have a nice page that summarizes the requirements. Please note that these requirements may not be trivial (particularly in Windows) and need to be completed prior to the workshop if you intend to compile C/C++ code and use Rcpp during the workshop.

Instructor John W. Emerson (Jay) is Director of Graduate Studies in the Departmentof Statistics at Yale University. He teaches a range of graduate and undergraduate courses as well as workshops, tutorials, and short courses at all levels around the world. His interests are in computational statistics and graphics, and his applied work ranges from topics in sports statistics to bioinformatics, environmental statistics, and Big Data challenges.

He is the author of several R packages including bcp (for Bayesian change point analysis), bigmemory and sister packages (towards a scalable solution for statistical computing with massive data), and gpairs (for generalized pairs plots). His teaching style is engaging and his workshops are active, hands-on learning experiences.

You can reserve your spot by going to http://bit.ly/ddlhpcr.

Facility Location Analysis Resources Incorporating Travel Time

This is a guest blog post by Alan Briggs. Alan is a operations researcher and data scientist at Elder Research. Alan and Harlan Harris (DC2 President and Data Science DC co-organizer) have co-presented a project with location analysis and Meetup location optimization at the Statistical Programming DC Meetup and an INFORMS-MD chapter meeting. Recently, Harlan presented a version of this work at the New York Statistical Programming Meetup. There was some great feedback on the Meetup page asking for some additional resources. This post by Alan is in response to that question.

If you’re looking for a good text resource to learn some of the basics about facility location, I highly recommend grabbing a chapter of Dr. Michael Kay’s e-book (pfd) available for free from his logistics engineering website. He gives an excellent overview of some of the basics of facility location, including single facility location, multi-facility location, facility location-allocation, etc. At ~20 pages, it’s entirely approachable, but technical enough to pique the interest of the more technically-minded analyst. For a deeper dive into some of the more advanced research in this space, I’d recommend using some of the subject headings in his book as seeds for a simple search on Google Scholar. It’s nothing super fancy, but there are plenty of articles in the public-domain that relate to minisum/minimax optimization and all of their narrowly tailored implementations.

One of the really nice things about the data science community is that it is full of people with an inveterate dedication to critical thinking. There is nearly always some great [constructive] criticism about what didn’t quite work or what could have been a little bit better. One of the responses to the presentation recommended optimizing with respect to travel time instead of distance. Obviously, in this work, we’re using Euclidean distance as a proxy for time. Harlan cites laziness as the primary motivation for this shortcut, and I’ll certainly echo his response. However, a lot of modeling boils down to cutting your losses, getting a good enough solution or trying to find the balance among your own limited resources (e.g. time, data, technique, etc.). For the problem at hand, clearly the goal is to make Meetup attendance convenient for the most number of people; we want to get people to Meetups. But, for our purposes, a rough order of magnitude was sufficient. Harlan humorously points out that the true optimal location for a version of our original analysis in DC was an abandoned warehouse—if it really was an actual physical location at all. So, when you really just need a good solution, and the precision associated with true optimality is unappreciated, distance can be a pretty good proxy for time.

A lot of times good enough works, but there are some obvious limitations. In statistics, the law of large numbers holds that independent measurements of a random quantity tend toward the theoretical average of that quantity. For this reason, logistics problems can be more accurately estimated when they involve a large number of entities (e.g. travelers, shipments, etc.). For the problem at hand, if we were optimizing facility location for 1,000 or 10,000 respondents, again, using distance as a proxy for time, we would feel much better about the optimality of the location. I would add that similar to large quantities, introducing greater degrees of variability can also serve to improve performance. Thus, optimizing facility location across DC or New York may be a little too homogeneous. If instead, however, your analysis uses data across a larger, more heterogeneous area, like say an entire state where you have urban, suburban and rural areas, you will again get better performance on optimality.

Let’s say you’ve weighed the pros and cons and you really want to dive deeper into optimizing based on travel time, there are a couple different options you can consider. First, applying a circuity factor to the Euclidean distance can account for non-direct travel between points. So, to go from point A to point B actually may take 1.2 units as opposed to the straight-line distance of 1 unit. That may not be as helpful for a really complex urban space where feasible routing is not always intuitive and travel times can vary wildly. However, it can give some really good approximations, again, over larger, more heterogeneous spaces. An extension to a singular circuity factor, would be to introduce some gradient circuity factor that is proportional to population density. There are some really good zip code data available that can be used to estimate population.

Increasing the circuity factor in higher-density locales and decreasing in the lower-density can help by providing a more realistic assessment of how far off of the straight line distance the average person would have to commute. For the really enthusiastic modeler that has some good data skills and is looking for even more integrity in their travel time analysis, there are a number of websites that provide road network information. They list roads across the United States by functional class (interstate, Expressway & principal arterial, Urban principal arterial, collector, etc.) and even provide shape files and things for GIS. I've done basic speed limit estimation by functional class, but you could also do something like introduce a speed gradient respective to population density (as we mentioned above for the circuity factor). You could also derive some type of an inverse speed distribution that slows traffic at certain times of the day based on rush hour travel in or near urban centers.

In the words of George E. P. Box, “all models are wrong, but some models are useful.” If you start with a basic location analysis and find it to be wrong but useful, you may have done enough. If not, however, perhaps you can make it useful by increasing complexity in one of the ways I have mentioned above.

Flask Mega Meta Tutorial for Data Scientists

Introduction

Data science isn't all statistical modeling, machine learning, and data frames. Eventually, your hard work pays off and you need to give back the data and the results of your analysis; those blinding insights that you and your team uncovered need to be operationalized in the final stage of the data science pipeline as a scalable data product. Fortunately for you, the web provides an equitable platform to do so and Python isn't all NumPy, Scipy, and Pandas. So, which Python web application framework should you jump into?

375px-Flask_logo.svg

There are numerous web application frameworks for Python with the 800lb gorilla being Django, "the web framework for perfectionists with deadlines." Django has been around since 2005 and, as the community likes to say, is definitely "batteries included," meaning that Django comes with a large number of bundled components (i.e., decisions that have already been made for you). Django comes built in with its chosen object-relational mapper (ORM) to make database access simple, a beautiful admin interface, a template system, a cache system for improving site performance, internationalization support, and much more. As a result, Django has a lot of moving parts and can feel large, especially for beginners. This behind the scene magic can obscure what actually happens and complicates matter when something goes wrong. From a learning perspective, this can make gaining a deeper understanding of web development more challenging.

Enter Flask, the micro web framework written in Python by Armin Ronacher. Purportedly, it came out as an April Fool's joke but proved popular enough not to go quietly into the night. The incredible thing about Flask is that you can have a web application running in about 10 lines of code and a few minutes of effort. Despite this, Flask can and does run real production sites.

As you need additional functionality, you can add it through the numerous extensions available. Need an admin interface? Try Flask-Admin. Need user session management and the ability for visitors to log in and out? Check out Flask-Login.

While Django advocates may argue that by the time you add all of the "needed" extensions to Flask, you get Django, others would argue that a micro-framework is the ideal platform to learn web development. Not only does the small size make it easier for you to get your head around it at first, but the fact that you need to add each component yourself requires you to understand fully the moving parts involved. Finally, and very relevant to data scientists, Flask is quite popular for building RESTful web services and creating APIs.

As Flask documentation isn't quite as extensive as the numerous Django articles on the web, I wanted to assemble the materials that I have been looking at as I explore the world of Flask development.

For Beginners - Getting Started

The Official Flask Docs

The official Flask home page has a great set of documentation, list of extensions to add functionality, and even a tutorial to build a micro blogging site. Check it out here.

Blog post on Learning Flask

This tutorial is really a hidden gem designed for those that are new to web programming and Flask. It contains lots of small insights into the process via examples that really help connect the dots.

Full Stack Python

While not exclusively focused on Flask, Fullstackpython.com is a fantastic page that offers a great amount of information and links about the full web stack, from servers, operating systems, and databases to configuration management, source control, and web analytics, all from the perspective of Python. A must read through for those new to web development.

Starting A Python Project the Right Way

Jeff Knupp again walks the beginner Python developer through how he starts a new Python project.


For Intermediate Developers

The Flask Mega Tutorial

The Flask Mega Tutorial is aptly named as it has eighteen separate parts walking you through the numerous aspects of building a website with Flask and various extensions. I think this tutorial is pretty good but have heard some comment that it can be a bit much if you don't have a strong background in web development.

Miguel Grinberg's tutorial has been so popular that it is even being used as the basis of an O'Reilly book, Flask Web Development, that is due out on May 25, 2014. An early release is currently available, so called "Raw & Unedited," here.

A snippet about the book is below:

If you have Python experience, and want to learn advanced web development techniques with the Flask micro-framework, this practical book takes you step-by-step through the development of a real-world project. It’s an ideal way to learn Flask from the ground up without the steep learning curve. The author starts with installation and brings you to more complicated topics such as database migrations, caching, and complex database relationships.

Each chapter in this focuses on a specific aspect of the project—first by exploring background on the topic itself, and then by waking you through a hands-on implementation. Once you understand the basics of Flask development, you can refer back to individual chapters to reinforce your grasp of the framework.

Flask Web Development

Beating O'Reilly to the punch, Packt Publishing started offering the book, "Instant Flask Web Development" from Ron DuPlain in August 2013.

Flask - Blog: Building a Flask Blog: Part1

This blog contains two different tutorials in separate posts. This one, tackles the familiar task of building a blog using Flask-SQLAlchemy, WTForms, Flask-WTF, Flask-Migrate, WebHelpers, and PostgreSQL. The second one shows the creation of a music streaming app.

Flask with uWSGI + Nginx

This short tutorial and Git Hub repository shows you how to set up a simple Flask app with uWSGI and the web server Nginx.

Python Web Applications with Flask

This extensive 3 part blog post from Real Python works its way through the development of a mid-sized web analytics application. The tutorial was last updated November 17th of 2013 and has the added bonus of demonstrating the use of virtualenv and git hub as well.

Python and Flask are Ridiculously Powerful

Jeff Knupp, author of Idiomatic Python, is fed up with online credit card payment processors so builds his own with Flask in a few hours.

 


More Advanced Content

The Blog of Erik Taubeneck

This is the blog of Erick Taubeneck, "a mathematician/economist/statistician by schooling, a data scientist by trade, and a python hacker by night." He is a Hacker School alum in NYC and contributor to a number of Flask extensions (and an all around good guy).

How to Structure Flask Applications

Here is a great post, recommended by Erik, discussing how one seasoned developers structures his flask applications.

Modern Web Applications with Flask and Backbone.js - Presentation

Here is a presentation about building a "modern web application" using Flask for the backend and backbone.js in the browser. Slide 2 shows a great timeline of the last 20 years in the history of web applications and slide 42 is a great overview of the distribution of Flask extensions in the backend and javascript in the frontend.

Modern Web Application Framework: Python, SQL Alchemy, Jinja2 & Flask - Presentation

This 68-slide presentation on Flask from Devert Alexandre is an excellent resource and tutorial discussing Flask, the very popular SQL Alchemy and practically default Flask ORM, and the Jinja2 template language.

Diving Into Flask - Presentation

66-slide presentation on Flask by Andrii V. Mishkovskyi from EuroPython 2012 that also discusses Flask-Cache, Flask-Celery, and blueprints for structuring larger applications.

Ensemble Learning Reading List

Tuesday's Data Science DC Meetup features GMU graduate student Jay Hyer's introduction of Ensemble Learning, a core set of Machine Learning techniques. Here are Jay's suggestions for readings and resources related to the topic. Attend the Meetup, and follow Jay on Twitter at @aDataHead! Also note that all images contain Amazon Affiliate links and will result in DC2 getting a small percentage of the proceeds should you purchase the book. Thanks for the support!

L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classification and Regression Trees. Chapman and Hall.CRC, Boca Raton, FL, 1984.

This book does not cover ensemble methods, but is the book that introduced classification and regression trees (CART), which is the basis of Random Forests. Classification trees are also the basis of the AdaBoost algorithm. CART methods are an important tool for a data scientist to have in their skill set.

L. Breiman. Random Forests. Machine Learning, 45(1):5-32, 2001.

This is the article that started it all.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, 2nd ed. Springer, New York, NY, 2009.

This book is light on application and heavy on theory. Nevertheless, chapters 10, 15 & 16 give very thorough coverage to boosting, Random Forests and ensemble learning, respectively. A free PDF version of the book is available on Tibshirani’s website.

G. James, D. Witten, T. Hastie, R. Tibshirani. An Introduction to Statistical Learning: with Apllications in R, Springer, New York, NY, 2013.

As the name and co-authors imply, this is an introductory version of the previous book in this list. Chapter 8 covers, bagging, Random Forests and boosting.

Y. Freund and R.E. Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Journal of Computer and System Sciences, 55(1): 119-139, 1997.

This is the article that introduced the AdaBoost algorithm.

G. Seni, and J. Elder. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan & Claypool Publishers, USA, 2010.

This is a good book with great illustrations and graphs. There is also a lot of R code too!

Z.H. Zhou. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC, 2012.

This is an excellent book the covers ensemble learning from A-Z and is well suited for anyone from an eager beginner to a critical expert.

Instructions for deploying an Elasticsearch Cluster with Titan

ElasticSearch Elasticsearch is an open source distributed real-time search engine for the cloud. It allows you to deploy a scalable, auto-discovered cluster of nodes, and as search capacity grows, you simple need to add more nodes and the cluster will reorganize itself. Titan, a distributed graph engine by Aurelius supports elasticsearch as an option to index your vertices for fast lookup and retrieval. By default, Titan supports elasticsearch running in the same JVM and storing data locally on the client, which is fine for embedded mode. However, once your Titan cluster starts growing, you have to respond by growing an elasticsearch cluster side by side with the graph engine.

This tutorial is how to quickly get a elasticsearch cluster up and running on EC2, then configuring Titan to use it for indexing. It assumes you already have an EC2/Titan cluster deployed. Note, that these instructions were for a particular deployment, so please forward any questions about specifics in the comments!

Step 1: Installation

NOTE: These instructions assume you've installed Java6 or later.

By far, the best installation mechanism to install eleasticsearch on an Ubuntu EC2 instance is the Debian package that is provided as a download. This package installs an init.d script and places the configuration files in /etc/elasticsearch and generally creates goodness that we don't have to deal with. You can find the .deb on the elastic search download page.

$ cd /tmp
$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.deb
$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.deb.sha1.txt
$ sha1sum elasticsearch-0.90.7.deb && cat elasticsearch-0.90.7.deb.sha1.txt 

Note that you may have to use the flag --no-check-certificate or you could use curl, just ensure that you use the correct filenames. Also ensure that the checksums match (and be even more paranoid and check the elasticsearch website). Installation is simple:

$ sudo dpkg -i elasticsearch-0.90.7.deb

Elasticsearch will now be running on your machine with the default configuration. To check this you can do the following:

$ sudo status elasticsearch
$ ps -ef | grep elasticsearch

But while we configure it, it doesn't really need to be running:

$ sudo service elasticsearch stop

In particular this does the following things you should be aware of:

  1. Creates the elasticsearch:elasticsearch user and group
  2. Installs the library into /usr/share/elasticsearch
  3. Creates the logging directory at /var/log/elasticsearch
  4. Creates the configuration directory at /etc/elasticsearch
  5. Creates a data directory at /var/lib/elasticsearch
  6. Creates a temp work directory at /tmp/elasticsearch
  7. Creates an upstart script at /etc/init.d/elasticsearch
  8. Creates an upstart configuration at /etc/default/elasticsearch

Because of our particular Titan deployment, this is not good enough for what we're trying to accomplish, so the next step is configuration.

Step 2: Configuration

The configuration we're looking for is an auto-discovered EC2 elastic cluster that is bound to the default ports, and works with data on the attached volume rather than on the much small root disk. In order to autodiscover on EC2 we have to install an AWS plugin, which can be found on the cloud aws plugin Github page:

$ cd /usr/share/elasticsearch
$ bin/plugin -install elasticsearch/elasticsearch-cloud-aws/1.15.0

Elasticsearch is configured via a YAML file in /etc/elasticsearch/elasticsearch.yml so open up your editor, and use the configurations as we added them below:

path:
    conf: /etc/elasticsearch
    data: /raid0/elasticsearch
    work: /raid0/tmp/elasticsearch
    logs: /var/log/elasticsearch
cluster:
    name: DC2
cloud:
    aws:
        access_key: ${AWS_ACCESS_KEY_ID}
        secret_key: ${AWS_SECRET_ACCESS_KEY}
discovery:
    type: ec2

For us, the other defaults worked just fine. So let's go through this a bit. First off, for all the paths, make sure that they exist, you've created them, and that they have the correct permissions. The raid0 folder is where we have mounted an EBS volume that contains enough non-ephemeral storage for our data services. Although this does add some network overhead, it prevents data loss when the instance is terminated. However, if you're not working with EBS or you've mounted in a different location, using the root directory defaults is probably fine.

$ sudo mkdir /raid0/elasticsearch
$ sudo chown elasticsearch:elasticsearch /raid0/elasticsearch
$ sudo chmod 775 elasticsearch
$ sudo mkdir -p /raid0/tmp/elasticsearch
$ sudo chmod 777 /raid0/tmp
$ sudo chown elasticsearch:elasticsearch /raid0/tmp/elasticsearch
$ sudo chmod 775 /raid0/tmp/elasticsearch

Editor's Note: I just discovered that you can actually set these options with the dpkg command so that you don't have to do it manually. See the elasticsearch as a service on linux guide for more.

The cluster name, in our case DC2, needs to be the same for every node on the cluster, this is also vital for EC2. The default, elasticsearch, could make the discovery more difficult. Also note that each node can be named separately, but by default the name is selected randomly on a list of 3000 or so Marvel characters. The cloud and discovery options allow discovery through EC2.

You should now be able to run the cluster:

$ sudo service elasticsearch start

Check the logs to make sure there are no errors, and that the cluster is running. If so, you should be able to navigate to the following URL:

http://localhost:9200/_cluster/health?pretty=true

By replacing localhost with the hostname, you can see the status of the cluster, as well as the number of nodes. But wait, why are there no more nodes being added? Don't keep waiting! The reason is because Titan has probably already been configured to use local Elasticsearch, and is blocking port 9300, the communication and control port for the ES cluster.

Configuring Titan

Titan is blocking the cluster elasticsearch with its own local elasticsearch, and anyway, we want Titan to use the elasticsearch cluster! Let's reconfigure Titan. First, open up your favorite editor and change the configuration of /opt/titan/config/yourgraph.properties to the following:

storage.backend=cassandra
storage.hostname=${LOCAL_IPADDR}

storage.index.search.backend=elasticsearch
storage.index.search.client-only=true
storage.index.search.hostname=${ES_ADDR},${ES_ADDR},${ES_ADDR}

Hopefully you don't have to replace the storage.backend and storage.hostname configurations. Remove the storage.index.search.local-mode configuration as well as the storage.index.search.directory configuration, and add the configurations above as follows.

For storage.index.search.hostname, add a comma separated list of every node in the ES cluster (for now).

That's it! Reload Titan, and you should soon see the cluster grow to include all the nodes you configured, as well as a speed up in queries to the Titan graph!

General Assembly & DC2 Scholarship

GA DC2 Scholarship The DC2 mission statement emphasises that "Data Community DC is an organization committed to connecting and promoting the work of data professionals...", ultimately we see DC2 becoming a hub for data scientists interested in exploring new material, advancing their skills, collaborating, starting a business with data, mentoring others, teaching classes, changing careers, etc. Education is clearly a large part of any of these interests, and while DC2 has held a few workshops and is sponsored by organizations like Statistics.com, we knew we could do more and so we partnered with General Assembly and created a GA & DC2 scholarship specifically for members of Data Community DC.

For our first scholarship we landed on Front End Web Development and User Experience, which we naturally announced first at Data Viz DC.  How does this relate to data science?  As I was happy to rebut Mr. Gelman in our DC2 blogpost reply, sometimes I would love to have a little sandbox where I get to play with algorithms all day, but then again this is exactly what I've run away from in 2013 in becoming an independent data science consultant, I don't want a business plan I'm not a part of dictating what I can play with.  Enter Web Dev and UX.  As Harlan Harris, organizer of DSDC, mentions in his venn diagram on what makes a data scientist, which Tony Ojeda later emphasizes, programming is a natural and necessary part of being a data scientist.  In other words, there's this thing called the interwebs that has more data than you can shake a stick at, and if you can't operate in that environment then as a data scientist you're asking someone else to do that heavy lifting for you.

Over the next month we'll be choosing the winners of the GA DC2 Scholarship, and if you'd like to see any other scholarships in the future please leave your thoughts in the comments below or tweet us.

Happy Thanksgiving!

Python for Data Analysis: The Landscape of Tutorials

Python has been one of the premier general scripting languages, and a major web development language. Numerical and data analysis and scientific programming developed through the packages Numpy and Scipy, which, along with the visualization package Matplotlib formed the basis for an open-source alternative to Matlab. Numpy provided array objects, cross-language integration, linear algebra and other functionalities. Scipy adds to this and provides optimization, linear algebra, optimization, statistics and basic image analysis capabilities. Matplotlib provides sophisticated 2-D and basic 3-D graphics capabilities with Matlab-like syntax.

Python_natalensis_Smith_1840

Further recent development has resulted in a rather complete stack for data manipulation and analysis, that includes Sympy for symbolic mathematics, pandas for data structures and analysis, and IPython as an enhanced console and HTML notebook that also facilitates parallel computation.

An even richer data analysis ecosystem is quickly evolving in Python, led by Enthought and Continuum Analytics and several other independent and associated efforts. We have described this ecosystem here.

This week, as part of Data Community DC meetups, Peter Wang from Continuum Analytics presents about the PyData ecosystem at Statistical Programming DC, and Jonathan Street from NIH is presenting on Scientific Computing in Python at Data Science MD.

How do you get started??!!!

This ecosystem is evolving and exists today, but how do you get started using these tools? Fortunately there are several tutorials available both in video and as presentations that you can use. Hopefully this will put you on the path. This listing is of course incomplete, and may not include your favorite tool. Tell us about it in the comments!!

PyData Workshop 2012

The PyData Workshop 2012 was organized in NYC last October to bring together data scientists, scientists and engineers. It focused on "techniques and tools for management, analytics, and visualization of data of different types and sizes with particular emphasis on big data". It was primarily sponsored by Continuum Analytics.  The videos for this workshop are aggregated here.

PyData Silicon Valley

The follow up PyData workshop was held alongside PyCon2013 in Santa Clara, CA. The videos for the presentations are available here. The topics at the workshop included tutorials for pandas, matplotlib, PySpark (for cluster computing), scikits-learn, Wise.io, Disco (a MapReduce implementation), Naive Bayes, Nodebox, machine learning in Python, and IPython.

PyData Workshop 2013

The next PyData Workshop will be in Cambridge, MA July 27-28, 2013

Tutorials for Particular Tools

Python for Data Analysis

  1. Getting started from Kaggle.com.

IPython

IPython notebooks have become the de facto standard for presenting Python analyses, as evidenced by the recent Scipy conference. There are several tutorials for learning IPython.

  1. The IPython tutorial
  2. Fernando Perez's talk on IPython (and video)
  3. PyCon 2012 tutorial
  4. Interesting IPython notebooks
  5. IPython notebook examples

Python Data Analysis Library (pandas)

  1. The 10-minute introduction to pandas
  2. The pandas cookbook
  3. 2012 PyData Workshop
  4. The pandas documentation
  5. Randal Olson's tutorial
  6. Wes McKinney's tutorials 1 and 2 on Kaggle.
  7. Hernan Rojas' tutorial
  8. Tutorials on financial data and time series using pandas

Scikit-learn

  1. 2012 PyData Workshop
  2. Official scikit-learn tutorial
  3. Jacob VanderPlas' tutorial
  4. PyCon 2013 tutorial on advanced machine learning with scikit-learn
  5. More scikit-learn tutorials.

Matplotlib

  1. Official tutorial
  2. N.P. Rougier's tutorial from EuroSciPy 2012
  3. Jake VanderPlas' tutorial from PyData NYC 2012
  4. John Hunter's Advanced Matplotlib Tutorial from PyData 2012
  5. A tutorial from Scigraph.

Sympy

  1. Official tutorial
  2. SciPy 2013 presentations

Numpy and Scipy

  1. The Guide to Numpy
  2. M. Scott Shell's Introduction to Numpy and Scipy

Databases from Python

  1. SQLite
  2. MySQL
  3. PostgreSQL

Books

  1. Python for Data Analysis by Wes McKinney
  2. Learning IPython for Interactive Computing and Data Visualization by Cyrille Rossant

If you have any additional suggestions, please leave them in the comments section below!

Editors Note: The book images link out to Amazon, of which we are an affiliate. Thus, if you click the link and buy the book, we get a single digit percentage cut of the purchase. So please, click and buy the books ;)