Methods

Announcing the Publication of Practical Data Science Cookbook

Practical Data Science Cookbook is perfect for those who want to learn data science and numerical programming concepts through hands-on, real-world project examples. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of data science projects, the steps in the data science pipeline, and the programming examples presented in this book. 

Simulation and Predictive Analytics

This is a guest post by Lawrence Leemis, a professor in the Department of Mathematics at The College of William & Mary.  A front-page article over the weekend in the Wall Street Journal indicated that the number one profession of interest to tech firms is a data scientist, someone whose analytic skills, computing skills, and domain skills are able to detect signals from data and use them to advantage. Although the terms are squishy, the push today is for "big data" skills and "predictive analytics" skills which allow firms to leverage the deluge of data that is now accessible.

I attended the Joint Statistical Meetings last week in Boston and I was impressed by the number of talks that referred to big data sets and also the number that used the R language. Over half of the technical talks that I attended included a simulation study of one type or another.

The two traditional aspects of the scientific method, namely theory and experimentation, have been enhanced with computation being added as a third leg. Sitting at the center of computation is simulation, which is the topic of this post. Simulation is a useful tool when analytic methods fail because of mathematical intractability.

The questions that I will address here are how Monte Carlo simulation and discrete-event simulation differ and how they fit into the general framework of predictive analytics.

First, how do how Monte Carlo and discrete-event simulation differ? Monte Carlo simulation is appropriate when the passage of time does not play a significant role. Probability calculations involving problems associated with playing cards, dice, and coins, for example, can be solved by Monte Carlo.

Discrete-event simulation, on the other hand, has the passage of time as an integral part of the model. The classic application areas in which discrete-event simulation has been applied are queuing, inventory, and reliability. As an illustration, a mathematical model for a queue with a single server might consist of (a) a probability distribution for the time between arrivals to the queue, (b) a probability distribution for the service time at the queue, and (c) an algorithm for placing entities in the queue (first-come-first served is the usual default). Discrete-event simulation can be coded into any algorithmic language, although the coding is tedious. Because of the complexities of coding a discrete-event simulation, dozens of languages have been developed to ease implementation of a model. 

The field of predictive analytics leans heavily on the tools from data mining in order to identify patterns and trends in a data set. Once an appropriate question has been posed, these patterns and trends in explanatory variables (often called covariates) are used to predict future behavior of variables of interest. There is both an art and a science in predictive analytics. The science side includes the standard tools of associated with mathematics computation, probability, and statistics. The art side consists mainly of making appropriate assumptions about the mathematical model constructed for predicting future outcomes. Simulation is used primarily for verification and validation of the mathematical models associated with a predictive analytics model. It can be used to determine whether the probabilistic models are reasonable and appropriate for a particular problem.

Two sources for further training in simulation are a workshop in Catonsville, Maryland on September 12-13 by Barry Lawson (University of Richmond) and me or the Winter Simulation Conference (December 7-10, 2014) in Savannah.

Saving Money Using Data Science for Dummies

I'd like to tell you a story about how I made "data science" work for me without writing a single line of code, launching a single data analysis or visualization app, or even looking at a single digit of data.

"No, way!", you say? Way, my friends. Way.

Our company is developing VoteRaise.com, a new way for voters to fundraise for the candidates they'd like to have run for political office. We realized that, even here in DC, there is no active community of people exploring how innovations in technology & techniques impacts political campaigning, so we decided to create it.

I started the process of creating RealPolitech, a new Meetup, last night. It's clear that Meetup up makes use of data and, more importantly, thinks about how to use that data to help its users. For instance, when it came time to pick the topics that RealPolitech would cover, Meetup did a great job of making suggestions. All I had to do was type a few letters, and I was presented with a list of options to select.

I got the whole meetup configured quickly. I was impressed by the process. But, when it came time to make a payment, I went looking for a discount code. Couldn't find one in FoundersCard. Couldn't find it in Fosterly. Or anywhere else online. Seventy two dollars for six months is a good deal, already. But, still, we're a bootstrapped startup, and every dollar saved is a dollar earned. So, I decided to try something.

I just... stopped. I figured, if Meetup is gathering and providing data during the configuration process, they must be doing the same during checkout. So, I figured I'll give them a day or two to see if I receive an unexpected "special offer". Sure enough, by the time I woke up, I had a FIFTY PERCENT discount coupon, along with a short list of people who may be interested in RealPolitech, as incentive to pay the dues and launch the Meetup.

RealPolitech is up and running. You can find it here, come join and share your ideas and expertise! We are putting together our kickoff event, and lining up speakers and sponsors.

Oh, and I saved 50% off the dues by leveraging my knowledge of data science to get the outcome I wanted by doing... absolutely nothing.

Facility Location Analysis Resources Incorporating Travel Time

This is a guest blog post by Alan Briggs. Alan is a operations researcher and data scientist at Elder Research. Alan and Harlan Harris (DC2 President and Data Science DC co-organizer) have co-presented a project with location analysis and Meetup location optimization at the Statistical Programming DC Meetup and an INFORMS-MD chapter meeting. Recently, Harlan presented a version of this work at the New York Statistical Programming Meetup. There was some great feedback on the Meetup page asking for some additional resources. This post by Alan is in response to that question.

If you’re looking for a good text resource to learn some of the basics about facility location, I highly recommend grabbing a chapter of Dr. Michael Kay’s e-book (pfd) available for free from his logistics engineering website. He gives an excellent overview of some of the basics of facility location, including single facility location, multi-facility location, facility location-allocation, etc. At ~20 pages, it’s entirely approachable, but technical enough to pique the interest of the more technically-minded analyst. For a deeper dive into some of the more advanced research in this space, I’d recommend using some of the subject headings in his book as seeds for a simple search on Google Scholar. It’s nothing super fancy, but there are plenty of articles in the public-domain that relate to minisum/minimax optimization and all of their narrowly tailored implementations.

One of the really nice things about the data science community is that it is full of people with an inveterate dedication to critical thinking. There is nearly always some great [constructive] criticism about what didn’t quite work or what could have been a little bit better. One of the responses to the presentation recommended optimizing with respect to travel time instead of distance. Obviously, in this work, we’re using Euclidean distance as a proxy for time. Harlan cites laziness as the primary motivation for this shortcut, and I’ll certainly echo his response. However, a lot of modeling boils down to cutting your losses, getting a good enough solution or trying to find the balance among your own limited resources (e.g. time, data, technique, etc.). For the problem at hand, clearly the goal is to make Meetup attendance convenient for the most number of people; we want to get people to Meetups. But, for our purposes, a rough order of magnitude was sufficient. Harlan humorously points out that the true optimal location for a version of our original analysis in DC was an abandoned warehouse—if it really was an actual physical location at all. So, when you really just need a good solution, and the precision associated with true optimality is unappreciated, distance can be a pretty good proxy for time.

A lot of times good enough works, but there are some obvious limitations. In statistics, the law of large numbers holds that independent measurements of a random quantity tend toward the theoretical average of that quantity. For this reason, logistics problems can be more accurately estimated when they involve a large number of entities (e.g. travelers, shipments, etc.). For the problem at hand, if we were optimizing facility location for 1,000 or 10,000 respondents, again, using distance as a proxy for time, we would feel much better about the optimality of the location. I would add that similar to large quantities, introducing greater degrees of variability can also serve to improve performance. Thus, optimizing facility location across DC or New York may be a little too homogeneous. If instead, however, your analysis uses data across a larger, more heterogeneous area, like say an entire state where you have urban, suburban and rural areas, you will again get better performance on optimality.

Let’s say you’ve weighed the pros and cons and you really want to dive deeper into optimizing based on travel time, there are a couple different options you can consider. First, applying a circuity factor to the Euclidean distance can account for non-direct travel between points. So, to go from point A to point B actually may take 1.2 units as opposed to the straight-line distance of 1 unit. That may not be as helpful for a really complex urban space where feasible routing is not always intuitive and travel times can vary wildly. However, it can give some really good approximations, again, over larger, more heterogeneous spaces. An extension to a singular circuity factor, would be to introduce some gradient circuity factor that is proportional to population density. There are some really good zip code data available that can be used to estimate population.

Increasing the circuity factor in higher-density locales and decreasing in the lower-density can help by providing a more realistic assessment of how far off of the straight line distance the average person would have to commute. For the really enthusiastic modeler that has some good data skills and is looking for even more integrity in their travel time analysis, there are a number of websites that provide road network information. They list roads across the United States by functional class (interstate, Expressway & principal arterial, Urban principal arterial, collector, etc.) and even provide shape files and things for GIS. I've done basic speed limit estimation by functional class, but you could also do something like introduce a speed gradient respective to population density (as we mentioned above for the circuity factor). You could also derive some type of an inverse speed distribution that slows traffic at certain times of the day based on rush hour travel in or near urban centers.

In the words of George E. P. Box, “all models are wrong, but some models are useful.” If you start with a basic location analysis and find it to be wrong but useful, you may have done enough. If not, however, perhaps you can make it useful by increasing complexity in one of the ways I have mentioned above.

Flask Mega Meta Tutorial for Data Scientists

Introduction

Data science isn't all statistical modeling, machine learning, and data frames. Eventually, your hard work pays off and you need to give back the data and the results of your analysis; those blinding insights that you and your team uncovered need to be operationalized in the final stage of the data science pipeline as a scalable data product. Fortunately for you, the web provides an equitable platform to do so and Python isn't all NumPy, Scipy, and Pandas. So, which Python web application framework should you jump into?

375px-Flask_logo.svg

There are numerous web application frameworks for Python with the 800lb gorilla being Django, "the web framework for perfectionists with deadlines." Django has been around since 2005 and, as the community likes to say, is definitely "batteries included," meaning that Django comes with a large number of bundled components (i.e., decisions that have already been made for you). Django comes built in with its chosen object-relational mapper (ORM) to make database access simple, a beautiful admin interface, a template system, a cache system for improving site performance, internationalization support, and much more. As a result, Django has a lot of moving parts and can feel large, especially for beginners. This behind the scene magic can obscure what actually happens and complicates matter when something goes wrong. From a learning perspective, this can make gaining a deeper understanding of web development more challenging.

Enter Flask, the micro web framework written in Python by Armin Ronacher. Purportedly, it came out as an April Fool's joke but proved popular enough not to go quietly into the night. The incredible thing about Flask is that you can have a web application running in about 10 lines of code and a few minutes of effort. Despite this, Flask can and does run real production sites.

As you need additional functionality, you can add it through the numerous extensions available. Need an admin interface? Try Flask-Admin. Need user session management and the ability for visitors to log in and out? Check out Flask-Login.

While Django advocates may argue that by the time you add all of the "needed" extensions to Flask, you get Django, others would argue that a micro-framework is the ideal platform to learn web development. Not only does the small size make it easier for you to get your head around it at first, but the fact that you need to add each component yourself requires you to understand fully the moving parts involved. Finally, and very relevant to data scientists, Flask is quite popular for building RESTful web services and creating APIs.

As Flask documentation isn't quite as extensive as the numerous Django articles on the web, I wanted to assemble the materials that I have been looking at as I explore the world of Flask development.

For Beginners - Getting Started

The Official Flask Docs

The official Flask home page has a great set of documentation, list of extensions to add functionality, and even a tutorial to build a micro blogging site. Check it out here.

Blog post on Learning Flask

This tutorial is really a hidden gem designed for those that are new to web programming and Flask. It contains lots of small insights into the process via examples that really help connect the dots.

Full Stack Python

While not exclusively focused on Flask, Fullstackpython.com is a fantastic page that offers a great amount of information and links about the full web stack, from servers, operating systems, and databases to configuration management, source control, and web analytics, all from the perspective of Python. A must read through for those new to web development.

Starting A Python Project the Right Way

Jeff Knupp again walks the beginner Python developer through how he starts a new Python project.


For Intermediate Developers

The Flask Mega Tutorial

The Flask Mega Tutorial is aptly named as it has eighteen separate parts walking you through the numerous aspects of building a website with Flask and various extensions. I think this tutorial is pretty good but have heard some comment that it can be a bit much if you don't have a strong background in web development.

Miguel Grinberg's tutorial has been so popular that it is even being used as the basis of an O'Reilly book, Flask Web Development, that is due out on May 25, 2014. An early release is currently available, so called "Raw & Unedited," here.

A snippet about the book is below:

If you have Python experience, and want to learn advanced web development techniques with the Flask micro-framework, this practical book takes you step-by-step through the development of a real-world project. It’s an ideal way to learn Flask from the ground up without the steep learning curve. The author starts with installation and brings you to more complicated topics such as database migrations, caching, and complex database relationships.

Each chapter in this focuses on a specific aspect of the project—first by exploring background on the topic itself, and then by waking you through a hands-on implementation. Once you understand the basics of Flask development, you can refer back to individual chapters to reinforce your grasp of the framework.

Flask Web Development

Beating O'Reilly to the punch, Packt Publishing started offering the book, "Instant Flask Web Development" from Ron DuPlain in August 2013.

Flask - Blog: Building a Flask Blog: Part1

This blog contains two different tutorials in separate posts. This one, tackles the familiar task of building a blog using Flask-SQLAlchemy, WTForms, Flask-WTF, Flask-Migrate, WebHelpers, and PostgreSQL. The second one shows the creation of a music streaming app.

Flask with uWSGI + Nginx

This short tutorial and Git Hub repository shows you how to set up a simple Flask app with uWSGI and the web server Nginx.

Python Web Applications with Flask

This extensive 3 part blog post from Real Python works its way through the development of a mid-sized web analytics application. The tutorial was last updated November 17th of 2013 and has the added bonus of demonstrating the use of virtualenv and git hub as well.

Python and Flask are Ridiculously Powerful

Jeff Knupp, author of Idiomatic Python, is fed up with online credit card payment processors so builds his own with Flask in a few hours.

 


More Advanced Content

The Blog of Erik Taubeneck

This is the blog of Erick Taubeneck, "a mathematician/economist/statistician by schooling, a data scientist by trade, and a python hacker by night." He is a Hacker School alum in NYC and contributor to a number of Flask extensions (and an all around good guy).

How to Structure Flask Applications

Here is a great post, recommended by Erik, discussing how one seasoned developers structures his flask applications.

Modern Web Applications with Flask and Backbone.js - Presentation

Here is a presentation about building a "modern web application" using Flask for the backend and backbone.js in the browser. Slide 2 shows a great timeline of the last 20 years in the history of web applications and slide 42 is a great overview of the distribution of Flask extensions in the backend and javascript in the frontend.

Modern Web Application Framework: Python, SQL Alchemy, Jinja2 & Flask - Presentation

This 68-slide presentation on Flask from Devert Alexandre is an excellent resource and tutorial discussing Flask, the very popular SQL Alchemy and practically default Flask ORM, and the Jinja2 template language.

Diving Into Flask - Presentation

66-slide presentation on Flask by Andrii V. Mishkovskyi from EuroPython 2012 that also discusses Flask-Cache, Flask-Celery, and blueprints for structuring larger applications.

A Tutorial for Deploying a Django Application that Uses Numpy and Scipy to Google Compute Engine Using Apache2 and modwsgi

by Sean Patrick Murphy

Introduction

This longer-than-initially planned article walks one through the process of deploying a non-standard Django application on a virtual instance provisioned not from Amazon Web Services but from Google Compute Engine. This means we will be creating our own virtual machine in the cloud and installing all necessary software to have it serve content, run the Django application, and handle the database all in one. Clearly, I do not expect an overwhelming amount of traffic to this site. Also, note that Google Compute Engine is very different from Google App Engine.

What makes this app "non-standard" is its use of both the Numpy and Scipy packages to perform fast computations. Numpy and Scipy are based on C and Fortran respectively and both have complicated compilation dependencies. Binaries may be available in some cases but are not always available for your preferred deployment environment. Most importantly, these two libraries prevented me from deploying my app to either Google App Engine (GAE) or to Heroku. I'm not saying that it is impossible to deploy Numpy- or Scipy-dependent apps on either service. However, neither service supports apps dependent on both Scipy and Numpy out-of-the-box although a limited amount of Googling suggests it should be possible.

In fact, GAE could have been an ideal solution if I had re-architected the app, separating the Django application from the computational code. I could run the Django application on GAE and allowed it to spin up a GCE instance as needed to perform the computations. One concern with this idea is the latency involved in spinning up the virtual instance for computation. Google Compute Engine instances spring to life quickly but not instantaneously. Maybe I'll go down this path for version 2.0 if there is a need.

Just in case you are wondering, the Djanogo app in question is here https://github.com/murphsp1/ppi-css.com and the live site is here www.ppi-css.com.

If you have any questions or comments or suggestions, please leave them in the comments section below.

Google Compute Engine (GCE)

I am a giant fan of Google Compute Engine and love the fact that Amazon's EC2 finally has a strong competitor. With that said, GCE definitely does not have the same number of tutorials or help content available online.

I will assume that you can provision your own instance in GCE either using gcutil at the command line or through the cloud services web interface provided by Google.

Once you have your project up and running, you will need to configure the firewall settings for your project. You can do this at the command line of your local machine using the command line below:

gcutil addfirewall http2 --description="Incoming http allowed." --allowed="tcp:http" --project="XXXXXXXXXXXXXX"

Update the Instance and Install Tools

Next, boot the instance and ssh into it from your local machine. The command line parameters required to ssh in can be daunting but fortunately Google gives you a simple way to copy and paste the command from the web-based cloud console. The command line could look something like this:

gcutil --service_version="v1beta16" --project="XXXXXXXXXX" ssh  --zone="GEOGRAPHIC_ZONE" "INSTANCE_NAME"

Next, we need to update the default packages installed on the GCE instance:

sudo apt-get update
sudo apt-get upgrade

and install some needed development tools:

sudo apt-get --yes install make
sudo apt-get --yes install wget
sudo apt-get --yes install git

and install some basic Python-related tools:

sudo apt-get --yes install python-setuptools
sudo easy_install pip
sudo pip install virtualenv

Note that in many of my sudo apt-get commands I include --yes. This flag just prevents me from having to type "Y" to agree to the file download.

Install Numpy and Scipy (SciPy requires Fortran compiler)

To install SciPy, Python's general purpose scientific computing library from which my app needs a single function, we need the Fortran compiler:

sudo apt-get --yes install gfortran

and then we need Numpy and Scipy and everything else:

sudo apt-get --yes install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose

Finally, we need to add ProDy, a protein dynamics and sequence analysis package for Python.

sudo pip install prody

Install and Configure the Database (MySQL)

The Django application needs a database and there are many to choose from, most likely either Postgres or MySQL. Here, I went with MySQL for the simple reason was that it took fewer steps to get the MySQL server up and running on the GCE instance than the Postgres server did. I actually run Postgres on my development machine.

sudo apt-get --yes install mysql-server

The installation process should prompt you to create a root password. Please do so for security purposes.

Next, we are going to execute a script to secure the MySQL installation:

mysql_secure_installation

You already have a root password from the installation process but otherwise answer "Y" to every question.

With the DB installed, we now need to create our database for Django (mine is creatively called django_test). Please note that there must not* be a space between "--password=" and your password on the command line.

mysql --user=root --password=INSERT PASSWORD
mysql> create database django_test;
mysql> quit;

Finally for this step we need the MySQL database connector for Python which will be used by our Django app:

sudo apt-get install python-mysqldb

Install the Web Server (Apache2)

You have two main choices for your web server, either the tried and true Apache (now up to version 2+) or nginx. Nginx is supposed to be the new sexy when it comes to web servers but this newness comes at the price of less documentation/tutorials online. Thus, let's play it safe and go with Apach2.

First Attempt

First things first, we need to install apache2 and mod_wsgi. Mod_wsgi is an Apache HTTP server module that provides a WSGI compliant interface for web applications developed in Python.

sudo apt-get --yes install apache2 libapache2-mod-wsgi

This seems to be causing a good number of problems. In my Django error logs I see:

[Mon Nov 25 13:25:27 2013] [error] [client 108.21.2.20] Premature end of script headers: wsgi.py

and in:

cat /var/log/apache2/error.log

I see things like:

[Sun Nov 24 16:15:02 2013] [warn] mod_wsgi: Compiled for Python/2.7.2+.
[Sun Nov 24 16:15:02 2013] [warn] mod_wsgi: Runtime using Python/2.7.3.

with the occasional segfault:

[Mon Nov 25 00:02:55 2013] [notice] child pid 12532 exit signal Segmentation fault (11)
[Mon Nov 25 00:02:55 2013] [notice] seg fault or similar nasty error detected in the parent process

which is a strong indicator that something isn't quite working correctly.

Second Attempt

A little bit of Googling suggests that this could be the result of a number of issues with a prebuilt mod_wsgi. The solution seems to be grab the source code and compile it on my GCE instance. To do that, I:

sudo apt-get install apache2-prefork-dev

Now, we need to grab mod_wsgi while ssh'ed into the GCE instance:

wget https://modwsgi.googlecode.com/files/mod_wsgi-3.4.tar.gz
tar -zxvf mod_wsgi-3.4.tar.gz
cd mod_wsgi-3.4
./configure
make
sudo make install

Once mod_wsgi is intalled, the apache server needs to be told about it. On Apache 2, this is done by adding the load declaration and any configuration directives to the /etc/apache2/mods-available/ directory.

The load declaration for the module needs to go on a file named wsgi.load (in the /etc/apache2/mods-available/ directory), which contains only this:

LoadModule wsgi_module /usr/lib/apache2/modules/mod_wsgi.so

Then you have to activate the wsgi module with:

a2enmod wsgi

Note: a2enmod stands for "apache2 enable mod", this executable create the symlink for you. Actually a2enmod wsgi is equivalent to:

cd /etc/apache2/mods-enabled
ln -s ../mods-available/wsgi.load
ln -s ../mods-available/wsgi.conf # if it exists

Now we need to update the virtual hosts settings on the server. For Debian, this is here:

/etc/apache2/sites-enabled/000-default 

Restart the service:

sudo service apache2 restart 

and also change the owner of the directory on the GCE instance that will contain the files to be served by apache:

sudo chown -R www-data:www-data /var/www

Now that we have gone through all of that, it is nice to see things working. By default, the following page is served by the install:

/usr/share/apache2/default-site/index.html

If you go to the URL of the server (obtainable from the Google Cloud console), you should see a very simple example html page.

Setup the Overall Django Directory Structure on the Remote Server

I have seen many conflicting recommendations in online tutorials about how to best lay out the directory structure of a Django application in development. It would appear that after you have built your first dozen or so Django projects, you start formulating your own opinions and create a standard project structure for yourself.

Obviously, this experiential knowledge is not available to someone building and deploying one of their first sites. And, your directory structure directly impacts yours app's routings and the daunting-at-first settings.py file. If you move around a few directories, things tend to stop working and the resulting error messages aren't necessarily the most helpful.

The picture gets even murkier when you go from development to production and I have found much less discussion on best practices here. Luckily, I could ping on my friend Ben Bengfort and tap into his devops knowledge. The directory structure on the remote server looks like this as recommended by Mr. Bengfort.

/var/www/ppi-css.com
/var/www/ppi-css.com/htdocs/static
/var/www/ppi-css.com/htdocs/media
/var/www/ppi-css.com/django
/var/www/ppi-css.com/logs

Apache will see the htdocs directory as the main directory from which to serve files.

/static will contain the collected set of static files (images, css, javascript, and more) and media will contain uploaded documents.

/logs will contain relevant apache log files.

/django will contain the cloned copy of the Django project from Git Hub.

The following shell commands get things setup correctly:

sudo mkdir /var/www/ppi-css.com
sudo mkdir /var/www/ppi-css.com/htdocs
sudo mkdir /var/www/ppi-css.com/htdocs/static
sudo mkdir /var/www/ppi-css.com/htdocs/media
sudo mkdir /var/www/ppi-css.com/django
sudo mkdir /var/www/ppi-css.com/logs
cd /var/www/ppi-css.com/django
sudo git clone https://github.com/murphsp1/ppi-css.com.git

Configuring Apache for Our Django Project

With the directory structure of our Django application sorted, let's continue configuring apache.

First, let's disable the default virtual host for apache:

sudo a2dissite default

There will be aliases in the virtual host configuration file that let the apache server know about this structure. Fortunately, I have included the ppi-css.conf file in the repository and it must be moved into position:

sudo cp /var/www/ppi-css.com/django/ppi-css.com/ppi-css.conf /etc/apache2/sites-available/ppi-css.com

Next, we must enable the site using the following command:

sudo a2ensite ppi-css.com

and we must reload the apache2 service (remember this command as you will probably be using it alot)

sudo service apache2 reload

Now, when I restarted or reloaded the apache2 service, I get the following error message:

ERROR MESSAGES:
[....] Restarting web server: apache2apache2: apr_sockaddr_info_get() failed for (none)
apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for ServerName
 ... waiting apache2: apr_sockaddr_info_get() failed for (none)
apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for ServerName
. ok 
Setting up ssl-cert (1.0.32) ...
hostname: Name or service not known

To remove this, I simply added the following line:

ServerName localhost

to the /etc/apache2/apache2.conf file using vi. A quick

sudo service apache2 reload

shows that the error message has been banished.

Install a Few More Python Packages

The Django application contains a few more dependencies that were captured in the requirements file included in the repository. Note that since the installation of Numpy and Scipy has already been taken care of, those lines in the requirements.txt file can be removed.

sudo pip install -r /var/www/ppi-css.com/django/ppi-css.com/requirements.txt

Database Migrations

Before we can perform the needed database migrations, we need to update the database section of settings.py. It should look like below:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql', 
        'USER': 'root',    
        'PASSWORD': 'INSERT_YOUR_PASSWORD',
        'HOST': '',         # Set to empty string for localhost. 
        'PORT': '',         # Set to empty string for default.
    }
}

From the GCE instance, issue the following commands:

python /var/www/ppi-css.com/django/ppi-css.com/manage.py syncdb
python /var/www/ppi-css.com/django/ppi-css.com/manage.py migrate

Deploying Your Static Files

Static files, your css, javascript, images, and other unchanging files, can be problematic for new Django developers. When developing, Django is more than happy to serve your static files for you given their local development server. However, this does not work for production setttings.

The key to this is your settings.py file. In this file, we see:

# Absolute path to the directory static files should be collected to.
# Don't put anything in this directory yourself; store your static files
# in apps' "static/" subdirectories and in STATICFILES_DIRS.
# Example: "/home/media/media.lawrence.com/static/"
STATIC_ROOT = ''      #os.path.join(PROJECT_ROOT, 'static')

# URL prefix for static files.
# Example: "http://media.lawrence.com/static/"
STATIC_URL = '/static/'

# Additional locations of static files
STATICFILES_DIRS = (
    #os.path.join(PROJECT_ROOT,'static'),
    # Put strings here, like "/home/html/static" or "C:/www/django/static".
    # Always use forward slashes, even on Windows.
    # Don't forget to use absolute paths, not relative paths.
)      

For production, STATIC_ROOT must contain the directory where Apache2 will serve static content from. In this case, it should look like this:

STATIC_ROOT = '/var/www/ppi-css.com/htdocs/static'

For development, STATIC_ROOT looked like:

STATIC_ROOT = ''

Next, Django comes with a handy mechanism to round up all of your static files (in the case that they are spread out in separate app directories if you have a number of apps in a single project) and push them to a single parent directory when you go into production.

./manage.py collectstatic

Be very careful when going into production. If any of the directories listed in the STATICFILES_DIRS variable do not exist on your production server, collectstatic will fail and will not do so gracefully. The official Django documentation has a pretty good description of the entire process.

More Settings.py Updates

We aren't quite done with the settings.py file nand need to update the MEDIA_ROOT variable with the appropriate directory on the server:

MEDIA_ROOT = "/var/www/ppi-css.com/htdocs/media/" #os.path.join(PROJECT_ROOT, 'media')

Next, the ALLOWED_HOSTS variable must be set as shown below when the Django application is run in production mode and not in debug mode:

ALLOWED_HOSTS = [
        '.ppi-css.com', 
        'ppi-css.com',
]

And finally, check to make sure that the paths listed in the wsgi.py reflect the actual paths on the GCE instance.

A Very Nasty Bug

After having gone through through all of that work, I found a strange bug where the website would work fine but then become unresponsive. After extensive Googling, I found the error, best explained below:

Some third party packages for Python which use C extension modules, and this includes scipy and numpy, will only work in the Python main interpreter and cannot be used in sub interpreters as mod_wsgi by default uses. The result can be thread deadlock, incorrect behaviour or processes crashes. These is detailed here.

The workaround is to force the WSGI application to run in the main interpreter of the process using:

WSGIApplicationGroup %{GLOBAL}

If running multiple WSGI applications on same server, you would want to start investigating using daemon mode because some frameworks don't allow multiple instances to run in same interpreter. This is the case with Django. Thus use daemon mode so each is in its own process and force each to run in main interpreter of their respective daemon mode process groups.

The ppi-css.conf file with the required changes is now part of the repository.

Some Debugging Hints

Inevitably, things won't work on your remote server. Obviously leaving your application in Debug mode is ok for only the briefest time while you are trying to deploy but there are other things to check as well.

Is the web server running?

sudo service apache2 status

If it isn't or you need to restart the server:

sudo service apache2 restart

What do the apache error logs say?

cat /var/log/apache2/error.log
cat /var/www/ppi-css.com/logs/error.log 

Also, it is never a bad idea to log into MySQL and take a look at the django_test database.

Virtual Environment - Where Did It Go?

If you noticed, I did have a requirements.txt file in my project. When I started doing local development on my trusty Mac Book Air, I used virtualenv, an amazing tool. However, I had some difficulties getting Numpy and Scipy properly compiled and included in the virtualenv on my server whereas it was pretty simple to get them up and running in the system's default Python installation. Conversing with some of my more Django-experienced friends, they reassured me that while this wasn't a best practice, it wasn't a mortal sin either.

Getting to Know Git and Git Hub

Git or another code versioning tool is a fact of life for any developer. While the learning curve for the novice may be steep (or vertical), it is essential to climb this mountain as quickly as possible.

As powerful as GIT can be, I found myself using only a few commands on this small project.

First, I used git add with several different flags to stage files before committing. To stage all new and modified files (but not deleted files), use:

git add .

To stage all modified and deleted files (but not new files), use:

git add -u

Or, if you want to be lazy and want to stage everything everytime (new, modified, and deleted files), use:

git add -A

Next, the staged files must be committed and then pushed to GitHub.

git commit -m "insert great string for documentation here"
git push -u origin master

Commands in Local Development Environment

While Django isn't the most lightweight web framework in Python (hello Flask and others), "launching" the site in the local development environment is pretty simple. Compare the command line commands needed below to the rest of the blog. (Note that I am running OS X 10.9 Mavericks on a Mac Book Air with 8 GB of 1600 MHz DDR 3.)

First, start the local postgres server:

postgres -D /usr/local/var/postgres 

Next start the local development web server using the django-command-extensions that enables debugging of the site in the browser.

python manage.py runserver_plus   

Once a model has changed, we needed to make a migration using South and then apply it with the two commands below:

./manage.py schemamigration DJANGO_APP_NAME --auto
./manage.py migrate DJANGO_APP_NAME   

References

There are a ton of different tutorials out there to help you with all aspects of deployment. Of course, piecing together the relevant parts may take some time and this tutorial was assemble from many different sources.

November Data Science DC Event Review: Identifying Smugglers: Local Outlier Detection in Big Geospatial Data

This is a guest post from Data Science DC Member and quantitative political scientist David J. Elkind. Geopatial Outliers in the Strait of HormuzAs the November Data Science DC Meetup, Nathan Danneman, Emory University PhD and analytics engineer at Data Tactics, presented an approach to detecting unusual units within a geospatial data set. For me, the most enjoyable feature of Dr. Danneman’s talk was his engaging presentation. I suspect that other data consultants have also spent quite some time reading statistical articles and lost quite a few hours attempting to trace back the authors’ incoherent prose. Nathan approached his talk in a way that placed a minimal quantitative demand on the audience, instead focusing on the three essential components of his analysis: his analytical task, the outline of his approach, and the presentation of his findings. I’ll address each of these in turn.

Analytical Task

Nathan was presented with the problem of locating maritime vessels in the Strait of Hormuz engaged in smuggling activities: sanctions against Iran have made it very difficult for Iran to engage in international commerce, so improving detection of smugglers crossing the Strait from Iran to Qatar and the United Arab Emirates would improve the effectiveness of the sanctions regime and increase pressure on the regime. (I’ve written about issues related to Iranian sanctions for CSIS’s Project on Nuclear Issues Blog.)

Having collected publicly accessible satellite positioning data of maritime vessels, Nathan had four fields for each craft at several time intervals within some period: speed, heading, latitude and longitude.

But what do smugglers look like? Unfortunately, Nathan’s data set did not itself include any examples of watercraft which had been unambiguously identified by, e.g., the US Navy, as smugglers, so he could not rely on historical examples of smuggling as a starting point for his analysis. Instead, he has to puzzle out how to leverage information a craft’s spatial location

I’ve encountered a few applied quantitative researchers who, when faced with a lack of historical examples, would be entirely stymied in their progress, declaring the problem too hard. Instead of throwing up his hands, Dr. Danneman dug into the topic of maritime smuggling and found that many smuggling scenarios involve ship-to-ship transfers of contraband which take place outside of ordinary shipping lanes. This qualitatively-informed understanding transforms the project from mere speculation about what smugglers might look like into the problem of discovering maritime vessels which deviate too far from ordinary traffic patterns.

Importantly, framing the research in this way entirely premises the validity of inferences on the notion that unusual ships are smugglers and smugglers are unusual ships. But in reality, there are many reasons that ships might not conform to ordinary traffic patterns – for example, pleasure craft and fishing boats might have irregular movement patterns that don’t coincide with shipping lanes, and so look similar to the hypothesized smugglers.

Outline of Approach

The basic approach can be split into three sections: partitioning the strait into many grids, generating fake boats to compare the real boats, and then training a logistic regression to use the four data fields (speed, heading, latitude and longitude) to differentiate the real boats from the fake ones.

Partitioning the strait into grids helps emphasize the local character of ships’ movements in that region. For example, a grid square partially containing a shipping channel will have many ships located in the channel and on a heading taking it along that channel. Generating fake boats, with bivariate uniform distribution in the grid square, will tend not to fall in the path of ordinary traffic channel, just like the hypothesized behavior of smugglers. The same goes for the uniformly-distributed timestamps and otherwise randomly-assigned boat attributes for the comparison sets: these will all tend to stand apart from ordinary traffic. Therefore, training a model to differentiate between these two classes of behaviors will advance the goal of differentiating between smugglers and ordinary traffic.

Dr. Danneman described this procedure as unsupervised-as-supervised learning – a novel term for me, so forgive me if I’m loose with the particulars – but this in this case it refers to the notion that there are two classes of data points, one i.i.d. from some unknown density and another simulated via Monte Carlo methods from some known density.  Pooling both samples gives one a mixture of the two densities; this problem then becomes one of comparing the relative densities of the two classes of data points – that is, this problem is actually a restatement of the problem of logistic regression! Additional details can be found in Elements of Statistical Learning (2nd edition, section 14.2.4, p. 495).

Presentation of Findings

After fitting the model, we can examine which of the real boats the model rated as having a low odds of being real – that is, boats which looked so similar to the randomly-generated boats that the model had difficulty differentiating the two. These are the boats that we might call “outliers,” and, given the premise that ship-to-ship smuggling likely takes place aboard boats with unusual behavior, are more likely to be engaged in smuggling.

I will repeat here a slight criticism that I noted elsewhere and point out that the model output cannot be interpreted as a true probability, contrary to the results displayed in slide 39. In this research design, Dr. Danneman did not randomly sample from the population of all shipping traffic in the Strait of Hormuz to assemble a collection of smuggling craft and ordinary traffic in proportion roughly equal to their occurrence in nature. Rather, he generated one fake boat for each real boat. This is a case-control research design, so the intercept term of the logistic regression model is fixed to reflect the ratio of positive cases to negative cases in the data set. All of the terms in the model, including the intercept, are still MLE estimators, and all of the non-intercept terms are perfectly valid for comparing the odds of an observation being in one class or another. But to establish probabilities, one would have to replace the intercept term with knowledge of what the overall ratio of positives to negatives in the other.

In the question-and-answer session, some in the audience pushed back against the limited data set, noting that one could improve the results by incorporating other information specific to each of the ships (its flag, its shipping line, the type of craft, or other pieces of information). First, I believe that all applications would leverage this information – were it available – and model it appropriately; however, as befit a pedagogical talk on geospatial outlier detection, this talk focused on leveraging geospatial data for outlier detection.

Second, it should be intuitive that including more information in a model might improve the results: the more we know about the boats, the more we can differentiate between them. Collecting more data is, perhaps, the lowest-hanging fruit of model improvement. I think it’s more worthwhile to note that Nathan’s highly parsimonious model achieved very clean separation between fake and real boats despite the limited amount of information collected for each boat-time unit.

The presentation and code may be found on Nathan Danneman's web site. The audio for the presentation is also available.

Summer research project overview: “Supporting Social Data Research with NoSQL Databases: Comparison of HBase and Riak”

This post is a guest post by an aspiring young member of the Data Community DC who also happens to be looking for an internship. Please see his contact info at the bottom of the post if interested and also check out his blog, http://hex00sells.com/. This summer, I was accepted into a NSF-funded (National Science Foundation), two-month research project at Indiana University in Bloomington.  Sponsored by the School of Informatics & Computing (SOIC), the program is designed to expose undergraduate students majoring in STEM fields to research taking place at the graduate level, in the hopes that they will elect to continue their education past a BS level.  These programs are also often called REUs (Research Experiences for Undergraduates) and are offered nationwide at various times throughout the year.

Seeing as how I’ve been interested in Big Data/Data Science for some time now, and have wanted to learn more about the field, being accepted to this program was a dream come true! I could only hope that it would deliver on everything it promised, which it definitely did.  The program was designed to immerse students as much as possible in normal IU life—for two months, it was as if we were enrolled there.  We lived in IU on-campus dorms, had IU student IDs, access to all facilities, and meal cards that we could use at the dining halls.  The program coordinators also planned a great series of events designed to expose us to campus facilities, resources, and local attractions.  We also were able to tour the IU Data Center, which houses Big Red II (one of the two fastest university-owned supercomputers in the US).

big_red_evan  big_red2_evan  big_red3_evan

But on to the project! Entitled ”Supporting Social Data Research with NoSQL Databases: Comparison of HBase and Riak,” its aim was to compare the performance of two open-source, NoSQL databases: Apache’s Hadoop/HBase and Basho’s Riak platforms.  We wished to compare them using a subset of the dataset of the Truthy Project—a project which gathered approximately 10% of all Twitter traffic (using the Twiiter Streaming API) over a two year period, in order to create a repository (i.e., gigantic data set) which social research scientists could conduct research with.  Several interesting research papers have come out of the Truthy Project, such as papers investigating the flow and dissemination of information during Occupy Wall Street and the Arab Spring uprisings.  In addition, the project researchers have also made very interesting images that portray information diffusion through users’ Twitter networks:

truthy_evan

Our project’s goal (link to project page) was to first set up and then configure both Hadoop/HBase and Riak, in keeping with the data set we were working with and the types of queries we’d be performing.  I came in on the second half of the project—the first half, setting up and configuring Hadoop/HBase, and then loading data onto the nodes and running queries on it, had already been completed, so my portion dealt nearly exclusively with the Riak side of things. We had reserved eight nodes on IU’s FutureGrid cluster, a shared academic cloud computing resource that researchers can use for their projects.  Our nodes were running CentOS (an enterprise Linux distro).  Some of the configuration details included creating schemas for our data and managing the data-loading onto our nodes.

Next, we went on to perform some of the same search queries on our two databases that the Truthy Project researchers performed.  These queries might be for a particular Twitter meme over various periods of times (a few hours, a day, five days, 10 days, etc.), the amount of posts a given user made over a certain period of time, or the number of user mentions a user received, just to name a few example queries.  Due to the size of the data set (one month’s worth of tweets—350 GB), these queries might easily return results in the order of tens of thousands to hundreds of thousands of hits! As these are open-source platforms, there are a plethora of configuration options and possibilities, so part of our job was to optimize each platform to deliver the fastest results for the types of queries we were executing.  For Riak, there are multiple ways to execute searches, some of which involved using MapReduce, so that added another layer of complexity—as we had to learn how Riak executed MapReduce jobs and then optimize for them accordingly.

Our results were very interesting, and seemed to highlight some of the differences between the two platforms (at least at the time of this project).  We found Riak to be faster for queries returning < 100k results over shorter time periods, but Hadoop/HBase was faster for queries returning > 100k results over longer lengths of time.  For Riak, some of this difference seemed to be attributed to how it handled returning results—they seemed to be gathered from all the nodes unsorted/ungrouped and streamed through just one reducer on one node exclusively, thus possibly creating the “bottlenecking” that resulted in result sets of a certain size.  Hadoop/Hbase seemed to provide a more robust implementation of MapReduce, allowing for much more individual MapReduce node configuration and a design that seems to scale very well for large data sets.

Below is a graph which illustrates the results of one meme search conducted across different time ranges on both Riak and HBase (click for larger view):

hist_evan

This histogram details the results for all tweets containing the meme “#ff” over periods of five, 10, 15, and 20 days.  The number of results returned is next to each search period (for example, the search over five days for “#ff” on Riak returned 353,468 tweets in 126 seconds that contained that the search term).  It is clear that, in this case, Riak is faster over a shorter period of time, but as both the search periods and result sizes increase, HBase scales to be far faster, exhibiting a much smaller, linear rate of growth than Riak.  If anyone would like to view my research poster for this project, here is a link.

In conclusion, this was a wonderful experience and only served to whet my appetite for future learning.  Just some of the languages, technologies, and skills I learned or were exposed to were: Python, Javascript, shell scripting, MapReduce, NoSQL databases, and cloud computing.  I leave you with a shot of the beautiful campus:

campus_evan

Evan Roth is pursuing a Computer Science degree full-time at the University of the District of Columbia in Washington, DC.  He also works part-time for 3Pillar Global in Fairfax, VA as an iOS Developer. His LinkedIn profile is here.

 

 

What is a Hackathon?

National Day of Civic Hacking Most people have probably heard of the term hackathon in today's technology landscape. But just because you have heard of it doesn't necessarily mean you know what it is. Wikipedia defines a hackathon as an event in which computer programmers and others involved in software development collaborate intensively on software projects. They can last for just a few hours or go as long as a week; and they come in all shapes and sizes. There are national hackathons, state-level hackathons, internal-corporate hackathons, open data-focused hackathons, and hackathons meant to flesh out APIs.

Hackathons can also operate on local levels. A couple of examples of local-level hackathons are the Baltimore Neighborhood Indicators Alliance-Jacob France Institute (BNIA-JFI) hackathon, which just completed its event in mid July, and the Jersey Shore Comeback-a-thon hackathon this past April hosted by Marathon Data Systems.

The BNIA-JFI hackathon was paired with a Big Data Day event and lasted for 8 hours one Friday. The goal of this hackathon was to come up with solutions that help out two local Baltimore neighborhoods: Old Goucher and Pigtown. The two each had a very clear goal established at the beginning of the hackathon but the overall goal was simple: help improve the local community. The data was provided to participants from the hackathon sponsors.

The Jersey Shore Comeback-a-thon hackathon lasted 24 hours straight, taking some participants back to their college days of pulling all-nighters. This hackathon also differed from the Baltimore hackathon in that it did not seem to provide data but instead focused on the pure technology of an application and how it can be used to achieve the goal of alerting locals and tourists that the Jersey shore is open for business.

Both of these events are great examples of local hackathons that look to give back to the community. However, if you want to learn more about federal level hackathons, open government data, or state level hackathons, please join Data Science MD at their next event titled Open Data Breakfast this Friday, August 9th, 8AM at Unallocated, a local Maryland hackerspace.

Why You Should Not Build a Recommendation Engine

One does not simply build an MVP with a recommendation engine Recommendation engines are arguably one of the trendiest uses of data science in startups today. How many new apps have you heard of that claim to "learn your tastes"? However, recommendations engines are widely misunderstood both in terms of what is involved in building a one as well as what problems they actually solve. A true recommender system involves some fairly hefty data science -- it's not something you can build by simply installing a plugin without writing code. With the exception of very rare cases, it is not the killer feature of your minimum viable product (MVP) that will make users flock to you -- especially since there are so many fake and poorly performing recommender systems out there.

A recommendation engine is a feature (not a product) that filters items by predicting how a user might rate them. It solves the problem of connecting your existing users with the right items in your massive inventory (i.e. tens of thousands to millions) of products or content. Which means that if you don't have existing users and a massive inventory, a recommendation engine does not truly solve a problem for you. If I can view the entire inventory of your e-commerce store in just a few pages, I really don't need a recommendation system to help me discover products! And if your e-commerce store has no customers, who are you building a recommendation system for? It works for Netflix and Amazon because they have untold millions of titles and products and a large existing user base who are already there to stream movies or buy products. Presenting users with  recommended movies and products increases usage and sales, but doesn't create either to begin with.

There are two basic approaches to building a recommendation system: the collaborative filtering method and the content-based approach. Collaborative filtering algorithms take user ratings or other user behavior and make recommendations based on what users with similar behavior liked or purchased. For example, a widely used technique in the Netflix prize was to use machine learning to build a model that predicts how a user would rate a film based solely on the giant sparse matrix of how 480,000 users rated 18,000 films (100 million data points in all). This approach has the advantage of not requiring an understanding of the content itself, but does require a significant amount of data, ideally millions of data points or more, on user behavior. The more data the better. With little or no data, you won't be able to make recommendations at all -- a pitfall of this approach known as the cold-start problem. This is why you cannot use this approach in a brand new MVP. 

The content-based approach requires deep knowledge of your massive inventory of products. Each item must be profiled based on its characteristics. For a very large inventory (the only type of inventory you need a recommender system for), this process must be automatic, which can prove difficult depending on the nature of the items. A user's tastes are then deduced based on either their ratings, behavior, or directly entering information about their preferences. The pitfalls of this approach are that an automated classification system could require significant algorithmic development and is likely not available as a commodity technical solution. Second, as with the collaborative filtering approach, the user needs to input information on their personal tastes, though not on the same scale. One advantage of the content-based approach is that it doesn't suffer from the cold-start problem -- even the first user can gain useful recommendations if the content is classified well. But the benefit that recommendations offer to the user must justify the effort required to offer input on personal tastes. That is, the recommendations must be excellent and the effort required to enter personal preferences must be minimal and ideally baked into the general usage. (Note that if your offering is an e-commerce store, this data entry amounts to adding a step to your funnel and could hurt sales more than it helps.) One product that has been successful with this approach is Pandora. Based on naming a single song or artist, Pandora can recommend songs that you will likely enjoy. This is because a single song title offers hundreds of points of data via the Music Genome Project. The effort required to classify every song in the Music Genome Project cannot be understated -- it took 5 years to develop the algorithm and classify the inventory of music offered in the first launch of Pandora. Once again, this is not something you can do with a brand new MVP.

Pandora may be the only example of a successful business where the recommendation engine itself is the core product, not a feature layered onto a different core product. Unless you have the domain expertise, algorithm development skill, massive inventory, and frictionless user data entry design to build your vision of the Pandora for socks / cat toys / nail polish / etc, your recommendation system will not be the milkshake that brings all the boys to the yard. Instead, you should focus on building your core product, optimizing your e-commerce funnel, growing your user base, developing user loyalty, and growing your inventory. Then, maybe one day, when you are the next Netflix or Amazon, it will be worth it to add on a recommendation system to increase your existing usage and sales. In the mean time, you can drive serendipitous discovery simply by offering users a selection of most popular content or editor's picks.

Data Visualization: From Excel to ???

So you're an excel wizard, you make the best graphs and charts Microsoft's classic product has to offer, and you expertly integrate them into your business operations.  Lately you've studied up on all the latest uses for data visualization and dashboards in taking your business to the next level, which you tried to emulate with excel and maybe some help from the Microsoft cloud, but it just doesn't work the way you'd like it to.  How do you transition your business from the stalwart of the late 20th century?

If you believe you can transition your business operations to incorporate data visualization, you're likely gathering raw data, maintaining basic information, making projections, all eventually used in an analysis-of-alternatives and final decision for internal and external clients.  In addition, it's not just about using the latest tools and techniques, your operational upgrades must actually make it easier for you and your colleagues to execute daily, otherwise it's just an academic exercise.

Google Docs

There are some advantages to using Google Docs over desktop excel, the fact that it's in the cloud, has built in sharing capabilities, wider selection of visualization options, but my favorite is that you can reference and integrate multiple sheets from multiple users to create a multi-user network of spreadsheets.  If you have a good javascript programmer on hand you can even define custom functions, which can be nice when you have particularly lengthy calculations as spreadsheet formulas tend to be cumbersome.  A step further, you could use Google Docs as a database for input to R, which can then be used to set up dashboards for the team using a Shiny Server.  Bottom line, Google makes it flexible, allowing you to pivot when necessary, but it can take time to master.

Tableau Server

Tableau Server is a great option to share information across all users in your organization, have access to a plethora of visualization tools, utilize your mobile device, set up dashboards, keep your information secure.  The question is, how big is your organization?  Tableau Server will cost you $1000/user, with a minimum of 10 users, and 20% yearly maintenance.  If you're a small shop it's likely that your internal operations are straightforward and can be outlined to someone new in a good presentation, meaning that Tableau is like grabbing the whole toolbox to hang a picture, it may be more than necessary.  If you're a larger organization, Tableau may accelerate your business in ways you never thought of before.

Central Database

There are a number of database options, including Amazon Relational Data Services and Google Apps Engine.  There are a lot of open source solutions using either, and it will take more time to set up, but with these approaches you're committing to a future.  As you gain more clients, and gather more data, you may want to access to discover insights you know are there from your experience in gathering that data.  This is a simple function call from R, and results you like can be set up as a dashboard using a number of different languages.  You may expand your services, hire new employees, but want to easily access your historical data to set up new dashboards for daily operations.  Even old dashboards may need an overhaul, and being able to access the data from a standard system, as opposed to coordinating a myriad of spreadsheets, makes pivoting much easier.

Centralize vs Distributed

Google docs is very much a distributed system where different users have different permissions, whereas setting up a centralized database will restrict most people into using your operational system according to your prescription.  So when do you consolidate into a single system and when do you give people the flexibility to use their data as they see fit?  It depends of course.  It depends on the time history of that data, if the data is no good next week then be flexible, if this is your company's gold then make sure the data is in a safe, organized, centralized place.  You may want to allow employees to access your company's gold for their daily purposes, and classic spreadsheets may be all they need for that, but when you've made considerable effort to get the unique data you have, make sure it's in a safe place and use a database system you know you can easily come back to when necessary.

Data Visualization: The Data Industry

In any industry you either provide a service or a product, and data science is no exception.  Although the people who constitute the data science workforce are in many cases rebranded from statistician, physicist, algorithm developer, computer scientist, biologist, or anyone else who has had to systematically encode meaning from information as the product of their profession, data scientists are unique from these previous professions in that they operate across verticals as opposed to diving ever deeper down the rabbit hole.

What defines a business in the Data Science Industry?

There are a lot of companies doing very cool work that revolves around information, but data science has a specific meaning.  Science is the intellectual and practical activity encompassing the systematic study of structure and behavior, and Data Science focuses on the structure and behavior of whatever dataset you choose.  The line between a business in the data science industry and everyone else is whether they are searching for axioms in your data, irreducible truths about your information at hand.  Other businesses may utilize known truths in a product or service, but just as a welder is not a chemist, a chip manufacturer is not a physicist, a farmer is not a geneticist, a civil engineer is not a materials scientist, a gameifier is not a psychologist, or a politician is not a sociologist, just because you can wield the results of science does not mean you are a master or philosopher of that science.

In organizing data science events I've seen a trend in the demographic of data science, aka the data demographic, people are either new to it, contributing to it, standardizing it, consulting for it, or trading it.  We can use this characterization to review products and services, to ultimately understand who would use what's being sold, data scientists or otherwise, and whether the business has a chance.

Data Demographics

Let's focus on one product or service for each demographic.

New Recruits

People new to data science or who like to be in touch with new developments want summary information, they are sniffing around the edges and aren't ready to go down any one rabbit hole too far.  Blogs are a bridge, they are a megaphone for those who've been able to materialize their thoughts, and they are a communication medium with those who want to understand our thoughts.  Flowing Data focuses on data visualization and has a ranked list of the best they've found.  'Success' in a blog can be hard to measure, there are plenty of writers that gain value simply from expressing their thoughts and engaging with a like-minded audience, which can blend from the off-line to the on.  Nathan Yau may be profitable, depending on blog traffic, in that he's using his blog to draw attention to his books Data Points and Visualize This, as well as a tasteful CPM or affiliate model with Tableau and a few others.  I've highlighted this blog because of the excellent Twitter visualizations he features.

Contributors

Contributors are practitioners, they are a special group because they understand the details of the underlying systems and can recreate it themselves if necessary, they are actors dug deep within the data science industry.  Contributors need and create tools and techniques that help them execute on a day to day basis, and whatever expedites this process is as good as gold.  Concurrent has a great buzzy summary of their primary product Cascading, I see it as an infrastructure and environment for data scientists to focus on analytics and applications without having to become Java, Hadoop, and web deployment experts in the process.  Data scientists are interested in the analyses at their fingertips not the infrastructure that enables it, like a driver is interested in wielding performance not becoming a mechanic; in that same vein they may know about the infrastructure but it's not the focus.

 

Distillers

Distillers take a good idea, break it down, streamline it, and package it.  Distillers are looking for automation, real-time performance, mathematical approximations, simplification, anything that helps get the job done and no more.  Infegy has successfully simplified sentiment analysis in their flagship product Social Radar, which enables real-time monitoring of social media for market analysis.

Want to know if that press release is being discussed in a positive light or if that VP gaff is as bad as you're imagining, this tool can help.  These are not themselves data science applications, but there is certainly data science and data visualization under the hood of Social Radar, and is a great example of what products are possible with the right architecture of data science techniques.

Consultants

Consulting from the others' point of view is another term for "exploring options".  Cubetto Mind is an iOS application based on the Mind Mapping diagram technique. The app is useful for highlighting the relationships within complex groups, and what relationships, existing or not, might make a difference in group dynamics.

So imagine consulting for a new client's organization, there are a number of different players, they all have different priorities, and they may not see how your new fangled data science ideas are ultimately in their interest.  Mind mapping can show the interdependencies within an organization, and how your proposals affect those dependencies.

Traders

Valuing and investing in the data science industry is about supply and demand awareness, and in data science as in most industries both are best provided by people.  To generate good leads and connect with the right people you need to filter the signal from the noise, a way to analyze alternatives and come to a conclusion.  Social Action was developed by the UMD Human Computer Interaction Lab and is designed to explore human networks by filtering relationships based on common attributes, and displaying those results using force graph data visualizations.

Compiling the initial data might be difficult and time consuming, but for people who can realistically valuate entrepreneurs in data science this should come naturally or they should have the right resources at their disposal.  Once the data is loaded, recognizing the next lead is about knowing what you believe is an important metric in measuring data scientists and their work.

Data Visualization: Reactive Functions in Shiny

We now know that Shiny for R is a powerful tool for data scientists to display their work quickly and easily to a broad audience, so let's get to some nitty gritty about what it takes to create Shiny visualizations.  We're not going to get into syntax (unless I want to scare everyone off), let's focus on its basic structures and why it comes naturally to those of us who're not web programmers by trade.

All Shiny applications have two basic functions: ui.R and server.R, and it is the relationship between these two functions that determines the functions you use.

Static Pages

A static page is the most basic architecture to start with and can be written using only these basic functions:

  • ui.R
    • shinyUI()
    • pageWithSidebar()
    • sidebarPanel()
    • mainPanel()
  • server.R
    • shinyServer()
    • output$variable <- reactive()
    • output$image <- reactivePlot()

All we've set up is a control panel on the left-hand-side and the plot outputs on the right-hand-side.  You can prompt the user to upload a file with "fileInput()", used within "sidebarPanel()", but to begin with it may be easier to use a file you're familiar with and wont throw you any curve balls.

You can simply display the table of values given using "tableOutput()", but the goal here is to use the side panel to explore your data, and visualizing the table is much more effective than displaying it.  You can prompt the user for basic yes/no information through "radioButtons()", you can ask them to choose specific columns with "selectInput()", or the user can select multiple values using "checkboxGroupInput()".

If you're not a web programmer by trade, and especially if you're primarily used to linear programming, it is important to note that functions in ui.R are always linked to something in server.R by the "input" and "output" variables, literally.  Shiny has reserved these two variables for passing information to server.R, "input", and passing processing information back to ui.R, "output".  For whatever reason, possibly because there are so many different variable names, server.R uses the full reference "input$inVariable", but ui.R only needs the "outVariable" portion of "output$outVariable".  So for example, you may create a plot in server.R:

output$imagePlot <- reactivePlot(function() { code code }

But ui.R only uses:

div(class="span6", plotOutput("imagePlot")),

I threw the div() function in there just to show the html relationship.

Dynamic Pages

There are a couple ways in which we can make the pages more dynamic, we have already given the user control over what data is used in the output graphs, but we can also dynamically choose what options we present to the user.  Based on initial choices (such as type of data), we can change the range on control functions (such as slider bars), and we can change the type of graph that's produced based on the input data and configuration.

To have the configuration panel change we use:

output$variable <- reactiveUI( function() { code code }

Shiny will keep track of that output variable and if it is to be used in ui.R, this reactiveUI function will be called first; I have seen race conditions so don't have too many functions feeding back on themselves.  To create a slide-bar, to say limit values of a joint PDF, place the "sliderInput()" function within reactiveUI:

output$slideRange <- reactiveUI( function() { code ; code ; sliderInput()}

Then render the output in ui.R with this simple call within sidebarPanel():

uiOutput("slideRange")

The same approach is used to change radio buttons based on the header within an uploaded file, or to add completely new sections to sidebarPanel().  Creating dynamic chart types is easy, just include conditional statements in server.R based on the "input" variables.  So if we're working with a fully-populated dataset create nice histograms or a heat map, but if it's sparsely populated create a force-graph or social-graph.  There must be a corresponding function in ui.R for each chart however, so you are limited by how you set up the page to begin with; your three charts may change their look, but there will always be three.

If many graphs depend on a single set of calculations, it may be prudent to use "reactiveValues()" coupled with the "observe()" function so you don't have to call the same function multiple times.  Shiny will keep track of what's been changed and what hasen't, so all you have to do is call the variable and Shiny will make sure it's up to date.  When you're operating on a large data set, this is essential for a real-time interface.  If we were displaying the psychological results of the Three Stooges, we might create a reactive variable like the following that our reactiveUI() functions would reference:

v <- reactiveValues( Names = c("Larry","Moe","Curly") , sane = NULL) observe(function() { #Calculations on v$Names and v$sane }

As a note for posterity, though we haven't listed many functions, a good data viz is simple and leads the user by your design.  Use the proverbial "eye chart" approach and you'll be lucky if the person dives in at all, too little and the user doesn't 'play' with the viz and doesn't explore their own data.  This can be likened to gamification, although I'm not advocating Farmville for data science; I remember reading as a kid that the original Super Mario Bros. for Nintendo was designed to be difficult yet allow me to win enough that I couldn't put it down.  Although I'm not sure who Bowser's equivalent is in your data, if you're looking to make an impression on your user remember that I still know where all the secret warp levels are.

Data Visualization: Shiny Clusters

Clustering is about recognizing associations between data points, which we can easily FailureProbabilitiesvisualize using different forcegraph layout structures (fructerman, reingold, circle, etc.).  Exploring data is about understanding how different data associations change the overall structure of the data corpus.  With hundreds of data fields and no specific rules on how data may or may not be related, it is up to the user to declare an association and verify their instincts through the resulting data viz.  As data scientists, many times we are expected to have The answer, so when we present our work the presentees may not be so willing to question.  This is where the value of RStudio Shiny becomes clear.  Just as Salman Kahn, of Kahn Academy, recognized that his nephews would rather listen to his lectures on YouTube, where they are free to rewind and fast forward without being rude, our presentees may want to experiment with the data associations and the overall structure.  Shiny allows data scientists to create the interactive clustering process, an alternative to boring power point presentations, that allows our presentees to freely ask their questions.  Data psychology shows that people remember better when they're part of the process, and our ultimate goal is to make an impression.

Data Science DC had an event a while back on clustering where Dr.Abhijit Dasgupta presented on unsupervised clustering.  The approaches outlined in the good doctor's presentation presume the data to be based on rules between data points in the set.  However, we can also introduce declaration or repudiation of associations, where the user declares data fields to be either associative, to filter associations between other fields, or to not be included.

This is important because when looking for patterns in the data, if we compare everything to everything else we may get the proverbial 'hairball' cluster, where everything is mutually connected.  This is useless if we're trying to find structure for a decision algorithm, where separation and distinction are key.

RStudio Shiny gives the power to easily build interactive cluster exploration visualizations, web apps, in R.  Shiny uses reactive functions to pass inputs and outputs between the ui.R and server.R functions in the application directory.  Programming a new app takes a little getting used to as linear programming in R is different than web programming in R; for instance assigning value to the output structure in server.R doesn't necessarily mean its available to pass to the reactive function a few lines down.  To keep things simple you have to use the right type of 'reactive function' on the server.R side or div function on the ui.R side, but the structure is simple and the rest of coding in R remains exactly the same.  Shiny Server gives you the power to host your web app in the cloud, but be warned that large applications on Amazon EC2 micro instances may run Very Slowly - which is Amazon's business model and understandable, they want you to upgrade now that you know the potential they offer.

Using the tools Shiny provides, you will end up with a control panel and a series of graphs; how many and where is up to you.  The difference between Shiny and say Tableau is the ability to process data on the back end, which is where you interpret the user's selections and operate to dynamically update the visualization presented on the webpage.  There is some UX flexibility to better guide the user experience, but if you want to have truly interactive graphs you'll have to incorporate JavaScript... another post for another Friday.

Data Visualization: Shiny Democratization

In organizing Data Visualization DC we focus on three themes: The Message, The Process, The Psychology. In other words, ideas and examples of what can be communicated, the tools and know-how to get it done, and how best to communicate. We know intuitively and from experience that the best communication comes in the form of visualizations, and we know there are certain approaches that are more effective than others. What is it about certain visualizations that stimulate memory? Perhaps because we're naturally visual creatures, perhaps visuals allow multiple ideas to be associated with one object, perhaps visuals bring people together and create a common reference for further fluid discussion. Let's explore these ideas.

Rembrant&MemoryVisualization is the Natural Medium

No one really has the answer.  The best visualizers have traditionally been artists, and we know that any given artwork speaks to some and not to others.  Visualizations help you think in new ways, make you ask new questions, but each person will ask different questions and there is no one size fits all.  Visualizations will help you have a conversation without even speaking, much the way Khan Academy allows study on your own time.  Trying to turn this into a science is a noble effort, and articles like "The eyes have it" do an excellent job outlining the cutting edge, but when we have to use visualizations to conduct our work more efficiently we know the first question in communicating is "who's the audience?"  There are certainly best practices (no eye charts, good coloring, associated references, etc.) but the same information will vary in its presentation for each audience.

FoodVizCase in point, everyone has to eat, everyone knows food from their perspective, so if we want to communicate nutrition facts why not know use your audience's craving for delicious looking food to draw them into exploring the visualization.  Fat or Fiction  does an excellent job of this, and I can tell you I never would have known cottage cheese had such a low nutritional value next to cheddar if they weren't juxtaposed for easy comparison.

Ultimately there is a balance and “If you focus more attention on one part of the data, you are ultimately taking attention away from another part of the data,” explains Amitabh Varshney, director of the Institute for Advanced Computer Studies at the University of Maryland, US.  You can attempt to optimize this by hacking your learning, but if you're as curious as I am you need some way of exploring memory on a regular basis to learn for yourself, it shouldn't have to be a never-ending checklist of best practices.

HarlemShake_feb_7_8

Personally I believe that social memes are an example of societal memory, they shape and define our culture giving us objects to reference in conversation.  Looking at the relationships in the initial force-graph presentation of the meme, I can't help but think of neural patterns, the basis of our own memory.  We're all familiar with this challenge when we meet someone from another country, or another generation, and we draw an analogy with a favorite movie, song, actor, etc.; If the person is familiar with the social meme the reference immediately invokes thoughts and memories, which we use to continue the flow of the conversation and introduce new ideas.  The Harlem Shake: anatomy of a viral meme captures how memes emerge over time, and allows you to drill down all the way to what actions people took in different contexts.  My goal in studying this chart is to come away with how to introduce ideas for each audience, through visualizations or otherwise, to maximum information retention.

Data Visualization: Shiny Spiced Consulting

If you haven't already heard, RStudio has developed an incredibly easy way to deploy R on the web with its Shiny Package. For those who have heard, this really isn't new as bloggers have already been blogging about it for some months now, but I have primarily seen a focus on how to build Shiny apps, and feel it's also important to focus on utilizing Shiny Apps for clients.

From Pigeon Hole to Social Web Deployment

I was originally taught C/C++ but I didn't really begin programming until I was introduced to Matlab.  A breath of fresh air, I no longer had to manage memory issues, and its mathematics and matrix design allowed me to think about the algorithms as I wrote, rather than the code, much like we write sentences in English without worrying too much about grammar.  Removing those human-interpretive layers and allowing the mind to focus on the real challenge at hand had an interesting secondary effect where I eventually began thinking and dreaming in Matlab, and it was easier at times to write a quick algorithm than a descriptive paper. What's more, Matlab had beautiful graphics which greatly simplified the communicative process, as a good graphic is self-evident to a much larger crowd. Fast forward and today we are an open source and socially networked community where the web is our medium.  Social Networks are not reserved for Facebook and Twitter, in a way when you use a new Package in R you're "Friending" its developer and anyone else who uses it.  For working individually this is a great model, but unfortunately to deploy the power of AI, machine learning, or even simple algorithms and make use of the web-medium required the additional skillset of web-programming.  Although not an overly complicated skillset to be proficient at, like running, biking, or swimming, just because you once ran seven minutes per mile doesn't mean you can after a few years of inactivity.

Democratizing Data

Enter RStudio Shiny, an instant hit.  In the second half of 2012 I worked on a project using D3.js, Spring MVC, and Renjin, the idea being more administrative in that UI developers could focus on UI and algorithm developers could focus on algorithms, perhaps eventually meeting in the middle. I was practically building a custom version of Shiny, and for 90% of the intended use stories, I wish Shiny had been available early in 2012.

Thousands of lines of code were cut by an order of magnitude when implementing in Shiny, and just like when back in the day Matlab let me think in terms of algorithms, Shiny is letting me think in terms of communicating with my audience.  If I can plot it in R, I can host it on a Shiny Server. R is already excellent for writing algorithms, and once a framework is written in Shiny, integrating new algorithms or new plots is as simple as replacing function calls. This allows you to quickly iterate between meetings and create an interactive experience that is self evident to everyone because it's closely related to the conversation at hand. What's more, because it's web based, the experience goes beyond the meeting and everyone from the CEO to Administrative Assistants can explore the underlying data, creating a common thread for discussion much like chatting about the Oscars around the proverbial water cooler. Shiny Democratizes Data.

Progress

The response to Shiny has been very positive, and its use is quickly becoming wide-spread. Like Yoda said, "See past the decisions already made we can" (I may be paraphrasing), we can see the next steps for Shiny, including interactive plots, user-friendly reactive functions, easier web deployment with Shiny Servers, and integration of third party applications such as GoogleVis and D3js. With respect to Yoda, Shiny's allowed me to decide that dynamic interaction with data, for the wider data science community, is the clear next step.

Risks of Predictive Analytics

Coiffed Ray-GunBased on an analysis of more than a half million public posts on message boards, blogs, social media sites and news sources, IBM predicts that ‘steampunk,’ a sub-genre inspired by the clothing, technology and social mores of Victorian society, will be a major trend to bubble up, and take hold, of the retail industry. --Jan. 14, 2013 IBM Press Release  

Really? Is this a good idea? Not steampunk fashion -- that's clearly a bad idea. But publicizing this data-driven prediction -- is that a good idea? Could this press release actually cause an increase in rayguns and polished brass driving goggles?

I think this illustrates one of a couple of important potentially negative consequences to making and communicating statistical predictions. The first risk is that making predictions may sway people to follow the predictions. The second risk is that making predictions may sway people to inaction and complacency. Both of these risks may need to be actively managed to prevent advanced predictive modeling from causing more harm than good.

Recently, none other than Nate Silver indicated that if he thought his predictive models of elections were swaying the results, he would stop publishing them. There are longstanding questions about bandwagon and "back the winner" effects in polling and voting. If your predictions are widely seen as accurate, as Silver's are, then your statements may increase votes for the perceived winner and decrease them for the perceived loser. It's well known that more people report, after the fact, that they voted for a winning candidate than actually did so.

There are other ways that prediction can drive outcomes in unpredicted or undesired ways, especially when predictions are tied to action. If your predictive model estimates increased automobile traffic between two locations, and you build a highway to speed that traffic, than the "induced demand" effect (added capacity causes increased use) will almost certainly prove your predictive model correct. Even if the model was predicting only noise. The steampunk prediction may fall into this category, sadly.

The other problem is exemplified by sales forecasts. If your predictions are read by the people whose effort is needed to realize the forecast results, they may be less likely to come true. Your predictions are probably based on a number of assumptions, including that the sales team is putting in the same type of effort that they did last month or last year. But if forecast results are perceived as a "done deal," that assumption will be violated. A prediction is not a target, and should not be seen or communicated as such.

How can these problems be mitigated? In some cases, by better communications strategies. Instead of providing a point estimate of sales ("we're going to make $82,577.11 next week!"), you may be better off providing the numbers from an 80% or 90% confidence interval: "if we slack off, we could make as little as $60,000, but if we work hard, we could make as much as $100,000." Of course, if you have the sort of data where you can include sales effort as a predictor, you can do even better than that.

Another trick to keeping people motivated is to let them beat their targets most but not all of the time. How do you do this? Consider providing the 20th percentile of a forecast distribution as the target. If your model is well-calibrated, those forecasts will be met 80% of the time. There is extensive psychological and business research in the best way to set goals, and my (limited) understanding of it is that people who think they are doing well, but with room for improvement, are best engaged.

Returning to the upcoming steampunk sartorial catastrophe, perhaps IBM should have exercised some professional judgement, as Nate Silver seems to be doing, and just kept their big blue mouth shut on this one.