Methods

Announcing the Publication of Practical Data Science Cookbook

Practical Data Science Cookbook is perfect for those who want to learn data science and numerical programming concepts through hands-on, real-world project examples. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of data science projects, the steps in the data science pipeline, and the programming examples presented in this book. 

Simulation and Predictive Analytics

This is a guest post by Lawrence Leemis, a professor in the Department of Mathematics at The College of William & Mary.  A front-page article over the weekend in the Wall Street Journal indicated that the number one profession of interest to tech firms is a data scientist, someone whose analytic skills, computing skills, and domain skills are able to detect signals from data and use them to advantage. Although the terms are squishy, the push today is for "big data" skills and "predictive analytics" skills which allow firms to leverage the deluge of data that is now accessible.

I attended the Joint Statistical Meetings last week in Boston and I was impressed by the number of talks that referred to big data sets and also the number that used the R language. Over half of the technical talks that I attended included a simulation study of one type or another.

The two traditional aspects of the scientific method, namely theory and experimentation, have been enhanced with computation being added as a third leg. Sitting at the center of computation is simulation, which is the topic of this post. Simulation is a useful tool when analytic methods fail because of mathematical intractability.

The questions that I will address here are how Monte Carlo simulation and discrete-event simulation differ and how they fit into the general framework of predictive analytics.

First, how do how Monte Carlo and discrete-event simulation differ? Monte Carlo simulation is appropriate when the passage of time does not play a significant role. Probability calculations involving problems associated with playing cards, dice, and coins, for example, can be solved by Monte Carlo.

Discrete-event simulation, on the other hand, has the passage of time as an integral part of the model. The classic application areas in which discrete-event simulation has been applied are queuing, inventory, and reliability. As an illustration, a mathematical model for a queue with a single server might consist of (a) a probability distribution for the time between arrivals to the queue, (b) a probability distribution for the service time at the queue, and (c) an algorithm for placing entities in the queue (first-come-first served is the usual default). Discrete-event simulation can be coded into any algorithmic language, although the coding is tedious. Because of the complexities of coding a discrete-event simulation, dozens of languages have been developed to ease implementation of a model. 

The field of predictive analytics leans heavily on the tools from data mining in order to identify patterns and trends in a data set. Once an appropriate question has been posed, these patterns and trends in explanatory variables (often called covariates) are used to predict future behavior of variables of interest. There is both an art and a science in predictive analytics. The science side includes the standard tools of associated with mathematics computation, probability, and statistics. The art side consists mainly of making appropriate assumptions about the mathematical model constructed for predicting future outcomes. Simulation is used primarily for verification and validation of the mathematical models associated with a predictive analytics model. It can be used to determine whether the probabilistic models are reasonable and appropriate for a particular problem.

Two sources for further training in simulation are a workshop in Catonsville, Maryland on September 12-13 by Barry Lawson (University of Richmond) and me or the Winter Simulation Conference (December 7-10, 2014) in Savannah.

Saving Money Using Data Science for Dummies

I'd like to tell you a story about how I made "data science" work for me without writing a single line of code, launching a single data analysis or visualization app, or even looking at a single digit of data.

"No, way!", you say? Way, my friends. Way.

Our company is developing VoteRaise.com, a new way for voters to fundraise for the candidates they'd like to have run for political office. We realized that, even here in DC, there is no active community of people exploring how innovations in technology & techniques impacts political campaigning, so we decided to create it.

I started the process of creating RealPolitech, a new Meetup, last night. It's clear that Meetup up makes use of data and, more importantly, thinks about how to use that data to help its users. For instance, when it came time to pick the topics that RealPolitech would cover, Meetup did a great job of making suggestions. All I had to do was type a few letters, and I was presented with a list of options to select.

I got the whole meetup configured quickly. I was impressed by the process. But, when it came time to make a payment, I went looking for a discount code. Couldn't find one in FoundersCard. Couldn't find it in Fosterly. Or anywhere else online. Seventy two dollars for six months is a good deal, already. But, still, we're a bootstrapped startup, and every dollar saved is a dollar earned. So, I decided to try something.

I just... stopped. I figured, if Meetup is gathering and providing data during the configuration process, they must be doing the same during checkout. So, I figured I'll give them a day or two to see if I receive an unexpected "special offer". Sure enough, by the time I woke up, I had a FIFTY PERCENT discount coupon, along with a short list of people who may be interested in RealPolitech, as incentive to pay the dues and launch the Meetup.

RealPolitech is up and running. You can find it here, come join and share your ideas and expertise! We are putting together our kickoff event, and lining up speakers and sponsors.

Oh, and I saved 50% off the dues by leveraging my knowledge of data science to get the outcome I wanted by doing... absolutely nothing.

Facility Location Analysis Resources Incorporating Travel Time

This is a guest blog post by Alan Briggs. Alan is a operations researcher and data scientist at Elder Research. Alan and Harlan Harris (DC2 President and Data Science DC co-organizer) have co-presented a project with location analysis and Meetup location optimization at the Statistical Programming DC Meetup and an INFORMS-MD chapter meeting. Recently, Harlan presented a version of this work at the New York Statistical Programming Meetup. There was some great feedback on the Meetup page asking for some additional resources. This post by Alan is in response to that question.

If you’re looking for a good text resource to learn some of the basics about facility location, I highly recommend grabbing a chapter of Dr. Michael Kay’s e-book (pfd) available for free from his logistics engineering website. He gives an excellent overview of some of the basics of facility location, including single facility location, multi-facility location, facility location-allocation, etc. At ~20 pages, it’s entirely approachable, but technical enough to pique the interest of the more technically-minded analyst. For a deeper dive into some of the more advanced research in this space, I’d recommend using some of the subject headings in his book as seeds for a simple search on Google Scholar. It’s nothing super fancy, but there are plenty of articles in the public-domain that relate to minisum/minimax optimization and all of their narrowly tailored implementations.

One of the really nice things about the data science community is that it is full of people with an inveterate dedication to critical thinking. There is nearly always some great [constructive] criticism about what didn’t quite work or what could have been a little bit better. One of the responses to the presentation recommended optimizing with respect to travel time instead of distance. Obviously, in this work, we’re using Euclidean distance as a proxy for time. Harlan cites laziness as the primary motivation for this shortcut, and I’ll certainly echo his response. However, a lot of modeling boils down to cutting your losses, getting a good enough solution or trying to find the balance among your own limited resources (e.g. time, data, technique, etc.). For the problem at hand, clearly the goal is to make Meetup attendance convenient for the most number of people; we want to get people to Meetups. But, for our purposes, a rough order of magnitude was sufficient. Harlan humorously points out that the true optimal location for a version of our original analysis in DC was an abandoned warehouse—if it really was an actual physical location at all. So, when you really just need a good solution, and the precision associated with true optimality is unappreciated, distance can be a pretty good proxy for time.

A lot of times good enough works, but there are some obvious limitations. In statistics, the law of large numbers holds that independent measurements of a random quantity tend toward the theoretical average of that quantity. For this reason, logistics problems can be more accurately estimated when they involve a large number of entities (e.g. travelers, shipments, etc.). For the problem at hand, if we were optimizing facility location for 1,000 or 10,000 respondents, again, using distance as a proxy for time, we would feel much better about the optimality of the location. I would add that similar to large quantities, introducing greater degrees of variability can also serve to improve performance. Thus, optimizing facility location across DC or New York may be a little too homogeneous. If instead, however, your analysis uses data across a larger, more heterogeneous area, like say an entire state where you have urban, suburban and rural areas, you will again get better performance on optimality.

Let’s say you’ve weighed the pros and cons and you really want to dive deeper into optimizing based on travel time, there are a couple different options you can consider. First, applying a circuity factor to the Euclidean distance can account for non-direct travel between points. So, to go from point A to point B actually may take 1.2 units as opposed to the straight-line distance of 1 unit. That may not be as helpful for a really complex urban space where feasible routing is not always intuitive and travel times can vary wildly. However, it can give some really good approximations, again, over larger, more heterogeneous spaces. An extension to a singular circuity factor, would be to introduce some gradient circuity factor that is proportional to population density. There are some really good zip code data available that can be used to estimate population.

Increasing the circuity factor in higher-density locales and decreasing in the lower-density can help by providing a more realistic assessment of how far off of the straight line distance the average person would have to commute. For the really enthusiastic modeler that has some good data skills and is looking for even more integrity in their travel time analysis, there are a number of websites that provide road network information. They list roads across the United States by functional class (interstate, Expressway & principal arterial, Urban principal arterial, collector, etc.) and even provide shape files and things for GIS. I've done basic speed limit estimation by functional class, but you could also do something like introduce a speed gradient respective to population density (as we mentioned above for the circuity factor). You could also derive some type of an inverse speed distribution that slows traffic at certain times of the day based on rush hour travel in or near urban centers.

In the words of George E. P. Box, “all models are wrong, but some models are useful.” If you start with a basic location analysis and find it to be wrong but useful, you may have done enough. If not, however, perhaps you can make it useful by increasing complexity in one of the ways I have mentioned above.

Flask Mega Meta Tutorial for Data Scientists

Introduction

Data science isn't all statistical modeling, machine learning, and data frames. Eventually, your hard work pays off and you need to give back the data and the results of your analysis; those blinding insights that you and your team uncovered need to be operationalized in the final stage of the data science pipeline as a scalable data product. Fortunately for you, the web provides an equitable platform to do so and Python isn't all NumPy, Scipy, and Pandas. So, which Python web application framework should you jump into?

375px-Flask_logo.svg

There are numerous web application frameworks for Python with the 800lb gorilla being Django, "the web framework for perfectionists with deadlines." Django has been around since 2005 and, as the community likes to say, is definitely "batteries included," meaning that Django comes with a large number of bundled components (i.e., decisions that have already been made for you). Django comes built in with its chosen object-relational mapper (ORM) to make database access simple, a beautiful admin interface, a template system, a cache system for improving site performance, internationalization support, and much more. As a result, Django has a lot of moving parts and can feel large, especially for beginners. This behind the scene magic can obscure what actually happens and complicates matter when something goes wrong. From a learning perspective, this can make gaining a deeper understanding of web development more challenging.

Enter Flask, the micro web framework written in Python by Armin Ronacher. Purportedly, it came out as an April Fool's joke but proved popular enough not to go quietly into the night. The incredible thing about Flask is that you can have a web application running in about 10 lines of code and a few minutes of effort. Despite this, Flask can and does run real production sites.

As you need additional functionality, you can add it through the numerous extensions available. Need an admin interface? Try Flask-Admin. Need user session management and the ability for visitors to log in and out? Check out Flask-Login.

While Django advocates may argue that by the time you add all of the "needed" extensions to Flask, you get Django, others would argue that a micro-framework is the ideal platform to learn web development. Not only does the small size make it easier for you to get your head around it at first, but the fact that you need to add each component yourself requires you to understand fully the moving parts involved. Finally, and very relevant to data scientists, Flask is quite popular for building RESTful web services and creating APIs.

As Flask documentation isn't quite as extensive as the numerous Django articles on the web, I wanted to assemble the materials that I have been looking at as I explore the world of Flask development.

For Beginners - Getting Started

The Official Flask Docs

The official Flask home page has a great set of documentation, list of extensions to add functionality, and even a tutorial to build a micro blogging site. Check it out here.

Blog post on Learning Flask

This tutorial is really a hidden gem designed for those that are new to web programming and Flask. It contains lots of small insights into the process via examples that really help connect the dots.

Full Stack Python

While not exclusively focused on Flask, Fullstackpython.com is a fantastic page that offers a great amount of information and links about the full web stack, from servers, operating systems, and databases to configuration management, source control, and web analytics, all from the perspective of Python. A must read through for those new to web development.

Starting A Python Project the Right Way

Jeff Knupp again walks the beginner Python developer through how he starts a new Python project.


For Intermediate Developers

The Flask Mega Tutorial

The Flask Mega Tutorial is aptly named as it has eighteen separate parts walking you through the numerous aspects of building a website with Flask and various extensions. I think this tutorial is pretty good but have heard some comment that it can be a bit much if you don't have a strong background in web development.

Miguel Grinberg's tutorial has been so popular that it is even being used as the basis of an O'Reilly book, Flask Web Development, that is due out on May 25, 2014. An early release is currently available, so called "Raw & Unedited," here.

A snippet about the book is below:

If you have Python experience, and want to learn advanced web development techniques with the Flask micro-framework, this practical book takes you step-by-step through the development of a real-world project. It’s an ideal way to learn Flask from the ground up without the steep learning curve. The author starts with installation and brings you to more complicated topics such as database migrations, caching, and complex database relationships.

Each chapter in this focuses on a specific aspect of the project—first by exploring background on the topic itself, and then by waking you through a hands-on implementation. Once you understand the basics of Flask development, you can refer back to individual chapters to reinforce your grasp of the framework.

Flask Web Development

Beating O'Reilly to the punch, Packt Publishing started offering the book, "Instant Flask Web Development" from Ron DuPlain in August 2013.

Flask - Blog: Building a Flask Blog: Part1

This blog contains two different tutorials in separate posts. This one, tackles the familiar task of building a blog using Flask-SQLAlchemy, WTForms, Flask-WTF, Flask-Migrate, WebHelpers, and PostgreSQL. The second one shows the creation of a music streaming app.

Flask with uWSGI + Nginx

This short tutorial and Git Hub repository shows you how to set up a simple Flask app with uWSGI and the web server Nginx.

Python Web Applications with Flask

This extensive 3 part blog post from Real Python works its way through the development of a mid-sized web analytics application. The tutorial was last updated November 17th of 2013 and has the added bonus of demonstrating the use of virtualenv and git hub as well.

Python and Flask are Ridiculously Powerful

Jeff Knupp, author of Idiomatic Python, is fed up with online credit card payment processors so builds his own with Flask in a few hours.

 


More Advanced Content

The Blog of Erik Taubeneck

This is the blog of Erick Taubeneck, "a mathematician/economist/statistician by schooling, a data scientist by trade, and a python hacker by night." He is a Hacker School alum in NYC and contributor to a number of Flask extensions (and an all around good guy).

How to Structure Flask Applications

Here is a great post, recommended by Erik, discussing how one seasoned developers structures his flask applications.

Modern Web Applications with Flask and Backbone.js - Presentation

Here is a presentation about building a "modern web application" using Flask for the backend and backbone.js in the browser. Slide 2 shows a great timeline of the last 20 years in the history of web applications and slide 42 is a great overview of the distribution of Flask extensions in the backend and javascript in the frontend.

Modern Web Application Framework: Python, SQL Alchemy, Jinja2 & Flask - Presentation

This 68-slide presentation on Flask from Devert Alexandre is an excellent resource and tutorial discussing Flask, the very popular SQL Alchemy and practically default Flask ORM, and the Jinja2 template language.

Diving Into Flask - Presentation

66-slide presentation on Flask by Andrii V. Mishkovskyi from EuroPython 2012 that also discusses Flask-Cache, Flask-Celery, and blueprints for structuring larger applications.

A Tutorial for Deploying a Django Application that Uses Numpy and Scipy to Google Compute Engine Using Apache2 and modwsgi

by Sean Patrick Murphy

Introduction

This longer-than-initially planned article walks one through the process of deploying a non-standard Django application on a virtual instance provisioned not from Amazon Web Services but from Google Compute Engine. This means we will be creating our own virtual machine in the cloud and installing all necessary software to have it serve content, run the Django application, and handle the database all in one. Clearly, I do not expect an overwhelming amount of traffic to this site. Also, note that Google Compute Engine is very different from Google App Engine.

What makes this app "non-standard" is its use of both the Numpy and Scipy packages to perform fast computations. Numpy and Scipy are based on C and Fortran respectively and both have complicated compilation dependencies. Binaries may be available in some cases but are not always available for your preferred deployment environment. Most importantly, these two libraries prevented me from deploying my app to either Google App Engine (GAE) or to Heroku. I'm not saying that it is impossible to deploy Numpy- or Scipy-dependent apps on either service. However, neither service supports apps dependent on both Scipy and Numpy out-of-the-box although a limited amount of Googling suggests it should be possible.

In fact, GAE could have been an ideal solution if I had re-architected the app, separating the Django application from the computational code. I could run the Django application on GAE and allowed it to spin up a GCE instance as needed to perform the computations. One concern with this idea is the latency involved in spinning up the virtual instance for computation. Google Compute Engine instances spring to life quickly but not instantaneously. Maybe I'll go down this path for version 2.0 if there is a need.

Just in case you are wondering, the Djanogo app in question is here https://github.com/murphsp1/ppi-css.com and the live site is here www.ppi-css.com.

If you have any questions or comments or suggestions, please leave them in the comments section below.

Google Compute Engine (GCE)

I am a giant fan of Google Compute Engine and love the fact that Amazon's EC2 finally has a strong competitor. With that said, GCE definitely does not have the same number of tutorials or help content available online.

I will assume that you can provision your own instance in GCE either using gcutil at the command line or through the cloud services web interface provided by Google.

Once you have your project up and running, you will need to configure the firewall settings for your project. You can do this at the command line of your local machine using the command line below:

gcutil addfirewall http2 --description="Incoming http allowed." --allowed="tcp:http" --project="XXXXXXXXXXXXXX"

Update the Instance and Install Tools

Next, boot the instance and ssh into it from your local machine. The command line parameters required to ssh in can be daunting but fortunately Google gives you a simple way to copy and paste the command from the web-based cloud console. The command line could look something like this:

gcutil --service_version="v1beta16" --project="XXXXXXXXXX" ssh  --zone="GEOGRAPHIC_ZONE" "INSTANCE_NAME"

Next, we need to update the default packages installed on the GCE instance:

sudo apt-get update
sudo apt-get upgrade

and install some needed development tools:

sudo apt-get --yes install make
sudo apt-get --yes install wget
sudo apt-get --yes install git

and install some basic Python-related tools:

sudo apt-get --yes install python-setuptools
sudo easy_install pip
sudo pip install virtualenv

Note that in many of my sudo apt-get commands I include --yes. This flag just prevents me from having to type "Y" to agree to the file download.

Install Numpy and Scipy (SciPy requires Fortran compiler)

To install SciPy, Python's general purpose scientific computing library from which my app needs a single function, we need the Fortran compiler:

sudo apt-get --yes install gfortran

and then we need Numpy and Scipy and everything else:

sudo apt-get --yes install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose

Finally, we need to add ProDy, a protein dynamics and sequence analysis package for Python.

sudo pip install prody

Install and Configure the Database (MySQL)

The Django application needs a database and there are many to choose from, most likely either Postgres or MySQL. Here, I went with MySQL for the simple reason was that it took fewer steps to get the MySQL server up and running on the GCE instance than the Postgres server did. I actually run Postgres on my development machine.

sudo apt-get --yes install mysql-server

The installation process should prompt you to create a root password. Please do so for security purposes.

Next, we are going to execute a script to secure the MySQL installation:

mysql_secure_installation

You already have a root password from the installation process but otherwise answer "Y" to every question.

With the DB installed, we now need to create our database for Django (mine is creatively called django_test). Please note that there must not* be a space between "--password=" and your password on the command line.

mysql --user=root --password=INSERT PASSWORD
mysql> create database django_test;
mysql> quit;

Finally for this step we need the MySQL database connector for Python which will be used by our Django app:

sudo apt-get install python-mysqldb

Install the Web Server (Apache2)

You have two main choices for your web server, either the tried and true Apache (now up to version 2+) or nginx. Nginx is supposed to be the new sexy when it comes to web servers but this newness comes at the price of less documentation/tutorials online. Thus, let's play it safe and go with Apach2.

First Attempt

First things first, we need to install apache2 and mod_wsgi. Mod_wsgi is an Apache HTTP server module that provides a WSGI compliant interface for web applications developed in Python.

sudo apt-get --yes install apache2 libapache2-mod-wsgi

This seems to be causing a good number of problems. In my Django error logs I see:

[Mon Nov 25 13:25:27 2013] [error] [client 108.21.2.20] Premature end of script headers: wsgi.py

and in:

cat /var/log/apache2/error.log

I see things like:

[Sun Nov 24 16:15:02 2013] [warn] mod_wsgi: Compiled for Python/2.7.2+.
[Sun Nov 24 16:15:02 2013] [warn] mod_wsgi: Runtime using Python/2.7.3.

with the occasional segfault:

[Mon Nov 25 00:02:55 2013] [notice] child pid 12532 exit signal Segmentation fault (11)
[Mon Nov 25 00:02:55 2013] [notice] seg fault or similar nasty error detected in the parent process

which is a strong indicator that something isn't quite working correctly.

Second Attempt

A little bit of Googling suggests that this could be the result of a number of issues with a prebuilt mod_wsgi. The solution seems to be grab the source code and compile it on my GCE instance. To do that, I:

sudo apt-get install apache2-prefork-dev

Now, we need to grab mod_wsgi while ssh'ed into the GCE instance:

wget https://modwsgi.googlecode.com/files/mod_wsgi-3.4.tar.gz
tar -zxvf mod_wsgi-3.4.tar.gz
cd mod_wsgi-3.4
./configure
make
sudo make install

Once mod_wsgi is intalled, the apache server needs to be told about it. On Apache 2, this is done by adding the load declaration and any configuration directives to the /etc/apache2/mods-available/ directory.

The load declaration for the module needs to go on a file named wsgi.load (in the /etc/apache2/mods-available/ directory), which contains only this:

LoadModule wsgi_module /usr/lib/apache2/modules/mod_wsgi.so

Then you have to activate the wsgi module with:

a2enmod wsgi

Note: a2enmod stands for "apache2 enable mod", this executable create the symlink for you. Actually a2enmod wsgi is equivalent to:

cd /etc/apache2/mods-enabled
ln -s ../mods-available/wsgi.load
ln -s ../mods-available/wsgi.conf # if it exists

Now we need to update the virtual hosts settings on the server. For Debian, this is here:

/etc/apache2/sites-enabled/000-default 

Restart the service:

sudo service apache2 restart 

and also change the owner of the directory on the GCE instance that will contain the files to be served by apache:

sudo chown -R www-data:www-data /var/www

Now that we have gone through all of that, it is nice to see things working. By default, the following page is served by the install:

/usr/share/apache2/default-site/index.html

If you go to the URL of the server (obtainable from the Google Cloud console), you should see a very simple example html page.

Setup the Overall Django Directory Structure on the Remote Server

I have seen many conflicting recommendations in online tutorials about how to best lay out the directory structure of a Django application in development. It would appear that after you have built your first dozen or so Django projects, you start formulating your own opinions and create a standard project structure for yourself.

Obviously, this experiential knowledge is not available to someone building and deploying one of their first sites. And, your directory structure directly impacts yours app's routings and the daunting-at-first settings.py file. If you move around a few directories, things tend to stop working and the resulting error messages aren't necessarily the most helpful.

The picture gets even murkier when you go from development to production and I have found much less discussion on best practices here. Luckily, I could ping on my friend Ben Bengfort and tap into his devops knowledge. The directory structure on the remote server looks like this as recommended by Mr. Bengfort.

/var/www/ppi-css.com
/var/www/ppi-css.com/htdocs/static
/var/www/ppi-css.com/htdocs/media
/var/www/ppi-css.com/django
/var/www/ppi-css.com/logs

Apache will see the htdocs directory as the main directory from which to serve files.

/static will contain the collected set of static files (images, css, javascript, and more) and media will contain uploaded documents.

/logs will contain relevant apache log files.

/django will contain the cloned copy of the Django project from Git Hub.

The following shell commands get things setup correctly:

sudo mkdir /var/www/ppi-css.com
sudo mkdir /var/www/ppi-css.com/htdocs
sudo mkdir /var/www/ppi-css.com/htdocs/static
sudo mkdir /var/www/ppi-css.com/htdocs/media
sudo mkdir /var/www/ppi-css.com/django
sudo mkdir /var/www/ppi-css.com/logs
cd /var/www/ppi-css.com/django
sudo git clone https://github.com/murphsp1/ppi-css.com.git

Configuring Apache for Our Django Project

With the directory structure of our Django application sorted, let's continue configuring apache.

First, let's disable the default virtual host for apache:

sudo a2dissite default

There will be aliases in the virtual host configuration file that let the apache server know about this structure. Fortunately, I have included the ppi-css.conf file in the repository and it must be moved into position:

sudo cp /var/www/ppi-css.com/django/ppi-css.com/ppi-css.conf /etc/apache2/sites-available/ppi-css.com

Next, we must enable the site using the following command:

sudo a2ensite ppi-css.com

and we must reload the apache2 service (remember this command as you will probably be using it alot)

sudo service apache2 reload

Now, when I restarted or reloaded the apache2 service, I get the following error message:

ERROR MESSAGES:
[....] Restarting web server: apache2apache2: apr_sockaddr_info_get() failed for (none)
apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for ServerName
 ... waiting apache2: apr_sockaddr_info_get() failed for (none)
apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for ServerName
. ok 
Setting up ssl-cert (1.0.32) ...
hostname: Name or service not known

To remove this, I simply added the following line:

ServerName localhost

to the /etc/apache2/apache2.conf file using vi. A quick

sudo service apache2 reload

shows that the error message has been banished.

Install a Few More Python Packages

The Django application contains a few more dependencies that were captured in the requirements file included in the repository. Note that since the installation of Numpy and Scipy has already been taken care of, those lines in the requirements.txt file can be removed.

sudo pip install -r /var/www/ppi-css.com/django/ppi-css.com/requirements.txt

Database Migrations

Before we can perform the needed database migrations, we need to update the database section of settings.py. It should look like below:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql', 
        'USER': 'root',    
        'PASSWORD': 'INSERT_YOUR_PASSWORD',
        'HOST': '',         # Set to empty string for localhost. 
        'PORT': '',         # Set to empty string for default.
    }
}

From the GCE instance, issue the following commands:

python /var/www/ppi-css.com/django/ppi-css.com/manage.py syncdb
python /var/www/ppi-css.com/django/ppi-css.com/manage.py migrate

Deploying Your Static Files

Static files, your css, javascript, images, and other unchanging files, can be problematic for new Django developers. When developing, Django is more than happy to serve your static files for you given their local development server. However, this does not work for production setttings.

The key to this is your settings.py file. In this file, we see:

# Absolute path to the directory static files should be collected to.
# Don't put anything in this directory yourself; store your static files
# in apps' "static/" subdirectories and in STATICFILES_DIRS.
# Example: "/home/media/media.lawrence.com/static/"
STATIC_ROOT = ''      #os.path.join(PROJECT_ROOT, 'static')

# URL prefix for static files.
# Example: "http://media.lawrence.com/static/"
STATIC_URL = '/static/'

# Additional locations of static files
STATICFILES_DIRS = (
    #os.path.join(PROJECT_ROOT,'static'),
    # Put strings here, like "/home/html/static" or "C:/www/django/static".
    # Always use forward slashes, even on Windows.
    # Don't forget to use absolute paths, not relative paths.
)      

For production, STATIC_ROOT must contain the directory where Apache2 will serve static content from. In this case, it should look like this:

STATIC_ROOT = '/var/www/ppi-css.com/htdocs/static'

For development, STATIC_ROOT looked like:

STATIC_ROOT = ''

Next, Django comes with a handy mechanism to round up all of your static files (in the case that they are spread out in separate app directories if you have a number of apps in a single project) and push them to a single parent directory when you go into production.

./manage.py collectstatic

Be very careful when going into production. If any of the directories listed in the STATICFILES_DIRS variable do not exist on your production server, collectstatic will fail and will not do so gracefully. The official Django documentation has a pretty good description of the entire process.

More Settings.py Updates

We aren't quite done with the settings.py file nand need to update the MEDIA_ROOT variable with the appropriate directory on the server:

MEDIA_ROOT = "/var/www/ppi-css.com/htdocs/media/" #os.path.join(PROJECT_ROOT, 'media')

Next, the ALLOWED_HOSTS variable must be set as shown below when the Django application is run in production mode and not in debug mode:

ALLOWED_HOSTS = [
        '.ppi-css.com', 
        'ppi-css.com',
]

And finally, check to make sure that the paths listed in the wsgi.py reflect the actual paths on the GCE instance.

A Very Nasty Bug

After having gone through through all of that work, I found a strange bug where the website would work fine but then become unresponsive. After extensive Googling, I found the error, best explained below:

Some third party packages for Python which use C extension modules, and this includes scipy and numpy, will only work in the Python main interpreter and cannot be used in sub interpreters as mod_wsgi by default uses. The result can be thread deadlock, incorrect behaviour or processes crashes. These is detailed here.

The workaround is to force the WSGI application to run in the main interpreter of the process using:

WSGIApplicationGroup %{GLOBAL}

If running multiple WSGI applications on same server, you would want to start investigating using daemon mode because some frameworks don't allow multiple instances to run in same interpreter. This is the case with Django. Thus use daemon mode so each is in its own process and force each to run in main interpreter of their respective daemon mode process groups.

The ppi-css.conf file with the required changes is now part of the repository.

Some Debugging Hints

Inevitably, things won't work on your remote server. Obviously leaving your application in Debug mode is ok for only the briefest time while you are trying to deploy but there are other things to check as well.

Is the web server running?

sudo service apache2 status

If it isn't or you need to restart the server:

sudo service apache2 restart

What do the apache error logs say?

cat /var/log/apache2/error.log
cat /var/www/ppi-css.com/logs/error.log 

Also, it is never a bad idea to log into MySQL and take a look at the django_test database.

Virtual Environment - Where Did It Go?

If you noticed, I did have a requirements.txt file in my project. When I started doing local development on my trusty Mac Book Air, I used virtualenv, an amazing tool. However, I had some difficulties getting Numpy and Scipy properly compiled and included in the virtualenv on my server whereas it was pretty simple to get them up and running in the system's default Python installation. Conversing with some of my more Django-experienced friends, they reassured me that while this wasn't a best practice, it wasn't a mortal sin either.

Getting to Know Git and Git Hub

Git or another code versioning tool is a fact of life for any developer. While the learning curve for the novice may be steep (or vertical), it is essential to climb this mountain as quickly as possible.

As powerful as GIT can be, I found myself using only a few commands on this small project.

First, I used git add with several different flags to stage files before committing. To stage all new and modified files (but not deleted files), use:

git add .

To stage all modified and deleted files (but not new files), use:

git add -u

Or, if you want to be lazy and want to stage everything everytime (new, modified, and deleted files), use:

git add -A

Next, the staged files must be committed and then pushed to GitHub.

git commit -m "insert great string for documentation here"
git push -u origin master

Commands in Local Development Environment

While Django isn't the most lightweight web framework in Python (hello Flask and others), "launching" the site in the local development environment is pretty simple. Compare the command line commands needed below to the rest of the blog. (Note that I am running OS X 10.9 Mavericks on a Mac Book Air with 8 GB of 1600 MHz DDR 3.)

First, start the local postgres server:

postgres -D /usr/local/var/postgres 

Next start the local development web server using the django-command-extensions that enables debugging of the site in the browser.

python manage.py runserver_plus   

Once a model has changed, we needed to make a migration using South and then apply it with the two commands below:

./manage.py schemamigration DJANGO_APP_NAME --auto
./manage.py migrate DJANGO_APP_NAME   

References

There are a ton of different tutorials out there to help you with all aspects of deployment. Of course, piecing together the relevant parts may take some time and this tutorial was assemble from many different sources.

November Data Science DC Event Review: Identifying Smugglers: Local Outlier Detection in Big Geospatial Data

This is a guest post from Data Science DC Member and quantitative political scientist David J. Elkind. Geopatial Outliers in the Strait of HormuzAs the November Data Science DC Meetup, Nathan Danneman, Emory University PhD and analytics engineer at Data Tactics, presented an approach to detecting unusual units within a geospatial data set. For me, the most enjoyable feature of Dr. Danneman’s talk was his engaging presentation. I suspect that other data consultants have also spent quite some time reading statistical articles and lost quite a few hours attempting to trace back the authors’ incoherent prose. Nathan approached his talk in a way that placed a minimal quantitative demand on the audience, instead focusing on the three essential components of his analysis: his analytical task, the outline of his approach, and the presentation of his findings. I’ll address each of these in turn.

Analytical Task

Nathan was presented with the problem of locating maritime vessels in the Strait of Hormuz engaged in smuggling activities: sanctions against Iran have made it very difficult for Iran to engage in international commerce, so improving detection of smugglers crossing the Strait from Iran to Qatar and the United Arab Emirates would improve the effectiveness of the sanctions regime and increase pressure on the regime. (I’ve written about issues related to Iranian sanctions for CSIS’s Project on Nuclear Issues Blog.)

Having collected publicly accessible satellite positioning data of maritime vessels, Nathan had four fields for each craft at several time intervals within some period: speed, heading, latitude and longitude.

But what do smugglers look like? Unfortunately, Nathan’s data set did not itself include any examples of watercraft which had been unambiguously identified by, e.g., the US Navy, as smugglers, so he could not rely on historical examples of smuggling as a starting point for his analysis. Instead, he has to puzzle out how to leverage information a craft’s spatial location

I’ve encountered a few applied quantitative researchers who, when faced with a lack of historical examples, would be entirely stymied in their progress, declaring the problem too hard. Instead of throwing up his hands, Dr. Danneman dug into the topic of maritime smuggling and found that many smuggling scenarios involve ship-to-ship transfers of contraband which take place outside of ordinary shipping lanes. This qualitatively-informed understanding transforms the project from mere speculation about what smugglers might look like into the problem of discovering maritime vessels which deviate too far from ordinary traffic patterns.

Importantly, framing the research in this way entirely premises the validity of inferences on the notion that unusual ships are smugglers and smugglers are unusual ships. But in reality, there are many reasons that ships might not conform to ordinary traffic patterns – for example, pleasure craft and fishing boats might have irregular movement patterns that don’t coincide with shipping lanes, and so look similar to the hypothesized smugglers.

Outline of Approach

The basic approach can be split into three sections: partitioning the strait into many grids, generating fake boats to compare the real boats, and then training a logistic regression to use the four data fields (speed, heading, latitude and longitude) to differentiate the real boats from the fake ones.

Partitioning the strait into grids helps emphasize the local character of ships’ movements in that region. For example, a grid square partially containing a shipping channel will have many ships located in the channel and on a heading taking it along that channel. Generating fake boats, with bivariate uniform distribution in the grid square, will tend not to fall in the path of ordinary traffic channel, just like the hypothesized behavior of smugglers. The same goes for the uniformly-distributed timestamps and otherwise randomly-assigned boat attributes for the comparison sets: these will all tend to stand apart from ordinary traffic. Therefore, training a model to differentiate between these two classes of behaviors will advance the goal of differentiating between smugglers and ordinary traffic.

Dr. Danneman described this procedure as unsupervised-as-supervised learning – a novel term for me, so forgive me if I’m loose with the particulars – but this in this case it refers to the notion that there are two classes of data points, one i.i.d. from some unknown density and another simulated via Monte Carlo methods from some known density.  Pooling both samples gives one a mixture of the two densities; this problem then becomes one of comparing the relative densities of the two classes of data points – that is, this problem is actually a restatement of the problem of logistic regression! Additional details can be found in Elements of Statistical Learning (2nd edition, section 14.2.4, p. 495).

Presentation of Findings

After fitting the model, we can examine which of the real boats the model rated as having a low odds of being real – that is, boats which looked so similar to the randomly-generated boats that the model had difficulty differentiating the two. These are the boats that we might call “outliers,” and, given the premise that ship-to-ship smuggling likely takes place aboard boats with unusual behavior, are more likely to be engaged in smuggling.

I will repeat here a slight criticism that I noted elsewhere and point out that the model output cannot be interpreted as a true probability, contrary to the results displayed in slide 39. In this research design, Dr. Danneman did not randomly sample from the population of all shipping traffic in the Strait of Hormuz to assemble a collection of smuggling craft and ordinary traffic in proportion roughly equal to their occurrence in nature. Rather, he generated one fake boat for each real boat. This is a case-control research design, so the intercept term of the logistic regression model is fixed to reflect the ratio of positive cases to negative cases in the data set. All of the terms in the model, including the intercept, are still MLE estimators, and all of the non-intercept terms are perfectly valid for comparing the odds of an observation being in one class or another. But to establish probabilities, one would have to replace the intercept term with knowledge of what the overall ratio of positives to negatives in the other.

In the question-and-answer session, some in the audience pushed back against the limited data set, noting that one could improve the results by incorporating other information specific to each of the ships (its flag, its shipping line, the type of craft, or other pieces of information). First, I believe that all applications would leverage this information – were it available – and model it appropriately; however, as befit a pedagogical talk on geospatial outlier detection, this talk focused on leveraging geospatial data for outlier detection.

Second, it should be intuitive that including more information in a model might improve the results: the more we know about the boats, the more we can differentiate between them. Collecting more data is, perhaps, the lowest-hanging fruit of model improvement. I think it’s more worthwhile to note that Nathan’s highly parsimonious model achieved very clean separation between fake and real boats despite the limited amount of information collected for each boat-time unit.

The presentation and code may be found on Nathan Danneman's web site. The audio for the presentation is also available.

Summer research project overview: “Supporting Social Data Research with NoSQL Databases: Comparison of HBase and Riak”

This post is a guest post by an aspiring young member of the Data Community DC who also happens to be looking for an internship. Please see his contact info at the bottom of the post if interested and also check out his blog, http://hex00sells.com/. This summer, I was accepted into a NSF-funded (National Science Foundation), two-month research project at Indiana University in Bloomington.  Sponsored by the School of Informatics & Computing (SOIC), the program is designed to expose undergraduate students majoring in STEM fields to research taking place at the graduate level, in the hopes that they will elect to continue their education past a BS level.  These programs are also often called REUs (Research Experiences for Undergraduates) and are offered nationwide at various times throughout the year.

Seeing as how I’ve been interested in Big Data/Data Science for some time now, and have wanted to learn more about the field, being accepted to this program was a dream come true! I could only hope that it would deliver on everything it promised, which it definitely did.  The program was designed to immerse students as much as possible in normal IU life—for two months, it was as if we were enrolled there.  We lived in IU on-campus dorms, had IU student IDs, access to all facilities, and meal cards that we could use at the dining halls.  The program coordinators also planned a great series of events designed to expose us to campus facilities, resources, and local attractions.  We also were able to tour the IU Data Center, which houses Big Red II (one of the two fastest university-owned supercomputers in the US).

big_red_evan  big_red2_evan  big_red3_evan

But on to the project! Entitled ”Supporting Social Data Research with NoSQL Databases: Comparison of HBase and Riak,” its aim was to compare the performance of two open-source, NoSQL databases: Apache’s Hadoop/HBase and Basho’s Riak platforms.  We wished to compare them using a subset of the dataset of the Truthy Project—a project which gathered approximately 10% of all Twitter traffic (using the Twiiter Streaming API) over a two year period, in order to create a repository (i.e., gigantic data set) which social research scientists could conduct research with.  Several interesting research papers have come out of the Truthy Project, such as papers investigating the flow and dissemination of information during Occupy Wall Street and the Arab Spring uprisings.  In addition, the project researchers have also made very interesting images that portray information diffusion through users’ Twitter networks:

truthy_evan

Our project’s goal (link to project page) was to first set up and then configure both Hadoop/HBase and Riak, in keeping with the data set we were working with and the types of queries we’d be performing.  I came in on the second half of the project—the first half, setting up and configuring Hadoop/HBase, and then loading data onto the nodes and running queries on it, had already been completed, so my portion dealt nearly exclusively with the Riak side of things. We had reserved eight nodes on IU’s FutureGrid cluster, a shared academic cloud computing resource that researchers can use for their projects.  Our nodes were running CentOS (an enterprise Linux distro).  Some of the configuration details included creating schemas for our data and managing the data-loading onto our nodes.

Next, we went on to perform some of the same search queries on our two databases that the Truthy Project researchers performed.  These queries might be for a particular Twitter meme over various periods of times (a few hours, a day, five days, 10 days, etc.), the amount of posts a given user made over a certain period of time, or the number of user mentions a user received, just to name a few example queries.  Due to the size of the data set (one month’s worth of tweets—350 GB), these queries might easily return results in the order of tens of thousands to hundreds of thousands of hits! As these are open-source platforms, there are a plethora of configuration options and possibilities, so part of our job was to optimize each platform to deliver the fastest results for the types of queries we were executing.  For Riak, there are multiple ways to execute searches, some of which involved using MapReduce, so that added another layer of complexity—as we had to learn how Riak executed MapReduce jobs and then optimize for them accordingly.

Our results were very interesting, and seemed to highlight some of the differences between the two platforms (at least at the time of this project).  We found Riak to be faster for queries returning < 100k results over shorter time periods, but Hadoop/HBase was faster for queries returning > 100k results over longer lengths of time.  For Riak, some of this difference seemed to be attributed to how it handled returning results—they seemed to be gathered from all the nodes unsorted/ungrouped and streamed through just one reducer on one node exclusively, thus possibly creating the “bottlenecking” that resulted in result sets of a certain size.  Hadoop/Hbase seemed to provide a more robust implementation of MapReduce, allowing for much more individual MapReduce node configuration and a design that seems to scale very well for large data sets.

Below is a graph which illustrates the results of one meme search conducted across different time ranges on both Riak and HBase (click for larger view):

hist_evan

This histogram details the results for all tweets containing the meme “#ff” over periods of five, 10, 15, and 20 days.  The number of results returned is next to each search period (for example, the search over five days for “#ff” on Riak returned 353,468 tweets in 126 seconds that contained that the search term).  It is clear that, in this case, Riak is faster over a shorter period of time, but as both the search periods and result sizes increase, HBase scales to be far faster, exhibiting a much smaller, linear rate of growth than Riak.  If anyone would like to view my research poster for this project, here is a link.

In conclusion, this was a wonderful experience and only served to whet my appetite for future learning.  Just some of the languages, technologies, and skills I learned or were exposed to were: Python, Javascript, shell scripting, MapReduce, NoSQL databases, and cloud computing.  I leave you with a shot of the beautiful campus:

campus_evan

Evan Roth is pursuing a Computer Science degree full-time at the University of the District of Columbia in Washington, DC.  He also works part-time for 3Pillar Global in Fairfax, VA as an iOS Developer. His LinkedIn profile is here.

 

 

What is a Hackathon?

National Day of Civic Hacking Most people have probably heard of the term hackathon in today's technology landscape. But just because you have heard of it doesn't necessarily mean you know what it is. Wikipedia defines a hackathon as an event in which computer programmers and others involved in software development collaborate intensively on software projects. They can last for just a few hours or go as long as a week; and they come in all shapes and sizes. There are national hackathons, state-level hackathons, internal-corporate hackathons, open data-focused hackathons, and hackathons meant to flesh out APIs.

Hackathons can also operate on local levels. A couple of examples of local-level hackathons are the Baltimore Neighborhood Indicators Alliance-Jacob France Institute (BNIA-JFI) hackathon, which just completed its event in mid July, and the Jersey Shore Comeback-a-thon hackathon this past April hosted by Marathon Data Systems.

The BNIA-JFI hackathon was paired with a Big Data Day event and lasted for 8 hours one Friday. The goal of this hackathon was to come up with solutions that help out two local Baltimore neighborhoods: Old Goucher and Pigtown. The two each had a very clear goal established at the beginning of the hackathon but the overall goal was simple: help improve the local community. The data was provided to participants from the hackathon sponsors.

The Jersey Shore Comeback-a-thon hackathon lasted 24 hours straight, taking some participants back to their college days of pulling all-nighters. This hackathon also differed from the Baltimore hackathon in that it did not seem to provide data but instead focused on the pure technology of an application and how it can be used to achieve the goal of alerting locals and tourists that the Jersey shore is open for business.

Both of these events are great examples of local hackathons that look to give back to the community. However, if you want to learn more about federal level hackathons, open government data, or state level hackathons, please join Data Science MD at their next event titled Open Data Breakfast this Friday, August 9th, 8AM at Unallocated, a local Maryland hackerspace.

Why You Should Not Build a Recommendation Engine

One does not simply build an MVP with a recommendation engine Recommendation engines are arguably one of the trendiest uses of data science in startups today. How many new apps have you heard of that claim to "learn your tastes"? However, recommendations engines are widely misunderstood both in terms of what is involved in building a one as well as what problems they actually solve. A true recommender system involves some fairly hefty data science -- it's not something you can build by simply installing a plugin without writing code. With the exception of very rare cases, it is not the killer feature of your minimum viable product (MVP) that will make users flock to you -- especially since there are so many fake and poorly performing recommender systems out there.

A recommendation engine is a feature (not a product) that filters items by predicting how a user might rate them. It solves the problem of connecting your existing users with the right items in your massive inventory (i.e. tens of thousands to millions) of products or content. Which means that if you don't have existing users and a massive inventory, a recommendation engine does not truly solve a problem for you. If I can view the entire inventory of your e-commerce store in just a few pages, I really don't need a recommendation system to help me discover products! And if your e-commerce store has no customers, who are you building a recommendation system for? It works for Netflix and Amazon because they have untold millions of titles and products and a large existing user base who are already there to stream movies or buy products. Presenting users with  recommended movies and products increases usage and sales, but doesn't create either to begin with.

There are two basic approaches to building a recommendation system: the collaborative filtering method and the content-based approach. Collaborative filtering algorithms take user ratings or other user behavior and make recommendations based on what users with similar behavior liked or purchased. For example, a widely used technique in the Netflix prize was to use machine learning to build a model that predicts how a user would rate a film based solely on the giant sparse matrix of how 480,000 users rated 18,000 films (100 million data points in all). This approach has the advantage of not requiring an understanding of the content itself, but does require a significant amount of data, ideally millions of data points or more, on user behavior. The more data the better. With little or no data, you won't be able to make recommendations at all -- a pitfall of this approach known as the cold-start problem. This is why you cannot use this approach in a brand new MVP. 

The content-based approach requires deep knowledge of your massive inventory of products. Each item must be profiled based on its characteristics. For a very large inventory (the only type of inventory you need a recommender system for), this process must be automatic, which can prove difficult depending on the nature of the items. A user's tastes are then deduced based on either their ratings, behavior, or directly entering information about their preferences. The pitfalls of this approach are that an automated classification system could require significant algorithmic development and is likely not available as a commodity technical solution. Second, as with the collaborative filtering approach, the user needs to input information on their personal tastes, though not on the same scale. One advantage of the content-based approach is that it doesn't suffer from the cold-start problem -- even the first user can gain useful recommendations if the content is classified well. But the benefit that recommendations offer to the user must justify the effort required to offer input on personal tastes. That is, the recommendations must be excellent and the effort required to enter personal preferences must be minimal and ideally baked into the general usage. (Note that if your offering is an e-commerce store, this data entry amounts to adding a step to your funnel and could hurt sales more than it helps.) One product that has been successful with this approach is Pandora. Based on naming a single song or artist, Pandora can recommend songs that you will likely enjoy. This is because a single song title offers hundreds of points of data via the Music Genome Project. The effort required to classify every song in the Music Genome Project cannot be understated -- it took 5 years to develop the algorithm and classify the inventory of music offered in the first launch of Pandora. Once again, this is not something you can do with a brand new MVP.

Pandora may be the only example of a successful business where the recommendation engine itself is the core product, not a feature layered onto a different core product. Unless you have the domain expertise, algorithm development skill, massive inventory, and frictionless user data entry design to build your vision of the Pandora for socks / cat toys / nail polish / etc, your recommendation system will not be the milkshake that brings all the boys to the yard. Instead, you should focus on building your core product, optimizing your e-commerce funnel, growing your user base, developing user loyalty, and growing your inventory. Then, maybe one day, when you are the next Netflix or Amazon, it will be worth it to add on a recommendation system to increase your existing usage and sales. In the mean time, you can drive serendipitous discovery simply by offering users a selection of most popular content or editor's picks.