text processing

DC NLP February Meetup Announcement: Sentiment Analysis

Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP February Meetup!

This month, we're featuring a single speaker with a presentation related to sentiment analysis:


Nathan Danneman received his PhD in political science from Emory University, with focuses in international conflict and applied statistics. He currently works as a Data Scientist at Data Tactics, where he works on a range of topics including geospatial analysis, outlier detection, and text analysis. He will present Item Response Theory models as a means of unsupervised sentiment analysis.

DC NLP February Meetup

Wednesday, February 12, 2014
6:30 PM to 8:30 PM
Stetsons Famous Bar & Grill
1610 U Street Northwest, Washington, DC


The DC NLP meetup group is for anyone in the Washington, D.C. area working in (or interested in) Natural Language Processing. Our meetings will be an opportunity for folks to network, give presentations about their work or research projects, learn about the latest advancements in our field, and exchange ideas or brainstorm. Topics may include computational linguistics, machine learning, text analytics, data mining, information extraction, speech processing, sentiment analysis, and much more.

For more information and to RSVP, please visit: http://www.meetup.com/DC-NLP/events/154934032/


Getting Ready to Teach the Elephant to Read: A Strata + Hadoop World 2013 Tutorial

We (Ben Bengfort and Sean Murphy) are very excited to be holding the Teaching the Elephant to Read tutorial at the sold out Strata + Hadoop World 2013 on Monday, the 28th of October. We will be discussing and using numerous software packages that can be time consuming to install on various operating systems and laptops. If you are taking our tutorial, we strongly encourage you to set aside an hour or two this weekend to follow the instructions below to install and configure the virtual machine needed for the class. The instructions have been tested and debugged so you shouldn't have too many problems (famous last words ;).

Important Notes

Please note that

  1. you will need a 64-bit machine and operating system for this tutorial. The virtual machine/image that we will be building and using has been tested on Mac OS X (up through Mavericks) and 64-bit Windows.
  2. this process could take an hour or longer depending on the bandwidth of your connection as you will need to download approximately 1 GB of software.

1) Install and Configure your Virtual Machine

First, you will need to install Virtual Box, free software from Oracle. Go here to download the 64-bit version appropriate for your machine.

Download Virtual Box

Once Virtual Box is installed, you will need to grab a Ubuntu x64 Server 12.04 LTS image and you can do that directly from Ubuntu here.

Ubuntu Image

There are numerous tutorials online for creating a virtual machine from this image with Virtual Box. We recommend that you configure your virtual machine with at least 1GB of RAM and a 12 GB hard drive.

2) Setup Linux

First, let's create a user account with admin privileges with username "hadoop" and the very creative password "password."

username: hadoop
password: password

Honestly, you don't have to do this. If you have a user account that can already sudo, you are good to go and can skip to the install some software. But if not, use the following commands.

~$ sudo adduser hadoop
~$ sudo usermod -a -G sudo hadoop
~$ sudo adduser hadoop sudo

Log out and log back in as "hadoop."

Now you need to install some software.

~$ sudo apt-get update && sudo apt-get upgrade
~$ sudo apt-get install build-essential ssh avahi-daemon
~$ sudo apt-get install vim lzop git
~$ sudo apt-get install python-dev python-setuptools libyaml-dev
~$ sudo easy_install pip

The above installs may take some time.

At this point you should probably generate some ssh keys (for hadoop and so you can ssh in and get out of the VM terminal.)

~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
[… snip …]

Make sure that you leave the password as blank, hadoop will need the keys if you're setting up a cluster for more than one user. Also note that it is good practice to keep the administrator seperate from the hadoop user- but since this is a development cluster, we're just taking a shortcut and leaving them the same.

One final step, copy allow that key to be authorized for ssh.

~$ cp .ssh/id_rsa.pub .ssh/authorized_keys

You can download this key and use it to ssh into your virtual environment if needed.

3) Install and Configure Hadoop

Hadoop requires Java - and since we're using Ubuntu, we're going to use OpenJDK rather than Sun because Ubuntu doesn't provide a .deb package for Oracle Java. Hadoop supports OpenJDK with a few minor caveats: java versions on hadoop. If you'd like to install a different version, see installing java on hadoop.

~$ sudo apt-get install openjdk-7-jdk

Do a quick check to make sure you have the right version of Java installed:

~$ java -version
java version "1.7.0_25"
OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.04.2)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

Now we need to disable IPv6 on Ubuntu- there is one issue when hadoop binds on that it also binds to the IPv6 address. This isn't too hard: simply edit (with the editor of your choice, I prefer vim) the /etc/sysctl.conf file using the following command

sudo vim /etc/sysctl.conf

and add the following lines to the end of the file:

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Unfortunately you'll have to reboot your machine for this change to take affect. You can then check the status with the following command (0 is enabled, 1 is disabled):

~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

And now we're ready to download Hadoop from the Apache Download Mirrors. Hadoop versions are a bit goofy: an update on Apache Hadoop 1.0 however, as of October 15, 2013 release 2.2.0 is available. However, the stable version is still listed as version 1.2.1.

Go ahead and unpack in a location of your choice. We've debated internally what directory to place Hadoop and other distributed services like Cassandra or Titan in- but we've landed on /srv thanks to this post. Unpack the file, change the permissions to the hadoop user and then create a symlink from the version to a local hadoop link. This will allow you to set any version to the latest hadoop without worrying about losing versioning.

/srv$ sudo tar -xzf hadoop-1.2.1.tar.gz
/srv$ sudo chown -R hadoop:hadoop hadoop-1.2.1
/srv$ sudo ln -s $(pwd)/hadoop-1.2.1 $(pwd)/hadoop

Now we have to configure some environment variables so that everything executes correctly, while we're at it will create a couple aliases in our bash profile to make our lives a bit easier. Edit the ~/.profile file in your home directory and add the following to the end:

# Set the Hadoop Related Environment variables
export HADOOP_PREFIX=/srv/hadoop

# Set the JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

# Add Hadoop bin/ directory to PATH

# Some helpful aliases

unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
alias ..="cd .."
alias ...="cd ../.."

lzohead() {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less

We'll continue configuring the Hadoop environment. Edit the following files in /srv/hadoop/conf/:


# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64








That's it configuration over! But before we get going we have to format the distributed filesystem in order to use it. We'll store our file system in the /app/hadoop/tmp directory as per Michael Noll and as we set in the core-site.xml configuration. We'll have to set up this directory and then format the name node.

/srv$ sudo mkdir -p /app/hadoop/tmp
/srv$ sudo chown -R hadoop:hadoop /app/hadoop
/srv$ sudo chmod -R 750 /app/hadoop
/srv$ hadoop namenode -format
[… snip …]

You should now be able to run Hadoop's start-all.sh command to start all the relevant daemons:

/srv$ hadoop-1.2.1/bin/start-all.sh
starting namenode, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-namenode-ubuntu.out
localhost: starting datanode, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-secondarynamenode-ubuntu.out
starting jobtracker, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-tasktracker-ubuntu.out

And you can use the jps command to see what's running:

/srv$ jps
1321 NameNode
1443 DataNode
1898 Jps
1660 JobTracker
1784 TaskTracker
1573 SecondaryNameNode

Furthermore, you can access the various hadoop web interfaces as follows:

To stop Hadoop simply run the stop-all.sh command.

4) Install Python Packages and the Code for the Class

To run the code in this section, you'll need to install some Python packages as dependencies, and in particular the NLTK library. The simplest way to install these packages is with the requirements.txt file that comes with the code library in our repository. We'll clone it into a repository called tutorial.

~$ git clone https://github.com/bbengfort/strata-teaching-the-elephant-to-read.git tutorial
~$ cd tutorial/code
~$ sudo pip install -U -r requirements.txt
[… snip …]

However, if you simply want to install the dependencies yourself, here are the contents of the requirements.txt file:

# requirements.txt

You'll also have to download the NLTK data packages which will install to /usr/share/nltk_data unless you set an environment variable called NLTK_DATA. The best way to install all this data is as follows:

~$ sudo python -m nltk.downloader -d /usr/share/nltk_data all

At this point the steps that are left are loading data into Hadoop.


  1. Hadoop/Java Versions
  2. Installing Java on Ubuntu
  3. An Update on Apache Hadoop 1.0
  4. Running Hadoop on Ubuntu Linux Single Node Cluster
  5. Apache: Single Node Setup

Uncovering Hidden Social Information Generates Quite a Buzz

We are pleased to have community member Greg Toth present this event review. Greg is a consultant and entrepreneur in the Washington DC area. As a consultant, he helps clients design and build large-scale information systems, process and analyze data, and solve business and technical problems. As an entrepreneur, he connects the dots between what’s possible and what’s needed, and brings people together to pursue new business opportunities. Greg is the president of Tricarta Corporation and the CTO of EIC Data Systems, Inc. The March 2013 meetup of Data Science DC generated quite a buzz!  Well over a hundred data scientists and practitioners gathered in Chevy Chase to hear Prof. Jennifer Golbeck from the Univ. of Maryland give a very interesting – and at times somewhat startling – talk about how hidden information can be uncovered from people’s online social media activities.


Prof. Golbeck develops methods for discovering things about people online.  She opened her talk with a brief example of how bees reveal specific information to their hive’s social network through the characteristics of their “waggle dance.”  The figure eight patterns of the waggle dance convey distance and direction to pollen sources and water to the rest of the hive – which is a large social network.

Facebook Information Sharing

From there the discussion turned to how Facebook’s information sharing defaults have evolved from 2005 through 2010.  In 2005, Facebook’s default settings shared a relatively narrow set of your personal data with friends and other Facebook users.  At this point none of your information was – by default – shared with the entire Internet.

In subsequent years the default settings changed each year, sharing more and more information with a wider and wider audience.  By 2009, several pieces of your information were being shared openly with anyone on the Internet unless you had changed the default settings.  By 2010 the default settings were sharing significant amounts of information with a large swath of other people, including people you don’t even know.

The Facebook sharing information Prof. Golbeck described came from Matt McKeon’s work, which can be found here:  http://mattmckeon.com/facebook-privacy/

This ever-increasing amount of shared information has opened up new avenues for people to find out things about you, and many people may be shocked at what's possible.  Prof. Golbeck gave a live demonstration of a web site called Take This Lollipop, using her own Facebook account.  I won’t spoil things by telling you what it does, but suffice to say it was quite startling.  If this piques your interest, check out www.takethislollipop.com

Predicting Personality Traits

From there the discussion shifted to a research project intended to determine whether it's possible to predict people's personality traits by analyzing what they put on social media.  First, a group of research participants were asked to identify their core personality traits by going through a standardized psychological evaluation.  The Big Five factors that they measured are openness, conscientiousness, extraversion, agreeableness, and neuroticism.

Next the research team gathered information from these people’s Facebook and Twitter accounts, including language features (e.g. words they use in posts), personal information, activities and preferences, internal Facebook stats, and other factors.  Tweets were processed in an application called LIWC, which stands for Linguistic Inquiry and Word Count.  LIWC is a text analysis program that examines a piece of text and the individual words it contains, and computes numeric values for positive and negative emotions as well as several other factors.

The data gathered from Twitter and Facebook was fed into a personality prediction algorithm developed by the research team and implemented using the Weka machine learning toolkit.  Predicted personality trait values from the algorithm were compared to the original Big Five assessment results to evaluate how well the prediction model performed.  Overall, the difference between predicted and measured personality traits was roughly 10 to 12% for Facebook (considered very good) and roughly 12 to 18% for Twitter (not quite as good).  The overall conclusion was that yes, it is possible to predict personality traits by analyzing what people put on social media.

Predicting Political Preferences

The second research project was about computing political preference in Twitter audiences.  Originally this project started with the intention of looking at the Twitter feeds of news media outlets and trying to predict media bias.  However, the topic of media bias in general was deemed too problematic and controversial and they decided instead to focus on predicting the political preferences of the media audiences.

The objective was to come up with a method for computing the political orientation of people who followed popular news media outlets on Twitter.  To do this, the team computed the political preference of about 1 million Twitter users by finding which Congresspeople they followed on Twitter, and looking at the liberal to conservative ratings of those Congresspeople.  A key assumption was that people's political preferences will, on average, reflect those of the Congresspeople they follow.

From there, the team looked at 20 different Twitter news outlets and identified who followed each one.  The political preferences of each media outlet's followers were composited together to compute an overall audience political preference factor ranging from heavily conservative to heavily liberal at the two extremes, with moderate ranges in the middle.  The results showed that Fox News had the most conservative audience, NPR Morning Edition had the most liberal audience, and Good Morning America was in the middle with a balanced mix of both conservative and liberal followers.  Further details on the results can be found in the paper here.

Summary & Wrap-up

An awful lot of things about you can be figured out by looking at public information in your social media streams.  Personality traits and political preferences are but two examples.  Sometimes this information can be used for beneficial purposes, such as showing you useful recommendations.  Likewise, a future employer could use this kind of information to form opinions during the hiring process.  People don't always think about this (or necessarily even realize what's possible) when they post things to social media.

Overall Prof. Golbeck’s presentation was well received and generated a number of questions and conversations after the talk.  The key takeaway was that “We know who you are and what you are thinking” and that information can be used for a variety of purposes – in most cases without you even being aware.  The situation was summed up pretty well in one of Prof. Golbeck’s opening slides:

I develop methods for discovering things about people online.

I never want anyone to use those methods on me.

-- Jennifer Golbeck

For those who want to delve deeper, several resources are available:


Overall I found this presentation to be very worthwhile and thought-provoking.  Prof. Golbeck was an engaging speaker who was both informative and entertaining.  She provided a number of useful references, links and papers for delving deeper into the topics covered.  The venue and logistics were great and there were plenty of opportunities for networking and talking with colleagues both before and after the presentation.

The topic of predicting people's traits and behaviors is very relevant, particularly in the realm of politics.  At least one other Data Science DC meetup held within the last few months focused on how data sciences were used in the last presidential election and the tremendous impact it had.  That trend is sure to continue, fueled by research like this coupled with the availability of data, more sophisticated tools, and the right kinds of data scientists to connect the dots and put it all together.

If you have the time, I would recommend listening to the audio recording and following along the slide deck.  There were many more interesting details in the talk than what I could cover here.

My personal opinion is that too few people realize the data footprint they leave when using social media.  That footprint has a long memory and can be used for many purposes, including purposes that haven't even been invented yet.  Many people seem to think that either the data they put on social media is trivial and doesn't reveal anything, or think that no-one cares and it's just "personal stuff."  But as we've seen in this talk, people can discover a lot more than you may think.

This post contains affiliate links.

Event Review: Data Science MD - Teaching Machines to Read: Processing Text with NLP

Data Science MD held its second meeting, Teaching Machines to Read: Processing Text with NLP, its first Baltimore-based event, at the fantastic Ad.com venue, part of the Under Armour campus at Hull Point. Our group was fortunate to have two excellent speakers, Craig Pfeifer, a Principal Artificial Intelligence Engineer at the MITRE Corp, and Dr. Jesse English, a PhD computer scientist from UMBC who specialized in NLP, machine learning, and machine reading. teaching machines to read

Craig discussed his experiences on a project looking at author attribution in 422 supreme court decisions from 2006-2008. Using the name of the author as the sole document label (and all authors were in the training set), Craig extracted a large number of features (character n-grams, sentence metrics, etc) and used Weka's SMO, a support vector machine approximation, for classification with good results. Very notable were Craig's comments on the time consuming nature of extracting text from PDFs (don't try this at home). His slides are here in PDF form.

Dr. Jesse English introduced the audience to the brand new open source python tool kit for deep semantic analysis of big data: WIMS (Weakly Inferred Meanings). WIMs, humorously capable of "answering questions like a boss," allows users to ask who/what/when/where/why and even how questions of the text. In Dr English's own words:

A WIM is a structured meaning representation, not unlike a TMR (text meaning representation), with a limited scope in expected coverage. The scope has been limited intentionally for performance reasons – one would use a WIM rather than a TMR when the scope of coverage is sufficient and the cost (in time or development) of a full TMR is too great.

Typically, the production of a full TMR would require a domain-comprehensive syntactic-semantic lexicon and accompanying ontology (as well as a wealth of other related knowledge bases). A compilation of microtheories of meaning analysis would be required to process the text using the knowledge – both of these resources are extremely expensive to produce, and accurate processing of the text rapidly becomes unscalable without introducing domain-dependent algorithmic shortcuts.

By relying on WIMs, rather than a full TMR, the most typically relevant semantic data can still be produced in linear time with off-the-shelf knowledge resources (e.g., WordNet).

To view the presentation, click Deep Semantic Analysis of Big Data (WIMs) for the pdf.

Data Scientists Clash with Publishers – Local Expert Comment on the Debate

There is a fascinating debate raging in the world of web-scale text processing with  "[s]cientists and publishers clash[ing] over licences that would let machines read research papers."  To get started, head over to Nature here and get the story's background. Once done, come back here where our very own local subject matter expert, Ben Bengfort, has started the discussion below:

The Copyright Approach

Copyright law is consistently unclear about machines creating copies of artistic material for potential analysis. Indeed, even for simple usage the precedent in MAI Systems, Inc vs. Peak Computer says that creating a copy of protected material in RAM from the hard disk is infringement, not to mention the copies created during network traffic. A computer system that performs analysis should be allowed to "own" a copy of the material in the same way that any other user would be afforded, limited to the rights and responsibilities of any copy holder. Consider not just the analysis of academic material, but also music, video, and books especially for ratings or reviews. Whether or not fair use applies, if a computer system purchases a license of copyright material, that computer system does not infringe on copyright law should be allowed to "read" or analyze the contents of that material [editor: what is "reading" but the understand and analysis of text by humans]. While the market of copyrighted material in this sense will force text-miners to pay fair market value for copyrighted material instead of creating ubiquitous crawlers, it will also allow them governed access to the text that they seek to mine.

The Science Approach

The Internet has been a vast resource for companies, especially Google, and academic institutions to mine text-based data. The free spirit of the web and the massive amount of text content has allowed data scientists to hone their skills at deep mining and clustering applications. But, as our machine learning techniques and models grow ever more precise, it has become more apparent that domain-specific knowledge engineering produces the best, often surprising results rather than the shotgun approach of many different temperaments of data on the Internet. The appeal of categorized, deep-domain academic papers will clearly provide the best, most novel results for our text mining applications, and will clearly change the way that we conduct research and innovate in the future. That a roadblock as simple as copyright (and particularly copyright owners like publishers that are clinging to an old and crumbling business model) is preventing the exponential growth of human knowledge is a crime against humanity. That novel results will come from mining academic papers is not a question--Google's page rank algorithm itself comes from an analysis of linking of citations of academic papers.


Benjamin Bengfort is a full-stack data scientist with a passion for teaching machines by crunching data--the more the better. A founding partner and CTO at Unbound Concepts, he led the development of Meridian, the company's advanced text analysis engine designed to parse and predict the reading level of content for K-6 readers. With a professional background in military and intelligence and an academic background in economics and computer science, he brings a unique set of skills and insights to his work, and is currently pursuing a PhD in computer science at UMBC with a focus on machine learning techniques for natural language processing.

(Note, DataCommunityDC is an Amazon Affiliate. Thus, if you click the image in the post and buy the book, we will make approximately $0.43 and retire to a small island).