Text Analytics

Natural Language Analysis with NLTK on October 25th

Natural Language Analysis with NLTK on October 25th

Data Community DC and District Data Labs are excited to be hosting a Natural Language Analysis with NLTK workshop on October 25th  For more info and to sign up, go to http://bit.ly/1pK0pFN.  There’s even an early bird discount if you register before October 3rd!

Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK).

DC NLP December Meetup Announcement: Open Mic Night!

Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP December Meetup! Let's experiment with something different this month for the meetup format. At past events, we've had either one 60-minute talk or three 20-minute talks. For December, we're going to try out an "open mic night" format. Here's how it'll work:

  • Everyone who wants to (time permitting) will get 5 to 10 minutes to talk about anything they like. You can give a quick talk about something you've been working on, ask a question about a problem you've been struggling with, run a demo for a interesting tool you've found, etc.
  • The only rule in terms of content is that it must be related to natural language processing.
  • You can use slides or a computer, but it's certainly not required. You can just get up in front of the group and ramble on for 5-10 minutes if you like. We won't judge you.
  • Essentially, it'll be an hour or so of impromptu lightning talks, hopefully with a high degree of audience interaction.

DC NLP December Meetup Wednesday, December 11, 2013 6:30 PM to 8:30 PM Stetsons Famous Bar & Grill 1610 U Street Northwest, Washington, DC

The DC NLP meetup group is for anyone in the Washington, D.C. area working in (or interested in) Natural Language Processing. Our meetings will be an opportunity for folks to network, give presentations about their work or research projects, learn about the latest advancements in our field, and exchange ideas or brainstorm. Topics may include computational linguistics, machine learning, text analytics, data mining, information extraction, speech processing, sentiment analysis, and much more. For more information and to RSVP, please visit: http://www.meetup.com/DC-NLP/events/142924742/

Follow us on Twitter: @DCNLP

Getting Ready to Teach the Elephant to Read: A Strata + Hadoop World 2013 Tutorial

We (Ben Bengfort and Sean Murphy) are very excited to be holding the Teaching the Elephant to Read tutorial at the sold out Strata + Hadoop World 2013 on Monday, the 28th of October. We will be discussing and using numerous software packages that can be time consuming to install on various operating systems and laptops. If you are taking our tutorial, we strongly encourage you to set aside an hour or two this weekend to follow the instructions below to install and configure the virtual machine needed for the class. The instructions have been tested and debugged so you shouldn't have too many problems (famous last words ;).

Important Notes

Please note that

  1. you will need a 64-bit machine and operating system for this tutorial. The virtual machine/image that we will be building and using has been tested on Mac OS X (up through Mavericks) and 64-bit Windows.
  2. this process could take an hour or longer depending on the bandwidth of your connection as you will need to download approximately 1 GB of software.

1) Install and Configure your Virtual Machine

First, you will need to install Virtual Box, free software from Oracle. Go here to download the 64-bit version appropriate for your machine.

Download Virtual Box

Once Virtual Box is installed, you will need to grab a Ubuntu x64 Server 12.04 LTS image and you can do that directly from Ubuntu here.

Ubuntu Image

There are numerous tutorials online for creating a virtual machine from this image with Virtual Box. We recommend that you configure your virtual machine with at least 1GB of RAM and a 12 GB hard drive.

2) Setup Linux

First, let's create a user account with admin privileges with username "hadoop" and the very creative password "password."

username: hadoop
password: password

Honestly, you don't have to do this. If you have a user account that can already sudo, you are good to go and can skip to the install some software. But if not, use the following commands.

~$ sudo adduser hadoop
~$ sudo usermod -a -G sudo hadoop
~$ sudo adduser hadoop sudo

Log out and log back in as "hadoop."

Now you need to install some software.

~$ sudo apt-get update && sudo apt-get upgrade
~$ sudo apt-get install build-essential ssh avahi-daemon
~$ sudo apt-get install vim lzop git
~$ sudo apt-get install python-dev python-setuptools libyaml-dev
~$ sudo easy_install pip

The above installs may take some time.

At this point you should probably generate some ssh keys (for hadoop and so you can ssh in and get out of the VM terminal.)

~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
[… snip …]

Make sure that you leave the password as blank, hadoop will need the keys if you're setting up a cluster for more than one user. Also note that it is good practice to keep the administrator seperate from the hadoop user- but since this is a development cluster, we're just taking a shortcut and leaving them the same.

One final step, copy allow that key to be authorized for ssh.

~$ cp .ssh/id_rsa.pub .ssh/authorized_keys

You can download this key and use it to ssh into your virtual environment if needed.

3) Install and Configure Hadoop

Hadoop requires Java - and since we're using Ubuntu, we're going to use OpenJDK rather than Sun because Ubuntu doesn't provide a .deb package for Oracle Java. Hadoop supports OpenJDK with a few minor caveats: java versions on hadoop. If you'd like to install a different version, see installing java on hadoop.

~$ sudo apt-get install openjdk-7-jdk

Do a quick check to make sure you have the right version of Java installed:

~$ java -version
java version "1.7.0_25"
OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.04.2)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

Now we need to disable IPv6 on Ubuntu- there is one issue when hadoop binds on 0.0.0.0 that it also binds to the IPv6 address. This isn't too hard: simply edit (with the editor of your choice, I prefer vim) the /etc/sysctl.conf file using the following command

sudo vim /etc/sysctl.conf

and add the following lines to the end of the file:

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Unfortunately you'll have to reboot your machine for this change to take affect. You can then check the status with the following command (0 is enabled, 1 is disabled):

~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

And now we're ready to download Hadoop from the Apache Download Mirrors. Hadoop versions are a bit goofy: an update on Apache Hadoop 1.0 however, as of October 15, 2013 release 2.2.0 is available. However, the stable version is still listed as version 1.2.1.

Go ahead and unpack in a location of your choice. We've debated internally what directory to place Hadoop and other distributed services like Cassandra or Titan in- but we've landed on /srv thanks to this post. Unpack the file, change the permissions to the hadoop user and then create a symlink from the version to a local hadoop link. This will allow you to set any version to the latest hadoop without worrying about losing versioning.

/srv$ sudo tar -xzf hadoop-1.2.1.tar.gz
/srv$ sudo chown -R hadoop:hadoop hadoop-1.2.1
/srv$ sudo ln -s $(pwd)/hadoop-1.2.1 $(pwd)/hadoop

Now we have to configure some environment variables so that everything executes correctly, while we're at it will create a couple aliases in our bash profile to make our lives a bit easier. Edit the ~/.profile file in your home directory and add the following to the end:

# Set the Hadoop Related Environment variables
export HADOOP_PREFIX=/srv/hadoop

# Set the JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_PREFIX/bin

# Some helpful aliases

unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
alias ..="cd .."
alias ...="cd ../.."

lzohead() {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

We'll continue configuring the Hadoop environment. Edit the following files in /srv/hadoop/conf/:

hadoop-env.sh

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

core-site.xml



        fs.default.name
        hdfs://localhost:9000

        hadoop.tmp.dir
        /app/hadoop/tmp

hdfs-site.xml



        dfs.replication
        1

mapred-site.xml



        mapred.job.tracker
        localhost:9001

That's it configuration over! But before we get going we have to format the distributed filesystem in order to use it. We'll store our file system in the /app/hadoop/tmp directory as per Michael Noll and as we set in the core-site.xml configuration. We'll have to set up this directory and then format the name node.

/srv$ sudo mkdir -p /app/hadoop/tmp
/srv$ sudo chown -R hadoop:hadoop /app/hadoop
/srv$ sudo chmod -R 750 /app/hadoop
/srv$ hadoop namenode -format
[… snip …]

You should now be able to run Hadoop's start-all.sh command to start all the relevant daemons:

/srv$ hadoop-1.2.1/bin/start-all.sh
starting namenode, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-namenode-ubuntu.out
localhost: starting datanode, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-secondarynamenode-ubuntu.out
starting jobtracker, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-tasktracker-ubuntu.out

And you can use the jps command to see what's running:

/srv$ jps
1321 NameNode
1443 DataNode
1898 Jps
1660 JobTracker
1784 TaskTracker
1573 SecondaryNameNode

Furthermore, you can access the various hadoop web interfaces as follows:

To stop Hadoop simply run the stop-all.sh command.

4) Install Python Packages and the Code for the Class

To run the code in this section, you'll need to install some Python packages as dependencies, and in particular the NLTK library. The simplest way to install these packages is with the requirements.txt file that comes with the code library in our repository. We'll clone it into a repository called tutorial.

~$ git clone https://github.com/bbengfort/strata-teaching-the-elephant-to-read.git tutorial
~$ cd tutorial/code
~$ sudo pip install -U -r requirements.txt
[… snip …]

However, if you simply want to install the dependencies yourself, here are the contents of the requirements.txt file:

# requirements.txt
PyYAML==3.10
dumbo==0.21.36
language-selector==0.1
nltk==2.0.4
numpy==1.7.1
typedbytes==0.3.8
ufw==0.31.1-1

You'll also have to download the NLTK data packages which will install to /usr/share/nltk_data unless you set an environment variable called NLTK_DATA. The best way to install all this data is as follows:

~$ sudo python -m nltk.downloader -d /usr/share/nltk_data all

At this point the steps that are left are loading data into Hadoop.

References

  1. Hadoop/Java Versions
  2. Installing Java on Ubuntu
  3. An Update on Apache Hadoop 1.0
  4. Running Hadoop on Ubuntu Linux Single Node Cluster
  5. Apache: Single Node Setup

11 Most Interesting #DCTech Tweets According to Synapsify

In July, Peter Corbett of iStrategyLabs shared a data set of 65,000 tweets and facebook statuses from the DC tech community which we visualized with a word cloud. Now we've used technology from local startup, Synapsify to find the most meaningful and interesting tweets in the dataset. Synapsify uses algorithms based on patterns in music to analyze sounds of sentiment and metaphor in written language. For more about Synapsify, come to tonight's DC NLP Meetup, where I'll be speaking about the Synapsify algorithm in more depth. The following tweets are the least cheerleader-ey and highlight the most important themes for the DC tech community.

 

DC Natural Language Processing is Back

The following is a guest post from Charlie Greenbacker, organizer of the DC NLP meetup. Curious about techniques and methods for applying data science to unstructured text?

The DC NLP meetup group is for anyone in the Washington, D.C. area working in, or interested in learning about, Natural Language Processing (NLP). Our meetings will be an opportunity for folks to network, give presentations about their work or research projects, learn about the latest advancements in our field, and exchange ideas or brainstorm. Topics may include computational linguistics, machine learning, text analytics, data mining, information extraction, speech processing, sentiment analysis, and much more.

Building on the tremendous success of our inaugural meetup in April (cross-listed with our friends at Data Science DC), we're ready to strike out on our own and start holding regular monthly meetups. We've recently made arrangements with the great folks at Stetsons Famous Bar & Grill in Adams Morgan to host our meetups in their upstairs bar on the second Wednesday of each month. We're truly excited to see you all again and will kick things off with a great set of presentations.

At our next meetup on Wednesday, Sept 11, we'll host three 20-minute talks on various NLP topics:

• Many of our new members are just getting started in natural language processing and have indicated an interest in learning some of the basics.Charlie Greenbacker, Principal Data Scientist at Berico Technologies, will provide a "crash course" in NLP techniques and applications.

• Ben Johnson is a co-founder of OpenWIMs.org – a set of standards and implementations of knowledge-based and stochastic semantic text analyzers. Ben will briefly describe the OpenWIMs system and give a demo of the latest OpenWIMs build.

• Valerie Coffman is the founder of Feastie.com, a search engine for recipes from food blogs, and currently a developer and data scientist working with Synapsify. She will present the unique text analytics methods used by Synapsify to measure quality and themes in writing.

For more information and to RSVP, please visit the DC NLP meetup page. Follow us on Twitter: @DCNLP.