We (Ben Bengfort and Sean Murphy) are very excited to be holding the Teaching the Elephant to Read tutorial at the sold out Strata + Hadoop World 2013 on Monday, the 28th of October. We will be discussing and using numerous software packages that can be time consuming to install on various operating systems and laptops. If you are taking our tutorial, we strongly encourage you to set aside an hour or two this weekend to follow the instructions below to install and configure the virtual machine needed for the class. The instructions have been tested and debugged so you shouldn't have too many problems (famous last words ;).

## Important Notes

1. you will need a 64-bit machine and operating system for this tutorial. The virtual machine/image that we will be building and using has been tested on Mac OS X (up through Mavericks) and 64-bit Windows.
2. this process could take an hour or longer depending on the bandwidth of your connection as you will need to download approximately 1 GB of software.

## 1) Install and Configure your Virtual Machine

First, you will need to install Virtual Box, free software from Oracle. Go here to download the 64-bit version appropriate for your machine.

Once Virtual Box is installed, you will need to grab a Ubuntu x64 Server 12.04 LTS image and you can do that directly from Ubuntu here.

Ubuntu Image

There are numerous tutorials online for creating a virtual machine from this image with Virtual Box. We recommend that you configure your virtual machine with at least 1GB of RAM and a 12 GB hard drive.

## 2) Setup Linux

username: hadoop


Honestly, you don't have to do this. If you have a user account that can already sudo, you are good to go and can skip to the install some software. But if not, use the following commands.

~$sudo adduser hadoop ~$ sudo usermod -a -G sudo hadoop
~$sudo adduser hadoop sudo  Log out and log back in as "hadoop." Now you need to install some software. ~$ sudo apt-get update && sudo apt-get upgrade
~$sudo apt-get install build-essential ssh avahi-daemon ~$ sudo apt-get install vim lzop git
~$sudo apt-get install python-dev python-setuptools libyaml-dev ~$ sudo easy_install pip


The above installs may take some time.

At this point you should probably generate some ssh keys (for hadoop and so you can ssh in and get out of the VM terminal.)

~$ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. [… snip …]  Make sure that you leave the password as blank, hadoop will need the keys if you're setting up a cluster for more than one user. Also note that it is good practice to keep the administrator seperate from the hadoop user- but since this is a development cluster, we're just taking a shortcut and leaving them the same. One final step, copy allow that key to be authorized for ssh. ~$ cp .ssh/id_rsa.pub .ssh/authorized_keys


## 3) Install and Configure Hadoop

Hadoop requires Java - and since we're using Ubuntu, we're going to use OpenJDK rather than Sun because Ubuntu doesn't provide a .deb package for Oracle Java. Hadoop supports OpenJDK with a few minor caveats: java versions on hadoop. If you'd like to install a different version, see installing java on hadoop.

~$sudo apt-get install openjdk-7-jdk  Do a quick check to make sure you have the right version of Java installed: ~$ java -version
java version "1.7.0_25"
OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.04.2)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)


Now we need to disable IPv6 on Ubuntu- there is one issue when hadoop binds on 0.0.0.0 that it also binds to the IPv6 address. This isn't too hard: simply edit (with the editor of your choice, I prefer vim) the /etc/sysctl.conf file using the following command

sudo vim /etc/sysctl.conf


and add the following lines to the end of the file:

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1


Unfortunately you'll have to reboot your machine for this change to take affect. You can then check the status with the following command (0 is enabled, 1 is disabled):

~$cat /proc/sys/net/ipv6/conf/all/disable_ipv6  And now we're ready to download Hadoop from the Apache Download Mirrors. Hadoop versions are a bit goofy: an update on Apache Hadoop 1.0 however, as of October 15, 2013 release 2.2.0 is available. However, the stable version is still listed as version 1.2.1. Go ahead and unpack in a location of your choice. We've debated internally what directory to place Hadoop and other distributed services like Cassandra or Titan in- but we've landed on /srv thanks to this post. Unpack the file, change the permissions to the hadoop user and then create a symlink from the version to a local hadoop link. This will allow you to set any version to the latest hadoop without worrying about losing versioning. /srv$ sudo tar -xzf hadoop-1.2.1.tar.gz
/srv$sudo chown -R hadoop:hadoop hadoop-1.2.1 /srv$ sudo ln -s $(pwd)/hadoop-1.2.1$(pwd)/hadoop


Now we have to configure some environment variables so that everything executes correctly, while we're at it will create a couple aliases in our bash profile to make our lives a bit easier. Edit the ~/.profile file in your home directory and add the following to the end:

# Set the Hadoop Related Environment variables

# Set the JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

export PATH=$PATH:$HADOOP_PREFIX/bin

unalias fs &> /dev/null
unalias hls &> /dev/null
alias hls="fs -ls"
alias ..="cd .."
alias ...="cd ../.."

hadoop fs -cat $1 | lzop -dc | head -1000 | less }  We'll continue configuring the Hadoop environment. Edit the following files in /srv/hadoop/conf/: hadoop-env.sh # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64  core-site.xml  fs.default.name hdfs://localhost:9000 hadoop.tmp.dir /app/hadoop/tmp  hdfs-site.xml  dfs.replication 1  mapred-site.xml  mapred.job.tracker localhost:9001  That's it configuration over! But before we get going we have to format the distributed filesystem in order to use it. We'll store our file system in the /app/hadoop/tmp directory as per Michael Noll and as we set in the core-site.xml configuration. We'll have to set up this directory and then format the name node. /srv$ sudo mkdir -p /app/hadoop/tmp
/srv$sudo chown -R hadoop:hadoop /app/hadoop /srv$ sudo chmod -R 750 /app/hadoop
/srv$hadoop namenode -format [… snip …]  You should now be able to run Hadoop's start-all.sh command to start all the relevant daemons: /srv$ hadoop-1.2.1/bin/start-all.sh


And you can use the jps command to see what's running:

/srv$jps 1321 NameNode 1443 DataNode 1898 Jps 1660 JobTracker 1784 TaskTracker 1573 SecondaryNameNode  Furthermore, you can access the various hadoop web interfaces as follows: To stop Hadoop simply run the stop-all.sh command. ## 4) Install Python Packages and the Code for the Class To run the code in this section, you'll need to install some Python packages as dependencies, and in particular the NLTK library. The simplest way to install these packages is with the requirements.txt file that comes with the code library in our repository. We'll clone it into a repository called tutorial. ~$ git clone https://github.com/bbengfort/strata-teaching-the-elephant-to-read.git tutorial
~$cd tutorial/code ~$ sudo pip install -U -r requirements.txt
[… snip …]


However, if you simply want to install the dependencies yourself, here are the contents of the requirements.txt file:

# requirements.txt
PyYAML==3.10
dumbo==0.21.36
language-selector==0.1
nltk==2.0.4
numpy==1.7.1
typedbytes==0.3.8
ufw==0.31.1-1


You'll also have to download the NLTK data packages which will install to /usr/share/nltk_data unless you set an environment variable called NLTK_DATA. The best way to install all this data is as follows:

~$sudo python -m nltk.downloader -d /usr/share/nltk_data all  At this point the steps that are left are loading data into Hadoop. ### References # Data Science MD July Recap: Python and R Meetup For July's meetup, Data Science MD was honored to have Jonathan Street of NIH and Brian Godsey of RedOwl Analytics come discuss using Python and R for data analysis. Jonathan started off by describing the growing ecosystem of Python data analysis tools including Numpy, Matplotlib, and Pandas. He next walked through a brief example demonstrating Numpy, Pandas, and Matplotlib that he made available with the IPython notebook viewer. The second half of Jonathan's talk focused on the problem of using clustering to identify scientific articles of interest. He needed to a) convert PDF to text b) extract sections of the document c) cluster and d) retrieve new material. Jonathan used the PyPDF library for PDF conversion and then used the NLTK library for text processing. For a thorough discussion of NLTK, please see Data Community DC's multi-part series written by Ben Bengfort. Clustering was done using scikit-learn, which identified seven groups of articles. From these, Jonathan was then able to retrieve other similar articles to read. Overall, by combining several Python packages to handle text conversion, text processing, and clustering, Jonathan was able to create an automated personalized scientific recommendation system. Please see the Data Community DC posts on Python data analysis tutorials and Python data tools for more information. Next to speak was Brian Godsey of RedOwl Analytics who was presenting on their social network analysis. He first described the problem of identifying misbehavior in a financial firm. Their goals are to detect patterns in employee behavior, measure differences between various types of people, and ultimately find anomalies in behavior. In order to find these anomalies, they model behavior based on patterns in communications and estimate model parameters from a data set and set of effects. Brian then revealed that while implementing their solution they have developed a R package called rRevelation that allows a user to import data sets, create covariates, specify a model's behavioral parameters, and estimate the parameter values. To conclude his presentation, Brian demonstrated using the package against the well-known Enron data set and discussed how larger data sets requires using other technologies such as MapReduce. http://www.youtube.com/watch?v=1rFlqD3prKE&list=PLgqwinaq-u-Piu9tTz58e7-5IwYX_e6gh Slides can be found here for Jonathan and here for Brian. # Beyond Preprocessing - Weakly Inferred Meanings - Part 5 Congrats! This is the final post in our 6 part series! Just in case you have missed any parts, click through to the introductionpart 1part 2, part 3, and part 4. After you have treebanks, then what? The answer is that syntactic guessing is not the final frontier of NLP, we must go beyond to something more semantic. The idea is to determine the meaning of text in a machine tractable way by creating a TMR, a text-meaning representation (or thematic meaning representation). This, however, is not a trivial task, and now you’re at the frontier of the science. Text Meaning Representations are language-independent representations of a language unit, and can be thought of as a series of connected frames that represent knowledge. TMRs allow us to do extremely deep querying of natural language, including the creation of knowledge bases, question and answer systems, and even allow for conversational agents. Unfortunately they require extremely deep ontologies and knowledge to be constructed and can be extremely process intensive, particularly in resolving ambiguity. We have created a system called WIMs -- Weakly Inferred Meanings that attempts to stand in the middle ground of no semantic computation at all, and the extremely labor intensive TMR. WIMs reduces the search space by using a limited, but extremely important set of relations. These relations can be created using available knowledge -- Wordnet has proved to be a valuable ontology for creating WIMs, and they are extremely lightweight. Even better, they’re open source! Both TMRs and WIMs are graph representations of content, and therefore any semantic computation involving these techniques will involve graph traversal. Although there are graph databases created on top of HDFS (particularly Titan on HBase), graph computation is not the strong point of MapReduce. Hadoop, unfortunately, can only get us so far. # Hadoop for Preprocessing Language - Part 4 We are glad that you have stuck around for this long and, just in case you have missed any parts, click through to the introductionpart 1part 2, and part 3. You might ask me, doesn’t Hadoop do text processing extremely well? After all, the first Hadoop jobs we learn are word count and inverted index! The answer is that NLP preprocessing techniques are more complicated than splitting on whitespace and punctuation, and different tasks require different kinds of tokenization (also called segmentation or chunking). Consider the following sentence: “You're not going to the U.S.A. in that super-zeppelin, Dr. Stoddard?” How do you split this as a stand alone sentence? If you simply used punctuation, this would segment (sentence tokenization) to six sentences (“You’re not going to the U.”, “S.”, “A.”, “in that super-zeppelin, Dr.”, “Stoddard?”). Also, is the “You’re” two tokens or a single token? What about Punctuation? Is “Dr. Stoddard” one token or more? How about “super-zeppelin”. N-Gram analysis and other syntactic tokenization will also probably require different token lengths that go beyond white space. So we require some more formal NLP mechanisms even for simple tokenization. However, I propose that Hadoop might be perfect for language preprocessing. A Hadoop job creates output in the file system, so each job can be considered an NLP preprocessing task. Moreover, in many other Big Data analytics, Hadoop is used this way; last mile computations usually occur within 100GB of memory, Map Reduce jobs are used to perform calculations designed to transform data into something that is computable in that memory space. We will do the same thing with NLP, and transform our raw text as follows: Raw Text → Tokenized Text → Tagged Text → Parsed Text → Treebanks Namely, after we have tokenized our text depending on our requirements, splitting it into sentences, chunks and tokens as required, we then want to understand the syntactic class of the tokens, and tag it as such. Tagged text can then be structured into parses - a structured representation of the sentence. The final output, used for training our stochastic mechanisms and going beyond to more semantic analyses are treebanks. Each of these tasks can be one or more MapReduce jobs. NLTK comes with a few notable built-ins making your preprocessing with Hadoop integration easier (you’ll note all these methods are stochastic): • Punct Word and Sentence tokenizer uses an unsupervised training set to capture the beginning of sentences and other non-sentence termination marks. It doesn’t require a single sentence to perform tokenization. • Brill Tagger - a transformational rule based tagger that does a first pass tagging then applies rules that were trained from a tagged training data set. • Viterbi Parser- a dynamic programming algorithm that uses a weighted grammar to fill in a most-likely-constituent table and very quickly come up with the most likely parse. The end result after a series of MapReduce jobs (we had six) was a Treebank -- a machine tractable syntactic representation of language; it’s very important. # Python's Natural Language Took Kit (NLTK) and Hadoop - Part 3 Welcome back to part 3 of Ben's talk about Big Data and Natural Language Processing. (Click through to see the intro, part 1, and part 2). We chose NLTK (Natural Language Toolkit) particularly because it’s not Stanford. Stanford is kind of a magic black box, and it costs money to get a commercial license. NLTK is open source and it’s Python. But more importantly, NLTK is completely built around stochastic analysis techniques and comes with data sets and training mechanisms built in. Particularly because the magic and foo of Big Data with NLP requires using your own domain knowledge and data set, NLTK is extremely valuable from a leadership perspective! And anyway, it does come with out of the box NLP - use the Viterbi parser with a trained PCFG (Probabilistic Context Free Grammer, also called a Weighted Grammar) from the Penn Treebank, and you’ll get excellent parses immediately. Choosing Hadoop might seem obvious given that we’re talking about Data Science and particularly Big Data. But I do want to point out that the NLP tasks that we’re going to talk about right off the bat are embarrassingly parallel - meaning that they are extremely well suited for the Map Reduce paradigm. If you consider the unit of natural language the sentence, then each sentence (at least to begin with) can be analyzed on its own with little required knowledge about the surrounding processing of sentences. Combine that with the many flavors of Hadoop and the fact that you can get a cluster going in your closet for cheap-- it’s the right price for a startup! The glue to making NLTK (Python) and Hadoop (Java) play nice is Hadoop Streaming. Hadoop Streaming will allow you to create a mapper and a reducer with any executable, and expects that the executable will receive key-value pairs via stdin and output them via stdout. Just keep in mind that all other Hadoopy-ness exists, e.g. the FileInputFormat, HDFS, and Job Scheduling, all you get to replace is the mapper and reducer, but this is enough to include NLTK, so you’re off and running! Here’s an example of a Mapper and Reducer to get you started doing token counts with NLTK (note that these aren’t word counts -- to computational linguists, words are language elements that have senses and therefore convey meaning. Instead, you’ll be counting tokens, the syntactic base for words in this example, and you might be surprised to find out what tokens are-- trust me, it isn’t as simple as splitting on whitespace and punctuation!). ### mapper.py  #!/usr/bin/env python import sys from nltk.tokenize import wordpunct_tokenize def read_input(file): for line in file: # split the line into tokens yield wordpunct_tokenize(line) def main(separator='\t'): # input comes from STDIN (standard input) data = read_input(sys.stdin) for tokens in data: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial token count is 1 for token in tokens: print '%s%s%d' % (word, separator, 1) if __name__ == "__main__": main() ### reducer.py  #!/usr/bin/env python from itertools import groupby from operator import itemgetter import sys def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(separator='\t'): # input comes from STDIN (standard input) data = read_mapper_output(sys.stdin, separator=separator) # groupby groups multiple word-count pairs by word, # and creates an iterator that returns consecutive keys and their group: # current_word - string containing a word (the key) # group - iterator yielding all ["<current_word>", "<count>"] items for current_word, group in groupby(data, itemgetter(0)): try: total_count = sum(int(count) for current_word, count in group) print "%s%s%d" % (current_word, separator, total_count) except ValueError: # count was not a number, so silently discard this item pass if __name__ == "__main__": main() ### Running the Job  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-file /home/hduser/mapper.py    -mapper /home/hduser/mapper.py \
-file /home/hduser/reducer.py   -reducer /home/hduser/reducer.py \
-input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output

# The "Foo" of Big Data - Part 2

Welcome to Part 2 of this epic Big Data and Natural Language Processing perspective series. Here is the intro and part one if you missed any of them.

Domain knowledge is incredibly important, particularly in the context of stochastic methodologies, and particularly in NLP. Not all language, text, or domains have the same requirements, and there is no way to make a universal model for it. Consider how the language of doctors and lawyers may be outside our experience in the language of computer science or data science. Even more generally, regions tends to have specialized dialects or phrases even within the same language. As an anthropological rule, groups tend to specialize particular language features to communicate more effectively, and attempting to capture all of these specializations leads to ambiguity.

This leads to an interesting hypothesis: the foo of big data is to combine specialized domain knowledge with an interesting data set.

Further, given that domain knowledge and an interesting data set or sets:

1. Form hypothesis (a possible data product)
2. Mix in NLP techniques and machine learning tools
3. Perform computation and test hypothesis
4. Add to data set and domain knowledge
5. Iterate

If this sounds like the scientific method, you’re right! This is why it’s cool to hire PhDs again; in the context of Big Data, science is needed to create products. We’re not building bridges, we’re testing hypotheses, and this is the future of business.

But this alone is not why Big Data is a buzzword. The magic of big data is that we can iterate through the foo extremely rapidly and  multiple hypotheses can be tested via distributed computing techniques in weeks instead of years or ever shorter time periods. There is currently a surplus of data and domain knowledge, so everyone is interested in getting their own piece of data real estate, that is, getting their domain knowledge and their data set. The demand is rising to meet the surplus, and as a result we’re making a lot of money. Money means buzzwords. This is the magic of Big Data!

# Big Data and Natural Language Processing - Part 1

We hope you enjoyed the introduction to this series, part 1 is below.

“The science that has been developed around the facts of language passed through three stages before finding its true and unique object. First something called "grammar" was studied. This study, initiated by the Greeks and continued mainly by the French, was based on logic. It lacked a scientific approach and was detached from language itself. Its only aim was to give rules for distinguishing between correct and incorrect forms; it was a normative discipline, far removed from actual observation, and its scope was limited.” -Ferdinand de Saussure

Language is dynamic - trying to create rules to capture the full scope of language (e.g. grammars) fails because of how rapidly language changes. Instead, it is much easier to learn from as many examples as possible and guess the likelihood of the meaning of language; this, after all, is what humans do. Therefore Natural Language Processing and Computational Linguistics are stochastic methodologies, and a subset of artificial intelligence that benefits from Machine Learning techniques.

Machine Learning has many flavors, and most of them attempt to get at the long tail -- e.g. the low frequency events where the most relevant analysis occurs. To capture these events without resorting to some sort of comprehensive smoothing, more data is required, indeed the more data the better. I have yet to observe a machine learning discipline that complained of having too much data. (Generally speaking they complain of having too much modeling -- overfit). Therefore the stochastic approach of NLP needs Big Data.

The flip side of the coin is not as straightforward. We know there are many massive natural language data sets on the web and elsewhere. Consider tweets, reviews, job listings, emails, etc. These data sets fulfil the three V’s of Big Data: velocity, variety, and volume. But do these data sets require comprehensive natural language processing to produce interesting data products?

The answer is not yet. Hadoop and other tools already have build in text processing support. There are many approaches being applied to these data sets, particularly inverted indices, co-location scores, even N-Gram modeling. However, these approaches are not true NLP -- they are simply search. They leverage string and lightweight syntactic analysis to perform frequency analyses.

We have not yet exhausted all opportunities to perform these frequency analyses -- many interesting results, particularly in clustering, classification and authorship forensics, have been explored. However, these approaches will soon start to fail to produce the more interesting results that users are coming to expect. Products like machine translation, sentence generation, text summarization, and more meaningful text recommendation will require strong semantic methodologies, and eventually Big Data will come to require NLP, it’s just not there yet.

# Natural Language Processing and Big Data: Using NLTK and Hadoop - Talk Overview

My previous startup, Unbound Concepts, created a machine learning algorithm that determined the textual complexity (e.g. reading level) of children’s literature. Our approach started as a natural language processing problem -- designed to pull out language features to train our algorithms, and then quickly became a big data problem when we realized how much literature we had to go through in order to come up with meaningful representations. We chose to combine NLTK and Hadoop to create our Big Data NLP architecture, and we learned some useful lessons along the way. This series of posts is based on a talk done at the April Data Science DC meetup.

Think of this post as the Cliff Notes of the talk and the upcoming series of posts so you don’t have to read every word ... but trust me, it's worth it.

Related to the interaction between Big Data and NLP:

• Natural Language Processing needs Big Data
• Big Data doesn’t need NLP... yet.

Related to using Hadoop and NLTK:

• The combination of NLTK and Hadoop is perfect for prepossessing raw text
• More semantic analysis tend to be graph problems that Map Reduce isn’t great at computing.