# Beyond Preprocessing - Weakly Inferred Meanings - Part 5

Congrats! This is the final post in our 6 part series! Just in case you have missed any parts, click through to the introductionpart 1part 2, part 3, and part 4.

After you have treebanks, then what? The answer is that syntactic guessing is not the final frontier of NLP, we must go beyond to something more semantic. The idea is to determine the meaning of text in a machine tractable way by creating a TMR, a text-meaning representation (or thematic meaning representation). This, however, is not a trivial task, and now you’re at the frontier of the science.

Text Meaning Representations are language-independent representations of a language unit, and can be thought of as a series of connected frames that represent knowledge. TMRs allow us to do extremely deep querying of natural language, including the creation of knowledge bases, question and answer systems, and even allow for conversational agents. Unfortunately they require extremely deep ontologies and knowledge to be constructed and can be extremely process intensive, particularly in resolving ambiguity.

We have created a system called WIMs -- Weakly Inferred Meanings that attempts to stand in the middle ground of no semantic computation at all, and the extremely labor intensive TMR. WIMs reduces the search space by using a limited, but extremely important set of relations. These relations can be created using available knowledge -- Wordnet has proved to be a valuable ontology for creating WIMs, and they are extremely lightweight.

Even better, they’re open source!

Both TMRs and WIMs are graph representations of content, and therefore any semantic computation involving these techniques will involve graph traversal. Although there are graph databases created on top of HDFS (particularly Titan on HBase), graph computation is not the strong point of MapReduce. Hadoop, unfortunately, can only get us so far.

# Hadoop for Preprocessing Language - Part 4

We are glad that you have stuck around for this long and, just in case you have missed any parts, click through to the introductionpart 1part 2, and part 3.

You might ask me, doesn’t Hadoop do text processing extremely well? After all, the first Hadoop jobs we learn are word count and inverted index!

The answer is that NLP preprocessing techniques are more complicated than splitting on whitespace and punctuation, and different tasks require different kinds of tokenization (also called segmentation or chunking). Consider the following sentence:

“You're not going to the U.S.A. in that super-zeppelin, Dr. Stoddard?”

How do you split this as a stand alone sentence? If you simply used punctuation, this would segment (sentence tokenization) to six sentences (“You’re not going to the U.”, “S.”, “A.”, “in that super-zeppelin, Dr.”, “Stoddard?”). Also, is the “You’re” two tokens or a single token? What about Punctuation? Is “Dr. Stoddard” one token or more? How about “super-zeppelin”. N-Gram analysis and other syntactic tokenization will also probably require different token lengths that go beyond white space.

So we require some more formal NLP mechanisms even for simple tokenization. However, I propose that Hadoop might be perfect for language preprocessing. A Hadoop job creates output in the file system, so each job can be considered an NLP preprocessing task. Moreover, in many other Big Data analytics, Hadoop is used this way; last mile computations usually occur within 100GB of memory, Map Reduce jobs are used to perform calculations designed to transform data into something that is computable in that memory space. We will do the same thing with NLP, and transform our raw text as follows:

Raw Text → Tokenized Text → Tagged Text → Parsed Text → Treebanks

Namely, after we have tokenized our text depending on our requirements, splitting it into sentences, chunks and tokens as required, we then want to understand the syntactic class of the tokens, and tag it as such. Tagged text can then be structured into parses - a structured representation of the sentence. The final output, used for training our stochastic mechanisms and going beyond to more semantic analyses are treebanks. Each of these tasks can be one or more MapReduce jobs.

NLTK comes with a few notable built-ins making your preprocessing with Hadoop integration easier (you’ll note all these methods are stochastic):

• Punct Word and Sentence tokenizer uses an unsupervised training set to capture the beginning of sentences and other non-sentence termination marks. It doesn’t require a single sentence to perform tokenization.

• Brill Tagger - a transformational rule based tagger that does a first pass tagging then applies rules that were trained from a tagged training data set.

• Viterbi Parser- a dynamic programming algorithm that uses a weighted grammar to fill in a most-likely-constituent table and very quickly come up with the most likely parse.

The end result after a series of MapReduce jobs (we had six) was a Treebank -- a machine tractable syntactic representation of language; it’s very important.

# Python's Natural Language Took Kit (NLTK) and Hadoop - Part 3

Welcome back to part 3 of Ben's talk about Big Data and Natural Language Processing. (Click through to see the intro, part 1, and part 2).

We chose NLTK (Natural Language Toolkit) particularly because it’s not Stanford. Stanford is kind of a magic black box, and it costs money to get a commercial license. NLTK is open source and it’s Python. But more importantly, NLTK is completely built around stochastic analysis techniques and comes with data sets and training mechanisms built in. Particularly because the magic and foo of Big Data with NLP requires using your own domain knowledge and data set, NLTK is extremely valuable from a leadership perspective! And anyway, it does come with out of the box NLP - use the Viterbi parser with a trained PCFG (Probabilistic Context Free Grammer, also called a Weighted Grammar)  from the Penn Treebank, and you’ll get excellent parses immediately.

Choosing Hadoop might seem obvious given that we’re talking about Data Science and particularly Big Data. But I do want to point out that the NLP tasks that we’re going to talk about right off the bat are embarrassingly parallel - meaning that they are extremely well suited for the Map Reduce paradigm. If you consider the unit of natural language the sentence, then each sentence (at least to begin with) can be analyzed on its own with little required knowledge about the surrounding processing of sentences.

Combine that with the many flavors of Hadoop and the fact that you can get a cluster going in your closet for cheap-- it’s the right price for a startup!

The glue to making NLTK (Python) and Hadoop (Java) play nice is Hadoop Streaming. Hadoop Streaming will allow you to create a mapper and a reducer with any executable, and expects that the executable will receive key-value pairs via stdin and output them via stdout. Just keep in mind that all other Hadoopy-ness exists, e.g. the FileInputFormat, HDFS, and Job Scheduling, all you get to replace is the mapper and reducer, but this is enough to include NLTK, so you’re off and running!

Here’s an example of a Mapper and Reducer to get you started doing token counts with NLTK (note that these aren’t word counts -- to computational linguists, words are language elements that have senses and therefore convey meaning. Instead, you’ll be counting tokens, the syntactic base for words in this example, and you might be surprised to find out what tokens are-- trust me, it isn’t as simple as splitting on whitespace and punctuation!).

### mapper.py


#!/usr/bin/env python

import sys
from nltk.tokenize import wordpunct_tokenize

for line in file:
# split the line into tokens
yield wordpunct_tokenize(line)

def main(separator='\t'):
# input comes from STDIN (standard input)
for tokens in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial token count is 1
for token in tokens:
print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":
main()

### reducer.py


#!/usr/bin/env python

from itertools import groupby
from operator import itemgetter
import sys

for line in file:
yield line.rstrip().split(separator, 1)

def main(separator='\t'):
# input comes from STDIN (standard input)
# groupby groups multiple word-count pairs by word,
# and creates an iterator that returns consecutive keys and their group:
#   current_word - string containing a word (the key)
#   group - iterator yielding all ["<current_word>", "<count>"] items
for current_word, group in groupby(data, itemgetter(0)):
try:
total_count = sum(int(count) for current_word, count in group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
# count was not a number, so silently discard this item
pass

if __name__ == "__main__":
main()

### Running the Job


-file /home/hduser/mapper.py    -mapper /home/hduser/mapper.py \
-file /home/hduser/reducer.py   -reducer /home/hduser/reducer.py \
-input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output

# The "Foo" of Big Data - Part 2

Welcome to Part 2 of this epic Big Data and Natural Language Processing perspective series. Here is the intro and part one if you missed any of them.

Domain knowledge is incredibly important, particularly in the context of stochastic methodologies, and particularly in NLP. Not all language, text, or domains have the same requirements, and there is no way to make a universal model for it. Consider how the language of doctors and lawyers may be outside our experience in the language of computer science or data science. Even more generally, regions tends to have specialized dialects or phrases even within the same language. As an anthropological rule, groups tend to specialize particular language features to communicate more effectively, and attempting to capture all of these specializations leads to ambiguity.

This leads to an interesting hypothesis: the foo of big data is to combine specialized domain knowledge with an interesting data set.

Further, given that domain knowledge and an interesting data set or sets:

1. Form hypothesis (a possible data product)
2. Mix in NLP techniques and machine learning tools
3. Perform computation and test hypothesis
4. Add to data set and domain knowledge
5. Iterate

If this sounds like the scientific method, you’re right! This is why it’s cool to hire PhDs again; in the context of Big Data, science is needed to create products. We’re not building bridges, we’re testing hypotheses, and this is the future of business.

But this alone is not why Big Data is a buzzword. The magic of big data is that we can iterate through the foo extremely rapidly and  multiple hypotheses can be tested via distributed computing techniques in weeks instead of years or ever shorter time periods. There is currently a surplus of data and domain knowledge, so everyone is interested in getting their own piece of data real estate, that is, getting their domain knowledge and their data set. The demand is rising to meet the surplus, and as a result we’re making a lot of money. Money means buzzwords. This is the magic of Big Data!

# Big Data and Natural Language Processing - Part 1

We hope you enjoyed the introduction to this series, part 1 is below.

“The science that has been developed around the facts of language passed through three stages before finding its true and unique object. First something called "grammar" was studied. This study, initiated by the Greeks and continued mainly by the French, was based on logic. It lacked a scientific approach and was detached from language itself. Its only aim was to give rules for distinguishing between correct and incorrect forms; it was a normative discipline, far removed from actual observation, and its scope was limited.” -Ferdinand de Saussure

Language is dynamic - trying to create rules to capture the full scope of language (e.g. grammars) fails because of how rapidly language changes. Instead, it is much easier to learn from as many examples as possible and guess the likelihood of the meaning of language; this, after all, is what humans do. Therefore Natural Language Processing and Computational Linguistics are stochastic methodologies, and a subset of artificial intelligence that benefits from Machine Learning techniques.

Machine Learning has many flavors, and most of them attempt to get at the long tail -- e.g. the low frequency events where the most relevant analysis occurs. To capture these events without resorting to some sort of comprehensive smoothing, more data is required, indeed the more data the better. I have yet to observe a machine learning discipline that complained of having too much data. (Generally speaking they complain of having too much modeling -- overfit). Therefore the stochastic approach of NLP needs Big Data.

The flip side of the coin is not as straightforward. We know there are many massive natural language data sets on the web and elsewhere. Consider tweets, reviews, job listings, emails, etc. These data sets fulfil the three V’s of Big Data: velocity, variety, and volume. But do these data sets require comprehensive natural language processing to produce interesting data products?

The answer is not yet. Hadoop and other tools already have build in text processing support. There are many approaches being applied to these data sets, particularly inverted indices, co-location scores, even N-Gram modeling. However, these approaches are not true NLP -- they are simply search. They leverage string and lightweight syntactic analysis to perform frequency analyses.

We have not yet exhausted all opportunities to perform these frequency analyses -- many interesting results, particularly in clustering, classification and authorship forensics, have been explored. However, these approaches will soon start to fail to produce the more interesting results that users are coming to expect. Products like machine translation, sentence generation, text summarization, and more meaningful text recommendation will require strong semantic methodologies, and eventually Big Data will come to require NLP, it’s just not there yet.

# Natural Language Processing and Big Data: Using NLTK and Hadoop - Talk Overview

My previous startup, Unbound Concepts, created a machine learning algorithm that determined the textual complexity (e.g. reading level) of children’s literature. Our approach started as a natural language processing problem -- designed to pull out language features to train our algorithms, and then quickly became a big data problem when we realized how much literature we had to go through in order to come up with meaningful representations. We chose to combine NLTK and Hadoop to create our Big Data NLP architecture, and we learned some useful lessons along the way. This series of posts is based on a talk done at the April Data Science DC meetup.

Think of this post as the Cliff Notes of the talk and the upcoming series of posts so you don’t have to read every word ... but trust me, it's worth it.

Related to the interaction between Big Data and NLP:

• Natural Language Processing needs Big Data
• Big Data doesn’t need NLP... yet.

Related to using Hadoop and NLTK:

• The combination of NLTK and Hadoop is perfect for prepossessing raw text
• More semantic analysis tend to be graph problems that Map Reduce isn’t great at computing.