# Python's Natural Language Took Kit (NLTK) and Hadoop - Part 3

Welcome back to part 3 of Ben's talk about Big Data and Natural Language Processing. (Click through to see the intro, part 1, and part 2).

We chose NLTK (Natural Language Toolkit) particularly because it’s not Stanford. Stanford is kind of a magic black box, and it costs money to get a commercial license. NLTK is open source and it’s Python. But more importantly, NLTK is completely built around stochastic analysis techniques and comes with data sets and training mechanisms built in. Particularly because the magic and foo of Big Data with NLP requires using your own domain knowledge and data set, NLTK is extremely valuable from a leadership perspective! And anyway, it does come with out of the box NLP - use the Viterbi parser with a trained PCFG (Probabilistic Context Free Grammer, also called a Weighted Grammar)  from the Penn Treebank, and you’ll get excellent parses immediately.

Choosing Hadoop might seem obvious given that we’re talking about Data Science and particularly Big Data. But I do want to point out that the NLP tasks that we’re going to talk about right off the bat are embarrassingly parallel - meaning that they are extremely well suited for the Map Reduce paradigm. If you consider the unit of natural language the sentence, then each sentence (at least to begin with) can be analyzed on its own with little required knowledge about the surrounding processing of sentences.

Combine that with the many flavors of Hadoop and the fact that you can get a cluster going in your closet for cheap-- it’s the right price for a startup!

The glue to making NLTK (Python) and Hadoop (Java) play nice is Hadoop Streaming. Hadoop Streaming will allow you to create a mapper and a reducer with any executable, and expects that the executable will receive key-value pairs via stdin and output them via stdout. Just keep in mind that all other Hadoopy-ness exists, e.g. the FileInputFormat, HDFS, and Job Scheduling, all you get to replace is the mapper and reducer, but this is enough to include NLTK, so you’re off and running!

Here’s an example of a Mapper and Reducer to get you started doing token counts with NLTK (note that these aren’t word counts -- to computational linguists, words are language elements that have senses and therefore convey meaning. Instead, you’ll be counting tokens, the syntactic base for words in this example, and you might be surprised to find out what tokens are-- trust me, it isn’t as simple as splitting on whitespace and punctuation!).

### mapper.py


#!/usr/bin/env python

import sys
from nltk.tokenize import wordpunct_tokenize

for line in file:
# split the line into tokens
yield wordpunct_tokenize(line)

def main(separator='\t'):
# input comes from STDIN (standard input)
for tokens in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial token count is 1
for token in tokens:
print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":
main()

### reducer.py


#!/usr/bin/env python

from itertools import groupby
from operator import itemgetter
import sys

for line in file:
yield line.rstrip().split(separator, 1)

def main(separator='\t'):
# input comes from STDIN (standard input)
# groupby groups multiple word-count pairs by word,
# and creates an iterator that returns consecutive keys and their group:
#   current_word - string containing a word (the key)
#   group - iterator yielding all ["<current_word>", "<count>"] items
for current_word, group in groupby(data, itemgetter(0)):
try:
total_count = sum(int(count) for current_word, count in group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
# count was not a number, so silently discard this item
pass

if __name__ == "__main__":
main()

### Running the Job


-input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output