Python's Natural Language Took Kit (NLTK) and Hadoop - Part 3

Welcome back to part 3 of Ben's talk about Big Data and Natural Language Processing. (Click through to see the intro, part 1, and part 2).

NLP of Big Data using NLTK and Hadoop12

We chose NLTK (Natural Language Toolkit) particularly because it’s not Stanford. Stanford is kind of a magic black box, and it costs money to get a commercial license. NLTK is open source and it’s Python. But more importantly, NLTK is completely built around stochastic analysis techniques and comes with data sets and training mechanisms built in. Particularly because the magic and foo of Big Data with NLP requires using your own domain knowledge and data set, NLTK is extremely valuable from a leadership perspective! And anyway, it does come with out of the box NLP - use the Viterbi parser with a trained PCFG (Probabilistic Context Free Grammer, also called a Weighted Grammar)  from the Penn Treebank, and you’ll get excellent parses immediately.

NLP of Big Data using NLTK and Hadoop13

Choosing Hadoop might seem obvious given that we’re talking about Data Science and particularly Big Data. But I do want to point out that the NLP tasks that we’re going to talk about right off the bat are embarrassingly parallel - meaning that they are extremely well suited for the Map Reduce paradigm. If you consider the unit of natural language the sentence, then each sentence (at least to begin with) can be analyzed on its own with little required knowledge about the surrounding processing of sentences.

NLP of Big Data using NLTK and Hadoop14

Combine that with the many flavors of Hadoop and the fact that you can get a cluster going in your closet for cheap-- it’s the right price for a startup!

NLP of Big Data using NLTK and Hadoop15

The glue to making NLTK (Python) and Hadoop (Java) play nice is Hadoop Streaming. Hadoop Streaming will allow you to create a mapper and a reducer with any executable, and expects that the executable will receive key-value pairs via stdin and output them via stdout. Just keep in mind that all other Hadoopy-ness exists, e.g. the FileInputFormat, HDFS, and Job Scheduling, all you get to replace is the mapper and reducer, but this is enough to include NLTK, so you’re off and running!

Here’s an example of a Mapper and Reducer to get you started doing token counts with NLTK (note that these aren’t word counts -- to computational linguists, words are language elements that have senses and therefore convey meaning. Instead, you’ll be counting tokens, the syntactic base for words in this example, and you might be surprised to find out what tokens are-- trust me, it isn’t as simple as splitting on whitespace and punctuation!).

    #!/usr/bin/env python

    import sys
    from nltk.tokenize import wordpunct_tokenize

    def read_input(file):
        for line in file:
            # split the line into tokens
            yield wordpunct_tokenize(line)

    def main(separator='\t'):
        # input comes from STDIN (standard input)
        data = read_input(sys.stdin)
        for tokens in data:
            # write the results to STDOUT (standard output);
            # what we output here will be the input for the
            # Reduce step, i.e. the input for
            # tab-delimited; the trivial token count is 1
            for token in tokens:
                print '%s%s%d' % (word, separator, 1)

    if __name__ == "__main__":

    #!/usr/bin/env python

    from itertools import groupby
    from operator import itemgetter
    import sys

    def read_mapper_output(file, separator='\t'):
        for line in file:
            yield line.rstrip().split(separator, 1)

    def main(separator='\t'):
        # input comes from STDIN (standard input)
        data = read_mapper_output(sys.stdin, separator=separator)
        # groupby groups multiple word-count pairs by word,
        # and creates an iterator that returns consecutive keys and their group:
        #   current_word - string containing a word (the key)
        #   group - iterator yielding all ["<current_word>", "<count>"] items
        for current_word, group in groupby(data, itemgetter(0)):
                total_count = sum(int(count) for current_word, count in group)
                print "%s%s%d" % (current_word, separator, total_count)
            except ValueError:
                # count was not a number, so silently discard this item

    if __name__ == "__main__":

Running the Job

    hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
        -file /home/hduser/    -mapper /home/hduser/ \
        -file /home/hduser/   -reducer /home/hduser/ \
        -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output

Some interesting notes about using Hadoop Streaming relate to memory usage. NLP can sometimes be a memory intensive task as you have to load up training data to compute various aspects of your processing- loading these things up can take minutes at the beginning of your processing. However, with Hadoop Streaming, only one interpreter per job is loaded, thus saving you repeating that loading process. Similarly, you can use generators and other python iteration techniques to carve through mountains of data very easily. There are some Python libraries, including dumbo, mrjob, and hadoopy that can make all of this a bit easier.