We chose NLTK (Natural Language Toolkit) particularly because it’s not Stanford. Stanford is kind of a magic black box, and it costs money to get a commercial license. NLTK is open source and it’s Python. But more importantly, NLTK is completely built around stochastic analysis techniques and comes with data sets and training mechanisms built in. Particularly because the magic and foo of Big Data with NLP requires using your own domain knowledge and data set, NLTK is extremely valuable from a leadership perspective! And anyway, it does come with out of the box NLP - use the Viterbi parser with a trained PCFG (Probabilistic Context Free Grammer, also called a Weighted Grammar) from the Penn Treebank, and you’ll get excellent parses immediately.
Choosing Hadoop might seem obvious given that we’re talking about Data Science and particularly Big Data. But I do want to point out that the NLP tasks that we’re going to talk about right off the bat are embarrassingly parallel - meaning that they are extremely well suited for the Map Reduce paradigm. If you consider the unit of natural language the sentence, then each sentence (at least to begin with) can be analyzed on its own with little required knowledge about the surrounding processing of sentences.
Combine that with the many flavors of Hadoop and the fact that you can get a cluster going in your closet for cheap-- it’s the right price for a startup!
The glue to making NLTK (Python) and Hadoop (Java) play nice is Hadoop Streaming. Hadoop Streaming will allow you to create a mapper and a reducer with any executable, and expects that the executable will receive key-value pairs via
stdin and output them via
stdout. Just keep in mind that all other Hadoopy-ness exists, e.g. the
FileInputFormat, HDFS, and Job Scheduling, all you get to replace is the mapper and reducer, but this is enough to include NLTK, so you’re off and running!
Here’s an example of a Mapper and Reducer to get you started doing token counts with NLTK (note that these aren’t word counts -- to computational linguists, words are language elements that have senses and therefore convey meaning. Instead, you’ll be counting tokens, the syntactic base for words in this example, and you might be surprised to find out what tokens are-- trust me, it isn’t as simple as splitting on whitespace and punctuation!).
#!/usr/bin/env python import sys from nltk.tokenize import wordpunct_tokenize def read_input(file): for line in file: # split the line into tokens yield wordpunct_tokenize(line) def main(separator='\t'): # input comes from STDIN (standard input) data = read_input(sys.stdin) for tokens in data: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial token count is 1 for token in tokens: print '%s%s%d' % (word, separator, 1) if __name__ == "__main__": main()
#!/usr/bin/env python from itertools import groupby from operator import itemgetter import sys def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(separator='\t'): # input comes from STDIN (standard input) data = read_mapper_output(sys.stdin, separator=separator) # groupby groups multiple word-count pairs by word, # and creates an iterator that returns consecutive keys and their group: # current_word - string containing a word (the key) # group - iterator yielding all ["<current_word>", "<count>"] items for current_word, group in groupby(data, itemgetter(0)): try: total_count = sum(int(count) for current_word, count in group) print "%s%s%d" % (current_word, separator, total_count) except ValueError: # count was not a number, so silently discard this item pass if __name__ == "__main__": main()
Running the Job
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \ -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py \ -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \ -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output
Some interesting notes about using Hadoop Streaming relate to memory usage. NLP can sometimes be a memory intensive task as you have to load up training data to compute various aspects of your processing- loading these things up can take minutes at the beginning of your processing. However, with Hadoop Streaming, only one interpreter per job is loaded, thus saving you repeating that loading process. Similarly, you can use generators and other python iteration techniques to carve through mountains of data very easily. There are some Python libraries, including dumbo, mrjob, and hadoopy that can make all of this a bit easier.