Big Data and Natural Language Processing - Part 1

We hope you enjoyed the introduction to this series, part 1 is below.

“The science that has been developed around the facts of language passed through three stages before finding its true and unique object. First something called "grammar" was studied. This study, initiated by the Greeks and continued mainly by the French, was based on logic. It lacked a scientific approach and was detached from language itself. Its only aim was to give rules for distinguishing between correct and incorrect forms; it was a normative discipline, far removed from actual observation, and its scope was limited.” -Ferdinand de Saussure

Language is dynamic - trying to create rules to capture the full scope of language (e.g. grammars) fails because of how rapidly language changes. Instead, it is much easier to learn from as many examples as possible and guess the likelihood of the meaning of language; this, after all, is what humans do. Therefore Natural Language Processing and Computational Linguistics are stochastic methodologies, and a subset of artificial intelligence that benefits from Machine Learning techniques. 

Machine Learning has many flavors, and most of them attempt to get at the long tail -- e.g. the low frequency events where the most relevant analysis occurs. To capture these events without resorting to some sort of comprehensive smoothing, more data is required, indeed the more data the better. I have yet to observe a machine learning discipline that complained of having too much data. (Generally speaking they complain of having too much modeling -- overfit). Therefore the stochastic approach of NLP needs Big Data. 

NLP of Big Data using NLTK and Hadoop6

The flip side of the coin is not as straightforward. We know there are many massive natural language data sets on the web and elsewhere. Consider tweets, reviews, job listings, emails, etc. These data sets fulfil the three V’s of Big Data: velocity, variety, and volume. But do these data sets require comprehensive natural language processing to produce interesting data products?

NLP of Big Data using NLTK and Hadoop7

The answer is not yet. Hadoop and other tools already have build in text processing support. There are many approaches being applied to these data sets, particularly inverted indices, co-location scores, even N-Gram modeling. However, these approaches are not true NLP -- they are simply search. They leverage string and lightweight syntactic analysis to perform frequency analyses.

NLP of Big Data using NLTK and Hadoop8

We have not yet exhausted all opportunities to perform these frequency analyses -- many interesting results, particularly in clustering, classification and authorship forensics, have been explored. However, these approaches will soon start to fail to produce the more interesting results that users are coming to expect. Products like machine translation, sentence generation, text summarization, and more meaningful text recommendation will require strong semantic methodologies, and eventually Big Data will come to require NLP, it’s just not there yet.