NLP of Big Data using NLTK and Hadoop1

My previous startup, Unbound Concepts, created a machine learning algorithm that determined the textual complexity (e.g. reading level) of children’s literature. Our approach started as a natural language processing problem -- designed to pull out language features to train our algorithms, and then quickly became a big data problem when we realized how much literature we had to go through in order to come up with meaningful representations. We chose to combine NLTK and Hadoop to create our Big Data NLP architecture, and we learned some useful lessons along the way. This series of posts is based on a talk done at the April Data Science DC meetup.

Think of this post as the Cliff Notes of the talk and the upcoming series of posts so you don’t have to read every word ... but trust me, it's worth it.

Related to the interaction between Big Data and NLP:

  • Natural Language Processing needs Big Data
  • Big Data doesn’t need NLP... yet.

Related to using Hadoop and NLTK:

  • The combination of NLTK and Hadoop is perfect for prepossessing raw text
  • More semantic analysis tend to be graph problems that Map Reduce isn’t great at computing.

About data products in general:

  • The foo of Big Data is the ability to take domain knowledge and a data set (or sets) and iterate quickly through hypotheses using available tools (NLP)
  • The magic of big data is that there is currently a surplus of both data and knowledge and our tools are working, so it’s easy to come up with a data product (until demand meets supply).

I'll go over each of these points in detail as I did in my presentation, so stay tuned for the longer version [editor: so long that it has been broken up into multiple posts]