Welcome to Part 2 of this epic Big Data and Natural Language Processing perspective series. Here is the intro and part one if you missed any of them.

NLP of Big Data using NLTK and Hadoop9

Domain knowledge is incredibly important, particularly in the context of stochastic methodologies, and particularly in NLP. Not all language, text, or domains have the same requirements, and there is no way to make a universal model for it. Consider how the language of doctors and lawyers may be outside our experience in the language of computer science or data science. Even more generally, regions tends to have specialized dialects or phrases even within the same language. As an anthropological rule, groups tend to specialize particular language features to communicate more effectively, and attempting to capture all of these specializations leads to ambiguity.

This leads to an interesting hypothesis: the foo of big data is to combine specialized domain knowledge with an interesting data set.

NLP of Big Data using NLTK and Hadoop11

Further, given that domain knowledge and an interesting data set or sets:

  1. Form hypothesis (a possible data product)
  2. Mix in NLP techniques and machine learning tools
  3. Perform computation and test hypothesis
  4. Add to data set and domain knowledge
  5. Iterate

If this sounds like the scientific method, you’re right! This is why it’s cool to hire PhDs again; in the context of Big Data, science is needed to create products. We’re not building bridges, we’re testing hypotheses, and this is the future of business.

But this alone is not why Big Data is a buzzword. The magic of big data is that we can iterate through the foo extremely rapidly and  multiple hypotheses can be tested via distributed computing techniques in weeks instead of years or ever shorter time periods. There is currently a surplus of data and domain knowledge, so everyone is interested in getting their own piece of data real estate, that is, getting their domain knowledge and their data set. The demand is rising to meet the surplus, and as a result we’re making a lot of money. Money means buzzwords. This is the magic of Big Data!