After you have treebanks, then what? The answer is that syntactic guessing is not the final frontier of NLP, we must go beyond to something more semantic. The idea is to determine the meaning of text in a machine tractable way by creating a TMR, a text-meaning representation (or thematic meaning representation). This, however, is not a trivial task, and now you’re at the frontier of the science.
Text Meaning Representations are language-independent representations of a language unit, and can be thought of as a series of connected frames that represent knowledge. TMRs allow us to do extremely deep querying of natural language, including the creation of knowledge bases, question and answer systems, and even allow for conversational agents. Unfortunately they require extremely deep ontologies and knowledge to be constructed and can be extremely process intensive, particularly in resolving ambiguity.
We have created a system called WIMs -- Weakly Inferred Meanings that attempts to stand in the middle ground of no semantic computation at all, and the extremely labor intensive TMR. WIMs reduces the search space by using a limited, but extremely important set of relations. These relations can be created using available knowledge -- Wordnet has proved to be a valuable ontology for creating WIMs, and they are extremely lightweight.
Even better, they’re open source!
Both TMRs and WIMs are graph representations of content, and therefore any semantic computation involving these techniques will involve graph traversal. Although there are graph databases created on top of HDFS (particularly Titan on HBase), graph computation is not the strong point of MapReduce. Hadoop, unfortunately, can only get us so far.