# Natural Language Processing with Python Workshop on April 9th

Data Community DC and District Data Labs are hosting a Natural Language Processing with Python workshop on Saturday April 9th from 9am - 5pm. Register before March 26th for an early bird discount!

# Data Community DC Video Series Kicks Off: Dr. Jesse English Talks NLP and Text Processing

We are excited to announce the first in a new series of posts and a brand new initiative: Data Community DC Videos! We are going to film and publish online videos (and separate audio, resources permitting) as many talks from Data Community DC meetups as possible. Yes, we want you to experience the events in person, but realize that not everyone who wants to be a part of our community can attend every single event. To kick this off, we have a fantastic video of Dr. Jesse English passionately discussing a brand new, open source framework, WIMs (Weakly Inferred Meanings), a novel approach to creating structured meaning representations for semantic analyses. Whereas a TMR (text meaning representation) requires a large, domain-specific knowledge base and significant computation times, WIMs cover a limited scope of possible relationships. The limitation is intentional, and allows for better performance-- but still carries enough relationships for most applications. Additionally, the creation of a bespoke knowledge base and microtheory is not required, the novel pattern matching technique means that available ontologies like WordNet provide enough coverage. WIMs are Open Source and available now, and are truly a break through in semantic processing.

Dr. Jesse English holds a PhD in computer science from UMBC and has specialized in natural language processing, machine learning and machine reading. As the Chief Science Officer at Unbound Concepts, Jesse's focused on automatic extraction of semantically rich meaning from literature, and application of that knowledge to the company's big-data driven machine learning algorithm. Before his work at Unbound Concepts, Jesse worked as a research associate at UMBC, focusing on automatically bridging the knowledge acquisition bottleneck through machine reading, as well as developing agent-based conversation systems.

[vimeo 61955058]

# Beyond Preprocessing - Weakly Inferred Meanings - Part 5

Congrats! This is the final post in our 6 part series! Just in case you have missed any parts, click through to the introductionpart 1part 2, part 3, and part 4.

After you have treebanks, then what? The answer is that syntactic guessing is not the final frontier of NLP, we must go beyond to something more semantic. The idea is to determine the meaning of text in a machine tractable way by creating a TMR, a text-meaning representation (or thematic meaning representation). This, however, is not a trivial task, and now you’re at the frontier of the science.

Text Meaning Representations are language-independent representations of a language unit, and can be thought of as a series of connected frames that represent knowledge. TMRs allow us to do extremely deep querying of natural language, including the creation of knowledge bases, question and answer systems, and even allow for conversational agents. Unfortunately they require extremely deep ontologies and knowledge to be constructed and can be extremely process intensive, particularly in resolving ambiguity.

We have created a system called WIMs -- Weakly Inferred Meanings that attempts to stand in the middle ground of no semantic computation at all, and the extremely labor intensive TMR. WIMs reduces the search space by using a limited, but extremely important set of relations. These relations can be created using available knowledge -- Wordnet has proved to be a valuable ontology for creating WIMs, and they are extremely lightweight.

Even better, they’re open source!

Both TMRs and WIMs are graph representations of content, and therefore any semantic computation involving these techniques will involve graph traversal. Although there are graph databases created on top of HDFS (particularly Titan on HBase), graph computation is not the strong point of MapReduce. Hadoop, unfortunately, can only get us so far.

# Hadoop for Preprocessing Language - Part 4

We are glad that you have stuck around for this long and, just in case you have missed any parts, click through to the introductionpart 1part 2, and part 3.

You might ask me, doesn’t Hadoop do text processing extremely well? After all, the first Hadoop jobs we learn are word count and inverted index!

The answer is that NLP preprocessing techniques are more complicated than splitting on whitespace and punctuation, and different tasks require different kinds of tokenization (also called segmentation or chunking). Consider the following sentence:

“You're not going to the U.S.A. in that super-zeppelin, Dr. Stoddard?”

How do you split this as a stand alone sentence? If you simply used punctuation, this would segment (sentence tokenization) to six sentences (“You’re not going to the U.”, “S.”, “A.”, “in that super-zeppelin, Dr.”, “Stoddard?”). Also, is the “You’re” two tokens or a single token? What about Punctuation? Is “Dr. Stoddard” one token or more? How about “super-zeppelin”. N-Gram analysis and other syntactic tokenization will also probably require different token lengths that go beyond white space.

So we require some more formal NLP mechanisms even for simple tokenization. However, I propose that Hadoop might be perfect for language preprocessing. A Hadoop job creates output in the file system, so each job can be considered an NLP preprocessing task. Moreover, in many other Big Data analytics, Hadoop is used this way; last mile computations usually occur within 100GB of memory, Map Reduce jobs are used to perform calculations designed to transform data into something that is computable in that memory space. We will do the same thing with NLP, and transform our raw text as follows:

Raw Text → Tokenized Text → Tagged Text → Parsed Text → Treebanks

Namely, after we have tokenized our text depending on our requirements, splitting it into sentences, chunks and tokens as required, we then want to understand the syntactic class of the tokens, and tag it as such. Tagged text can then be structured into parses - a structured representation of the sentence. The final output, used for training our stochastic mechanisms and going beyond to more semantic analyses are treebanks. Each of these tasks can be one or more MapReduce jobs.

NLTK comes with a few notable built-ins making your preprocessing with Hadoop integration easier (you’ll note all these methods are stochastic):

• Punct Word and Sentence tokenizer uses an unsupervised training set to capture the beginning of sentences and other non-sentence termination marks. It doesn’t require a single sentence to perform tokenization.

• Brill Tagger - a transformational rule based tagger that does a first pass tagging then applies rules that were trained from a tagged training data set.

• Viterbi Parser- a dynamic programming algorithm that uses a weighted grammar to fill in a most-likely-constituent table and very quickly come up with the most likely parse.

The end result after a series of MapReduce jobs (we had six) was a Treebank -- a machine tractable syntactic representation of language; it’s very important.

# Python's Natural Language Took Kit (NLTK) and Hadoop - Part 3

Welcome back to part 3 of Ben's talk about Big Data and Natural Language Processing. (Click through to see the intro, part 1, and part 2).

We chose NLTK (Natural Language Toolkit) particularly because it’s not Stanford. Stanford is kind of a magic black box, and it costs money to get a commercial license. NLTK is open source and it’s Python. But more importantly, NLTK is completely built around stochastic analysis techniques and comes with data sets and training mechanisms built in. Particularly because the magic and foo of Big Data with NLP requires using your own domain knowledge and data set, NLTK is extremely valuable from a leadership perspective! And anyway, it does come with out of the box NLP - use the Viterbi parser with a trained PCFG (Probabilistic Context Free Grammer, also called a Weighted Grammar)  from the Penn Treebank, and you’ll get excellent parses immediately.

Choosing Hadoop might seem obvious given that we’re talking about Data Science and particularly Big Data. But I do want to point out that the NLP tasks that we’re going to talk about right off the bat are embarrassingly parallel - meaning that they are extremely well suited for the Map Reduce paradigm. If you consider the unit of natural language the sentence, then each sentence (at least to begin with) can be analyzed on its own with little required knowledge about the surrounding processing of sentences.

Combine that with the many flavors of Hadoop and the fact that you can get a cluster going in your closet for cheap-- it’s the right price for a startup!

The glue to making NLTK (Python) and Hadoop (Java) play nice is Hadoop Streaming. Hadoop Streaming will allow you to create a mapper and a reducer with any executable, and expects that the executable will receive key-value pairs via stdin and output them via stdout. Just keep in mind that all other Hadoopy-ness exists, e.g. the FileInputFormat, HDFS, and Job Scheduling, all you get to replace is the mapper and reducer, but this is enough to include NLTK, so you’re off and running!

Here’s an example of a Mapper and Reducer to get you started doing token counts with NLTK (note that these aren’t word counts -- to computational linguists, words are language elements that have senses and therefore convey meaning. Instead, you’ll be counting tokens, the syntactic base for words in this example, and you might be surprised to find out what tokens are-- trust me, it isn’t as simple as splitting on whitespace and punctuation!).

### mapper.py


#!/usr/bin/env python

import sys
from nltk.tokenize import wordpunct_tokenize

for line in file:
# split the line into tokens
yield wordpunct_tokenize(line)

def main(separator='\t'):
# input comes from STDIN (standard input)
for tokens in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial token count is 1
for token in tokens:
print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":
main()

### reducer.py


#!/usr/bin/env python

from itertools import groupby
from operator import itemgetter
import sys

for line in file:
yield line.rstrip().split(separator, 1)

def main(separator='\t'):
# input comes from STDIN (standard input)
# groupby groups multiple word-count pairs by word,
# and creates an iterator that returns consecutive keys and their group:
#   current_word - string containing a word (the key)
#   group - iterator yielding all ["<current_word>", "<count>"] items
for current_word, group in groupby(data, itemgetter(0)):
try:
total_count = sum(int(count) for current_word, count in group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
# count was not a number, so silently discard this item
pass

if __name__ == "__main__":
main()

### Running the Job


-file /home/hduser/mapper.py    -mapper /home/hduser/mapper.py \
-file /home/hduser/reducer.py   -reducer /home/hduser/reducer.py \
-input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output

# The "Foo" of Big Data - Part 2

Welcome to Part 2 of this epic Big Data and Natural Language Processing perspective series. Here is the intro and part one if you missed any of them.

Domain knowledge is incredibly important, particularly in the context of stochastic methodologies, and particularly in NLP. Not all language, text, or domains have the same requirements, and there is no way to make a universal model for it. Consider how the language of doctors and lawyers may be outside our experience in the language of computer science or data science. Even more generally, regions tends to have specialized dialects or phrases even within the same language. As an anthropological rule, groups tend to specialize particular language features to communicate more effectively, and attempting to capture all of these specializations leads to ambiguity.

This leads to an interesting hypothesis: the foo of big data is to combine specialized domain knowledge with an interesting data set.

Further, given that domain knowledge and an interesting data set or sets:

1. Form hypothesis (a possible data product)
2. Mix in NLP techniques and machine learning tools
3. Perform computation and test hypothesis
4. Add to data set and domain knowledge
5. Iterate

If this sounds like the scientific method, you’re right! This is why it’s cool to hire PhDs again; in the context of Big Data, science is needed to create products. We’re not building bridges, we’re testing hypotheses, and this is the future of business.

But this alone is not why Big Data is a buzzword. The magic of big data is that we can iterate through the foo extremely rapidly and  multiple hypotheses can be tested via distributed computing techniques in weeks instead of years or ever shorter time periods. There is currently a surplus of data and domain knowledge, so everyone is interested in getting their own piece of data real estate, that is, getting their domain knowledge and their data set. The demand is rising to meet the surplus, and as a result we’re making a lot of money. Money means buzzwords. This is the magic of Big Data!

# Big Data and Natural Language Processing - Part 1

We hope you enjoyed the introduction to this series, part 1 is below.

“The science that has been developed around the facts of language passed through three stages before finding its true and unique object. First something called "grammar" was studied. This study, initiated by the Greeks and continued mainly by the French, was based on logic. It lacked a scientific approach and was detached from language itself. Its only aim was to give rules for distinguishing between correct and incorrect forms; it was a normative discipline, far removed from actual observation, and its scope was limited.” -Ferdinand de Saussure

Language is dynamic - trying to create rules to capture the full scope of language (e.g. grammars) fails because of how rapidly language changes. Instead, it is much easier to learn from as many examples as possible and guess the likelihood of the meaning of language; this, after all, is what humans do. Therefore Natural Language Processing and Computational Linguistics are stochastic methodologies, and a subset of artificial intelligence that benefits from Machine Learning techniques.

Machine Learning has many flavors, and most of them attempt to get at the long tail -- e.g. the low frequency events where the most relevant analysis occurs. To capture these events without resorting to some sort of comprehensive smoothing, more data is required, indeed the more data the better. I have yet to observe a machine learning discipline that complained of having too much data. (Generally speaking they complain of having too much modeling -- overfit). Therefore the stochastic approach of NLP needs Big Data.

The flip side of the coin is not as straightforward. We know there are many massive natural language data sets on the web and elsewhere. Consider tweets, reviews, job listings, emails, etc. These data sets fulfil the three V’s of Big Data: velocity, variety, and volume. But do these data sets require comprehensive natural language processing to produce interesting data products?

The answer is not yet. Hadoop and other tools already have build in text processing support. There are many approaches being applied to these data sets, particularly inverted indices, co-location scores, even N-Gram modeling. However, these approaches are not true NLP -- they are simply search. They leverage string and lightweight syntactic analysis to perform frequency analyses.

We have not yet exhausted all opportunities to perform these frequency analyses -- many interesting results, particularly in clustering, classification and authorship forensics, have been explored. However, these approaches will soon start to fail to produce the more interesting results that users are coming to expect. Products like machine translation, sentence generation, text summarization, and more meaningful text recommendation will require strong semantic methodologies, and eventually Big Data will come to require NLP, it’s just not there yet.

# Natural Language Processing and Big Data: Using NLTK and Hadoop - Talk Overview

My previous startup, Unbound Concepts, created a machine learning algorithm that determined the textual complexity (e.g. reading level) of children’s literature. Our approach started as a natural language processing problem -- designed to pull out language features to train our algorithms, and then quickly became a big data problem when we realized how much literature we had to go through in order to come up with meaningful representations. We chose to combine NLTK and Hadoop to create our Big Data NLP architecture, and we learned some useful lessons along the way. This series of posts is based on a talk done at the April Data Science DC meetup.

Think of this post as the Cliff Notes of the talk and the upcoming series of posts so you don’t have to read every word ... but trust me, it's worth it.

Related to the interaction between Big Data and NLP:

• Natural Language Processing needs Big Data
• Big Data doesn’t need NLP... yet.

Related to using Hadoop and NLTK:

• The combination of NLTK and Hadoop is perfect for prepossessing raw text
• More semantic analysis tend to be graph problems that Map Reduce isn’t great at computing.

• The foo of Big Data is the ability to take domain knowledge and a data set (or sets) and iterate quickly through hypotheses using available tools (NLP)
• The magic of big data is that there is currently a surplus of both data and knowledge and our tools are working, so it’s easy to come up with a data product (until demand meets supply).

I'll go over each of these points in detail as I did in my presentation, so stay tuned for the longer version [editor: so long that it has been broken up into multiple posts]

# A Survey of Stochastic and Gazetteer Based Approaches for Named Entity Recognition - Part 2

This is part 2 of a two-part series. Part 1 is here.

# Approaches to Named Entity Recognition

Generally speaking, the most effective named entity recognition systems can be categorized as rule-based, gazetteer and machine learning approaches. Within each of these approaches are a myriad of sub-approaches that combine to varying degrees each of these top-level categorizations. However, because of the research challenge posed by each approach, typically one or the other is focused on in the literature.

Rule-based systems utilize pattern-matching techniques in text as well as heuristics derived either from the morphology or the semantics of the input sequence. They are generally used as classifiers in machine-learning approaches, or as candidate taggers in gazetteers. Some applications can also make effective use of stand-alone rule-based systems, but they are prone to both overreach and skipping over named entities. Rule-based approaches are discussed in (10), (12), (13), and (14).

Gazetteer approaches make use of some external knowledge source to match chunks of the text via some dynamically constructed lexicon or gazette to the names and entities. Gazetteers also further provide a non-local model for resolving multiple names to the same entity. This approach requires either the hand crafting of name lexicons or some dynamic approach to obtaining a gazette from the corpus or another external source. However, gazette based approaches achieve better results for specific domains. Most of the research on this topic focuses on the expansion of the gazetteer to more dynamic lexicons, e.g. the use of Wikipedia or Twitter to construct the gazette. Gazette based approaches are discussed in (15), (16), and (17).

Stochastic approaches fare better across domains, and can perform predictive analysis on entities that are unknown in a gazette. These systems use statistical models and some form of feature identification to make predictions about named entities in text. They can further be supplemented with smoothing for universal coverage. Unfortunately these approaches require large amounts of annotated training data in order to be effective, and they don’t naturally provide a non-local model for entity resolution. Systems implemented with this approach are discussed in (7), (8), (4), (9), and (6).

# 1) Rule Based Annotation

The first conceit in tackling named entity recognition is to look for clues within the structure and grammar of the text that indicate to readers (and therefore hopefully also systems) that the words and tokens refer to some named entity. Aside from the obvious issues with transcription, spelling, and unique formatting – it turns out that pattern matching is fairly successful at detecting those entities, if only in formal corpora like newsprint. Application of a series of rules as a preprocessing step to reduce the complexity of other techniques is widely used and is exemplified in 10, a rule-based NER annotation system that was supplemented with a machine-learning component to facilitate discovery of domain specific names.

Nadeau et al provide a table of word-level features that may indicate named-entities in (19). These features include case – if the word is capitalized or all capital letters (or indeed, includes a mixed case), punctuation, morphology, and the part of speech. Features like these, especially capitalization, form the basis for most heuristics. Word level features can then be combined with contextual features to further identify names.

So called “sure-fire” rules are described in (12), which rely on contextual indicators either prepended or appended to a series of capitalized tokens, whose part of speech is unknown or are generically referenced as nouns. These can take the form of titles – Mr., Mrs., or Dr. to name a few that are commonly prepended to names, or Jr., MD, and III, commonly appended. Corporate designators such as Ltd, LLC, or Bank can similarly identify organizations, and address designators like rue, rd., or strasse can identify particular locations.

These grammatical indicators can then be further extended to a series of rules that rely on part of speech tags or positional patterns. For instance, first names are generally easy to identify via a lexicon of proper names, and a capitalized word that follows a first name is most likely a last name, which tend to be more unique. Other rules require inspection of the context surrounding the possible named entity. In the sentence “... purchased 100 shares of (Tokens+) ...” where (Tokens+) is a series of capitalized words most likely the name of a publically traded company. Alternatively, prepositions and other parts of speech can also help us identify locations—for example, Frederick can be either a name or a location (a city in Western Maryland) but use of the phrase “in Frederick” would indicate that it was a location.

# 2) Learning and Stochastic Approaches

Out of sixteen participants at the CoNLL-2003 shared task, all employed some form of machine learning (1). According to the paper, most used a maximum entropy model, others used a statistical learning method, and still others used Hidden Markov Models. Tjong Kim Sang and De Meulder even go so far as to say that because Maximum Entropy was used in both isolation and combined with other systems, it seems to be a good choice for the named entity recognition task.

As for the performance evaluations, a combined system approach faired the best, namely a Classifier Combination approach as in (7). This approach employed a simple combination of four classifiers with equal voting, namely a Robust Risk Minimization Classifier, a Maximum Entropy Classifier, a Transformation-Based Learning Classifier and a Hidden Markov Model Classifier. However, the approach in (8) used only a Maximum Entropy learner and scored extremely closely to the combined approach, and many of the top scorers used Hidden Markov Models in combination with some other classifier, as in (20), therefore these approaches are the ones this paper will focus on.

## Maximum Entropy Classification

Maximum Entropy estimates probability solely on the basis of the constraints derived from training data in an attempt to simplify the estimation by making as few assumptions as possible; in other words, attempting to select the best probability distribution with the information available to the learner. In (8), this is expressed as the following:

Where p(ci|s,D) is the probability of a classifier given a sequence of tokens and the document that the sequence appears in. Note that D is used to perform NER with global information across the document, a variation on the Maximum Entropy approach that improved Chieu and Ng’s performance over other statistical methods. However, as a general discussion of Maximum Entropy classification, it is not needed here. Dynamic programming algorithms are then used to select word sequences with the highest probabilities.

Classifiers are trained through a series of non-contextual features (previously called word-level features in the rule-based approach), lexical features, morphological features, and global features. For example, non-contextual features include case and zone (the location of the token in the sequence), string information (e.g. the string contains all capital letters and a period), and first word (whether or not the word is at the beginning of a sentence). Lexical features include token frequencies, the existence of the token in the vocabulary, and common names. Global features can be taken from a Gazette or from information from the rest of the document.

## Chunked Tagging with Hidden Markov Models

Hidden Markov Models seem to be naturally applied to named entity recognition annotation because the act of tagging named entities seems closely related to the act of part of speech tagging—and these systems have been successful using HHMs, exemplified in (21). NER, like part of speech tagging, is a classification problem that involves an inherent state sequence representation. However, unlike in part of speech tagging, the state itself is a sequence of states that needs to be coordinated by some sort of chunking. Zhou and Su tackled this problem in (4) through the following, familiar equation:

In order to find the token sequence (G = g1, ... , gn) that maximizes the probability of a tag sequence (T = t1, ... , tn). This allows the chunking program to generate original named entity tags from the original text, and as with other approaches that use Bayes rule, we can calculate P(gi|ti) using n-gram frequencies in a training set, and P(T1n) by similarly counting an annotated training set. Log probabilities are used here to prevent underflow. Hidden Markov Models rely heavily on training data to both acquire sequence boundary features (although many systems in the literature use a combined approach with classifiers similar to those described in the maximum entropy approach). They also must use smoothing to include untrained data.

# 3) Gazetteer Based Approaches

Stochastic methodologies for named entity recognition provide candidates for annotation, and to a certain extent, can list the likelihood of a candidate belonging to a category or subcategory of a named entity. However, this is not a complete solution as machine-learning approaches require knowledge in order to propose candidacy for untrained tokens, and further, after candidates have been proposed, knowledge is required to classify them. Additionally, other issues may be resolved by gazettes, for instance non-local dependencies can be listed in gazettes to indicate cross-references between various names indicating the same entity.

Gazettes, therefore, are utilized to supply external knowledge to learners, or to supply unannotated data with a training source. In fact, most of the teams in the CoNLL-2003 Shared Task made use of a gazette (1), and gazettes have been described as the bottleneck in many learning systems (12) due to the elasticity and rapid evolution and expansion of names as unique descriptions for entities. Therefore, research has been directed toward the development of gazettes as lexicons of named entities rather than towards the specific application of gazettes in a named entity system.

This does not mean, however, that gazette-only systems do not exist. However, because of the overhead of gazette-based look-up (many require a query to the Internet), filtering of corpus tokens is required to generate named entity candidates. Systems described in (15), (16), and (19) use a combination of rule-based mechanisms, part of speech tags, and word frequency analysis to propose these candidates, with no machine learning approach.

## Wiki Based Gazetteer Approaches

Looking to the Internet as a source for knowledge that both humans and computers can understand poses its own design challenges. Kazama and Torisawa propose that the advent of open collaborative encyclopedias on the web, like Wikipedia, organize external knowledge for gazette construction in (15). They make the point that since Wikipedia’s articles are intended to be encyclopedic, most articles are about named entities and are more structured than raw text. Therefore, named entity candidates can be resolved by a simple web search to Wikipedia.

The Wikipedia article can then be mined for classification information. Kazama and Torisawa use the first sentence to locate the object of the verb “to be.” Richman and Schone utilize the Wikipedia categories attached to the article for classification in their attempt to mine resources for multilingual named entity recognition (16) in order to avoid the use of semantic parsing. Generally speaking, the beginning of the article on an encyclopedic entry can be used as a contextual signature to categorize the named entity. Further, it can be used to provide non-local dependency analysis.

For example the Wikipedia result for “Mekong Delta” contains the contextual clues “region”, “river”, “sea”, “distributaries”, “Vietnam”, and “water” that could lead to the classification for water based location. Salman Rushdie’s article contains the signature words “novelist” and “essayist” in order to classify this entity as a writer (person). This technique of classification is similar to the word disambiguation techniques used in the Lesk algorithms.

Wikipedia also provides for non-local dependency analysis through two unique features. First, each article concerns only one particular named entity. Therefore any referential names found in the article can be said to identify that particular entity. Further, Wikipedia provides as a tool a page called “disambiguation” that will present the most frequent searches first, along with categorization. For instance, a search for “Cambridge (disambiguation)” will reveal several geographic place names in England, the United States, Canada, and Australia. It will also expose the University of Cambridge as another common use of that name. Mining contextual clues in these Wikipedia articles for a signature, and comparing them to the signature of the input provides not only the correct categorization, but also a source for variations on the name for use in cross-referencing.

Both (15) and (16) scored well on the CoNLL-2003 evaluation scale. Kazama et al. published their results with an F-Score of 88.02, which compares very well to the best performing stochastic methods. They attribute the increase in score from the baseline as a result of improved precision of results rather than improved recall. However, neither of these methods supplied details of the performance cost associated with a Wikipedia gazette based approach.

# 4) Mechanical Turk Approaches

One interesting approach to named-entity knowledge acquisition deserves a small mention here. Lawson et al. made use of Amazon’s Mechanical Turk service to annotate a data set of emails with named entities in (22). The Mechanical Turk is a low cost way of providing human labor to a repetitive task by paying some small amount of money for a small work unit. This paper demonstrates the difficulties in having human annotated training data available for statistical or gazette based analysis.

For even the seemingly simple task of identifying the categories for PERSON, LOCATION, or ORGANIZATION in emails of no more than 60 to 900 characters, Lawson et al. found that they could not pay a flat rate per email, but instead had to set up a reward mechanism in which each discovered entity paid off (otherwise annotators simply checked no-entries to game the system). Although the authors argue that they were able to use inter-annotator agreement to assign a particular number of workers to the same task to reduce costs, it seems that only a portion of the annotators provided significant contribution as 12.5% of the workers completed 74.9% of the annotation! Although the final annotated data set was of a high quality, and a low economic cost, the dataset took more than four months to build using this methodology.

# Conclusion

This paper explored the topic of named entity recognition, particularly as described in the CoNLL-2003 shared task: language-independent named entity recognition of organizations, locations, and people. Several of the pitfalls associated with named entity recognition systems were discussed, including chunking and text representation for multi-word names, inference for resolving name ambiguity, cross-referencing with non-local dependencies, and the use of external knowledge in NER systems even though unique names provide a lexical challenge.

Several approaches to named entity recognition were discussed including rule-based systems, machine-learning systems and gazetteer approaches. Rule based systems provide an advantage of not requiring significant computation, but have low accuracy and recall. These systems are often used in coordination with gazette based and machine learning systems. Stochastic methods focus on probabilistic modeling and can cover new tokens and decide if sequences of tokens belong together, however they require a large training set to operate, as well as a smoothing implementation to capture universal knowledge. Finally, gazetteer, or dynamic lexical generation, was discussed in accordance with named entity recognizers. This approach performs well at discovering non-local dependencies as well as disambiguating multiple name senses. However, candidates must be proposed for gazette look-up, and some overhead is incurred in accessing the gazette.

For now, combinations of machine learning and gazetteer systems, which use rule-based classifiers for features and candidate proposal, are the best performing systems. Further research needs to be performed on multi-stage approaches to better integrate the two methodologies.

# References

(1) Erik F. Tjong Kim Sang and De Meulder Fien, "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 142-147. (2) Nancy Chinchor, Erica Brown, Lisa Ferro, and Patty Robinson, "1999 Named Entity Recognition Task Definition," MITRE and SAIC, 1999. (3) Lev Ratinov and Dan Roth, "Design Challenges and Misconceptions in Named Entity Recognition," in CoNLL '09 Proceedings of the 13th Conference on Computational Natural Language Learning, Stroudsburg, PA, 2009, pp. 147-155. (4) GuoDong Zhou and Jian Su, "Named Entity Recognition Using an HMM-Based Chunk Tagger," in ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, 2002, pp. 473-480. (5) Jenny Rose Finkel and Christopher D. Manning, "Joint Parsing and Named Entity Recognition," in NAACL '09 Proceedings of Human Language Technologies, Stroudsburg, Pa, 2009, pp. 326-334. (6) Hirotaka Funayama, Tomohide Shibata, and Sadao Kurohashi, "Bottom-Up Named Entity Recognition Using a Two-Stage Machine Learning Method," in MWE '09 Proceedings of the Workship on Multiword Expressions: Identification, Interpretation, Disambiguation, and Applications, Stroudsburg, PA, 2009, pp. 55-62. (7) Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang, "Named Entity Recognition Through Classifier Combination," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 168-171. (8) Hai Leong Chieu and Hwee Tou Ng, "Named Entity Recognition: A Maximum Entropy Approach Using Global Information," in COLING '02 Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, Stroudsburg, PA, 2002, pp. 1-7. (9) Andrew McCallum and Wei Li, "Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction, and Web-Enhanced Lexicons," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 188-191. (10) Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Frederick Reiss, and Shivakumar Vaithyanathan, "Domain Adaption of Rule-Based Annotators for Named Entity Recognition Tasks," in EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, 2010, pp. 1002-1012. (11) Vijay Krishnan and Christopher D. Manning, "An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition," in ACL-44 Proceedings of the 21st International Conference on Computational Linguistics, Stroudsburg, PA, 2006, pp. 1121- 1128. (12) Andrei Mikheev, Marc Moens, and Claire Grover, "Named Entity Recognition Without Gazetteers," in EACL '99 Proceedings of the 9th Conference on the European Chapter of the Association for Computational Linguistics, Stroudsburg, PA, 1999, pp. 1-8. (13) Silviu Cucerzan and David Yarowsky, "Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence," in Joint SIGDAT Conference on EMNLP and VLC, 1999. (14) Dimitra Farmakioutou et al., "Rule-Based Named Entity Recognition for Greek Financial Texts," in Proceedings of the Workshop on Computational Lexicography and Multimedia Dictionaries, 2000, pp. 75-78. (15) Jun'ichi Kazama and Kentaro Torisawa, "Exploiting Wikipedia as External Knowledge for Named Entity Recognition," in Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007, pp. 698-707. (16) Alexander E. Richman and Patrick Schone, "Mining Wiki Resources for Multilingual Named Entity Recognition," in Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technologies, Stroudsburg, PA, 2008, pp. 1-9. (17) Alan Ritter, Sam Clark, Mausam, and Oren Etzioni, "Named Entity Recognition in Tweets: An Experimental Study," in EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, 2011, pp. 1524-1534. (18) Nancy Chinchor, "MUC-7 Named Entity Task Definition," in Message Understanding Conference. (19) David Nadeau and Satoshi Sekine. (2008, Jan.) A Survey of Named Entity Recognition and Classification. (Online). http://nlp.cs.nyu.edu/sekine/papers/li07.pdf (20) Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning, "Named Entity Recognition with Character-Level Models," in Proceedings of CoNLL-2003, 2003, pp. 180- 183. (21) Scott M Thede and Mary P Harper, "A Second-Order Hidden Markov Model for Part-of- Speech Tagging," in Proceedings of the 37th annual meeting of ACL, Stradsbourg, PA, 1999, pp. 175-182. (22) Nolan Lawson, Kevin Eustice, Mike Perkowitz, and Meliha Yetisgen-Yildiz, "Annotating Large Email Datasets for Named Entity Recognition with Mechanical Turk," in CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speach and Language Data with Amazon's Mechanical Turk, Stroudsburg, PA, 2010, pp. 71-79.

# Abstract

The task of identifying proper names of people, organizations, locations, or other entities is a subtask of information extraction from natural language documents. This paper presents a survey of techniques and methodologies that are currently being explored to solve this difficult subtask. After a brief review of the challenges of the task, as well as a look at previous conventional approaches, the focus will shift to a comparison of stochastic and gazetteer based approaches. Several machine-learning approaches are identified and explored, as well as a discussion of knowledge acquisition relevant to recognition. This two-part white paper will show that applications that require named entity recognition will be served best by some combination of knowledge- based and non-deterministic approaches.

# Introduction

In school we were taught that a proper noun was “a specific person, place, or thing,” thus extending our definition from a concrete noun. Unfortunately, this seemingly simple mnemonic masks an extremely complex computational linguistic task—the extraction of named entities, e.g. persons, organizations, or locations from corpora (1). More formally, the task of Named Entity Recognition and Classification can be described as the identification of named entities in computer readable text via annotation with categorization tags for information extraction.

Not only is named entity recognition a subtask of information extraction, but it also plays a vital role in reference resolution, other types of disambiguation, and meaning representation in other natural language processing applications. Semantic parsers, part of speech taggers, and thematic meaning representations could all be extended with this type of tagging to provide better results. Other, NER-specific, applications abound including question and answer systems, automatic forwarding, textual entailment, and document and news searching. Even at a surface level, an understanding of the named entities involved in a document provides much richer analytical frameworks and cross-referencing.

Named entities have three top-level categorizations according to DARPA’s Message Understanding Conference: entity names, temporal expressions, and number expressions (2). Because the entity names category describes the unique identifiers of people, locations, geopolitical bodies, events, and organizations, these are usually referred to as named entities and as such, much of the literature discussed in this paper focuses solely on this categorization, although it is easy to imagine extending the proposed systems to cover the full MUC-7 task. Further, the CoNLL-2003 Shared Task, upon which the standard of evaluation for such systems is based, only evaluates the categorization of organizations, persons, locations, and miscellaneous named entities. For example:

(ORG S.E.C.) chief (PER Mary Shapiro) to leave (LOC Washington) in December.

This sentence contains three named entities that demonstrate many of the complications associated with named entity recognition. First, S.E.C. is an acronym for the Securities and Exchange Commission, which is an organization. The two words “Mary Shapiro” indicate a single person, and Washington, in this case, is a location and not a name. Note also that the token "chief" is not included in the person tag, although it very well could be. In this scenario, it is ambiguous if "S.E.C. chief Mary Shapiro" is a single named entity, or if multiple, nested tags would be required.

# Named Entity Recognition Challenges

Some key design decisions in an NER system are proposed in (3) that cover the requirements of NER in the example sentence above:

• Chunking and text representation
• Inference and ambiguity resolution algorithms
• Modeling of Non-Local dependencies
• Implementation of external knowledge resources and gazetteers

Named entities are often not simply singular words, but are chunks of text, e.g. University of Maryland Baltimore County or The Central Bank of Australia. Therefore, some chunking or parsing prediction model is required to predict whether a group of tokens belong in the same entity. Left to right decoding, Viterbi, and beam search algorithms has been employed as chunking algorithms in the literature. Further, some NER systems are comprised primarily of text parsers as in (4), (5), and (6).

Inference refers to the ability of a system to determine that a chunk is actually a named entity, or, sometimes more importantly, to determine the classification of a named entity, especially in places where there is ambiguity. For example “Washington” might refer to either a name or a location. “Galaxy” might refer to a generic noun or the professional major league soccer team. Maximum Entropy Models, Hidden Markov Models and other statistical methods are employed to perform this analysis, usually implemented as a machine-learning system, as for instance in (7), (8), and (9).

Non-local dependency models refer to the ability to identify multiple tokens that should have the same label assignment or cross-reference. It is important to note that case becomes important here—e.g. Bengfort, bengfort, and BENGFORT should all be identified as the same entity, and these would break word-level rule based systems (e.g. find all words that are capitalized). But further, even different chunks should be identified similarly – UMBC vs. University of Maryland Baltimore County or the inclusion of titles in one location that are absent from another as in President Barack Obama vs. Obama. The non-locality of these models refers to the usage of these terms outside the scope of a sequence of tokens that is being analyzed together (usually a sentence), but rather in the entire document or corpus. The papers that address non-local dependencies include (10) and (11), but the focus of this paper will be on solutions that require external knowledge.

Names, because they uniquely identify entities, are a domain not easily captured by even the most expansive lexicons. For example, simply creating a list of the names of all companies formed in the United States would expand dramatically every single year. However, external knowledge and name lexicons are required for many of the approaches and solutions, not just non-local dependency models. Therefore the construction and use of gazetteers and other resources is necessary.

# Evaluation of NERC Systems

Throughout the literature on Named Entity Recognition and Classification systems, two seminal conferences and evaluations are mentioned when evaluating systems: the 1997 Message Understanding Conference (MUC-7) 18 and the CoNLL-2003 Shared Task 1. Especially since 2003, many NERC systems use the CoNLL-2003 evaluation to demonstrate the performance of their work. This evaluation specifies a data set containing a training file, a development file, a test file, and a large amount of unannotated data in two languages, English and German, taken from news articles corpora. All learning methods were trained on the training file, and tested on the test data. The development file could be used for tuning the parameters of the system. The data was preprocessed using a tokenizer, chunker and part of speech-tagger, with each token on a single line, and sentence boundaries represented by a single blank line. Named entities were tagged by hand at the University of Antwerp. Performance was measured using an F-Score with β=1 simplified to the harmonic mean:

Where precision is the ratio of correct results to results returned and recall is the ratio of correct results to the number of results that should have been returned. The baseline result was computed for a system that only identified entities with a unique class in the training data, and for CoNLL-2003 was 59.61 ± 1.2. The highest performing system was (7) with an F-Score of 88.76 ± 0.7, followed closely by (8) with an F-Score of 88.31 ± 0.7. However, the lowest performing systems still exceeded the baseline result.

Stay tuned for part 2 tomorrow!

# References

(1) Erik F. Tjong Kim Sang and De Meulder Fien, "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 142-147. (2) Nancy Chinchor, Erica Brown, Lisa Ferro, and Patty Robinson, "1999 Named Entity Recognition Task Definition," MITRE and SAIC, 1999. (3) Lev Ratinov and Dan Roth, "Design Challenges and Misconceptions in Named Entity Recognition," in CoNLL '09 Proceedings of the 13th Conference on Computational Natural Language Learning, Stroudsburg, PA, 2009, pp. 147-155. (4) GuoDong Zhou and Jian Su, "Named Entity Recognition Using an HMM-Based Chunk Tagger," in ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, 2002, pp. 473-480. (5) Jenny Rose Finkel and Christopher D. Manning, "Joint Parsing and Named Entity Recognition," in NAACL '09 Proceedings of Human Language Technologies, Stroudsburg, Pa, 2009, pp. 326-334. (6) Hirotaka Funayama, Tomohide Shibata, and Sadao Kurohashi, "Bottom-Up Named Entity Recognition Using a Two-Stage Machine Learning Method," in MWE '09 Proceedings of the Workship on Multiword Expressions: Identification, Interpretation, Disambiguation, and Applications, Stroudsburg, PA, 2009, pp. 55-62. (7) Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang, "Named Entity Recognition Through Classifier Combination," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 168-171. (8) Hai Leong Chieu and Hwee Tou Ng, "Named Entity Recognition: A Maximum Entropy Approach Using Global Information," in COLING '02 Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, Stroudsburg, PA, 2002, pp. 1-7. (9) Andrew McCallum and Wei Li, "Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction, and Web-Enhanced Lexicons," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 188-191. (10) Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Frederick Reiss, and Shivakumar Vaithyanathan, "Domain Adaption of Rule-Based Annotators for Named Entity Recognition Tasks," in EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, 2010, pp. 1002-1012. (11) Vijay Krishnan and Christopher D. Manning, "An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition," in ACL-44 Proceedings of the 21st International Conference on Computational Linguistics, Stroudsburg, PA, 2006, pp. 1121- 1128.