The task of identifying proper names of people, organizations, locations, or other entities is a subtask of information extraction from natural language documents. This paper presents a survey of techniques and methodologies that are currently being explored to solve this difficult subtask. After a brief review of the challenges of the task, as well as a look at previous conventional approaches, the focus will shift to a comparison of stochastic and gazetteer based approaches. Several machine-learning approaches are identified and explored, as well as a discussion of knowledge acquisition relevant to recognition. This two-part white paper will show that applications that require named entity recognition will be served best by some combination of knowledge- based and non-deterministic approaches.
In school we were taught that a proper noun was “a specific person, place, or thing,” thus extending our definition from a concrete noun. Unfortunately, this seemingly simple mnemonic masks an extremely complex computational linguistic task—the extraction of named entities, e.g. persons, organizations, or locations from corpora (1). More formally, the task of Named Entity Recognition and Classification can be described as the identification of named entities in computer readable text via annotation with categorization tags for information extraction.
Not only is named entity recognition a subtask of information extraction, but it also plays a vital role in reference resolution, other types of disambiguation, and meaning representation in other natural language processing applications. Semantic parsers, part of speech taggers, and thematic meaning representations could all be extended with this type of tagging to provide better results. Other, NER-specific, applications abound including question and answer systems, automatic forwarding, textual entailment, and document and news searching. Even at a surface level, an understanding of the named entities involved in a document provides much richer analytical frameworks and cross-referencing.
Named entities have three top-level categorizations according to DARPA’s Message Understanding Conference: entity names, temporal expressions, and number expressions (2). Because the entity names category describes the unique identifiers of people, locations, geopolitical bodies, events, and organizations, these are usually referred to as named entities and as such, much of the literature discussed in this paper focuses solely on this categorization, although it is easy to imagine extending the proposed systems to cover the full MUC-7 task. Further, the CoNLL-2003 Shared Task, upon which the standard of evaluation for such systems is based, only evaluates the categorization of organizations, persons, locations, and miscellaneous named entities. For example:
(ORG S.E.C.) chief (PER Mary Shapiro) to leave (LOC Washington) in December.
This sentence contains three named entities that demonstrate many of the complications associated with named entity recognition. First, S.E.C. is an acronym for the Securities and Exchange Commission, which is an organization. The two words “Mary Shapiro” indicate a single person, and Washington, in this case, is a location and not a name. Note also that the token "chief" is not included in the person tag, although it very well could be. In this scenario, it is ambiguous if "S.E.C. chief Mary Shapiro" is a single named entity, or if multiple, nested tags would be required.
Named Entity Recognition Challenges
Some key design decisions in an NER system are proposed in (3) that cover the requirements of NER in the example sentence above:
- Chunking and text representation
- Inference and ambiguity resolution algorithms
- Modeling of Non-Local dependencies
- Implementation of external knowledge resources and gazetteers
Named entities are often not simply singular words, but are chunks of text, e.g. University of Maryland Baltimore County or The Central Bank of Australia. Therefore, some chunking or parsing prediction model is required to predict whether a group of tokens belong in the same entity. Left to right decoding, Viterbi, and beam search algorithms has been employed as chunking algorithms in the literature. Further, some NER systems are comprised primarily of text parsers as in (4), (5), and (6).
Inference refers to the ability of a system to determine that a chunk is actually a named entity, or, sometimes more importantly, to determine the classification of a named entity, especially in places where there is ambiguity. For example “Washington” might refer to either a name or a location. “Galaxy” might refer to a generic noun or the professional major league soccer team. Maximum Entropy Models, Hidden Markov Models and other statistical methods are employed to perform this analysis, usually implemented as a machine-learning system, as for instance in (7), (8), and (9).
Non-local dependency models refer to the ability to identify multiple tokens that should have the same label assignment or cross-reference. It is important to note that case becomes important here—e.g. Bengfort, bengfort, and BENGFORT should all be identified as the same entity, and these would break word-level rule based systems (e.g. find all words that are capitalized). But further, even different chunks should be identified similarly – UMBC vs. University of Maryland Baltimore County or the inclusion of titles in one location that are absent from another as in President Barack Obama vs. Obama. The non-locality of these models refers to the usage of these terms outside the scope of a sequence of tokens that is being analyzed together (usually a sentence), but rather in the entire document or corpus. The papers that address non-local dependencies include (10) and (11), but the focus of this paper will be on solutions that require external knowledge.
Names, because they uniquely identify entities, are a domain not easily captured by even the most expansive lexicons. For example, simply creating a list of the names of all companies formed in the United States would expand dramatically every single year. However, external knowledge and name lexicons are required for many of the approaches and solutions, not just non-local dependency models. Therefore the construction and use of gazetteers and other resources is necessary.
Evaluation of NERC Systems
Throughout the literature on Named Entity Recognition and Classification systems, two seminal conferences and evaluations are mentioned when evaluating systems: the 1997 Message Understanding Conference (MUC-7) 18 and the CoNLL-2003 Shared Task 1. Especially since 2003, many NERC systems use the CoNLL-2003 evaluation to demonstrate the performance of their work. This evaluation specifies a data set containing a training file, a development file, a test file, and a large amount of unannotated data in two languages, English and German, taken from news articles corpora. All learning methods were trained on the training file, and tested on the test data. The development file could be used for tuning the parameters of the system. The data was preprocessed using a tokenizer, chunker and part of speech-tagger, with each token on a single line, and sentence boundaries represented by a single blank line. Named entities were tagged by hand at the University of Antwerp. Performance was measured using an F-Score with β=1 simplified to the harmonic mean:
Where precision is the ratio of correct results to results returned and recall is the ratio of correct results to the number of results that should have been returned. The baseline result was computed for a system that only identified entities with a unique class in the training data, and for CoNLL-2003 was 59.61 ± 1.2. The highest performing system was (7) with an F-Score of 88.76 ± 0.7, followed closely by (8) with an F-Score of 88.31 ± 0.7. However, the lowest performing systems still exceeded the baseline result.
Stay tuned for part 2 tomorrow!
(1) Erik F. Tjong Kim Sang and De Meulder Fien, "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 142-147. (2) Nancy Chinchor, Erica Brown, Lisa Ferro, and Patty Robinson, "1999 Named Entity Recognition Task Definition," MITRE and SAIC, 1999. (3) Lev Ratinov and Dan Roth, "Design Challenges and Misconceptions in Named Entity Recognition," in CoNLL '09 Proceedings of the 13th Conference on Computational Natural Language Learning, Stroudsburg, PA, 2009, pp. 147-155. (4) GuoDong Zhou and Jian Su, "Named Entity Recognition Using an HMM-Based Chunk Tagger," in ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, 2002, pp. 473-480. (5) Jenny Rose Finkel and Christopher D. Manning, "Joint Parsing and Named Entity Recognition," in NAACL '09 Proceedings of Human Language Technologies, Stroudsburg, Pa, 2009, pp. 326-334. (6) Hirotaka Funayama, Tomohide Shibata, and Sadao Kurohashi, "Bottom-Up Named Entity Recognition Using a Two-Stage Machine Learning Method," in MWE '09 Proceedings of the Workship on Multiword Expressions: Identification, Interpretation, Disambiguation, and Applications, Stroudsburg, PA, 2009, pp. 55-62. (7) Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang, "Named Entity Recognition Through Classifier Combination," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 168-171. (8) Hai Leong Chieu and Hwee Tou Ng, "Named Entity Recognition: A Maximum Entropy Approach Using Global Information," in COLING '02 Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, Stroudsburg, PA, 2002, pp. 1-7. (9) Andrew McCallum and Wei Li, "Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction, and Web-Enhanced Lexicons," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 188-191. (10) Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Frederick Reiss, and Shivakumar Vaithyanathan, "Domain Adaption of Rule-Based Annotators for Named Entity Recognition Tasks," in EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, 2010, pp. 1002-1012. (11) Vijay Krishnan and Christopher D. Manning, "An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition," in ACL-44 Proceedings of the 21st International Conference on Computational Linguistics, Stroudsburg, PA, 2006, pp. 1121- 1128.