White Paper

Selling Data Science: Validation

FixMyPineapple2 We are all familiar with the phrase "We can not see the forest for the trees", and this certainly applies to us as data scientists.  We can become so involved with what we're doing, what we're building, the details of our work, that we don't know what our work looks like to other people.  Often we want others to understand just how hard it was to do what we've done, just how much work went into it, and sometimes we're vain enough to want people to know just how smart we are.

So what do we do?  How do we validate one action over another?  Do we build the trees so others can see the forrest?  Must others know the details to validate what we've built, or is it enough that they can make use of our work?

We are all made equal by our limitation to 24 hours in a day, and we must choose what we listen to and what we don't, what we focus on and what we don't.  The people who make use of our work must do the same.  John Locke proposed the philosophical thought experiment, "If a tree falls in the woods and no one is around to hear it, does it make a sound?"  If we explain all the details of our work, and no one gives the time to listen, will anyone understand?  To what will people give their time?

Let's suppose that we can successfully communicate all the challenges we faced and overcame in building our magnificent ideas (as if anyone would sit still that long), what then?  Thomas Edison is famous for saying, “I have not failed. I've just found 10,000 ways that won't work.”, but today we buy lightbulbs that work, who remembers all the details about the different ways he failed?  "It may be important for people who are studying the thermodynamic effects of electrical currents through materials." Ok, it's important to that person to know the difference, but for the rest of us it's still not important.  We experiment, we fail, we overcome, thereby validating our work because others don't have to.

Better to teach a man to fish than to provide for him forever, but there are an infinite number of ways to successfully fish.  Some approaches may be nuanced in their differences, but others may be so wildly different they're unrecognizable, unbelievable, and beg for incredulity.  The catch is (no pun intended) methods are valid because they yield measurable results.

It's important to catch fish, but success is not consistent nor guaranteed, and groups of people may fish together so after sharing their bounty everyone is fed.  What if someone starts using this unrecognizable and unbelieveable method of fishing?  Will the others accept this "risk" and share their fish with those who won't use the "right" fishing technique, their technique?  Even if it works the first time that may simply be a fluke they say, and we certainly can't waste any more resources "risking" hungry bellies now can we.

So does validation lie in the method or the results?  If you're going hungry you might try a new technique, or you might have faith in what's worked until the bitter end.  If a few people can catch plenty of fish for the rest, let the others experiment.  Maybe you're better at making boats, so both you and the fishermen prosper.  Perhaps there's someone else willing to share the risk because they see your vision, your combined efforts giving you both a better chance at validation.

If we go along with what others are comfortable with, they'll provide fish.  If we have enough fish for a while, we can experiment and potentially catch more fish in the long run.  Others may see the value in our experiments and provide us fish for a while until we start catching fish.  In the end you need fish, and if others aren't willing to give you fish you have to get your own fish, whatever method yields results.

BinaryPig: Scalable Static Binary Analysis Over Hadoop from Endgame

Screen Shot 2013-09-05 at 9.49.10 AM Below is a white paper previously published by Endgame, reproduced with permission. "Endgame uses data science to bring clarity to the cyber domain, allowing you to sense, discover, and act in real time."

Authors: Zachary Hanif, Telvis Calhoun, and Jason Trost {zach, tcalhoun, jtrost}@endgame.com

I. Abstract

Malware collection and analysis are critical to the modern Internet security industry. This increasing tide of ‘unique’ malware samples and the difficulty in storing and analyzing these samples in a scalable fashion has been frequently decried in academic forums and industry conferences. Due to the increasing volume of samples that are being discovered in the wild, the need for an easily deployed, scalable architecture is becoming increasingly pronounced. Over the past 2.5 years Endgame received 20M samples of malware equating to roughly 9.5 TB of binary data. In this, we’re not alone. McAfee reports that it currently receives roughly 100,000 malware samples per day and received roughly 10M samples in the last quarter of 2012 [1]. Its total corpus is estimated to be about 100M samples. VirusTotal receives between 300k and 600k unique files per day, and of those roughly one-third to half are positively identified as malware [2]. Effectively combating and triaging malware is now a Big Data problem. Most malware analysis scripts are unprepared for processing collections of malware once they surpass the 100GB scale, let alone terabytes. This paper describes the architecture of a system the authors built to overcome these challenges that uses Hadoop [13] and Apache Pig [14] to enable analysts to leverage pre-existing tools to analyze terabytes of malware, at scale. We demonstrate this system to analyze ~20M malware samples weighing in at 9.5 TB of binary data.

II. Introduction

Malware authors utilize multiple methods of obfuscating their executables to evade detection by automated anti-virus systems. Many malware authors utilize packing, compression, and encryption tools in an effort to defeat automated static analysis; authors continually pack and repack their samples and check them against automated scanning engines. These samples are shared between interested parties, thereby inflating the number of samples that require analysis. These samples are relatively low value: many of them are never released into the wild, and moreover, as their core functionality has not been altered, analysis of these samples represents a needless redundancy. For samples that are in the wild, autonomously spreading malware often have algorithms which alter key elements of their executable payloads before transmission and execution on a newly compromised system in an effort to evade detection from network security sensors and other simpler protections. Collection of autonomously spreading samples by honeypot systems such as Amun [2] and Dionaea [3] therefore collect huge numbers of samples that appear unique when compared using cryptographic file checksums, yet represent functionally identical software. Further complicating the issue are the samples that are collected by both systems that represent false positives: samples that have mistakenly triggered a heuristic, or benign samples that were submitted to an online scanning engine.

BinaryPig hopes to provide a solution to the scalability problem represented by this influx of malware. It promotes rapid iteration and exploration through processing non-trivial amounts of malware at scale using easily developed scripts. It was also built to take advantage of pre-existing malware analysis scripts.

III. Prior Work

While there is no lack of published descriptions for malware analytic models and feature sets, to our knowledge, there are no salient papers that discuss the architecture behind the feature and data extraction from a static perspective [7], [8], [9]. As distributed data processing has engineering concerns that are generally unseen in centralized processing infrastructure, and the majority of papers that we explored while researching during the construction of this system focus on the results of derived analytics, it is assumed that the systems which coordinated the extraction, results and sample storage, and other distributed processing concerns were either considered to be too simplistic, too specialized, or too ungainly for publication. As these papers have been published, however, the authors believe that while systems similar in function to BinaryPig exist, our offering represents something novel in the form of an extensible, free, open-source framework [10].

The most prominent papers surrounding data extraction infrastructure are represented in a tangential area of research: malware dynamic analysis. There have been a number of excellent discussions surrounding various methods of dynamic analysis at relatively large scales [8], [9] as well as work that lends itself to simply adding new analytical methods to an existing framework [18], none of this work takes advantage of architecture that is likely to exist within an organization that deals with large sums of data - essentially, again the extraction and storage infrastructure that drives many papers that have been published recently appears to be unique to each of the research institutions.

IV. Design

We needed a system to address scalability, reliability, and maintainability concerns of large-scale malware processing. We determined that Apache Hadoop met many of these requirements, but it had some shortcomings that we had to address.

Developing native MapReduce code can be time consuming even for highly experienced users. An early version of this framework was developed using native MapReduce code, and while functional, was difficult to use and difficult to extend. As a result, we opted to use the Apache Pig DSL to provide a more accessible user and developer experience.

During the development of BinaryPig, Hadoop’s issues with storing and processing large numbers of small files was made apparent: attempting to store and iterate on malware samples directly resulted in memory issues on the NameNode and unacceptable performance decreases due task startup overhead dominating job execution times. To resolve these issues, we created ingest tool for compressing, packaging, and uploading malware binaries into Hadoop. The ingest tool combines thousands of small binary files into one or more large Hadoop SequenceFiles (each file typically ranging from 10GB to 100 GB in size), an internal data storage format that enables Hadoop to take advantage of sequential disk reads speeds as well as split these files for parallel processing [15]. The SequenceFiles created by BinaryPig are a collection of key/value pairs where the key represents the ID of the binary file (typically MD5 checksum) and the value is the contents of the file represented as raw bytes. To ensure that the malware data was properly distributed across computing nodes in the cluster and resilient to failures, the Hadoop Distributed File System (HDFS) was utilized [19]; HDFS provides any nice features for managing files and data including automatic data replication, failover, and optimizations for batch processing.

Within Pig, analysis scripts and their supporting libraries are uploaded to Hadoop’s DistributedCache [16] before job execution and destroyed from the processing nodes afterwards. This functionality allows users to quickly publish analytical scripts and libraries to the processing nodes without the general concerns of ensuring deployment operated as expected or the time associated with a manual deployment.

We developed a series of Pig loader functions that read SequenceFiles, copy binaries to the local filesystem, and enable scripts to process them. These load functions were designed to allow these analysis scripts to read the files dropped onto the local filesystem. This allows BinaryPig to leverage preexisting malware analysis scripts and tools. In essence, we are using HDFS to distribute our malware binaries across our cluster and using MapReduce to push the processing to the node where the data is stored. This prevents needless data copying as seen in many legacy cluster computing environments. The framework expects the analysis scripts to write their output to stdout and to write any error messages to stderr (as most legacy malware analysis scripts already do). Ideally the data written is structured (JSON, XML, etc), but the system gladly handles unstructured data too. Data emitted from the system is stored into HDFS for additional batch processing as needed. We also typically load this data into ElasticSearch indices [17]. This distributed index is used for manual exploration by individual users, as well as automated extraction by analytical processes. Through use of the Wonderdog library [12], we have been able to load data from the MapReduce jobs into ElasticSearch in a real-time fashion while continuing to serve queries.

end_game_fig1

We implemented optimizations to the backend processing system including using the /dev/shm/virtual device for writing malware binaries. This device functions as a filesystem and allows for reading/writing files into memory managed by the OS. This allowed us to avoid a great deal of system and Disk I/O when writing files and creating processes that did not lend themselves to accepting serialized files through stdin. Another optimization we implemented for some processing was creating long running analytics daemons in lieu of scripts. This drastically increased the performance for some important processing tasks since scripts are great for rapid iteration, but we found that the script interpreter startup times were dominating task execution times. The long running daemons listen on a socket, and read local file paths through this connection. They then analyze these respective files and return the results through the socket connection.

end_game_fig2

V. Preliminary Results

The initial testing and development of BinaryPig was performed on a 16-node Hadoop cluster. We loaded 20 million malware samples (~9.5 TB of data) onto this cluster. Both development time and execution time of new analytics are demonstrably faster, and the ability to execute them across historical sets, something that was not feasible before, is now possible. The ability to develop new analytics and execute them against historical sample sets for comparison and learning is exceptionally valuable. Full execution over the 20 million historical set was clocked at 5 hours when calculating the MD5, SHA1, SHA256, and SHA512 hashes for each sample. Operation speed is primarily based on the number of computational cores that can be dedicated to the process. Additionally, the ability to store, operate and analyze samples on a multi-use cluster is a marked improvement over requiring a dedicated cluster and storage environment.

VI. Ongoing Work

BinaryPig continues to be developed. Currently there is an effort to migrate currently existing analytics over to the new architecture so as to complete a malware census for use in developing predictive models. These results are continually developed and are intended for publication after validation. Additional input and output functionality between the Pig and analytics layers is being explored, as there is a need to be able to retrieve emitted binary data, without an intermediate encoding and serialization step, for icon, image, and other resource extraction. Additionally, we feel that the error mission over the stderr interface is not optimal for production job management and subtle error catching, and are looking to expand the published examples with scripts that demonstrate such practices.

VII. References

[1] Higgins. “McAfee: Close To 100K New Malware Samples Per Day In Q2.” http://www.darkreading.com/attacks-breaches/mcafee-close-to-100k-new-malware-samples/240006702 [2] “Virus Total File Statistics.” https://www.virustotal.com/en/statistics/ as of 4/9/2013 [3] Amun, http://amunhoney.sourceforge.net/ [4] Dionaea, http://dionaea.carnivore.it/ [5] Chau, Duen Horng, et al. "Polonium: Tera-scale graph mining for malware detection." Proceedings of the second workshop on Large-scale Data Mining: Theory and Applications (LDMTA 2010), Washington, DC. Vol. 25. 2010. [6] Perdisci, Roberto, Andrea Lanzi, and Wenke Lee. "Classification of packed executables for accurate computer virus detection." Pattern Recognition Letters 29.14 (2008): 1941-1946. [7] Neugschwandtner, Matthias, et al. "Forecast: skimming off the malware cream." Proceedings of the 27th Annual Computer Security Applications Conference. ACM, 2011. [8] Rieck, Konrad, et al. "Automatic analysis of malware behavior using machine learning." Journal of Computer Security 19.4 (2011): 639-668. [9] MTRACE, http://media.blackhat.com/bh-eu-12/Royal/bh-eu-12-Royal-Entrapment-WP.pdf [10] Firdausi, Ivan, et al. "Analysis of machine learning techniques used in behavior-based malware detection." Advances in Computing, Control and Telecommunication Technologies (ACT), 2010 Second International Conference on. IEEE, 2010. [11] Jang, Jiyong, David Brumley, and Shobha Venkataraman. "Bitshred: feature hashing malware for scalable triage and semantic analysis." Proceedings of the 18th ACM conference on Computer and communications security. ACM, 2011. [12] Wonderdog. InfoChimps. https://github.com/infochimps-labs/wonderdog [13] Apache Hadoop. Apache Foundation. http://hadoop.apache.org/ [14] Apache Pig. Apache Foundation. http://pig.apache.org/ [15] Sequence Files Format. Hadoop Wiki. http://wiki.apache.org/hadoop/SequenceFile [16] DistributedCache. Apache Foundation. http://hadoop.apache.org/docs/stable/mapred_tutorial.html#DistributedCache [17] ElasticSearch. http://www.elasticsearch.org/ [18] MASTIFF. http://sourceforge.net/projects/mastiff/ [19] Borthakur. HDFS Architecture Guide. http://hadoop.apache.org/docs/stable/hdfs_design.pdf

A Survey of Stochastic and Gazetteer Based Approaches for Named Entity Recognition - Part 2

This is part 2 of a two-part series. Part 1 is here.

Approaches to Named Entity Recognition

Generally speaking, the most effective named entity recognition systems can be categorized as rule-based, gazetteer and machine learning approaches. Within each of these approaches are a myriad of sub-approaches that combine to varying degrees each of these top-level categorizations. However, because of the research challenge posed by each approach, typically one or the other is focused on in the literature.

Rule-based systems utilize pattern-matching techniques in text as well as heuristics derived either from the morphology or the semantics of the input sequence. They are generally used as classifiers in machine-learning approaches, or as candidate taggers in gazetteers. Some applications can also make effective use of stand-alone rule-based systems, but they are prone to both overreach and skipping over named entities. Rule-based approaches are discussed in (10), (12), (13), and (14).

Gazetteer approaches make use of some external knowledge source to match chunks of the text via some dynamically constructed lexicon or gazette to the names and entities. Gazetteers also further provide a non-local model for resolving multiple names to the same entity. This approach requires either the hand crafting of name lexicons or some dynamic approach to obtaining a gazette from the corpus or another external source. However, gazette based approaches achieve better results for specific domains. Most of the research on this topic focuses on the expansion of the gazetteer to more dynamic lexicons, e.g. the use of Wikipedia or Twitter to construct the gazette. Gazette based approaches are discussed in (15), (16), and (17).

Stochastic approaches fare better across domains, and can perform predictive analysis on entities that are unknown in a gazette. These systems use statistical models and some form of feature identification to make predictions about named entities in text. They can further be supplemented with smoothing for universal coverage. Unfortunately these approaches require large amounts of annotated training data in order to be effective, and they don’t naturally provide a non-local model for entity resolution. Systems implemented with this approach are discussed in (7), (8), (4), (9), and (6).

1) Rule Based Annotation

The first conceit in tackling named entity recognition is to look for clues within the structure and grammar of the text that indicate to readers (and therefore hopefully also systems) that the words and tokens refer to some named entity. Aside from the obvious issues with transcription, spelling, and unique formatting – it turns out that pattern matching is fairly successful at detecting those entities, if only in formal corpora like newsprint. Application of a series of rules as a preprocessing step to reduce the complexity of other techniques is widely used and is exemplified in 10, a rule-based NER annotation system that was supplemented with a machine-learning component to facilitate discovery of domain specific names.

Nadeau et al provide a table of word-level features that may indicate named-entities in (19). These features include case – if the word is capitalized or all capital letters (or indeed, includes a mixed case), punctuation, morphology, and the part of speech. Features like these, especially capitalization, form the basis for most heuristics. Word level features can then be combined with contextual features to further identify names.

So called “sure-fire” rules are described in (12), which rely on contextual indicators either prepended or appended to a series of capitalized tokens, whose part of speech is unknown or are generically referenced as nouns. These can take the form of titles – Mr., Mrs., or Dr. to name a few that are commonly prepended to names, or Jr., MD, and III, commonly appended. Corporate designators such as Ltd, LLC, or Bank can similarly identify organizations, and address designators like rue, rd., or strasse can identify particular locations.

These grammatical indicators can then be further extended to a series of rules that rely on part of speech tags or positional patterns. For instance, first names are generally easy to identify via a lexicon of proper names, and a capitalized word that follows a first name is most likely a last name, which tend to be more unique. Other rules require inspection of the context surrounding the possible named entity. In the sentence “... purchased 100 shares of (Tokens+) ...” where (Tokens+) is a series of capitalized words most likely the name of a publically traded company. Alternatively, prepositions and other parts of speech can also help us identify locations—for example, Frederick can be either a name or a location (a city in Western Maryland) but use of the phrase “in Frederick” would indicate that it was a location.

2) Learning and Stochastic Approaches

Out of sixteen participants at the CoNLL-2003 shared task, all employed some form of machine learning (1). According to the paper, most used a maximum entropy model, others used a statistical learning method, and still others used Hidden Markov Models. Tjong Kim Sang and De Meulder even go so far as to say that because Maximum Entropy was used in both isolation and combined with other systems, it seems to be a good choice for the named entity recognition task.

As for the performance evaluations, a combined system approach faired the best, namely a Classifier Combination approach as in (7). This approach employed a simple combination of four classifiers with equal voting, namely a Robust Risk Minimization Classifier, a Maximum Entropy Classifier, a Transformation-Based Learning Classifier and a Hidden Markov Model Classifier. However, the approach in (8) used only a Maximum Entropy learner and scored extremely closely to the combined approach, and many of the top scorers used Hidden Markov Models in combination with some other classifier, as in (20), therefore these approaches are the ones this paper will focus on.

Maximum Entropy Classification

Maximum Entropy estimates probability solely on the basis of the constraints derived from training data in an attempt to simplify the estimation by making as few assumptions as possible; in other words, attempting to select the best probability distribution with the information available to the learner. In (8), this is expressed as the following:

eq2

Where p(ci|s,D) is the probability of a classifier given a sequence of tokens and the document that the sequence appears in. Note that D is used to perform NER with global information across the document, a variation on the Maximum Entropy approach that improved Chieu and Ng’s performance over other statistical methods. However, as a general discussion of Maximum Entropy classification, it is not needed here. Dynamic programming algorithms are then used to select word sequences with the highest probabilities.

Classifiers are trained through a series of non-contextual features (previously called word-level features in the rule-based approach), lexical features, morphological features, and global features. For example, non-contextual features include case and zone (the location of the token in the sequence), string information (e.g. the string contains all capital letters and a period), and first word (whether or not the word is at the beginning of a sentence). Lexical features include token frequencies, the existence of the token in the vocabulary, and common names. Global features can be taken from a Gazette or from information from the rest of the document.

Chunked Tagging with Hidden Markov Models

Hidden Markov Models seem to be naturally applied to named entity recognition annotation because the act of tagging named entities seems closely related to the act of part of speech tagging—and these systems have been successful using HHMs, exemplified in (21). NER, like part of speech tagging, is a classification problem that involves an inherent state sequence representation. However, unlike in part of speech tagging, the state itself is a sequence of states that needs to be coordinated by some sort of chunking. Zhou and Su tackled this problem in (4) through the following, familiar equation:

eq3

In order to find the token sequence (G = g1, ... , gn) that maximizes the probability of a tag sequence (T = t1, ... , tn). This allows the chunking program to generate original named entity tags from the original text, and as with other approaches that use Bayes rule, we can calculate P(gi|ti) using n-gram frequencies in a training set, and P(T1n) by similarly counting an annotated training set. Log probabilities are used here to prevent underflow. Hidden Markov Models rely heavily on training data to both acquire sequence boundary features (although many systems in the literature use a combined approach with classifiers similar to those described in the maximum entropy approach). They also must use smoothing to include untrained data.

3) Gazetteer Based Approaches

Stochastic methodologies for named entity recognition provide candidates for annotation, and to a certain extent, can list the likelihood of a candidate belonging to a category or subcategory of a named entity. However, this is not a complete solution as machine-learning approaches require knowledge in order to propose candidacy for untrained tokens, and further, after candidates have been proposed, knowledge is required to classify them. Additionally, other issues may be resolved by gazettes, for instance non-local dependencies can be listed in gazettes to indicate cross-references between various names indicating the same entity.

Gazettes, therefore, are utilized to supply external knowledge to learners, or to supply unannotated data with a training source. In fact, most of the teams in the CoNLL-2003 Shared Task made use of a gazette (1), and gazettes have been described as the bottleneck in many learning systems (12) due to the elasticity and rapid evolution and expansion of names as unique descriptions for entities. Therefore, research has been directed toward the development of gazettes as lexicons of named entities rather than towards the specific application of gazettes in a named entity system.

This does not mean, however, that gazette-only systems do not exist. However, because of the overhead of gazette-based look-up (many require a query to the Internet), filtering of corpus tokens is required to generate named entity candidates. Systems described in (15), (16), and (19) use a combination of rule-based mechanisms, part of speech tags, and word frequency analysis to propose these candidates, with no machine learning approach.

Wiki Based Gazetteer Approaches

Looking to the Internet as a source for knowledge that both humans and computers can understand poses its own design challenges. Kazama and Torisawa propose that the advent of open collaborative encyclopedias on the web, like Wikipedia, organize external knowledge for gazette construction in (15). They make the point that since Wikipedia’s articles are intended to be encyclopedic, most articles are about named entities and are more structured than raw text. Therefore, named entity candidates can be resolved by a simple web search to Wikipedia.

The Wikipedia article can then be mined for classification information. Kazama and Torisawa use the first sentence to locate the object of the verb “to be.” Richman and Schone utilize the Wikipedia categories attached to the article for classification in their attempt to mine resources for multilingual named entity recognition (16) in order to avoid the use of semantic parsing. Generally speaking, the beginning of the article on an encyclopedic entry can be used as a contextual signature to categorize the named entity. Further, it can be used to provide non-local dependency analysis.

For example the Wikipedia result for “Mekong Delta” contains the contextual clues “region”, “river”, “sea”, “distributaries”, “Vietnam”, and “water” that could lead to the classification for water based location. Salman Rushdie’s article contains the signature words “novelist” and “essayist” in order to classify this entity as a writer (person). This technique of classification is similar to the word disambiguation techniques used in the Lesk algorithms.

Wikipedia also provides for non-local dependency analysis through two unique features. First, each article concerns only one particular named entity. Therefore any referential names found in the article can be said to identify that particular entity. Further, Wikipedia provides as a tool a page called “disambiguation” that will present the most frequent searches first, along with categorization. For instance, a search for “Cambridge (disambiguation)” will reveal several geographic place names in England, the United States, Canada, and Australia. It will also expose the University of Cambridge as another common use of that name. Mining contextual clues in these Wikipedia articles for a signature, and comparing them to the signature of the input provides not only the correct categorization, but also a source for variations on the name for use in cross-referencing.

Both (15) and (16) scored well on the CoNLL-2003 evaluation scale. Kazama et al. published their results with an F-Score of 88.02, which compares very well to the best performing stochastic methods. They attribute the increase in score from the baseline as a result of improved precision of results rather than improved recall. However, neither of these methods supplied details of the performance cost associated with a Wikipedia gazette based approach.

4) Mechanical Turk Approaches

One interesting approach to named-entity knowledge acquisition deserves a small mention here. Lawson et al. made use of Amazon’s Mechanical Turk service to annotate a data set of emails with named entities in (22). The Mechanical Turk is a low cost way of providing human labor to a repetitive task by paying some small amount of money for a small work unit. This paper demonstrates the difficulties in having human annotated training data available for statistical or gazette based analysis.

For even the seemingly simple task of identifying the categories for PERSON, LOCATION, or ORGANIZATION in emails of no more than 60 to 900 characters, Lawson et al. found that they could not pay a flat rate per email, but instead had to set up a reward mechanism in which each discovered entity paid off (otherwise annotators simply checked no-entries to game the system). Although the authors argue that they were able to use inter-annotator agreement to assign a particular number of workers to the same task to reduce costs, it seems that only a portion of the annotators provided significant contribution as 12.5% of the workers completed 74.9% of the annotation! Although the final annotated data set was of a high quality, and a low economic cost, the dataset took more than four months to build using this methodology.

Conclusion

This paper explored the topic of named entity recognition, particularly as described in the CoNLL-2003 shared task: language-independent named entity recognition of organizations, locations, and people. Several of the pitfalls associated with named entity recognition systems were discussed, including chunking and text representation for multi-word names, inference for resolving name ambiguity, cross-referencing with non-local dependencies, and the use of external knowledge in NER systems even though unique names provide a lexical challenge.

Several approaches to named entity recognition were discussed including rule-based systems, machine-learning systems and gazetteer approaches. Rule based systems provide an advantage of not requiring significant computation, but have low accuracy and recall. These systems are often used in coordination with gazette based and machine learning systems. Stochastic methods focus on probabilistic modeling and can cover new tokens and decide if sequences of tokens belong together, however they require a large training set to operate, as well as a smoothing implementation to capture universal knowledge. Finally, gazetteer, or dynamic lexical generation, was discussed in accordance with named entity recognizers. This approach performs well at discovering non-local dependencies as well as disambiguating multiple name senses. However, candidates must be proposed for gazette look-up, and some overhead is incurred in accessing the gazette.

For now, combinations of machine learning and gazetteer systems, which use rule-based classifiers for features and candidate proposal, are the best performing systems. Further research needs to be performed on multi-stage approaches to better integrate the two methodologies.

References

(1) Erik F. Tjong Kim Sang and De Meulder Fien, "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 142-147. (2) Nancy Chinchor, Erica Brown, Lisa Ferro, and Patty Robinson, "1999 Named Entity Recognition Task Definition," MITRE and SAIC, 1999. (3) Lev Ratinov and Dan Roth, "Design Challenges and Misconceptions in Named Entity Recognition," in CoNLL '09 Proceedings of the 13th Conference on Computational Natural Language Learning, Stroudsburg, PA, 2009, pp. 147-155. (4) GuoDong Zhou and Jian Su, "Named Entity Recognition Using an HMM-Based Chunk Tagger," in ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, 2002, pp. 473-480. (5) Jenny Rose Finkel and Christopher D. Manning, "Joint Parsing and Named Entity Recognition," in NAACL '09 Proceedings of Human Language Technologies, Stroudsburg, Pa, 2009, pp. 326-334. (6) Hirotaka Funayama, Tomohide Shibata, and Sadao Kurohashi, "Bottom-Up Named Entity Recognition Using a Two-Stage Machine Learning Method," in MWE '09 Proceedings of the Workship on Multiword Expressions: Identification, Interpretation, Disambiguation, and Applications, Stroudsburg, PA, 2009, pp. 55-62. (7) Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang, "Named Entity Recognition Through Classifier Combination," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 168-171. (8) Hai Leong Chieu and Hwee Tou Ng, "Named Entity Recognition: A Maximum Entropy Approach Using Global Information," in COLING '02 Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, Stroudsburg, PA, 2002, pp. 1-7. (9) Andrew McCallum and Wei Li, "Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction, and Web-Enhanced Lexicons," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 188-191. (10) Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Frederick Reiss, and Shivakumar Vaithyanathan, "Domain Adaption of Rule-Based Annotators for Named Entity Recognition Tasks," in EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, 2010, pp. 1002-1012. (11) Vijay Krishnan and Christopher D. Manning, "An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition," in ACL-44 Proceedings of the 21st International Conference on Computational Linguistics, Stroudsburg, PA, 2006, pp. 1121- 1128. (12) Andrei Mikheev, Marc Moens, and Claire Grover, "Named Entity Recognition Without Gazetteers," in EACL '99 Proceedings of the 9th Conference on the European Chapter of the Association for Computational Linguistics, Stroudsburg, PA, 1999, pp. 1-8. (13) Silviu Cucerzan and David Yarowsky, "Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence," in Joint SIGDAT Conference on EMNLP and VLC, 1999. (14) Dimitra Farmakioutou et al., "Rule-Based Named Entity Recognition for Greek Financial Texts," in Proceedings of the Workshop on Computational Lexicography and Multimedia Dictionaries, 2000, pp. 75-78. (15) Jun'ichi Kazama and Kentaro Torisawa, "Exploiting Wikipedia as External Knowledge for Named Entity Recognition," in Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007, pp. 698-707. (16) Alexander E. Richman and Patrick Schone, "Mining Wiki Resources for Multilingual Named Entity Recognition," in Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technologies, Stroudsburg, PA, 2008, pp. 1-9. (17) Alan Ritter, Sam Clark, Mausam, and Oren Etzioni, "Named Entity Recognition in Tweets: An Experimental Study," in EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, 2011, pp. 1524-1534. (18) Nancy Chinchor, "MUC-7 Named Entity Task Definition," in Message Understanding Conference. (19) David Nadeau and Satoshi Sekine. (2008, Jan.) A Survey of Named Entity Recognition and Classification. (Online). http://nlp.cs.nyu.edu/sekine/papers/li07.pdf (20) Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning, "Named Entity Recognition with Character-Level Models," in Proceedings of CoNLL-2003, 2003, pp. 180- 183. (21) Scott M Thede and Mary P Harper, "A Second-Order Hidden Markov Model for Part-of- Speech Tagging," in Proceedings of the 37th annual meeting of ACL, Stradsbourg, PA, 1999, pp. 175-182. (22) Nolan Lawson, Kevin Eustice, Mike Perkowitz, and Meliha Yetisgen-Yildiz, "Annotating Large Email Datasets for Named Entity Recognition with Mechanical Turk," in CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speach and Language Data with Amazon's Mechanical Turk, Stroudsburg, PA, 2010, pp. 71-79.

An Introduction to Named Entity Recognition in Natural Language Processing - Part 1

Abstract

The task of identifying proper names of people, organizations, locations, or other entities is a subtask of information extraction from natural language documents. This paper presents a survey of techniques and methodologies that are currently being explored to solve this difficult subtask. After a brief review of the challenges of the task, as well as a look at previous conventional approaches, the focus will shift to a comparison of stochastic and gazetteer based approaches. Several machine-learning approaches are identified and explored, as well as a discussion of knowledge acquisition relevant to recognition. This two-part white paper will show that applications that require named entity recognition will be served best by some combination of knowledge- based and non-deterministic approaches.

Introduction

In school we were taught that a proper noun was “a specific person, place, or thing,” thus extending our definition from a concrete noun. Unfortunately, this seemingly simple mnemonic masks an extremely complex computational linguistic task—the extraction of named entities, e.g. persons, organizations, or locations from corpora (1). More formally, the task of Named Entity Recognition and Classification can be described as the identification of named entities in computer readable text via annotation with categorization tags for information extraction.

Not only is named entity recognition a subtask of information extraction, but it also plays a vital role in reference resolution, other types of disambiguation, and meaning representation in other natural language processing applications. Semantic parsers, part of speech taggers, and thematic meaning representations could all be extended with this type of tagging to provide better results. Other, NER-specific, applications abound including question and answer systems, automatic forwarding, textual entailment, and document and news searching. Even at a surface level, an understanding of the named entities involved in a document provides much richer analytical frameworks and cross-referencing.

Named entities have three top-level categorizations according to DARPA’s Message Understanding Conference: entity names, temporal expressions, and number expressions (2). Because the entity names category describes the unique identifiers of people, locations, geopolitical bodies, events, and organizations, these are usually referred to as named entities and as such, much of the literature discussed in this paper focuses solely on this categorization, although it is easy to imagine extending the proposed systems to cover the full MUC-7 task. Further, the CoNLL-2003 Shared Task, upon which the standard of evaluation for such systems is based, only evaluates the categorization of organizations, persons, locations, and miscellaneous named entities. For example:

(ORG S.E.C.) chief (PER Mary Shapiro) to leave (LOC Washington) in December.

This sentence contains three named entities that demonstrate many of the complications associated with named entity recognition. First, S.E.C. is an acronym for the Securities and Exchange Commission, which is an organization. The two words “Mary Shapiro” indicate a single person, and Washington, in this case, is a location and not a name. Note also that the token "chief" is not included in the person tag, although it very well could be. In this scenario, it is ambiguous if "S.E.C. chief Mary Shapiro" is a single named entity, or if multiple, nested tags would be required.

Named Entity Recognition Challenges

Some key design decisions in an NER system are proposed in (3) that cover the requirements of NER in the example sentence above:

  • Chunking and text representation
  • Inference and ambiguity resolution algorithms
  • Modeling of Non-Local dependencies
  • Implementation of external knowledge resources and gazetteers

Named entities are often not simply singular words, but are chunks of text, e.g. University of Maryland Baltimore County or The Central Bank of Australia. Therefore, some chunking or parsing prediction model is required to predict whether a group of tokens belong in the same entity. Left to right decoding, Viterbi, and beam search algorithms has been employed as chunking algorithms in the literature. Further, some NER systems are comprised primarily of text parsers as in (4), (5), and (6).

Inference refers to the ability of a system to determine that a chunk is actually a named entity, or, sometimes more importantly, to determine the classification of a named entity, especially in places where there is ambiguity. For example “Washington” might refer to either a name or a location. “Galaxy” might refer to a generic noun or the professional major league soccer team. Maximum Entropy Models, Hidden Markov Models and other statistical methods are employed to perform this analysis, usually implemented as a machine-learning system, as for instance in (7), (8), and (9).

Non-local dependency models refer to the ability to identify multiple tokens that should have the same label assignment or cross-reference. It is important to note that case becomes important here—e.g. Bengfort, bengfort, and BENGFORT should all be identified as the same entity, and these would break word-level rule based systems (e.g. find all words that are capitalized). But further, even different chunks should be identified similarly – UMBC vs. University of Maryland Baltimore County or the inclusion of titles in one location that are absent from another as in President Barack Obama vs. Obama. The non-locality of these models refers to the usage of these terms outside the scope of a sequence of tokens that is being analyzed together (usually a sentence), but rather in the entire document or corpus. The papers that address non-local dependencies include (10) and (11), but the focus of this paper will be on solutions that require external knowledge.

Names, because they uniquely identify entities, are a domain not easily captured by even the most expansive lexicons. For example, simply creating a list of the names of all companies formed in the United States would expand dramatically every single year. However, external knowledge and name lexicons are required for many of the approaches and solutions, not just non-local dependency models. Therefore the construction and use of gazetteers and other resources is necessary.

Evaluation of NERC Systems

Throughout the literature on Named Entity Recognition and Classification systems, two seminal conferences and evaluations are mentioned when evaluating systems: the 1997 Message Understanding Conference (MUC-7) 18 and the CoNLL-2003 Shared Task 1. Especially since 2003, many NERC systems use the CoNLL-2003 evaluation to demonstrate the performance of their work. This evaluation specifies a data set containing a training file, a development file, a test file, and a large amount of unannotated data in two languages, English and German, taken from news articles corpora. All learning methods were trained on the training file, and tested on the test data. The development file could be used for tuning the parameters of the system. The data was preprocessed using a tokenizer, chunker and part of speech-tagger, with each token on a single line, and sentence boundaries represented by a single blank line. Named entities were tagged by hand at the University of Antwerp. Performance was measured using an F-Score with β=1 simplified to the harmonic mean:

eq1

Where precision is the ratio of correct results to results returned and recall is the ratio of correct results to the number of results that should have been returned. The baseline result was computed for a system that only identified entities with a unique class in the training data, and for CoNLL-2003 was 59.61 ± 1.2. The highest performing system was (7) with an F-Score of 88.76 ± 0.7, followed closely by (8) with an F-Score of 88.31 ± 0.7. However, the lowest performing systems still exceeded the baseline result.

Stay tuned for part 2 tomorrow!

References

(1) Erik F. Tjong Kim Sang and De Meulder Fien, "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 142-147. (2) Nancy Chinchor, Erica Brown, Lisa Ferro, and Patty Robinson, "1999 Named Entity Recognition Task Definition," MITRE and SAIC, 1999. (3) Lev Ratinov and Dan Roth, "Design Challenges and Misconceptions in Named Entity Recognition," in CoNLL '09 Proceedings of the 13th Conference on Computational Natural Language Learning, Stroudsburg, PA, 2009, pp. 147-155. (4) GuoDong Zhou and Jian Su, "Named Entity Recognition Using an HMM-Based Chunk Tagger," in ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, 2002, pp. 473-480. (5) Jenny Rose Finkel and Christopher D. Manning, "Joint Parsing and Named Entity Recognition," in NAACL '09 Proceedings of Human Language Technologies, Stroudsburg, Pa, 2009, pp. 326-334. (6) Hirotaka Funayama, Tomohide Shibata, and Sadao Kurohashi, "Bottom-Up Named Entity Recognition Using a Two-Stage Machine Learning Method," in MWE '09 Proceedings of the Workship on Multiword Expressions: Identification, Interpretation, Disambiguation, and Applications, Stroudsburg, PA, 2009, pp. 55-62. (7) Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang, "Named Entity Recognition Through Classifier Combination," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 168-171. (8) Hai Leong Chieu and Hwee Tou Ng, "Named Entity Recognition: A Maximum Entropy Approach Using Global Information," in COLING '02 Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, Stroudsburg, PA, 2002, pp. 1-7. (9) Andrew McCallum and Wei Li, "Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction, and Web-Enhanced Lexicons," in CONLL '03 Proceedings of the 7th Conference on Natural Language Learning, vol. 4, Stroudsburg, PA, 2003, pp. 188-191. (10) Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Frederick Reiss, and Shivakumar Vaithyanathan, "Domain Adaption of Rule-Based Annotators for Named Entity Recognition Tasks," in EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, 2010, pp. 1002-1012. (11) Vijay Krishnan and Christopher D. Manning, "An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition," in ACL-44 Proceedings of the 21st International Conference on Computational Linguistics, Stroudsburg, PA, 2006, pp. 1121- 1128.

Applying Aggregated Government Data - Insights into a Data-focused Company

The following blog is a review of Enigma Technology, Inc., their VC funded open source government data aggregation platform, and how that platform may be utilized in different business applications.  Enjoy!

Approach

Aggregation of previously unconnected data sources is a new industry in our hyper-connected world.  Enigma has focused on aggregating open source US Government data, and the question is, “What is possible with this new technology?”  Given only the information from the website, this paper explores Enigma’s decision support approach in their three ‘Feature’ sections: Data Sources, Discover, and Analyze.

Data Sources

The technology to gather data from the open web, either directly or by scraping websites, has reached maturity#, and as a result it is simply a bureaucratic process to focus aggregation on the specific government industries they highlight (aircraft, lobbying, financial, spending, and patent).  Enigma focuses on data augmentation, API access, and custom data, which is another way of saying, “We can provide standard insights or give you easy access, and we can apply these principles to whatever data sets you have in mind.”  This business model is another 21st century standard: Standard charge for self-service data-applications, consult on private data-applications.

Discover (Data)

A primary feature of Government is its siloed approach to data; a classic example being sharing of intelligence# information following 9/11#.  Juxtaposed data always produces new correlations between the data sets and thereby new insights, allowing for exploration previously impossible or impractical.  Combined with powerful UI and self-service tools, Enigma is seeking to empower its users to recognize what’s most valuable to them, as opposed to their providing any one ‘Right’ answer, an approach broadly adopted in the software as a service (SaaS) industry and widely applied to Government data#.

Analyze (Data)

Enigma’s goal is to immerse its users in the data in a meaningful way, allowing them to drill down to any detail or rise above the fray with metadata created by its standard functions and operators ala classic statistics, classification#, and data ontology#.  Again utilizing a powerful UI and self-service tools#, Enigma plans to empower its users to focus on their data of interest (filtering) and choose which mathematical operations to perform on data under comparison, all with a goal of integrating previously independent Government data sets.

Business Applications

If the aggregated open source data is directly applicable to your existing business, by all means weigh the ROI of an Enigma subscription, however in most cases the application of this data will require more significant discussion and negotiations with potential clients, or Enigma’s self-service model will only serve as a demonstration for private or more restricted-access data.  Government organizations are being charged with integrating data across ‘silos’, and services like those of Enigma will provide comprehensive tools for the first step in this process, allowing for consultation on its application and for services specializing in the chartered goals of that Government data integration.