Data Science MD held its second meeting, Teaching Machines to Read: Processing Text with NLP, its first Baltimore-based event, at the fantastic Ad.com venue, part of the Under Armour campus at Hull Point. Our group was fortunate to have two excellent speakers, Craig Pfeifer, a Principal Artificial Intelligence Engineer at the MITRE Corp, and Dr. Jesse English, a PhD computer scientist from UMBC who specialized in NLP, machine learning, and machine reading.
Craig discussed his experiences on a project looking at author attribution in 422 supreme court decisions from 2006-2008. Using the name of the author as the sole document label (and all authors were in the training set), Craig extracted a large number of features (character n-grams, sentence metrics, etc) and used Weka's SMO, a support vector machine approximation, for classification with good results. Very notable were Craig's comments on the time consuming nature of extracting text from PDFs (don't try this at home). His slides are here in PDF form.
Dr. Jesse English introduced the audience to the brand new open source python tool kit for deep semantic analysis of big data: WIMS (Weakly Inferred Meanings). WIMs, humorously capable of "answering questions like a boss," allows users to ask who/what/when/where/why and even how questions of the text. In Dr English's own words:
A WIM is a structured meaning representation, not unlike a TMR (text meaning representation), with a limited scope in expected coverage. The scope has been limited intentionally for performance reasons – one would use a WIM rather than a TMR when the scope of coverage is sufficient and the cost (in time or development) of a full TMR is too great.
Typically, the production of a full TMR would require a domain-comprehensive syntactic-semantic lexicon and accompanying ontology (as well as a wealth of other related knowledge bases). A compilation of microtheories of meaning analysis would be required to process the text using the knowledge – both of these resources are extremely expensive to produce, and accurate processing of the text rapidly becomes unscalable without introducing domain-dependent algorithmic shortcuts.
By relying on WIMs, rather than a full TMR, the most typically relevant semantic data can still be produced in linear time with off-the-shelf knowledge resources (e.g., WordNet).
To view the presentation, click Deep Semantic Analysis of Big Data (WIMs) for the pdf.