This is a guest post by Charlie Greenbacker and Tommy Jones.
Data comes in many forms. As a data scientist, you might be comfortable working with large amounts of structured data nicely organized in a database or other tabular format, but what do you do if a customer drops 10,000 unstructured text documents in your lap and asks you to analyze them?
Some estimates claim unstructured data accounts for more than 90 percent of the digital universe, much of it in the form of text. Digital publishing, social media, and other forms of electronic communication all contribute to the deluge of text data from which you might seek to derive insights and extract value. Fortunately, many tools and techniques have been developed to facilitate large-scale text analytics. Operating at the intersection of computer science, artificial intelligence, and computational linguistics, Natural Language Processing (NLP) focuses on algorithmically understanding human language.
Interested in getting started with Natural Language Processing but don't know where to begin? On July 9th, a joint meetup co-hosted by Statistical Programming DC, Data Wranglers DC, and DC NLP will feature two introductory talks on the nuts & bolts of working with NLP in Python and R.
The Python programming language is increasingly popular in the data science community for a variety of reasons, including its ease of use and the plethora of open source software libraries available for scientific computing & data analysis. Packages like SciPy, NumPy, Scikit-learn, Pandas, NetworkX, and others help Python developers perform everything from linear algebra and dimensionality reduction, to clustering data and analyzing multigraphs.
Back in the dark ages (about 10+ years ago), folks working in NLP usually maintained an assortment of homemade utility programs designed to handle many of the common tasks involved with NLP. Despite our best intentions, most of this code was lousy, brittle, and poorly documented -- hardly a good foundation upon which to build your masterpiece. Over the past several years, however, mainstream open source software libraries like the Natural Language Toolkit for Python (NLTK) have emerged to offer a collection of high-quality reusable NLP functionality. NLTK enables researchers and developers to spend more time focusing on the application logic of the task at hand, and less on debugging an abandoned method for sentence segmentation or reimplementing noun phrase chunking.
If you're already familiar with Python, the NLTK library will equip you with many powerful tools for working with text data. The O'Reilly book Natural Language Processing with Python written by Steven Bird, Ewan Klein, and Edward Loper offers an excellent overview of using NLTK for text analytics. Topics include processing raw text, tagging words, document classification, information extraction, and much more. Best of all, the entire contents of this NLTK book are freely available online under a Creative Commons license.
The Python portion of this joint meetup event will cover a handful of the NLP building blocks provided by NLTK, including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. These components will then be assembled to build a very basic document summarization program.
Additional NLP resources in Python
- Natural Language Toolkit for Python (NLTK): http://www.nltk.org/ - Natural Language Processing with Python (book): http://oreilly.com/catalog/9780596516499/ (free online version: http://www.nltk.org/book/) - Python Text Processing with NLTK 2.0 Cookbook (book): http://www.packtpub.com/python-text-processing-nltk-20-cookbook/book - Python wrapper for the Stanford CoreNLP Java library: https://pypi.python.org/pypi/corenlp - guess_language (Python library for language identification): https://bitbucket.org/spirit/guess_language - MITIE (new C/C++-based NER library from MIT with a Python API): https://github.com/mit-nlp/MITIE - gensim (topic modeling library for Python): http://radimrehurek.com/gensim/
R is a programming language popular in statistics and machine learning research. R has several advantages in the ML/stat domains. R is optimized for vector operations. This simplifies programming since your code is very close to the math that you're trying to execute. R also has a huge community behind it; packages exist for just about any application you can think of. R has a close relationship with C, C++, and Fortran and there are R packages to execute Java and Python code, increasing its flexibility. Finally, the folks at CRAN are zealous about version control and compatibility, making installing R and subsequent packages a smooth experience.
However, R does have some sharp edges that become obvious when working with any non-trivially-sized linguistic data. R holds all data in your active workspace in RAM. If you are running R on a 32-bit system, you have a 4 GB limit to the RAM R can access. There are two implications of this: NLP data need to be stored in memory-efficient objects (more on that later) and (regrettably) there is still a hard limit on how much data you can work on at one time. There are packages, such as `bigmemory` that are moving to address this, but they are outside the scope of this presentation. You also need to write efficient code; the size of NLP data will punish you for inefficiencies.
What advantages, then, does R have? Every person and every problem is unique, but I can offer a few suggestions:
1. You are doing statistics/ML research and not developing software. 2. (Similar to 1.) You are a quantitative generalist (and probably good in R already) and NLP is just another feather in your cap.
Sometimes being a data scientist is about developing and tweaking your own algorithms. Sometimes being a data scientist is taking others' algorithms, plugging in your data, and moving on to other areas of the problem. If you are doing more of the former, R is a solid choice. If you are doing more of the latter, R isn't too bad. But I've found that my code often runs faster than some of the pre-packaged code. Your individual mileage may vary.
The second presentation at this meetup will cover the basics of reading documents into R and creating a document term matrix, then demonstrating some basic document summarization, keyword extraction, and document clustering techniques.
Seats are filling up quickly, so RSVP here now: http://www.meetup.com/stats-prog-dc/events/177772322/