Data Community DC and District Data Labs are hosting a Natural Language Processing with Python workshop on Saturday April 9th from 9am - 5pm. Register before March 26th for an early bird discount!
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK).
NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.
WHAT YOU WILL LEARN
In this course we will begin by exploring NLTK from the view of the corpora that it already comes with, and in this way we will get a feel for the various features and functionality that NLTK has. This will last us the first part of the course. However, most NLP practitioners want to work on their own corpora, therefore during the second half of the course we will focus on building a language aware data product from a specific corpus - a topic identification and document clustering algorithm from a web crawl of blog sites. The clustering algorithm will use a simple Lesk K-Means clustering to start, and then will improve with an LDA analysis.
The following represents the one-hour modules that will make up the course.
Part One: Using NLTK
- Introduction to NLTK: code + resources=magic
- The counting of things: concordances, frequency distributions, tokenization
- Tagging and parsing: PoS tagging, NERC, Syntactic Parsing
- Classifying text: sentiment analysis, document classification
Part Two: Building an NLP Data Product
- Using the NLTK API to wrap a custom corpus
- Word vectors for K-Means clustering
- LDA for topic analysis
Notably not mentioned: morphology, n-gram language models, search, raw text preprocessing, word sense disambiguation, pronoun resolution, language generation, machine translation, textual entailment, question and answer systems, summarization, etc.
After taking this workshop students will be able to create a Python module that wraps their own corpora and begin to leverage NLTK tools against it. They will also have an understanding of the features and functionality of NLTK, and a working knowledge of how to architect applications that use NLP. Finally, students who complete this course will have built an information extraction system that performs topic analyses on a corpora of documents.
INSTRUCTOR: BENJAMIN BENGFORT
Benjamin Bengfort is a Data Scientist who lives inside the beltway but ignores politics (the normal business of DC) favoring technology instead. He is currently working to finish his PhD at the University of Maryland where he studies machine learning and distributed computing. His focus is on highly consistent local distributed storage and visual diagnostics for data modeling. The lab next door does have robots and, much to his chagrin, they seem to constantly arm said robots with knives and tools; presumably to pursue culinary accolades. Having seen a robot attempt to slice a tomato, Benjamin prefers his own adventures in the kitchen where he specializes in fusion French and Guyanese cuisine as well as BBQ of all types. A professional programmer by trade, a Data Scientist by vocation, Benjamin's writing pursues a diverse range of subjects from Natural Language Processing, to Data Science with Python to analytics with Hadoop and Spark.