Data Community DC and District Data Labs are excited to be hosting a Natural Language Analysis with NLTK workshop on October 25th For more info and to sign up, go to http://bit.ly/1pK0pFN. There’s even an early bird discount if you register before October 3rd!
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK).
NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.
What You Will Learn
In this course we will begin by exploring NLTK from the view of the corpora that it already comes with, and in this way we will get a feel for the various features and functionality that NLTK has. This will last us the first part of the course. However, most NLP practitioners want to work on their own corpora, therefore during the second half of the course we will focus on building a language aware data product from a specific corpus - a topic identification and document clustering algorithm from a web crawl of blog sites. The clustering algorithm will use a simple Lesk K-Means clustering to start, and then will improve with an LDA analysis.
The following represents the one-hour modules that will make up the course.
Part One: Using NLTK
• Introduction to NLTK: code + resources = magic
• The counting of things: concordances, frequency distributions, tokenization
• Tagging and parsing: PoS tagging, NERC, Syntactic Parsing
• Classifying text: sentiment analysis, document classification
Part Two: Building an NLP Data Product
• Using the NLTK API to wrap a custom corpus
• Word vectors for K-Means clustering
• LDA for topic analysis
Notably not mentioned: morphology, n-gram language models, search, raw text preprocessing, word sense disambiguation, pronoun resolution, language generation, machine translation, textual entailment, question and answer systems, summarization, etc.
After taking this workshop students will be able to create a Python module that wraps their own corpora and begin to leverage NLTK tools against it. They will also have an understanding of the features and functionality of NLTK, and a working knowledge of how to architect applications that use NLP. Finally, students who complete this course will have built an information extraction system that performs topic analyses on a corpora of documents.
Instructor: Benjamin Bengfort
Benjamin is an experienced Data Scientist and Python developer who has worked in military, industry, and academia for the past eight years. He is currently pursuing his PhD in Computer Science at The University of Maryland, College Park, doing research in Machine Learning and Natural Language Processing. He holds a Masters degree from North Dakota State University where he taught undergraduate Computer Science courses. He is also adjunct faculty at Georgetown University where he teaches Data Science and Analytics. He has built many language aware data products including classifier systems, language models - both sequential and connectionist, and semantic recognition systems.