workshop

Building Data Apps with Python Workshop Returns on June 6th

Building Data Apps with Python Workshop Returns on June 6th

Data products are usually software applications that derive their value from data by leveraging the data science pipeline and generate data through their operation. They aren’t apps with data, nor are they one time analyses that produce insights - they are operational and interactive. The rise of these types of applications has directly contributed to the rise of the data scientist and the idea that data scientists are professionals “who are better at statistics than any software engineer and better at software engineering than any statistician.”

These applications have been largely built with Python. Python is flexible enough to develop extremely quickly on many different types of servers and has a rich tradition in web applications. Python contributes to every stage of the data science pipeline including real time ingestion and the production of APIs, and it is powerful enough to perform machine learning computations. In this class we’ll produce a data product with Python, leveraging every stage of the data science pipeline to produce a book recommender.

Social Network Analysis with Python Workshop on November 22nd

Data Community DC and District Data Labs are hosting a full-day Social Network Analysis with Python workshop on Saturday November 22nd.  For more info and to sign up, go to http://bit.ly/1lWFlLx.  Register before October 31st for an early bird discount!

Social networks are not new, even though websites like Facebook and Twitter might make you want to believe they are; and trust me- I’m not talking about Myspace! Social networks are extremely interesting models for human behavior, whose study dates back to the early twentieth century. However, because of those websites, data scientists have access to much more data than the anthropologists who studied the networks of tribes!

Natural Language Analysis with NLTK on October 25th

Natural Language Analysis with NLTK on October 25th

Data Community DC and District Data Labs are excited to be hosting a Natural Language Analysis with NLTK workshop on October 25th  For more info and to sign up, go to http://bit.ly/1pK0pFN.  There’s even an early bird discount if you register before October 3rd!

Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK).

High-Performance Computing in R Workshop

Data Community DC and District Data Labs are excited to be hosting a High-Performance Computing with R workshop on June 21st, 2014 taught by Yale professor and R package author Jay Emerson. If you're interested in learning about high-performance computing including concepts such as memory management, algorithmic efficiency, parallel programming, handling larger-than-RAM matrices, and using shared memory this is an awesome way to learn!

To reserve a spot, go to http://bit.ly/ddlhpcr.

Overview This intermediate-level masterclass will introduce you to topics in high-performance computing with R. We will begin by examining a range of related topics including memory management and algorithmic efficiency. Next, we will quickly explore the new parallel package (containing snow and multicore). We will then concentrate on the elegant framework for parallel programming offered by packages foreach and the associated parallel backends. The R package management system including the C/C++ interface and use of package Rcpp will be covered. We will conclude with basic examples of handling larger-than-RAM numeric matrices and use of shared memory. Hands-on exercises will be used throughout.

What will I learn? Different people approach statistical computing with R in different ways. It can be helpful to work on real data problems and learn something about R “on the fly” while trying to solve a problem. But it is also useful to have a more organized, formal presentation without the distraction of a complicated applied problem. This course offers four distinct modules which adopt both approaches and offer some overlap across the modules, helping to reinforce the key concepts. This is an active-learning class where attendees will benefit from working along with the instructor. Roughly, the modules include:

An intensive review of the core language syntax and data structures for working with and exploring data. Functions; conditionals arguments; loops; subsetting; manipulating and cleaning data; efficiency considerations and best practices, including loops and vector operations, memory overhead and optimizing performance.

Motivating parallel programming with an eye on programming efficiency: a case study. Processing, manipulating, and conducting a basic analysis of 100-200 MB of raw microarray data provides an excellent challenge on standard laptops. It is large enough to be mildly annoying, yet small enough that we can make progress and see the benefits of programming effiency and parallel programming.

Topics in high-performance computing with R, including packages parallel and foreach. Hands-on examples will help reinforce key concepts and techniques.

Authoring R packages, including an introduction to the C/C++ interface and the use of Rcpp for high-performance computing. Participants will build a toy package including calls to C/C++ functions.

Is this class right for me? This class will be a good fit for you if you are comfortable working in R and are familiar with R's core data structures (vectors, matrices, lists, and data frames). You are comfortable with for loops and preferably aware of R's apply-family of functions. Ideally you will have written a few functions on your own. You have some experience working with R, but are ready to take it to the next level. Or, you may have considerable experience with other programming languages but are interested in quickly getting up to speed in the areas covered by this masterclass.

After this workshop, what will I be able to do? You will be in a better position to code efficiently with R, perhaps avoiding the need, in some cases, to resort to C/C++ or parallel programming. But you will be able to implement so-called embarassingly parallel algorithms in R when the need arises, and you'll be ready to exploit R's C/C++ interface in several ways. You'll be in a position to author your own R package can include C/C++ code.

All participants will receive electronic copies of all slides, data sets, exercises, and R scripts used in the course.

What do I need to bring? You will need your laptop with the latest version of R. I recommend use of the R Studio IDE, but it is not necessary. A few add-on packages will be used in the workshop. Packages Rcpp and foreach will be used. As a complement to foreach you should also install doMC (Linux or MacOS only) and doSNOW(all platforms). If you want to work along with the C/C++ interface segment, some extra preparation will be required. Rcpp and use of the C/C++ interface requires compilers and extra tools; the folks at RStudio have a nice page that summarizes the requirements. Please note that these requirements may not be trivial (particularly in Windows) and need to be completed prior to the workshop if you intend to compile C/C++ code and use Rcpp during the workshop.

Instructor John W. Emerson (Jay) is Director of Graduate Studies in the Departmentof Statistics at Yale University. He teaches a range of graduate and undergraduate courses as well as workshops, tutorials, and short courses at all levels around the world. His interests are in computational statistics and graphics, and his applied work ranges from topics in sports statistics to bioinformatics, environmental statistics, and Big Data challenges.

He is the author of several R packages including bcp (for Bayesian change point analysis), bigmemory and sister packages (towards a scalable solution for statistical computing with massive data), and gpairs (for generalized pairs plots). His teaching style is engaging and his workshops are active, hands-on learning experiences.

You can reserve your spot by going to http://bit.ly/ddlhpcr.

Building Data Apps with Python

Data Community DC and District Data Labs are excited to be offering a Building Data Apps with Python workshop on April 19th, 2014. Python is one of the most popular programming languages for data analysis.  Therefore, it is important to have a basic working knowledge of the language in order to access more complex topics in data science and natural language processing.  The purpose of this one-day course is to introduce the development process in python using a project-based, hands-on approach.

Python_Building_Data_Apps

This course is focused on Python development in a data context for those who aren’t familiar with Python. Other courses like Python Data Analysis focus on data analytics with Python, not on Python development itself.

The main workshop will run from 11am - 6pm with an hour break for lunch around 1pm.  For those that are new to programming, there will be an optional introductory session from 9am - 11am aimed at getting you comfortable enough with Python development to follow along in the main session.

Introductory Session: Python for New Programmers (9am - 11am)

The morning session will teach the fundamentals of Python to those who are new to programming.  Learners would be grouped with a TA to ensure their success in the second session. The goal of this session is to ensure that students can demonstrate basic concepts in a classroom environment through successful completion of hands-on exercises. This beginning session will cover the following basic topics and exercises:

Topics:

  • Variables
  • Expressions
  • Conditionality
  • Loops
  • Executing Programs
  • Object Oriented Programming
  • Functions
  • Classes

Exercises:

  • Write a function to determine if input is even or odd
  • Read data from a file
  • Count the words/lines in a file

At the end of this session, students should be familiar enough with programming concepts in Python to be able to follow along in the second session. They will have acquired a learning cohort in their classmates and instructors to help them learn Python more thoroughly in the future, and they will have observed Python development in action.

Main Session: Building a Python Application (11am - 6pm)

The afternoon session will focus on python application development for those who already know how to program and are familiar with Python. In particular, we’ll build a data application from beginning to end in a workshop fashion. This course would be a prerequisite for all other DDL courses offered that use python.

The following topics will be covered:

  • Basic project structure
  • virtualenv & virtualenvwrapper
  • Building requirements outside the stdlib
  • Testing with nose
  • Ingesting data with request.py
  • Munging data into SQLite Databases
  • Some simple computations in Python
  • Reporting data with JSON
  • Data visualization with Jinja2 and Highcharts

We will build a Python application using the data science workflow: using Python to ingest, munge, compute, report, and even visualize. This is a basic, standard workflow that is repeatable and paves the way for more advanced courses using numerical and statistical packages in Python like Pandas and NumPy. In particular, we’ll use and fetch data from Data.gov, transform it and store it in a SQLite database, then do some simple computation. Then we will use Python to push our analyses out in JSON format and provide a simple reporting technique with Jinja2 and charting using Highcharts.

For more information and to reserve a spot, go to http://bit.ly/1m0y5ws.

Hope to see you there!