deep learning

TensorFlow's DC Introduction

The Hello, TensorFlow! post introducing the basic workings of the TensorFlow deep learning framework, up now in O'Reilly's Data, AI, and Learning sections, is a product of the local data community.

Aaron Schumacher, one of the Data Science DC organizers and an employee of Arlington-based Deep Learning Analytics, wrote the article with the support of many local reviewers, including feedback from members of the DC Machine Learning Journal Club.

Aaron will be giving a talk on the material of Hello, TensorFlow! on Wednesday June 29 as part of the Deep Dive into TensorFlow meetup to be hosted at Sapient in Arlington. It should be a great opportunity to explore and discuss this new and exciting tool!

Announcing Discussion Lists! First up: Deep Learning

Data Community DC is pleased to announce a new service to the area data community: topic-specific discussion lists! In this way we hope to extend the successes of our Meetups and workshops by providing a way for groups of local people with similar interests to maintain contact and have ongoing discussions. Our first discussion list will be on the topic of Deep Learning. The below is a guest post from John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC Meetup. A while back, there was this blog post about Deep Learning. At the end, we asked readers about their interest in hands-on Deep Learning tutorials.

ELEVEN

The results are in, and the survey went to 11. And as in all data science, context matters--and this eleven is decidedly less inspiring than Nigel Tufnel’s eleven. That said, ten out of eleven respondents wanted a hands-on Deep Learning tutorial, and eight respondents said they would register for a tutorial even if it required hardware approval or enrollment in a hardware tutorial. But interest in practical hands-on Deep Learning workshops appears to be highly nonuniform. One respondent said they’d drive from hundreds of miles away for these workshops, but of the 3000+ data scientists in DC’s data and analytics community, presumably more local, only eleven total responded with interest.

In short, the survey was a bust.

So it’s still not clear what the area data community wants out of Deep Learning, if anything, but since April I’ve gotten plenty of questions from plenty of people about Deep Learning on everything from hardware to parameter tuning, so I know there’s more interest than what we got back on the survey. Since a lot of these questions are probably shared, a discussion list might help us figure out how we can best help the most members get started in Deep Learning.

So how about a Deep Learning discussion list? If you’re a local and want to talk about Deep Learning, sign up here:

https://groups.google.com/a/datacommunitydc.org/d/forum/deeplearning

For the record, this discussion list was Harlan’s original suggestion. If you’re looking to take away any rules of thumb here, a simple one is “just agree with whatever Harlan says.” Tommy Jones and I will run this discussion list for now. To be clear, this list caters to the specific Deep Learning interests of data enthusiasts in the DC area. For a bigger community, there’s always deeplearning.net, the Deep Learning google+ page , and individual mailing lists and git repos for specific Deep Learning codebases, like Caffe, pylearn2, and Torch7.

In the meantime, I was happy to see some Deep Learning interest at DC NLP’s Open Mic night by Christo Kirov. And NLP data scientists need not watch Deep Learning developments from the sidelines anymore; some recent motivating results in the NLP space have been summarized in a tutorial by Richard Socher. I’m not qualified to say whether these are the kind of historic breakthroughs we’ve recently seen in speech recognition and object recognition, but it’s worth taking a look at what's happening out there.

Where are the Deep Learning Courses?

This is a guest post by John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC.

Why aren't there more Deep Learning talks, tutorials, or workshops in DC2?

It's been about two months since my Deep Learning talk at Artisphere for DC2. Again, thanks to the organizers (especially Harlan Harris and Sean Gonzalez) and the sponsors (especially Arlington Economic Development). We had a great turnout and a lot of good questions that night. Since the talk and at other Meetups since, I've been encouraged by the tidal wave of interest from teaching organizations and prospective students alike.

First some preemptive answers to the “FAQ” downstream of the talk:

  • Mary Galvin wrote a blog review of this event.
  • Yes, the slides are available.
  • Yes, corresponding audio is also available (thanks Geoff Moes).
  • A recently "reconstructed" talk combining the slides and audio is also now available!
  • Where else can I learn more about Deep Learning as a data scientist? (This may be a request to teach, a question about how to do something in Deep Learning, a question about theory, or a request to do an internship. They're all basically the same thing.)
  • It's this last question that's the focus of this blog post. Lots of people have asked and there are some answers out there already, but if people in the DC MSA are really interested, there could be more. At the end of this post is a survey—if you want more Deep Learning, let DC2 know what you want and together we'll figure out what we can make happen.

There actually was a class...

Aaron Schumacher and Tommy Shen invited me to come talk in April for General Assemb.ly's Data Science course. I did teach one Deep Learning module for them. That module was a slightly longer version of the talk I gave at Artisphere combined with one abbreviated “hands on” module on unsupervised feature learning based on Stanford's tutorial. It didn't help that the tutorial was written in Octave and the class had mostly been using Python up to that point. Though feedback was generally positive for the Deep Learning module, some students wondered if they could get a little more hands on and focus on specifics. And I empathize with them. I've spent real money on Deep Learning tutorials that I thought could have been much more useful if they were more hands on.

Though I've appreciated all the invitations to teach courses, workshops, or lectures, except for the General Assemb.ly course, I've turned down all the invitations to teach something more on Deep Learning. This is not because the data science community here in DC is already expert in Deep Learning or because it's not worth teaching. Quite the opposite. I've not committed to teach more Deep Learning mostly because of these three reasons:

  1. There are already significant Deep Learning Tutorial resources out there,
  2. There are significant front end investments that neophytes need to make for any workshop or tutorial to be valuable to both the class and instructor and,
  3. I haven't found a teaching model in the DC MSA that convinces me teaching a “traditional” class in the formal sense is a better investment of time than instruction through project-based learning on research work contracted through my company.

Resources to learn Deep Learning

There are already many freely available resources to learn the theory of Deep Learning, and it's made even more accessible by many of the very lucid authors who participate in this community. My talk was cherry-picked from a number of these materials and news stories. Here are some representative links that can connect you to much of the mainstream literature and discussion in Deep Learning:

  • The tutorials link on the DeepLearning.net page
  • NYU's Deep Learning course course material
  • Yann LeCun's overview of Deep Learning with Marc'Aurelio Ranzato
  • Geoff Hinton's Coursera course on Neural Networks
  • A book on Deep Learning from the Microsoft Speech Group
  • A reading list list from Carnegie Mellon with student notes on many of the papers
  • A Google+ page on Deep Learning

This is the first reason I don't think it's all that valuable for DC to have more of its own Deep Learning “academic” tutorials. And by “academic” I mean tutorials that don't end with students leaving the class successfully implementing systems that learn representations to do amazing things with those learned features. I'm happy to give tutorials in that “academic” direction or shape them based on my own biases, but I doubt I'd improve on what's already out there. I've been doing machine learning for 15 years, so I start with some background to deeply appreciate Deep Learning, but I've only been doing Deep Learning for two years now. And my expertise is self-taught. And I never did a post-doc with Geoff Hinton, Yann LeCun or Yoshua Bengio. I'm still learning, myself.

The investments to go from 0 to Deep Learning

It's a joy to teach motivated students who come equipped with all the prerequisites for really mastering a subject. That said, teaching a less equipped, uninvested and/or unmotivated studentry is often an exercise in joint suffering for both students and instructor.

I believe the requests to have a Deep Learning course, tutorial, workshop or another talk are all well intentioned... Except for Sean Gonzalez—it creeps me out how much he wants a workshop. But I think most of this eager interest in tutorials overlooks just how much preparation a student needs to get a good return on their time and tuition. And if they're not getting a good return, what's the point? The last thing I want to do is give the DC2 community a tutorial on “the Past” of neural nets. Here are what I consider some practical prerequisites for folks to really get something out of a hands-on tutorial:

  • An understanding of machine learning, including
    • optimization and stochastic gradient descent
    • hyperparameter tuning
    • bagging
    • at least a passing understanding of neural nets
  • A pretty good grasp of Python, including
    • a working knowledge of how to configure different packages
    • some appreciation for Theano (warts and all)
    • a good understanding of data preparation
  • Some recent CUDA-capable NVIDIA GPU hardware* configured for your machine
    • CUDA drivers
    • NVIDIA's CUDA examples compiled

*hardware isn't necessarily a prerequisite, but I don't know how you can get an understanding of any more than toy problems on a CPU

Resources like the ones above are great for getting a student up to speed on the “academic” issues of understanding deep learning, but that only scratches the surface. Once students know what can be done, if they’re anything like me, they want to be able to do it. And at that point, students need a pretty deep understanding of not just the theory, but of both hardware and software to really make some contributions in Deep Learning. Or even apply it to their problem.

Starting with the hardware, let's say, for sake of argument, that you work for the government or are for some other arbitrary reason forced to buy Dell hardware. You begin your journey justifying the $4000 purchase for a machine that might be semi-functional as a Deep Learning platform when there's a $2500 guideline in your department. Individual Dell workstations are like Deep Learning kryptonite, so even if someone in the n layers of approval bureaucracy somehow approved it, it's still the beginning of a frustrating story with an unhappy ending. Or let's say you build your own machine. Now add “building a machine” for a minimum of about $1500 to the prerequisites. But to really get a return in the sweet spot of those components, you probably want to spend at least $2500. Now the prerequisites include a dollar investment in addition to talent and tuition! Or let’s say you’re just going to build out your three-year-old machine you have for the new capability. Oh, you only have a 500W power supply? Lucky you! You’re going shopping! Oh, your machine has an ATI graphics card. I’m sure it’s just a little bit of glue code to repurpose CUDA calls to OpenCL calls for that hardware. Let's say you actually have an NVIDIA card (at least as recent as a GTX 580) and wanted to develop in virtual machines, so you need PCI pass-through to reach the CUDA cores. Lucky you! You have some more reading to do! Pray DenverCoder9's made a summary post in the past 11 years.

“But I run everything in the cloud on EC2,” you say! It's $0.65/hour for G2 instances. And those are the cheap GPU instances. Back of the envelope, it took a week of churning through 1.2 million training images with CUDA convnets (optimized for speed) to produce a breakthrough result. At $0.65/hour, you get maybe 20 or 30 tries doing that before it would have made more sense to have built your own machine. This isn't a crazy way to learn, but any psychological disincentive to experimentation, even $0.65/hour, seems like an unnecessary distraction. I also can't endorse the idea of “dabbling” in Deep Learning; it seems akin to “dabbling” in having children—you either make the commitment or you don't.

At this point, I’m not aware of an “import deeplearning” package in Python that can then fit a nine layer sparse autoencoder with invisible CUDA calls to your GPU on 10 million images at the ipython command line. Though people are trying. That's an extreme example, but in general, you need a flexible, stable codebase to even experiment at a useful scale—and that's really what we data scientists should be doing. Toys are fine and all, but if scale up means a qualitatively different solution, why learn the toy? And that means getting acquainted with the pros and cons of various codebases out there. Or writing your own, which... Good luck!

DC Metro-area teaching models

I start from the premise that no good teacher in the history of teaching has ever been rewarded appropriately with pay for their contributions and most teaching rewards are personal. I accept that premise. And this is all I really ever expect from teaching. I do, however, believe teaching is becoming even less attractive to good teachers every year at every stage of lifelong learning. Traditional post-secondary instructional models are clearly collapsing. Brick and mortar university degrees often trap graduates in debt at the same time the universities have already outsourced their actual teaching mission to low-cost adjunct staff and diverted funds to marketing curricula rather than teaching them. For-profit institutions are even worse. Compensation for a career in public education has never been particularly attractive, but still there have always been teachers who love to teach, are good at it, and do it anyway. However, new narrow metric-based approaches that hold teachers responsible for the students they're dealt rather than the quality of their teaching can be demoralizing for even the most self-possessed teachers. These developments threaten to reduce that pool of quality teachers to a sparse band of marginalized die-hards. But enough of my view of “teaching” the way most people typically blindly suggest I do it. The formal and informal teaching options in the DC MSA mirror these broader developments. I run a company with active contracts and however much I might love teaching and would like to see a well-trained crop of deep learning experts in the region, the investment doesn't add up. So I continue to mentor colleagues and partners through contracted research projects.

I don't know all the models for teaching and haven't spent a lot of time understanding them, but none seem to make sense to me in terms of time invested to teach students—partly because many of them really can't get at the hardware part of the list of prerequisites above. This is my vague understanding of compensation models generally available in the online space*:

  • Udemy – produce and own a "digital asset" of the course content and sell tuition and advertising as a MOOC. I have no experience with Udemy, but some people seemed happy to have made $20,000 in a month. Thanks to Valerie at Feastie for suggesting this option.
  • Statistics.com – Typically a few thousand for four sessions that Statistics.com then sells; I believe this must be a “work for hire” copyright model for the digital asset that Statistics.com buys from the instructor. I assume it's something akin to commissioned art, that once you create, you no longer own. [Editor’s note: Statistics.com is a sponsor of Data Science DC. The arrangement that John describes is similar to our understanding too.]
  • Myngle – Sell lots of online lessons for typically less than a 30% share.

And this is my understanding of compensation models locally available in the DC MSA*:

  • General Assemb.ly – Between 15-20% of tuition (where tuition may be $4000/student for a semester class).
  • District Data Labs Workshop – Splits total workshop tuition or profit 50% with the instructor—which may be the best deal I've heard, but 50% is a lot to pay for advertising and logistics. [Editor's note: These are the workshops that Data Community DC runs with our partner DDL.]
  • Give a lecture – typically a one time lecture with a modest honorarium ($100s) that may include travel. I've given these kinds of lectures at GMU and Marymount.
  • Adjunct at a local university – This is often a very labor- and commute-intensive investment and pays no better (with no benefits) than a few thousand dollars. Georgetown will pay about $200 per contact hour with students. Assuming there are three hours of out of classroom commitment for every hour in class, this probably ends up somewhere in the $50 per hour range. All this said, this was the suggestion of a respected entrepreneur in the DC region.
  • Tenure-track position at a local university – As an Assistant Professor, you will typically have to forego being anything but a glorified post-doc until your tenure review. And good luck convincing this crowd they need you enough to hire you with tenure.

*These are what I understand to be the approximate options and if you got a worse or better deal, please understand I might be wrong about these specific figures. I'm not wrong, though, that none of these are “market rate” for an experienced data scientist in the DC MSA.

Currently, all of my teaching happens through hands-on internships and project-based learning at my company, where I know the students (i.e. my colleagues, coworkers, subcontractors and partners) are motivated and I know they have sufficient resources to succeed (including hardware). When I “teach,” I typically do it for free, and I try hard to avoid organizations that create asymmetrical relationships with their instructors or sell instructor time as their primary “product” at a steep discount to the instructor compensation. Though polemic, Mike Selik summarized the same issue of cut rate data science in "The End of Kaggle." I'd love to hear of a good model where students could really get the three practical prerequisites for Deep Learning and how I could help make that happen here in DC2 short of making “teaching” my primary vocation. If there's a viable model for that out there, please let me know. If you still think you'd like to learn more about Deep Learning through DC2, please help us understand what you'd want out of it and whether you'd be able to bring your own hardware.

[wufoo username="datacommunitydc" formhash="m11ujb9d0m66byv" autoresize="true" height="1073" header="show" ssl="true"]

Weekly Round-Up: Open Data Order, Data Discovery, Andrew Ng, and Connected Devices

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from Open Data to connected devices. In this week's round-up:

  • Open Data Order Could Save Lives, Energy Costs And Make Cool Apps
  • Four Types of Discovery Technology
  • Andrew Ng and the Quest for the New AI
  • Our Connected Future

Open Data Order Could Save Lives, Energy Costs And Make Cool Apps

This is a TechCrunch article about President Obama's recent Open Data Order, an executive order intended to make more government agency data openly available for analysis. The article goes on to talk about some of the ways open data has been used in the past and has a link to Project Open Data's Github page where you can find more details.

Four Types of Discovery Technology

This Smart Data Collective post talks about the value of discovery in data analytics and business. The author claims there are four types of discovery for business analytics - event discovery, data discovery, information discovery, and visual discovery - and he goes into some detail explaining each one and the differences between them.

Andrew Ng and the Quest for the New AI

This is an interesting Wired piece about Andrew Ng, best known as the Stanford machine learning professor who also co-founded Coursera. The article talks about Ng's background and interest in artificial intelligence as well as some of the deep learning projects he is working on. It goes on to explain a little about what deep learning is and how it may evolve in the future.

Our Connected Future

Our final piece this week is a GigaOM article about connected devices and how they will become more prevalent in the future. The article highlights some very interesting devices, explains what they do, and describes how they are being used. The article also talks about the data that can be collected from connected devices such as these and different ways that this data can be used.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups