This is a guest post by Catherine Madden (@catmule), a lifelong doodler who realized a few years ago that doodling, sketching, and visual facilitation can be immensely useful in a professional environment. The post consists of her notes from the most recent Data Visualization DC Meetup. Catherine works as the lead designer for the Analytics Visualization Studio at Deloitte Consulting, designing user experiences and visual interfaces for visual analytics prototypes. She prefers Paper by Fifty Three and their Pencil stylus for digital note taking. (Click on the image to open full size.)
Below is a guest post and infographic from GovTribe, a DC startup that creates products that turn open government data into useful and understandable information. The hōrd iPhone app by GovTribe lets you understand the world of government contracting in real time.
Our latest release, hōrd 2.0, has way more data than our initial release. We spent the last six months building a completely new approach for consuming, processing, and making sense of government data from multiple sources. The iPhone app now provides insight and capability not available anywhere else. Our efforts have also given us pretty robust visibility into how the government behaves and where it allocates its resources. So we thought we'd share.
This post is the first in a series that GovTribe plans to publish. Our purpose is to find some signal in all that noisy data, and to provide some clear, interesting, and maybe even useful information about the world of federal government contracting.
We thought a good place to start, with just over a month left in fiscal year 2013, was a look back at what's been happening since October 1, 2012. Soon to come: Agency Insight. In this series we'll take a deeper look at individual agency activity. Stay tuned - and feedback is always appreciated.
These tweets represent two years of social media messages for and by the DC tech and startup community. Naturally, we thought visualizing this was a task for Data Community DC. I built two word clouds to visualize who and what people tweet about. Here's how I built them:
- Using the Python csv library, I imported the data from a csv file.
- I scrubbed each tweet/status using regular expressions and the NLTK stopword corpus (I added some of my own stopwords after examining the data).
- I used NLTK to create a frequency distribution of words and two-word phrases in the tweets/statuses.
- I fed this data to a Python word cloud library.
I also filtered out just the handles to see who gets mentioned the most.
- #DCtech is "great" and "awesome".
- We love our meetups.
- We love our journalists.
- @1776dc, @corbett3000, and @inthecapital are the cool kids in school.
Here's the code:
#!/usr/bin/env python # -*- coding: utf-8 -*- import sys, csv, re, nltk from nltk.corpus import stopwords sys.path.append("./word_cloud") from wordcloud import make_wordcloud import numpy as np SW = stopwords.words('english') SW.extend(["rt","via","amp","cc","get","us", "got","way","mt","10","gt"]) words =  def tokenize(update): update = re.sub('\.|:|;|&|\/|#|!|"|,|-|\?|\(|▸','',update) words = update.lower().split() words = [word for word in words if (word[:4] != "http" and len(word) > 1 and word not in SW)] bigrams = [' '.join(words[i:i+2]) for i in range(len(words)-1)] return words + bigrams with open("tweet_data.csv","r") as h: csvreader = csv.reader(h) for r in csvreader: words.extend(tokenize(r)) MINCOUNT = 5 freq = nltk.FreqDist(word for word in words) h1 = open("freq.txt","w") h2 = open("handles.txt","w") handles =  handlecounts =  for k, i in freq.items(): if i > MINCOUNT: h1.write(k+" "+str(i)+"\n") if k == '@' and len(k.split()) == 1: h2.write(k+" "+str(i)+"\n") handles.append(k) handlecounts.append(i) # wordcloud of words NUM = 300 w = 1200 h = 600 make_wordcloud(np.array(freq.keys()[:NUM]), np.array(freq.values()[:NUM]), "tweets.png", width=w, height=h) make_wordcloud(np.array(handles[:NUM]), np.array(handlecounts[:NUM]), "handles.png", width=w, height=h)
Visualizing geographic data is a task many of us face in our jobs as data scientists. Often, we must visualize vast amounts of data (tens of thousands to millions of data points) and we need to do so in the browser in real time to ensure the widest-possible audience for our efforts and we often want to do this leveraging free and/or open software. Luckily for us, Google offered a series of fascinating talks at this year's (2013) IO that show one particular way of solving this problem. Even better, Google discusses all aspects of this problem: from cleaning the data at scale using legacy C++ code to providing low latency yet web-scale data storage and, finally, to rendering efficiently in the browser. Not surprisingly, Google's approach highly leverages **alot** of Google's technology stack but we won't hold that against them.
The first talk walks through an overview of where the data comes from and the collection of Google cloud services that compose the system architecture responsible for cleaning, storing, and serving the data fast enough to do real time queries. This video is very useful for understanding how the different technology layers (browser, database, virtual instances, etc) can efficiently interact.
Description: Tens of thousands of ships report their position at least once every 5 minutes, 24 hours a day. Visualizing that quantity of data and serving it out to large numbers of people takes lots of power both in the browser and on the server. This session will explore the use of Maps, App Engine, Go, Compute Engine, BigQuery, Big Store, and WebGL to do massive data visualization.
Description: Much if not most of the world’s data has a geographic component. Data visualizations with a geographic component are some of the most popular on the web. This session will explore the principles of data visualization and how you can use HTML5 - particularly WebGL - to supplement Google Maps visualizations. https://www.youtube.com/watch?feature=player_embedded&v=aZJnI6hxr-c
As a bit of background, Brendan leverages a number of technologies that you might not be familiar with, including three.js and WebGL. Three.js is a nice wrapper for WebGL (among other things) and can greatly simplify the process of getting up and running with 3D in the browser. From the excellent tutorial here:
I have used Three.js for some of my experiments, and it does a really great job of abstracting away the headaches of getting going with 3D in the browser. With it you can create cameras, objects, lights, materials and more, and you have a choice of renderer, which means you can decide if you want your scene to be drawn using HTML 5's canvas, WebGL or SVG. And since it's open source you could even get involved with the project. But right now I'll focus on what I've learned by playing with it as an engine, and talk you through some of the basics.
WebGL is one mechanism for rendering three dimensional data in the browser and is based on OpenGL 2.0 ES. Wikipedia describes it as:
Infographics are an incredibly popular way to visualize data and a valuable marketing tool. If you need proof of that statement, check out this infographic on infographics. Infographics garner orders of magnitude more social shares than regular blog posts. When I set out to create an infographic for Feastie Analytics on "How to Make Your Recipe go Viral on Pinterest", I took a look at several online tools including visual.ly, infogr.am, Piktochart, and easel.ly. My conclusion: it's exhausting to figure out which one to use, and then learn it. I tried with a few and wasn't blown away with the user interface (too much to learn) or the results (the chart designs didn't thrill me). They all offer themes to work with, but not many. One offered only about six themes unless I paid a monthly subscription fee. It doesn't make sense to pay a monthly subscription for something I only use occasionally. Plus, having limited themes that dozens of other infographics are based on, and the tool's logo at the bottom could dilute my branding and thus marketing effect. On top of all that, I was afraid that I would get started, invest the time, and then realize I can't finish because an essential feature is missing.
Then I realized I already had the perfect tool for creating an infographic. It has a drag and drop interface that I already know, the ability to add any icons or photos, basic drawing tools and visual effects, alignment tools, and most importantly, chart building tools all built in. It's Keynote. Keynote may not be intended to create the types of infographics that are popular on the web, but when you think about it, a slide presentation is a lot like an infographic. By creating a one slide presentation with custom dimensions, you can easily use it to create great infographics for the web. Here are the steps I used to build mine:
- Set a custom slide size. From the inspector, go to the 'Document' tab and under 'Slide Size', choose 'Custom Size', then enter your dimensions. I made mine 640px wide because that's the width of my blog post template and 4000px long because that's the maximum length allowed by Keynote. Note that if you change the dimensions after adding objects to the slide, Keynote will rearrange your stuff in a failed attempt to be helpful. I recommend using the maximum length allowed. You can always crop any excess out later.
- Design a "theme". Visual style and pieces of flair are what separate an infographic from just a boring old set of charts. Choose an interesting background and some kind of visual element that separates the sections of the infographic. These visual elements should reflect your subject matter. I used a stock photo of a corkboard as my background. Then I created rectangles using the Keynote shape tool and the picture frame effect to look like sheets of paper with shadows. Finally, I used another stock photo of a pushpin to create the illusion that the papers had been pinned to a pinboard. Cut, paste, repeat.
- Fill in your sections. Import various graphics and use the built in Keynote chart tools to your heart's content. Customize the look of your charts with colors and textures that fit in nicely with your theme. A note of warning: the funky dimensions of your slide will confuse Keynote, causing it to create awkwardly sized charts. It helps to have another Keynote presentation open as a scratch space to create your charts and copy them over to your infographic. Another tip: make the type and visuals large so that the infographic is still readable when it's posted on Pinterest. The new width for Pinterest pins is around 240px so you'll want your infographic to work at about 40% of it's original size if you choose the same dimensions that I did. Don't forget to include a section that includes your name or the name of your company and the url of your website.
- Export as an image. From the file menu, choose 'Export', then click on 'Images'. I recommend using a PNG to avoid funky JPEG artifacts.
- Crop it. If you had excess space at the bottom of your you'll need to crop that out using an image editing tool such as GIMP. While you're at it, you may want to create a "title only" image to include above the fold in your blog post.
You're done! Now you have a great infographic to share on your blog and across all social media channels and you didn't have to learn any awkward new tools to create it.
Netflix recently used their own data to drive the creation of the hit series 'House of Cards'. A similar approach can be applied to other forms of media to create content that is highly likely to become popular or even go viral through social media channels.
I examined the data set collected by Feastie Analytics to determine the features of recipes that make them the most likely to go viral on Pinterest. Some of the results are in the infographic below (originally published here). The data set includes 109,000 recipes published after Jan 1, 2011 on over 1200 different food blogs. Each recipe is tagged by its ingredients, meal course, dish title, and publication date. For each recipe, I have a recent total pin count. I also have the dimensions of a representative photo from the original blog post.
The first thing that I examined is the distribution of pins by recipe. What I found is that the distribution of pins by recipe is much like the distribution of wealth in the United States -- the top 1% have orders of magnitude more than the bottom 90%. The top 0.1% has another order of magnitude more than the top 1%! Many of the most pinned recipes are from popular blogs that regularly have highly pinned recipes, but a surprising number are from smaller or newer blogs. A single viral photo can drive hundreds of thousands of new visitors to a site that has never seen that level of traffic before.
For the purposes of this analysis, I defined "going viral" as reaching the top 5% of recipes -- having a pin count over 2964 pins. Then, I calculated how much more (or less) likely a recipe is to go viral depending on its meal course, keywords in the dish title, ingredients, day of the week, and the aspect ratio of the photo.
Some of the results are surprising and some are expected. Many people would expect that desserts are most likely to go viral on Pinterest. But in reality, desserts are published the most but not most likely to go viral. Appetizers have the best probability of going viral, perhaps because they are published less frequently, yet are in relatively high demand. The popularity of cheese, chocolate, and other sweets in the dishes and ingredients is not surprising. What is somewhat surprising are some of the healthier ingredients such as quinoa, spinach, and black beans. The fact that Sunday is the second best day to publish is surprising, as most publishers avoid weekends. However traffic to recipe sites spikes on Sundays, so it makes sense that recipes published then have an advantage. Finally, it's no surprise that images with tall orientations are more likely to go viral on Pinterest considering how they are given more space by the Pinterest design. But now, we can put a number on just how much of an advantage portrait oriented photos have -- they are approximately twice as likely to go viral as the average photo.
Hungry yet? What other forms of content would you like to see this approach applied to?
Check back tomorrow for a tutorial on how to create an infographic with Keynote.
Clustering is about recognizing associations between data points, which we can easily visualize using different forcegraph layout structures (fructerman, reingold, circle, etc.). Exploring data is about understanding how different data associations change the overall structure of the data corpus. With hundreds of data fields and no specific rules on how data may or may not be related, it is up to the user to declare an association and verify their instincts through the resulting data viz. As data scientists, many times we are expected to have The answer, so when we present our work the presentees may not be so willing to question. This is where the value of RStudio Shiny becomes clear. Just as Salman Kahn, of Kahn Academy, recognized that his nephews would rather listen to his lectures on YouTube, where they are free to rewind and fast forward without being rude, our presentees may want to experiment with the data associations and the overall structure. Shiny allows data scientists to create the interactive clustering process, an alternative to boring power point presentations, that allows our presentees to freely ask their questions. Data psychology shows that people remember better when they're part of the process, and our ultimate goal is to make an impression.
Data Science DC had an event a while back on clustering where Dr.Abhijit Dasgupta presented on unsupervised clustering. The approaches outlined in the good doctor's presentation presume the data to be based on rules between data points in the set. However, we can also introduce declaration or repudiation of associations, where the user declares data fields to be either associative, to filter associations between other fields, or to not be included.
This is important because when looking for patterns in the data, if we compare everything to everything else we may get the proverbial 'hairball' cluster, where everything is mutually connected. This is useless if we're trying to find structure for a decision algorithm, where separation and distinction are key.
RStudio Shiny gives the power to easily build interactive cluster exploration visualizations, web apps, in R. Shiny uses reactive functions to pass inputs and outputs between the ui.R and server.R functions in the application directory. Programming a new app takes a little getting used to as linear programming in R is different than web programming in R; for instance assigning value to the output structure in server.R doesn't necessarily mean its available to pass to the reactive function a few lines down. To keep things simple you have to use the right type of 'reactive function' on the server.R side or div function on the ui.R side, but the structure is simple and the rest of coding in R remains exactly the same. Shiny Server gives you the power to host your web app in the cloud, but be warned that large applications on Amazon EC2 micro instances may run Very Slowly - which is Amazon's business model and understandable, they want you to upgrade now that you know the potential they offer.