#DCtech Tweets Visualized in 60 Minutes

Tweets with the hashtag #dctech
Tweets with the hashtag #dctech

Yesterday, Peter Corbett of iStrategyLabs posted a data set of 65,000 tweets and facebook statuses with the hashtag #dctech and challenged the community to visualize it.

These tweets represent two years of social media messages for and by the DC tech and startup community. Naturally, we thought visualizing this was a task for Data Community DC. I built two word clouds to visualize who and what people tweet about. Here's how I built them:

  1. Using the Python csv library, I imported the data from a csv file.
  2. I scrubbed each tweet/status using regular expressions and the NLTK stopword corpus (I added some of my own stopwords after examining the data).
  3. I used NLTK to create a frequency distribution of words and two-word phrases in the tweets/statuses.
  4. I fed this data to a Python word cloud library.

I also filtered out just the handles to see who gets mentioned the most.

Who's mentioned the most in dctech
Who's mentioned the most in dctech

Some takeaways:

  1. #DCtech is "great" and "awesome".
  2. We love our meetups.
  3. We love our journalists.
  4. @1776dc, @corbett3000, and @inthecapital are the cool kids in school.

Here's the code:

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import sys, csv, re, nltk 
from nltk.corpus import stopwords 
sys.path.append("./word_cloud") 
from wordcloud import make_wordcloud 
import numpy as np 

SW = stopwords.words('english')
SW.extend(["rt","via","amp","cc","get","us",
    "got","way","mt","10","gt"]) 
words = []

def tokenize(update): 
    update = re.sub('\.|:|;|&|\/|#|!|"|,|-|\?|\(|▸','',update) 
    words = update.lower().split() 
    words = [word for word in words if (word[:4] != 
        "http" and len(word) > 1 and word not in SW)] 
    bigrams = [' '.join(words[i:i+2]) 
        for i in range(len(words)-1)] 
    return words + bigrams 
    
with open("tweet_data.csv","r") as h: 
    csvreader = csv.reader(h) 
    for r in csvreader: 
        words.extend(tokenize(r[4])) 
        
MINCOUNT = 5 
freq = nltk.FreqDist(word for word in words) 
h1 = open("freq.txt","w") 
h2 = open("handles.txt","w") 
handles = [] 
handlecounts = [] 

for k, i in freq.items(): 
    if i > MINCOUNT: 
        h1.write(k+" "+str(i)+"\n") 
        if k[0] == '@' and len(k.split()) == 1: 
            h2.write(k+" "+str(i)+"\n") 
            handles.append(k) 
            handlecounts.append(i) 

# wordcloud of words 
NUM = 300 w = 1200 
h = 600 
make_wordcloud(np.array(freq.keys()[:NUM]), 
    np.array(freq.values()[:NUM]), 
    "tweets.png", width=w, height=h) 
make_wordcloud(np.array(handles[:NUM]),
    np.array(handlecounts[:NUM]),
    "handles.png", width=w, height=h)