These tweets represent two years of social media messages for and by the DC tech and startup community. Naturally, we thought visualizing this was a task for Data Community DC. I built two word clouds to visualize who and what people tweet about. Here's how I built them:
- Using the Python csv library, I imported the data from a csv file.
- I scrubbed each tweet/status using regular expressions and the NLTK stopword corpus (I added some of my own stopwords after examining the data).
- I used NLTK to create a frequency distribution of words and two-word phrases in the tweets/statuses.
- I fed this data to a Python word cloud library.
I also filtered out just the handles to see who gets mentioned the most.
- #DCtech is "great" and "awesome".
- We love our meetups.
- We love our journalists.
- @1776dc, @corbett3000, and @inthecapital are the cool kids in school.
Here's the code:
#!/usr/bin/env python # -*- coding: utf-8 -*- import sys, csv, re, nltk from nltk.corpus import stopwords sys.path.append("./word_cloud") from wordcloud import make_wordcloud import numpy as np SW = stopwords.words('english') SW.extend(["rt","via","amp","cc","get","us", "got","way","mt","10","gt"]) words =  def tokenize(update): update = re.sub('\.|:|;|&|\/|#|!|"|,|-|\?|\(|▸','',update) words = update.lower().split() words = [word for word in words if (word[:4] != "http" and len(word) > 1 and word not in SW)] bigrams = [' '.join(words[i:i+2]) for i in range(len(words)-1)] return words + bigrams with open("tweet_data.csv","r") as h: csvreader = csv.reader(h) for r in csvreader: words.extend(tokenize(r)) MINCOUNT = 5 freq = nltk.FreqDist(word for word in words) h1 = open("freq.txt","w") h2 = open("handles.txt","w") handles =  handlecounts =  for k, i in freq.items(): if i > MINCOUNT: h1.write(k+" "+str(i)+"\n") if k == '@' and len(k.split()) == 1: h2.write(k+" "+str(i)+"\n") handles.append(k) handlecounts.append(i) # wordcloud of words NUM = 300 w = 1200 h = 600 make_wordcloud(np.array(freq.keys()[:NUM]), np.array(freq.values()[:NUM]), "tweets.png", width=w, height=h) make_wordcloud(np.array(handles[:NUM]), np.array(handlecounts[:NUM]), "handles.png", width=w, height=h)