twitter analytics

Event Review: Analyzing Twitter: An End-to-End Data Pipeline Recap

Data Science MD once again returned to the wonderful Ad.com in Baltimore to discuss Analyzing Twitter: An End-to-End Data Pipeline with two experts from Cloudera.

Starting off the night, Joey Echeverria, a Principal Solutions Architect, first discussed a big data architecture and how a key components of relational data management system can be replaced with current big data technologies. With Twitter being increasingly popular with marketing teams, analyzing Twitter data becomes a perfect use case to demonstrate a complete big data pipeline.

Walking through each component, Joey described what functionality each technology provided to the solution. Flume is able to pull data from a source and store it into a sink. With a custom Flume source interfacing with the Twitter API, this allows the automatic retrieval of tweets and storage into HDFS using the JSON format.

For query and reporting functionality, Hive (an open source project under the Apache Software Foundation) provides a SQL like interface to create MapReduce jobs to access the data. Hive has a schema on read, supports scalar and complex types, and allows custom serializers and deserializers.  However, as Joey warned, it is not the same as accessing a relational database. There is no transaction support, and the queries can take several minutes to hours.

Complex queries can be written to select, group, and perform other calculations on the Twitter data. With a JSON deserializer, the JSON Twitter data stored by Flume can be queried by Hive.

While Hive is powerful, Joey explained it can be slow.  A tool like Impala can perform queries up to 100x times faster than Hive. Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs.

Lastly, in order to have everything run automatically and repeatedly, Joey introduced Oozie, which allows workflows to be created and managed. By combining these open source tools of the Hadoop ecosystem, a complete Twitter analysis pipleline can be created to provide efficient retrieval, storage, and querying of tweets.

In conclusion, Joey highlighting the three, part, and series that further covers the Twitter pipeline in detail.

Following Joey, Sean Busbey, a Solutions Architect at Cloudera, discussed working with Mahout, a scalable machine learning library for Hadoop. Sean first introduced the three C's of machine learning: classification, clustering, and collaborative filtering. With classification, learning from a training set supervised, and new examples can be categorized. Clustering allows examples to be grouped together with common features, while collaborative filtering allows new candidates to be suggested.

For tonight's presentation, Sean used clustering for the demonstration.

For clustering to work, a vector of features is needed for the algorithm to be able to cluster.

With two features in the vector, it is possible to visualize how the the clustering of items could occur.

Sean then discussed how tweets could be weighted for learning using basic count, inverse document frequencies, and n-grams. To get the data into a format for Mahout to use, a combination of Hive, Hadoop streaming, and the Mahout seq2sparse tool can be used.

For clustering to work, a notion of measurable similarity is needed. Sean discussed how Euclidean distance, cosine similarity, and Jaccard distance can be used.

K-Means clustering is a method of cluster analysis supported by Mahout. Several parameters including number of clusters, distance metric, max iterations, and stopping threshold must be specified when running k-means.

The results of the clustering were not ideal, so Sean next described using canopy clustering.

Even then though, the clusters were not the best, for which Sean concluded that short term casual speech is hard to analyze and that a custom analyzer that does further smoothing and removes certain stop words should be used.

If you would like to read the complete slides, Joey's can be found here, and Sean's are here.

Uncovering Hidden Social Information Generates Quite a Buzz

We are pleased to have community member Greg Toth present this event review. Greg is a consultant and entrepreneur in the Washington DC area. As a consultant, he helps clients design and build large-scale information systems, process and analyze data, and solve business and technical problems. As an entrepreneur, he connects the dots between what’s possible and what’s needed, and brings people together to pursue new business opportunities. Greg is the president of Tricarta Corporation and the CTO of EIC Data Systems, Inc. The March 2013 meetup of Data Science DC generated quite a buzz!  Well over a hundred data scientists and practitioners gathered in Chevy Chase to hear Prof. Jennifer Golbeck from the Univ. of Maryland give a very interesting – and at times somewhat startling – talk about how hidden information can be uncovered from people’s online social media activities.

600_220709992

Prof. Golbeck develops methods for discovering things about people online.  She opened her talk with a brief example of how bees reveal specific information to their hive’s social network through the characteristics of their “waggle dance.”  The figure eight patterns of the waggle dance convey distance and direction to pollen sources and water to the rest of the hive – which is a large social network.

Facebook Information Sharing

From there the discussion turned to how Facebook’s information sharing defaults have evolved from 2005 through 2010.  In 2005, Facebook’s default settings shared a relatively narrow set of your personal data with friends and other Facebook users.  At this point none of your information was – by default – shared with the entire Internet.

In subsequent years the default settings changed each year, sharing more and more information with a wider and wider audience.  By 2009, several pieces of your information were being shared openly with anyone on the Internet unless you had changed the default settings.  By 2010 the default settings were sharing significant amounts of information with a large swath of other people, including people you don’t even know.

The Facebook sharing information Prof. Golbeck described came from Matt McKeon’s work, which can be found here:  http://mattmckeon.com/facebook-privacy/

This ever-increasing amount of shared information has opened up new avenues for people to find out things about you, and many people may be shocked at what's possible.  Prof. Golbeck gave a live demonstration of a web site called Take This Lollipop, using her own Facebook account.  I won’t spoil things by telling you what it does, but suffice to say it was quite startling.  If this piques your interest, check out www.takethislollipop.com

Predicting Personality Traits

From there the discussion shifted to a research project intended to determine whether it's possible to predict people's personality traits by analyzing what they put on social media.  First, a group of research participants were asked to identify their core personality traits by going through a standardized psychological evaluation.  The Big Five factors that they measured are openness, conscientiousness, extraversion, agreeableness, and neuroticism.

Next the research team gathered information from these people’s Facebook and Twitter accounts, including language features (e.g. words they use in posts), personal information, activities and preferences, internal Facebook stats, and other factors.  Tweets were processed in an application called LIWC, which stands for Linguistic Inquiry and Word Count.  LIWC is a text analysis program that examines a piece of text and the individual words it contains, and computes numeric values for positive and negative emotions as well as several other factors.

The data gathered from Twitter and Facebook was fed into a personality prediction algorithm developed by the research team and implemented using the Weka machine learning toolkit.  Predicted personality trait values from the algorithm were compared to the original Big Five assessment results to evaluate how well the prediction model performed.  Overall, the difference between predicted and measured personality traits was roughly 10 to 12% for Facebook (considered very good) and roughly 12 to 18% for Twitter (not quite as good).  The overall conclusion was that yes, it is possible to predict personality traits by analyzing what people put on social media.

Predicting Political Preferences

The second research project was about computing political preference in Twitter audiences.  Originally this project started with the intention of looking at the Twitter feeds of news media outlets and trying to predict media bias.  However, the topic of media bias in general was deemed too problematic and controversial and they decided instead to focus on predicting the political preferences of the media audiences.

The objective was to come up with a method for computing the political orientation of people who followed popular news media outlets on Twitter.  To do this, the team computed the political preference of about 1 million Twitter users by finding which Congresspeople they followed on Twitter, and looking at the liberal to conservative ratings of those Congresspeople.  A key assumption was that people's political preferences will, on average, reflect those of the Congresspeople they follow.

From there, the team looked at 20 different Twitter news outlets and identified who followed each one.  The political preferences of each media outlet's followers were composited together to compute an overall audience political preference factor ranging from heavily conservative to heavily liberal at the two extremes, with moderate ranges in the middle.  The results showed that Fox News had the most conservative audience, NPR Morning Edition had the most liberal audience, and Good Morning America was in the middle with a balanced mix of both conservative and liberal followers.  Further details on the results can be found in the paper here.

Summary & Wrap-up

An awful lot of things about you can be figured out by looking at public information in your social media streams.  Personality traits and political preferences are but two examples.  Sometimes this information can be used for beneficial purposes, such as showing you useful recommendations.  Likewise, a future employer could use this kind of information to form opinions during the hiring process.  People don't always think about this (or necessarily even realize what's possible) when they post things to social media.

Overall Prof. Golbeck’s presentation was well received and generated a number of questions and conversations after the talk.  The key takeaway was that “We know who you are and what you are thinking” and that information can be used for a variety of purposes – in most cases without you even being aware.  The situation was summed up pretty well in one of Prof. Golbeck’s opening slides:

I develop methods for discovering things about people online.

I never want anyone to use those methods on me.

-- Jennifer Golbeck

For those who want to delve deeper, several resources are available:

Commentary

Overall I found this presentation to be very worthwhile and thought-provoking.  Prof. Golbeck was an engaging speaker who was both informative and entertaining.  She provided a number of useful references, links and papers for delving deeper into the topics covered.  The venue and logistics were great and there were plenty of opportunities for networking and talking with colleagues both before and after the presentation.

The topic of predicting people's traits and behaviors is very relevant, particularly in the realm of politics.  At least one other Data Science DC meetup held within the last few months focused on how data sciences were used in the last presidential election and the tremendous impact it had.  That trend is sure to continue, fueled by research like this coupled with the availability of data, more sophisticated tools, and the right kinds of data scientists to connect the dots and put it all together.

If you have the time, I would recommend listening to the audio recording and following along the slide deck.  There were many more interesting details in the talk than what I could cover here.

My personal opinion is that too few people realize the data footprint they leave when using social media.  That footprint has a long memory and can be used for many purposes, including purposes that haven't even been invented yet.  Many people seem to think that either the data they put on social media is trivial and doesn't reveal anything, or think that no-one cares and it's just "personal stuff."  But as we've seen in this talk, people can discover a lot more than you may think.

This post contains affiliate links.