Data Science MD once again returned to the wonderful Ad.com in Baltimore to discuss Analyzing Twitter: An End-to-End Data Pipeline with two experts from Cloudera.
Starting off the night, Joey Echeverria, a Principal Solutions Architect, first discussed a big data architecture and how a key components of relational data management system can be replaced with current big data technologies. With Twitter being increasingly popular with marketing teams, analyzing Twitter data becomes a perfect use case to demonstrate a complete big data pipeline.
Walking through each component, Joey described what functionality each technology provided to the solution. Flume is able to pull data from a source and store it into a sink. With a custom Flume source interfacing with the Twitter API, this allows the automatic retrieval of tweets and storage into HDFS using the JSON format.
For query and reporting functionality, Hive (an open source project under the Apache Software Foundation) provides a SQL like interface to create MapReduce jobs to access the data. Hive has a schema on read, supports scalar and complex types, and allows custom serializers and deserializers. However, as Joey warned, it is not the same as accessing a relational database. There is no transaction support, and the queries can take several minutes to hours.
Complex queries can be written to select, group, and perform other calculations on the Twitter data. With a JSON deserializer, the JSON Twitter data stored by Flume can be queried by Hive.
While Hive is powerful, Joey explained it can be slow. A tool like Impala can perform queries up to 100x times faster than Hive. Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs.
Lastly, in order to have everything run automatically and repeatedly, Joey introduced Oozie, which allows workflows to be created and managed. By combining these open source tools of the Hadoop ecosystem, a complete Twitter analysis pipleline can be created to provide efficient retrieval, storage, and querying of tweets.
In conclusion, Joey highlighting the three, part, and series that further covers the Twitter pipeline in detail.
Following Joey, Sean Busbey, a Solutions Architect at Cloudera, discussed working with Mahout, a scalable machine learning library for Hadoop. Sean first introduced the three C's of machine learning: classification, clustering, and collaborative filtering. With classification, learning from a training set supervised, and new examples can be categorized. Clustering allows examples to be grouped together with common features, while collaborative filtering allows new candidates to be suggested.
For tonight's presentation, Sean used clustering for the demonstration.
For clustering to work, a vector of features is needed for the algorithm to be able to cluster.
With two features in the vector, it is possible to visualize how the the clustering of items could occur.
Sean then discussed how tweets could be weighted for learning using basic count, inverse document frequencies, and n-grams. To get the data into a format for Mahout to use, a combination of Hive, Hadoop streaming, and the Mahout seq2sparse tool can be used.
For clustering to work, a notion of measurable similarity is needed. Sean discussed how Euclidean distance, cosine similarity, and Jaccard distance can be used.
K-Means clustering is a method of cluster analysis supported by Mahout. Several parameters including number of clusters, distance metric, max iterations, and stopping threshold must be specified when running k-means.
The results of the clustering were not ideal, so Sean next described using canopy clustering.
Even then though, the clusters were not the best, for which Sean concluded that short term casual speech is hard to analyze and that a custom analyzer that does further smoothing and removes certain stop words should be used.
If you would like to read the complete slides, Joey's can be found here, and Sean's are here.