Event Review: Analyzing Twitter: An End-to-End Data Pipeline Recap

Data Science MD once again returned to the wonderful in Baltimore to discuss Analyzing Twitter: An End-to-End Data Pipeline with two experts from Cloudera.

Starting off the night, Joey Echeverria, a Principal Solutions Architect, first discussed a big data architecture and how a key components of relational data management system can be replaced with current big data technologies. With Twitter being increasingly popular with marketing teams, analyzing Twitter data becomes a perfect use case to demonstrate a complete big data pipeline.

Walking through each component, Joey described what functionality each technology provided to the solution. Flume is able to pull data from a source and store it into a sink. With a custom Flume source interfacing with the Twitter API, this allows the automatic retrieval of tweets and storage into HDFS using the JSON format.

For query and reporting functionality, Hive (an open source project under the Apache Software Foundation) provides a SQL like interface to create MapReduce jobs to access the data. Hive has a schema on read, supports scalar and complex types, and allows custom serializers and deserializers.  However, as Joey warned, it is not the same as accessing a relational database. There is no transaction support, and the queries can take several minutes to hours.

Complex queries can be written to select, group, and perform other calculations on the Twitter data. With a JSON deserializer, the JSON Twitter data stored by Flume can be queried by Hive.

While Hive is powerful, Joey explained it can be slow.  A tool like Impala can perform queries up to 100x times faster than Hive. Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs.

Lastly, in order to have everything run automatically and repeatedly, Joey introduced Oozie, which allows workflows to be created and managed. By combining these open source tools of the Hadoop ecosystem, a complete Twitter analysis pipleline can be created to provide efficient retrieval, storage, and querying of tweets.

In conclusion, Joey highlighting the three, part, and series that further covers the Twitter pipeline in detail.

Following Joey, Sean Busbey, a Solutions Architect at Cloudera, discussed working with Mahout, a scalable machine learning library for Hadoop. Sean first introduced the three C's of machine learning: classification, clustering, and collaborative filtering. With classification, learning from a training set supervised, and new examples can be categorized. Clustering allows examples to be grouped together with common features, while collaborative filtering allows new candidates to be suggested.

For tonight's presentation, Sean used clustering for the demonstration.

For clustering to work, a vector of features is needed for the algorithm to be able to cluster.

With two features in the vector, it is possible to visualize how the the clustering of items could occur.

Sean then discussed how tweets could be weighted for learning using basic count, inverse document frequencies, and n-grams. To get the data into a format for Mahout to use, a combination of Hive, Hadoop streaming, and the Mahout seq2sparse tool can be used.

For clustering to work, a notion of measurable similarity is needed. Sean discussed how Euclidean distance, cosine similarity, and Jaccard distance can be used.

K-Means clustering is a method of cluster analysis supported by Mahout. Several parameters including number of clusters, distance metric, max iterations, and stopping threshold must be specified when running k-means.

The results of the clustering were not ideal, so Sean next described using canopy clustering.

Even then though, the clusters were not the best, for which Sean concluded that short term casual speech is hard to analyze and that a custom analyzer that does further smoothing and removes certain stop words should be used.

If you would like to read the complete slides, Joey's can be found here, and Sean's are here.

Cloudera Impala Talk Notes - Near Real Time Big Data Querying for the Masses

At the KickOff event for the Data Science MD meetup, we had the amazing good fortune to have a veteran big data practitioner, Wayne Wheeles, talk about his experiences as a two-week, impromptu Cloudera Impala beta-tester and evaluator. Below is a brief snippet about Impala and then Wayne's slides intermingled with my notes and some online research. So, what is Cloudera Impala? To put it simply, it is real time SQL-like querying for Hadoop, which usually runs in a batch mode. For more detail, take a look at what Cloudera says:

After a long period of intense engineering effort and user feedback, we are very pleased, and proud, to announce the Cloudera Impala project. This technology is a revolutionary one for Hadoop users, and we do not take that claim lightly.

When Google published its Dremel paper in 2010, we were as inspired as the rest of the community by the technical vision to bring real-time, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Today, we are announcing a fully functional, open-sourced codebase that delivers on that vision – and, we believe, a bit more – which we call Cloudera Impala. An Impala binary is now available in public beta form, but if you would prefer to test-drive Impala via a pre-baked VM, we have one of those for you, too. (Links to all downloads and documentation are here.) You can also review the source code and testing harness at Github right now.

Impala raises the bar for query performance while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. Furthermore, it uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive, providing a familiar and unified platform for batch-oriented or real-time queries. (For that reason, Hive users can utilize Impala with little setup overhead.) The first beta drop includes support for text files and SequenceFiles; SequenceFiles can be compressed as Snappy, GZIP, and BZIP (with Snappy recommended for maximum performance). Support for additional formats including Avro, RCFile, LZO text files, and Doug Cutting’sTrevni columnar format is planned for the production drop.

To avoid latency, Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. The result is order-of-magnitude faster performance than Hive, depending on the type of query and configuration. (See FAQ below for more details.) Note that this performance improvement has been confirmed by several large companies that have tested Impala on real-world workloads for several months now.

Talk Notes and Slides





And first we start off with some background on Wayne. Beyond the bullet points, you can often judge someone's proficiency in a particular skill by how natural it is for them to discuss their work.  It was a joy to listen to someone with Wayne's technical fluency present.




Interesting to note that Impala is not yet an Apache Foundation Project ...




Wayne next answered the question, "Why Impala?"






I would highly recommend anyone interested in the so-called "big data" space read the Google Dremel Paper mentioned below.





In case you would like more of an overview on Impala, check out Cloudera's slides here.


Slide08 Slide09

Wayne mentioned how surprisingly easy it was to simply drop Impala into an existing CDH 4.1 installation and have it just work.


Slide10 Slide11 Slide12


And some initial impressive query benchmarks. Note that all of these benchmarks involve only the query itself and not actually loading the data into the cluster.


Slide13 Slide14 Slide15