interesting data tools

Cloudera Impala Talk Notes - Near Real Time Big Data Querying for the Masses

At the KickOff event for the Data Science MD meetup, we had the amazing good fortune to have a veteran big data practitioner, Wayne Wheeles, talk about his experiences as a two-week, impromptu Cloudera Impala beta-tester and evaluator. Below is a brief snippet about Impala and then Wayne's slides intermingled with my notes and some online research. So, what is Cloudera Impala? To put it simply, it is real time SQL-like querying for Hadoop, which usually runs in a batch mode. For more detail, take a look at what Cloudera says:

After a long period of intense engineering effort and user feedback, we are very pleased, and proud, to announce the Cloudera Impala project. This technology is a revolutionary one for Hadoop users, and we do not take that claim lightly.

When Google published its Dremel paper in 2010, we were as inspired as the rest of the community by the technical vision to bring real-time, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Today, we are announcing a fully functional, open-sourced codebase that delivers on that vision – and, we believe, a bit more – which we call Cloudera Impala. An Impala binary is now available in public beta form, but if you would prefer to test-drive Impala via a pre-baked VM, we have one of those for you, too. (Links to all downloads and documentation are here.) You can also review the source code and testing harness at Github right now.

Impala raises the bar for query performance while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. Furthermore, it uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive, providing a familiar and unified platform for batch-oriented or real-time queries. (For that reason, Hive users can utilize Impala with little setup overhead.) The first beta drop includes support for text files and SequenceFiles; SequenceFiles can be compressed as Snappy, GZIP, and BZIP (with Snappy recommended for maximum performance). Support for additional formats including Avro, RCFile, LZO text files, and Doug Cutting’sTrevni columnar format is planned for the production drop.

To avoid latency, Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. The result is order-of-magnitude faster performance than Hive, depending on the type of query and configuration. (See FAQ below for more details.) Note that this performance improvement has been confirmed by several large companies that have tested Impala on real-world workloads for several months now.

Talk Notes and Slides





And first we start off with some background on Wayne. Beyond the bullet points, you can often judge someone's proficiency in a particular skill by how natural it is for them to discuss their work.  It was a joy to listen to someone with Wayne's technical fluency present.




Interesting to note that Impala is not yet an Apache Foundation Project ...




Wayne next answered the question, "Why Impala?"






I would highly recommend anyone interested in the so-called "big data" space read the Google Dremel Paper mentioned below.





In case you would like more of an overview on Impala, check out Cloudera's slides here.


Slide08 Slide09

Wayne mentioned how surprisingly easy it was to simply drop Impala into an existing CDH 4.1 installation and have it just work.


Slide10 Slide11 Slide12


And some initial impressive query benchmarks. Note that all of these benchmarks involve only the query itself and not actually loading the data into the cluster.


Slide13 Slide14 Slide15