Deconstructing the Trends of Strata 2013

Every conference offers a fascinating snapshot of the field that it covers and Strata 2013 was no exception. A WordCloud made in R from all of the talk titles at Strata2013

Big Companies Mean Big Business with Big Data

The most obvious large announcement was Intel's release its own flavor of Hadoop, the concisely titled Intel Distribution of Apache Hadoop software. When massive Intel announces its own distribution of Apache Hadoop, people and industries take notice.



Per Intel, key features include:

  • "Up to 30x boost in Hadoop performance with optimizations for Intel® Xeon processors, Intel® SSD storage, and Intel® 10GbE networking
  • Data confidentiality without a performance penalty with encryption and decryption in HDFS enhanced by Intel® AES-NI and role-based access control with cell-level granularity in HBase
  • Multi-site scalability and adaptive data replication in HBase and HDFS
  • Up to 3.5x improvement in Hive query performance
  • Support for statistical analysis with R connector
  • Enables graph analytics with Intel® Graph Builder
  • Enterprise-grade support and services from Intel"

Intel releasing software to help sell its hardware is nothing new, just look at the Intel C and C++ Compilers. However, Graph Builder looks quite interesting and R just took another step forward in displacing all previous statistical tools.

Also, two other new Hadoop distributions premiered this week:

  1. Hortonworks Data Platform (HDP) 1.1 for Windows 
  2. EMC/Greenplum Pivotal HD

which join the early incumbents Cloudera and MapR. Even IBM has its own distribution as well (InfoSphere BigInsights). While I might have missed one or two distributions, the message is clear. If anyone doubted it before, Hadoop has become the standard and the 800 lb gorillas are getting into the action.

If you want some more details, check out this great post here.


Night of the Living Dead RDB

I agree with Tim O'Brien (and his aptly named talk: The Future of Relational (or Why You Can't Escape SQL) that SQL isn't going away despite all of the attention that No SQL datastores are receiving.  Just look at HIVE and the number of talks focused on speeding up HIVE queries. It would appear that there has been too much work done with relational databases and too many queries written and reports standardized for SQL to be replaced.

The final nail in the coffin was Google's talk on F1: The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business.

Many of the services that are critical to Google’s ad business have historically been backed by MySQL. We have recently migrated several of these services to F1, a new RDBMS developed at Google. F1 implements rich relational database features, including a strictly enforced schema, a powerful parallel SQL query engine, general transactions, change tracking and notification, and indexing. F1 is built on top of Spanner, a new globally-distributed synchronously-replicated storage system that scales on standard hardware in Google data centers. Spanner supports efficient externally-consistent distributed transactions and includes useful features like non-blocking read-only transactions and multi-versioned snapshot reads.

The strong consistency properties of F1 and Spanner come at the cost of higher write latencies compared to MySQL. Having successfully migrated a rich customer-facing application suite at the heart of Google’s ad business to F1, we will describe how we restructured schema and applications to largely hide this increased latency from external users. The distributed nature of F1 also allows it to scale easily and to support significantly higher throughput for batch workloads than a traditional RDBMS.

With F1, we have built a novel hybrid system that combines the scalability, fault tolerance, transparent sharding, and cost benefits so far available only in “NoSQL” systems with the usability, familiarity, and transactional guarantees expected from an RDBMS.

If Google is doing it, it must be important.


.IO is the new .LY

On the startup front, it would appear that Libya might have run out of domain names as almost every startup that I saw has jumped from the .LY (the country code for Libya) to the .IO domain extension (British Indian Ocean Territory, Wikipedia it here)


Old Companies + Old Data = New Value (Maybe)

The final trend that I witnessed arose from conversations with individuals from non-internet Fortune 500 companies. These corporations are watching big data happen and don't want to miss out. The ironic part is that many of these companies have been sitting on piles of data for years and/or decades that potentially contain untold riches. As these data sets have been under lock and key, little innovation has occurred. I truly hope that these companies find ways of not just using the latest and greatest big data tools but also giving back to the community the innovations or maybe even the data sets to continue the rising data tide.   I also heard more than once the complaint that all of this generation's smartest minds are trying to optimize click rates on advertisements. This may be true but I say this is a far better situation than the best and the brightest going to Wall Street or entering into law.