Hadoop as a Data Warehouse Archive

Recently, companies have seen a huge growth in data volume both from existing structured data and from new, multi-structured data. Transaction data in particular from online shopping and mobile devices along with clickstream and social data is creating more data in one year than was ever created before. How is a company supposed to keep track of and store all of this data effectively? Traditional data warehouses would have to be constantly expanding to keep up with this constant stream of data, making storage increasingly too expensive and time consuming. Businesses have found some relief using Hadoop to extract and load the data into the data warehouse, but as the warehouse becomes full, businesses have had to expand the data warehouse’s storage capabilities.

Instead, businesses should consider moving the data back into Hadoop, turning Hadoop into a data warehouse archive. There are several advantages to using Hadoop as an archive in conjunction with a traditional data warehouse. Here’s a look at a few.

Improved Resilience and Performance

Many of the platforms designed around Hadoop have focused on making Hadoop more user friendly and have adjusted or added features to help protect data. MapR, for example removes single points of failure in Hadoop that made it easy for data to be destroyed or lost. Platforms will often offer data mirroring across clusters to help support failover and disaster recovery as well.

With a good level of data protection and recovery abilities, Hadoop platforms become a viable option for the long-term storage of Big Data and other data that has been archived in a data warehouse.

Hadoop also keeps historical data online and accessible which makes it easier to revisit data when new questions come and is dramatically faster and easier than going through traditional magnetic tapes.

Handle More Data for Less Cost

Hadoop’s file system can capture 10s of terabytes of data in a day, and this is accomplished at the lowest possible cost due to open source economics and commodity hardware. Hadoop can also easily handle more data by adding more Hadoop nodes to the cluster to continue to process data at speed thanks to Hadoop’s greater compute power. This is much less expensive than the continuous upgrades that would be required to maintain a traditional warehouse and to keep up with the extreme amount of data. On top of that, data tape archives found in traditional data stores can become costly because the data is difficult to retrieve. Not only is the data stored offline, requiring tons of time to restore, but the cartridges are prone to degrade over time resulting in costly losses of data.

High Availability

Traditional data warehouses often made it difficult for global businesses to maintain all of their data in one place with employees working and logging in from various locations around the world. Hadoop platforms will generally allow direct access to remote clients that want to mount the cluster to read or write data flows. This means that clients and employees will be working directly on the Hadoop cluster rather than first uploading data to a local or network storage system. In a global business where ETL processing may need to happen several times within the day, high availability is very important.

Reduce Tools Needed

Coupled with increased availability, the ability to access the cluster directly dramatically reduces the number of tools needed to capture data. For example, this reduces the need for log collection tools that may require agents on every application server. It also eliminates the need to keep up with changing tape formats every couple years or risk being unable to restore data because it is stored on obsolete tapes.

Author Bio

Rick Delgado, Freelance Tech Journalist

I've been blessed to have a successful career and have recently taken a step back to pursue my passion of writing. I've started doing freelance writing and I love to write about new technologies and how it can help us and our planet.