data warehouse

Keep it Simple: How to Build a Successful Business Intelligence/Data Warehouse Architecture

Is your data warehouse architecture starting to look like a Rube Goldberg machine held together with duct tape instead of the elegant solution enabling data driven decision making that you envisioned? If your organization is anything like the ones I’ve worked in, then I suspect it might. Many businesses say they recognize that data is an asset, but when it comes to implementing solutions, the focus on providing business value is quickly lost as technical complexities pile up.

duct_tape

How can you recognize if your data warehouse is getting too complicated?

Does it have multiple layers that capture the same data in just a slightly different way? An organization I worked with determined that they needed 4 database layers (staging, long term staging, enterprise data warehouse, and data marts) with significant amounts of duplication. The duplication resulted from each layer not having a clear purpose, but even with more clarity on purpose, this architecture makes adding, changing and maintaining data harder at every turn.

Are you using technologies just because you have used them in the past? Or thought they would be cool to try out? An organization I worked with implemented a fantastically simple data warehouse star schema (http://en.wikipedia.org/wiki/Star_schema) with well under 500 GB of data. Unfortunately, they decided to complicate the data warehouse by adding a semantic layer to support a BI tool and an OLAP cube (which is some ways was a second semantic layer to support BI tools). There is nothing wrong with semantic layers or OLAP cubes. In fact, there are many valid reasons to use them. But, if you do not have said valid reason, they become just another piece of the data architecture that requires maintenance. Has someone asked for data that “should” be easy to get, but instead will take weeks of dedicated effort to pull together? I frequently encounter requests that sound simple, but the number of underlying systems involved and the lack of consistent data integration practices expands the scope exponentially.

Before I bring up too many bad memories of technical complexities taking over a BI/DW project, I want to get into what to do to avoid making things overcomplicated. The most important thing is to find a way that works for your organization to stay focused on business value.

If you find yourself thinking…

“The data is in staging, but we need to transform it into the operational data store, enterprise data warehouse and update 5 data marts before anyone can access the data.”

or

“I am going to try to because I want to learn more about it.”

or

“I keep having to pull together the customer data and it takes 2 weeks just to get an approved list of all customers.”

Stop, drop and roll, oh wait, you’re not technically on fire, so just stopping should do. Take some time to consider how to reset so that the focus is on providing business value. You might try using an approach such as the 5 Whys which was developed for root cause analysis by Sakichi Toyoda for Toyota . It forces reflection on a specific problem and helps you drill down into the specific cause. Why not try it out to see if you can find the root cause of complexity in a BI/DW project? It might just help you reduce or eliminate complexities when there is no good reason for the complexity in the first place.

Another suggestion is to identify areas of complexity from a technical perspective, but don’t stop there. The crucial next step is to determine how the complex technical environment impacts business users. For example, a technical team identifies two complex ETL processes for loading sales and HR data. Both include one off logic and processes that make it difficult to discern what is going on so it takes hours to troubleshoot issues that arise. In addition, the performance of both ETL processes has significantly degraded. The business users don’t really care about all that, but they have been complaining more and more about delays in getting the latest sales figures. When you connect the growing complexity to the delays in getting important data, the business users can productively contribute to a discussion on priority and business value. In this case, sales data would take clear precedence over HR data. Both can be added to a backlog, along with any other areas of complexity identified, and addressed in priority order.

Neither of these is a quick fix, but even slowly chipping away at overly complex areas will yield immediate benefits. Each simplification makes understanding, maintaining and extending the existing system easier.

Bio

Sara_Handel Sara Handel, Co-founder of DC Business Intelligentsia, Business Intelligence Lead at Excella Consulting (www.excella.com) - I love working on projects that transform data into clear, meaningful information that can help business leaders shape their strategies and make better decisions. My systems engineering education coupled with my expertise in Business Intelligence (BI) allows me to help clients find ways to maximize their systems' data capabilities from requirements through long term maintenance. I co-founded the DC Business Intelligentsia meetup to foster a community of people interested in contributing to the better use of data.

Hadoop as a Data Warehouse Archive

Recently, companies have seen a huge growth in data volume both from existing structured data and from new, multi-structured data. Transaction data in particular from online shopping and mobile devices along with clickstream and social data is creating more data in one year than was ever created before. How is a company supposed to keep track of and store all of this data effectively? Traditional data warehouses would have to be constantly expanding to keep up with this constant stream of data, making storage increasingly too expensive and time consuming. Businesses have found some relief using Hadoop to extract and load the data into the data warehouse, but as the warehouse becomes full, businesses have had to expand the data warehouse’s storage capabilities.

Instead, businesses should consider moving the data back into Hadoop, turning Hadoop into a data warehouse archive. There are several advantages to using Hadoop as an archive in conjunction with a traditional data warehouse. Here’s a look at a few.

Improved Resilience and Performance

Many of the platforms designed around Hadoop have focused on making Hadoop more user friendly and have adjusted or added features to help protect data. MapR, for example removes single points of failure in Hadoop that made it easy for data to be destroyed or lost. Platforms will often offer data mirroring across clusters to help support failover and disaster recovery as well.

With a good level of data protection and recovery abilities, Hadoop platforms become a viable option for the long-term storage of Big Data and other data that has been archived in a data warehouse.

Hadoop also keeps historical data online and accessible which makes it easier to revisit data when new questions come and is dramatically faster and easier than going through traditional magnetic tapes.

Handle More Data for Less Cost

Hadoop’s file system can capture 10s of terabytes of data in a day, and this is accomplished at the lowest possible cost due to open source economics and commodity hardware. Hadoop can also easily handle more data by adding more Hadoop nodes to the cluster to continue to process data at speed thanks to Hadoop’s greater compute power. This is much less expensive than the continuous upgrades that would be required to maintain a traditional warehouse and to keep up with the extreme amount of data. On top of that, data tape archives found in traditional data stores can become costly because the data is difficult to retrieve. Not only is the data stored offline, requiring tons of time to restore, but the cartridges are prone to degrade over time resulting in costly losses of data.

High Availability

Traditional data warehouses often made it difficult for global businesses to maintain all of their data in one place with employees working and logging in from various locations around the world. Hadoop platforms will generally allow direct access to remote clients that want to mount the cluster to read or write data flows. This means that clients and employees will be working directly on the Hadoop cluster rather than first uploading data to a local or network storage system. In a global business where ETL processing may need to happen several times within the day, high availability is very important.

Reduce Tools Needed

Coupled with increased availability, the ability to access the cluster directly dramatically reduces the number of tools needed to capture data. For example, this reduces the need for log collection tools that may require agents on every application server. It also eliminates the need to keep up with changing tape formats every couple years or risk being unable to restore data because it is stored on obsolete tapes.

Author Bio

Rick Delgado, Freelance Tech Journalist

I've been blessed to have a successful career and have recently taken a step back to pursue my passion of writing. I've started doing freelance writing and I love to write about new technologies and how it can help us and our planet.