The Shortcut Guide to Large Scale Data Warehousing and Advanced Analytics

by Mark Scott


As data warehouses grow to 10 TB and beyond, the means by which they are designed and operated, and even the way in which analytics are executed, must change in order to maintain high levels of performance. The Shortcut Guide to Large-Scale Data Warehousing and Advanced Analytics explores the rationale for large-scale data warehouses, as well as the architectural alternatives for building them. It addresses the operational challenges of managing and monitoring these large data warehouses, and takes a fresh look at how to extract meaningful information from a data set of this size.


Chapter 1: Exploiting the Power of Large Enterprise Data Warehouses

For professionals that have been working with databases, data warehouses, and reporting applications, the dramatic and inevitable growth of the data that you must manage comes as no surprise. The amount of information that businesses store grows explosively from year to year. In 2007, more than 280 exabytes of data were generated. That's 10% more than anyone predicted. And enterprises are responsible for storing and securing 85% of that data (Source: orage_first_time/). To make strategic use of that information is the challenge faced by information architects and database engineers in every segment of the economy.

As the amount of data becomes increasingly more difficult to manage, the processes by which we store and retrieve that data changes. And, more significantly, the information that we extract from this data has increasingly more impact on the organizations that we serve. This guide is targeted specifically to address the challenge of large data sets. For our purposes, that means data sets with more than 10TB of data. If you have dealt with these large databases, you know how long it can take to load the data, perform backups, query the data, and provide accurate answers in a timely manner to your organization's key decision makers. You may have a system that works fine at the moment, but you can see the trends and you know that you are nearing the limits of your current architecture. This guide can help you lay a foundation that will allow your data to continue to provide the vital insights into your organization that will help the business thrive.

Chapter 2: Architecture for Large Enterprise Data Warehouses

As data warehouses grow, a number of issues begin to crop up. Time to process data becomes quite lengthy. Tables become so large that the time it takes to query them becomes excessive. Support for in-database analytics overtaxes the available server resources. The service level agreements (SLAs) can no longer be met. As the data warehouse grows to 10TB and beyond, conventional relational database management systems often prove inadequate.

To meet this challenge, the architecture of the data warehouse must be carefully considered. The logical, physical, and security approaches to large volumes of business data are very different from those where the databases are smaller and often less comprehensive.

This chapter will consider the ways in which large scale data warehouses and advanced analytic systems can be designed and implemented. Thinking about the logical and physical architecture, we will look at alternatives to meet the unique needs of large scale data volume. We will consider approaches to finding the correct tools to manage the system, and compare and contrast their relative merits. We will also explore the challenges to securing large data sources and managing business continuity.

Chapter 3: Operational Considerations for Large Enterprise Data Warehouses

Large scale data warehouses typically import and manage large amounts of data. Systems that need to import and store millions of rows each day and then need to report back on them have very unique challenges. This chapter will address the system considerations in meeting those challenges. There are four primary challenges on which we will concentrate:

  • Importing and transforming large amounts of data on a daily basis within tight maintenance windows
  • Dealing with the bandwidth and bottleneck issues on storing and retrieving large amounts of data from even larger data stores
  • Optimizing queries against large data sets to provide optimized results
  • Monitoring systems to assure continued operation

Data Import

Data warehouses grow large as organizations add new data sources and as the data sources themselves grow in the amount of data they host. They may come from a variety of sources: ERP, CRM, manufacturing, HR, and other related systems. Often, they have specific high volume systems that become the major daily contributors, such as credit card transactions or cell phone call logs. Identifying the sources and their issues is the first step to dealing with the challenge.

The second issue deals with the extraction transformation and load (ETL) process. This process can be very resource intensive. You must plan to control utilization of resources on both the source systems and within the data warehouse itself. In order to plan this properly, you need to consider the impact on the entire system and ensure that the source systems and the data warehouse can handle the load imposed by the load process without interfering with the users who rely on the source systems and the data warehouse.

Chapter 4: Analytic Considerations for Large Enterprise Data Warehouses

As the amount of data grows in the data warehouse, so does the need to summarize and use that data in a more manageable fashion. For the purposes of this chapter, analytics will be considered the manipulation of data beyond standard relational database reporting. This will include On‐Line Analytical Processing (OLAP), data mining, and the analysis of unstructured data (such as text).

We will explore the building of analytic data structures on a terabyte or petabyte scale. We will look at techniques that utilize the full power of a large scale data warehousing system to draw insight from this information. We will then discuss the processes for making the data available for use within the enterprise.