What Should The Data Warehouse Become In The Cloud?

By Dan Woods

December 5, 2016

Judging from the momentum from all of the big players, if you have a data warehouse, you will soon be asked if it is time to move to the cloud. This is a big decision. Most companies have been working for many years on their data warehouse and are not going to move it anywhere or change how it works without good reason.

So, the vendors are going to have to make a strong case to justify this move. You will be provided with different reasons depending on who is doing the persuading. Some vendors will suggest you move everything to a new solution in the cloud. Others will suggest you add an outpost in the cloud. Others may suggest you move all or part of your existing data warehouse to the cloud.

As I examined what each of the vendors are doing, and started thinking through what would make sense for the users of the data warehouse, it seemed obvious that we should rethink the role of the data warehouse in the cloud. In other words, let toss aside the limits of the past, embrace the power of the future, and ask: What should the data warehouse be in the cloud?

This is a complex question, but based on my research so far, here is my answer. The platonic idea of the cloud data warehouse (CDW) should:

Separate Storage and Compute: The cloud is the land of cheap storage and on-demand compute. The CDW should radically separate storing data from the engine that does the computation. This will allow as much data as possible to be stored and as many different type of engines as needed to process it to be created. This separation significantly changes the economics of the data warehouse because you don’t have to build a large system to handle your peak storage needs inside an on-premise system.

Create Workload-Specific Engines on Demand: The of the complexity of a data warehouse is decreased by the cloud’s ability to start up as many different computing engines as needed to handle your workloads. Some of these engines will wake up and stay running, handling on-demand requests or waiting for batch jobs. Others will wake up and process just one workload and then disappear. The point is that each of these engines is created on a separate infrastructure that doesn’t compete with the others. This simplifies the implementation.

Reconstruct the Optimizer Based on the Power of the Cloud: The performance of a data warehouse is determined by the quality of the optimizer that analyzes the SQL queries and determines how they will be executed. In the cloud, you have vast amounts of compute power, memory of various speeds and costs, and massive amounts of low cost storage. The ideal optimizer of a CDW must be adapted to use these new capabilities. For example, it should be possible to cache a vast amount of query results given the availability of low cost storage.

Dramatically Simplify Creating and Managing High-Performance Workloads: One characteristic of cloud technology is that it is highly automated. A CDW should not be an airplane dashboard full of knobs and dials that can only be tuned by an expert. These knobs and dials surely must still exist, but they must be controlled by an intelligent system that tunes the compute engine, memory, storage, and network to achieve the desired performance.

Handle All Important Workloads: A CDW should be able to handle massive, long-running, batch jobs, smaller batch jobs, and massive concurrent interactive queries. In a CDW you may start up different computing engines for each of these types of workloads, but all of these workloads should be supported.

Handle Big Data Volume and Variety: In my view, it is far easier to extend a data warehouse to handle various types of unstructured data than it is to make Hadoop into a powerful SQL engine. A CDW should be able to process a large amount of basic unstructured documents and extract structured data from them. The really odd stuff, can still happen in Hadoop, but the simpler processing should be able to take place in the CDW.

Execute Queries Across Multiple Repositories: The model of having just one data warehouse is a throwback to a simpler time. In any organization of significant size, there will be a need for multiple repositories. The CDW should participate in federated queries both as a query aggregation point and as a source for queries aggregated by other systems.

Have a Scalable Data Movement and Replication Capability: A CDW should have a strategy for moving data at scale to and from the data warehouse, and for replication and synchronization. This is another key capability needed to support a multi-repository, multi-cloud world.

Include the Whole Product Needed for a Successful Data Warehouse: A data warehouse is just one part of a data supply chain that creates the promised value. A variety of other systems for analytics, master data management, data governance, ETL, and so on all must be part of a complete data warehouse implementation. A CDW should be part of an entire product offering that is assembled from a variety of parts. A CDW should productize the integration between all the needed components, offering choices of components where possible.

There are other dimensions such as support for streaming data, security and disaster recovery that I have not yet analyzed that should be added to this list. And I’m certain that I will get many suggestions for new dimensions. My goal is to start the conversation about what a CDW should be. Now its your move to help expand the vision presented here.


This article was written by Dan Woods from Forbes and was legally licensed through the NewsCred publisher network.