Digital transformations often require real-time business processes driven by data from operational, historical and streaming sources. I’ve noticed that data and analytics leaders often use the terms “data warehouses,” “data lakes” and “data hubs” interchangeably, which can confuse nonspecialists seeking to understand their differences and the roles they play in digitally transformed businesses. Vendors often contribute to this confusion by positioning them as competing approaches, which can lead to deploying the wrong technology for an ill-fitted use case.
System architects and executives must understand the differences between these different architectures, the best solutions for each use case and how these architectures relate to other data sources — including operational databases, SaaS solutions and streaming data sources — to power the modern digital enterprise. Otherwise, an architecture may be deployed for the wrong tasks, making it difficult or impossible to leverage data across all required systems. Decision makers should also recognize that they can build modern data architectures using proven open-source solutions, enabling digital transformation without large upfront financial commitments.
A data warehouse holds well-defined, structured historical data for the purpose of running fast, repetitive analytical queries. The structured data supports predefined, complex and sometimes long-running queries, commonly using SQL, and is typically used for core business reporting. Data warehouses can be used for dashboards and may support some limited ad hoc queries.
Inserting clean, well-structured data into the data warehouse requires a time-consuming extract, transform and load (ETL) process. While this cleansed and transformed data is considered highly reliable, the time required for ETL means data warehouses are poorly suited to power real-time business processes that depend upon up-to-the-moment operational data.
Data warehouses can be built using open-source solutions such as Greenplum. However, companies often use Apache Hadoop for data warehouses, which I don’t think it’s well suited for, resulting in significant challenges for many companies.
A data lake holds structured and unstructured data from multiple sources. Data governance may be weak, and duplicate or conflicting data may be common. Data science teams typically use a data lake to perform exploratory analyses, including data discovery and visualization, as well as machine learning model training. Because the data is unstructured and unfiltered when it enters the data lake, the data used for data science projects typically needs to be cleansed before it is analyzed. Data lakes often hold historical data that no longer resides in operational datastores.
Hadoop, the most common open-source solution for building data lakes, acts as a data source for Apache Spark and other open-source solutions for machine learning and deep learning model training.
A data hub aggregates data from multiple data sources, which may include data warehouses, data lakes, operational datastores, SaaS applications and streaming data sources. The data in the hub is available for use by one or more business applications. Data hubs have been in use for many years in applications such as master data management, which aggregates customer data from multiple systems to identify missing data and correct inconsistencies and inaccuracies across all data sources.
Digital transformations driven by real-time business processes based on combined historical, operational and streaming data often require a special type of data hub called a digital integration hub (DIH). A DIH aggregates defined subsets of data from multiple on-premises and cloud-based systems, including data warehouses, data lakes, on-premises business applications, SaaS applications and streaming data feeds. DIHs built on in-memory data grids (IMDGs), the most common form of DIH, can synchronize changes made to the in-memory data by the connected business applications back to the relevant data sources. IMDGs provide extremely high performance by caching the relevant data in memory and parallel processing queries. IMDGs are distributed computing solutions offering massive scalability of the in-memory data cache simply by adding nodes to the IMDG cluster.
When caching the relevant data in-memory, the DIH is a high-performance, massively scalable data access layer that can support real-time business processes. The underlying IMDG typically supports a range of APIs, including key-value and SQL support.
The most popular open-source in-memory computing platform for DIHs is Apache Ignite. This is what my company’s platform is built on; we also contributed the source code for what is now Apache Ignite to the Apache Software Foundation several years ago. Apache Ignite can be deployed as a high-performance, massively scalable IMDG with a unified API that serves as the DIH cache and enables simple integration of business applications and datastores.
How They Work Together
Financial institutions typically offer several services: core banking, credit cards, mortgages, wealth management, insurance, etc. Each service may have huge amounts of data siloed in operational datastores, data lakes, data warehouses and SaaS applications. Each datastore serves a specific purpose within a given business unit, holding operational data for current business operations and historical data for analytics purposes. However, the institution may benefit from selectively accessing and processing a subset of the data from multiple business units in real time.
A DIH can span all the datastores and aggregate a single customer’s current and historical information to create a real-time, 360-degree view. This 360-degree view can power upselling and cross-selling opportunities across the company’s entire product line through any customer touch point, such as a mobile app or desktop browser, as the customer accesses any of their various accounts. Or the DIH can power a customer’s real-time, single view of all their accounts across all business units.
By understanding the role of these various architectures and how they can be leveraged to create real-time, 360-degree customer views, business leaders can ensure the projects they approve will take the greatest advantage of their data and set the right foundation for their organization’s future. Understanding the potential of proven, open-source solutions to support modern data architectures also enables executives to navigate their digital transformations far more cost-effectively. Enterprise-grade support and consulting are available for these open-source solutions from third parties that may also offer enterprise-grade versions of the solutions, so executives can be confident they are adequately balancing cost and risk.