The data lake is still quite new, so it’s natural that a few myths and misunderstandings have proliferated across the data management community. To set the record straight, I’d like to bust five myths that I keep hearing.
First, I need to define the data lake so that we’re all on the same page. A data lake is a user-defined method for organizing large volumes of highly diverse data. The data of a lake may be deployed on diverse data management platforms, including Hadoop clusters, relational databases, clouds, or a combination of these. Depending on the platform, a data lake may handle diverse data types, ranging from unstructured to semistructured to structured data.
For most organizations, a data lake supports multiple use cases, including broad data exploration, advanced analytics, data warehouse extensions, and data landing and staging. Lakes may also serve specific departments (marketing, supply chain) or industries (healthcare, logistics).
Let’s do some myth busting.
Myth #1: A data lake is a dumping ground.
Busted! Although it’s true that any database can become a dumping ground, this isn’t what successful early adopters are doing with their data lakes.
Lake owners interviewed by TDWI say that a data lake is a balancing act. Some users are allowed to dump, but others are not. For example, data analysts, data scientists, and some power users need to create “sandboxes” of data in their work; they are allowed to bring data into and out of the lake freely as long as they govern themselves. Most other users must petition the lake steward or curator, who vets incoming data.
Myth #2: Data lakes are only for Internet firms.
Busted! Hadoop and the data lake were pioneered by Internet firms, and we owe them our thanks for those innovations. However, TDWI has found data lakes in production in several mainstream industries, including finance, insurance, telco, pharma, and healthcare. As noted above, some lakes serve departmental operations or analytics.
TDWI has also found multiple forms of analytics operating on a lake data, including data and text mining, clustering, graph, and predictive analytics, and natural language processing. Lake-based analytics supports a number of analytics application types, including risk calculations, customer segmentation, and the detection of fraud, security breaches, and insider trading.
Data lakes serve a wide range of enterprises, and each lake is typically multitenant because it serves multiple business units and use cases.
Myth #3: Data lakes require Hadoop.
Busted! A recent TDWI survey showed that over half of data lakes in production are on Hadoop exclusively (53 percent). However, Hadoop is not required — a few lakes are on relational database management systems (RDBMSs) (5 percent).
Note that a data lake is a logical data architecture (similar to the logical data warehouse) that can be physically distributed across multiple platforms. This explains why roughly a quarter (24 percent) of data lakes are deployed atop a Hadoop cluster that is integrated with one or more RDBMS instances, and each of those in turn may be on a cloud.
Myth #4: If we build it, they will come.
Busted! As with everything we do in IT or data management, simply implementing a new data lake doesn’t mean that business and technical users will automatically flock to it. They won’t come unless you have a compelling business case; business users tell TDWI that they want to perform data exploration, prep, and visualization with lake data — and they want it in a self-service fashion.
Furthermore, users won’t stay unless you provide data that is governed, high quality, and trusted, plus a plan for controlled expansion of the lake’s data sets. Finally, users won’t succeed without specific training and help from consultants well versed with lakes.
Myth #5: All data lakes become swamps.
Busted! It’s true that a data lake may deteriorate into a data swamp, which is an undocumented and disorganized data store that is difficult to navigate, use, trust, and leverage. A data swamp results from a lack of data governance, stewardship, or curation as well as a lack of control over incoming data and access to the lake’s data.
With that in mind, the cure to the swamp is good data governance practices. To prevent a swamp, a data steward should curate the data of a lake, and governance policies should define controls and standards for the lake and its data.