The sheer scale of data being captured by the modern enterprise has necessitated a monumental shift in how that data is stored.
From the humble database through to data warehouses, data stores have grown both in scale and complexity to keep pace with the businesses they serve, and the data analysis now required to remain competitive. What was at first a data stream has morphed into a data river as enterprise businesses are harvesting reams of data from every conceivable input across every conceivable business function.
To address the flood of data and the needs of enterprise businesses to store, sort, and analyze that data, a new storage solution has evolved: the data lake.
What’s in a Data Lake?
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state.” – James Dixon
Enterprise businesses run on a foundation of tools and functions that provide valuable data, but rarely in a standardized format. While your accounting department is using their preferred billing and invoicing software, your warehouse is relying on a completely different inventory management system. All the while, your marketing team is relying on marketing automation or CRM software they find the most productive. These systems rarely communicate directly with each other—and while they can be cobbled together to react to business processes or workflows via integrations, there is still no standard output for the data being generated.
Data warehouses do a great job of standardizing data from disparate sources for analysis. In fact, by the time data is loaded into a data warehouse, the decision about how that data is going to be used and how it needs to be processed has already been made.
Data lakes, however, are a bigger, dirtier, more unwieldy beast—taking all of the data an enterprise business has access to, whether structured, semi-structured or unstructured and storing it in its raw format for further exploration and querying. Remember that data stream/river analogy earlier? All sources of data within your enterprise are tributaries for your data lake, which will collect all of your data, regardless of form, function, size, or speed. This is particularly useful when capturing event tracking or IoT data; though the uses of data lakes extend beyond just those scenarios.
Taking a Dip
Once the data has been collected in the lake, organizations can query and analyze the data, as well as utilize it as a data source for their data warehouse.
For example, Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed—and do all types of processing and analytics across platforms and languages. By removing the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics, Azure Data Lake works with existing IT investments for identity, management, and security for simplified data management and governance.
However, storage is only one component of a data lake, the other being the ability to run analysis on the structured, unstructured, relational and non-relational data to identify areas of opportunity or focus.
Analysis can be performed on data lake contents via Azure’s analytics job service or the HDInsight analytics service.
- Analytics job service: Data lakes are particularly valuable in analytical scenarios where you don’t know what you don’t know—with unfiltered access to raw, pre-transformed data, machine learning algorithms, data scientists, or analysts can process petabytes of data for diverse workload categories such as querying, ETL, analytics, machine learning, machine translation, image processing, and sentiment analysis. And using Azure’s built-in U-SQL library allows businesses allows you to write code once and have it automatically parallelized for the scale you need, whether in .NET languages, R, or Python.
- HDInsight: When it comes to Big Data analysis, the open-source Hadoop framework remains one of the most popular options. With the Microsoft HDInsight platform, open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, HBase, Microsoft ML Server & more can be applied to your data lakes via preconfigured clusters optimized for different big data scenarios.
Future-Proofing your Data
Data lakes represent a new frontier for businesses. By taking the entire sum of knowledge available to an enterprise and analyzing it in a raw, unfiltered state without expectation, incredible opportunities, insights, and optimizations can be unearthed.
Just like actual lakes, the long-term health of your organizational data lake depends on defending it from ‘pollution’—data governance is critical to ensure your data lake doesn’t become a data swamp. Ungoverned or uncatalogued data can leave businesses vulnerable both in terms of data quality (and organizational trust in that data), as well as in terms of security, regulatory, and compliance risks. At the very worst, data lakes can provide a wealth of data that is impossible to analyze in a meaningful way due to incorrect metadata or cataloging.
For businesses to truly reap the rewards of data lakes, they’ll want to have a firm internal governance policy, used in conjunction with a data catalog (like Azure Data Catalog). A data catalog’s tagging system helps to unify data through the creation and implementation of a common language, which includes data and data sets, glossaries, definitions, reports, metrics, dashboards, algorithms, and models. This unifying language allows users to understand the data in business terms, while also establishing relationships and associations between data sets (once the data reaches the warehousing or relational stage.)
Build your Business Intelligence Infrastructure on a Solid Foundation
By establishing a data lake alongside companion tools that allow for better organization and analysis, like Jet Analytics, your data lake will remain a crystal-clear source of knowledge for your business for many years to come. For more information on organizing your data or running big data workloads effectively, please reach out to our talented team of reporting and analytics experts.