What is a Data Lake?

According to Pentaho CTO James Dixon, who coined the term Data Lake – “Data Lake is more like a body of water in its natural state. The contents of the data lake stream in from of a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

In technical terms, A data lake is a central location that holds a large amount of data in its native, raw format, as well as a way to organise large volumes of highly diverse data.

As it can store structured, semi-structured, or unstructured data which means data can be kept in a more flexible format for future use.

Data Lakehouse

Why do we need a Data Lake ?

Cost effective

Data lakes are able to store a large amount of data at a relatively low cost, making them an ideal solution to store all historical data. A data lake offers companies more cost effective storage solutions than other traditional systems because of the scalability and simplicity of its function. 

Accessing data whenever we are ready

As data lake stores data in granular format, we can send the data through ETL pipelines later. We can run queries only when we understand the data thoroughly so that we don’t end up in stripping away the critical information.

Getting rid of data silos

If we store data in individual databases that means we are creating too many Data Silos. Data lake help us to remove these silos and give access to analyse historical data so that every department can understand customers more deeply with the same data.

Schema on read

Rather than defining the structure of the data while storing it, we can define the structure of the data while reading it, making it possible for us to read the data whichever way we prefer.


Data lake Usage

When properly architected, data lake help us to deal with:

  • Power data science and machine learning.
  • Centralise, consolidate, and catalogue our data.
  • Quickly and seamlessly integrate diverse data sources and formats.
  • Democratise our data by offering users self-service tools.

Data Lake vs Data warehouse

A data lake and a data warehouse are similar in their basic purpose and objective, which makes them easily confusing:

  1. Both are storage repositories that consolidate the various data stores in an organisation.
  2. The objective of both is to create a one-stop data store that will feed into various applications.

However, there are fundamental distinctions between the two that make them suitable for different scenarios.

Data Lake vs Data Warehouse

The Challenges Of Data Lake

The main challenge with data lake architectures is that raw data is stored with no oversight of the contents. We can come across various challenges in journey of a data lake.

Data Reliability

Without the proper tools in place, data lakes can suffer from data reliability issues that make it difficult for data scientists and analysts to reason about the data. These issues can stem from difficulty combining batch and streaming data, data corruption and other factors.

Query Performance

As the size of the data in a data lake increases, the performance of traditional query engines get slower. Some of the bottlenecks like metadata management, improper data partitioning etc.

Data Governance

Data lakes are hard to secure and govern due to the lack of visibility and ability to delete or update data. These limitations make it very difficult to meet the requirements of regulatory bodies.

For a data lake to make data usable, it needs to have defined mechanisms to catalog, and secure data. Without these elements, data cannot be found, or trusted resulting in a “data swamp.” Overcoming the challenges of data lake requires it to have governance, semantic consistency, and access controls.

The Journey to the Data Lake

During the initial phase of a project, Data Lake just seems like a data storage where different formats of data are coming from different sources at different frequencies. But, if not monitored properly, it can make us fall too deep.

There are few checkpoints which can be followed to avoid pitfalls regardless of whichever cloud platform is being used.

Data Source Identification

This step will be needed for every new type of information that needs to be collected. Before Collecting the information, we need to understand it first. Ask some questions like:

  • Is the data tracked in log files ?
  • Is it coming in a batch ?
  • Is the data generated in an event stream, means each activity is sent separately as it happens in the source application ?
  • After identifying the source, determine what data we actually need and communicate the same with data owners.
  • Establishing the plan for obtaining the required data. 
Data Ingestion
  • For batch data, set up processes to schedule periodic file transfers or batch data extracts.
  • For event data, set up processes to ingest the events – this might be an event endpoint.
  • For log data, determine how long it will be available.
  • Set up the storage location, for example, an AWS account with S3 buckets, to serve as the data lake. 
  • Consider how to deal with production/dev/test environments for source and lake environments.
  • Consider other groups/departments that may be impacted by any new processes established and communicate the changes proactively.
Data Cleanup

Combine data in more meaningful ways to serve upstream reporting/dashboard queries.

  • Identify and locate common identifiers across the incoming data records
  • Identify mappings between similar but differently named data fields and define logic for any transformations.
  • Manufacture a global set of identifiers to unify the data across systems.
Data Staging

Most important thing to note is that from the same data lake, different data “marts” can be positioned to serve a variety of upstream use cases.

  • Consider the types of queries that will be needed for the data. This may involve working with different departments.
  • Set up table layouts in the data lake.
  • It may be useful to aggregate metrics at logical boundaries, like usage stats daily, weekly or monthly.
  • For performance reasons, it may be useful to store the same data in different formats based on how it may need to be commonly displayed or accessed.
  • Build out a library of queries that will be useful for dashboards and reports but that could also be used for ad hoc queries.
Data Visualisation

Once the data is staged, it can be accessed in various ways by multiple front end business intelligence (BI) tools.

  • For commonly used BI tools such as Tableau, consider setting up workbooks with pre-populated queries or table definitions.
  • Establish a BI environment for ad hoc queries.
  • Evaluate whether queries and visualisations can be stored in Jupyter Notebooks, Databricks Notebooks or Google colab, which enable sharing and reuse. Connect with data science professionals to prototype and validate algorithms.
  • Work with key stakeholders to put together some preliminary dashboards and then run it through user testing to ensure that the views of the data are understandable and useful. 
  • Maintain regular communication with the user community in order to determine new requirements for new or extended data sources.
Tools and Technologies