According to Pentaho CTO James Dixon, who coined the term Data Lake – “Data Lake is more like a body of water in its natural state. The contents of the data lake stream in from of a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
In technical terms, A data lake is a central location that holds a large amount of data in its native, raw format, as well as a way to organise large volumes of highly diverse data.
As it can store structured, semi-structured, or unstructured data which means data can be kept in a more flexible format for future use.
Data lakes are able to store a large amount of data at a relatively low cost, making them an ideal solution to store all historical data. A data lake offers companies more cost effective storage solutions than other traditional systems because of the scalability and simplicity of its function.
As data lake stores data in granular format, we can send the data through ETL pipelines later. We can run queries only when we understand the data thoroughly so that we don’t end up in stripping away the critical information.
If we store data in individual databases that means we are creating too many Data Silos. Data lake help us to remove these silos and give access to analyse historical data so that every department can understand customers more deeply with the same data.
Rather than defining the structure of the data while storing it, we can define the structure of the data while reading it, making it possible for us to read the data whichever way we prefer.
When properly architected, data lake help us to deal with:
A data lake and a data warehouse are similar in their basic purpose and objective, which makes them easily confusing:
However, there are fundamental distinctions between the two that make them suitable for different scenarios.
The main challenge with data lake architectures is that raw data is stored with no oversight of the contents. We can come across various challenges in journey of a data lake.
Without the proper tools in place, data lakes can suffer from data reliability issues that make it difficult for data scientists and analysts to reason about the data. These issues can stem from difficulty combining batch and streaming data, data corruption and other factors.
As the size of the data in a data lake increases, the performance of traditional query engines get slower. Some of the bottlenecks like metadata management, improper data partitioning etc.
Data lakes are hard to secure and govern due to the lack of visibility and ability to delete or update data. These limitations make it very difficult to meet the requirements of regulatory bodies.
For a data lake to make data usable, it needs to have defined mechanisms to catalog, and secure data. Without these elements, data cannot be found, or trusted resulting in a “data swamp.” Overcoming the challenges of data lake requires it to have governance, semantic consistency, and access controls.
During the initial phase of a project, Data Lake just seems like a data storage where different formats of data are coming from different sources at different frequencies. But, if not monitored properly, it can make us fall too deep.
There are few checkpoints which can be followed to avoid pitfalls regardless of whichever cloud platform is being used.
This step will be needed for every new type of information that needs to be collected. Before Collecting the information, we need to understand it first. Ask some questions like:
Combine data in more meaningful ways to serve upstream reporting/dashboard queries.
Most important thing to note is that from the same data lake, different data “marts” can be positioned to serve a variety of upstream use cases.
Once the data is staged, it can be accessed in various ways by multiple front end business intelligence (BI) tools.