Comparison: Data Warehouse and Data Lake

Structured Data

Unstructured Data

Data Lake (Raw Data)

Reporting

Analytics

Process

Data Sources

Data Lake Architecture

Analytics

Reporting

Process

Structured Data

Reporting

Analytics

(Extract Transform Load)

ETL

Data Warehouse

(Metadata + Summary + Raw Data)

Data Sources

Data Warehouse Architecture

The comparison between a data warehouse and a data lake is as follows:

The figure represents the process of data collection, storage, and reporting in Big Data.

	Data Warehouse	Data Lake
Data	Structure and transformed	Structure/semi-structure, processed/raw
Working	Structure-ingest-analyze	Ingest-analyze-understand
Processing	Schema-on-write	Schema-on-read
Storage	Expensive when volumes are high	Built for low cost storage
Flexibility	Fixed configuration, not very flexible	No particular structure, configure and reconfigure as per your requirements
Cost/Efficiency	Efficiently uses CPU/IO	Efficiently uses storage and processing capabilities at very low cost

Data Lake Functions and Tools

The four basic functions of data lakes are ingestion, storage/retention, processing, and access.

	Ingestion Architecture	Storage/Retention	Processing	Access
Description	- Scalable, extensible to capture streaming and batch data. - Provide capability to use business logic, filters, validation, data quality, routing, and so on business requirements.	- Depending on the requirements, data is placed into Hadoop HDFS, Hive, Hbase, Elastic Search or in-memory. - Meta data management - Policy-based data retention is provided	- Processing is provided for both batch and near-realtime uses cases. - Provision workflows for repeatable data processing - Provide late data arrival handling	- Dashboard and applications that provide valuable business insights. - Data will be made available to consumers using API, MQ Feed and DB access.
Tools (Technology Stack)	Apache Flume, Apache Kafka, Apache Storm, Apache Sqoop, NFS Gateway	HDFS, Hive Tables, Hbase/MapR DB, Elastic Search	Map Reduce, Hive, Spark, Storm, Drill	Qlik/Tableau/Spotfire, REST APIs, Apache Kafka, JDBC

Reference: "Architecting Data Lakes" by Alice LaPlante and Ben Sharma, O'Reilly, 2016

Data Lake Element

Scale-Out Single Repository

Multi-Protocol / Workload Tiers

In-Place Analytics

Manage PBs

Enterprise Features

Data Lake

SMB

HDFS

REST

SWIFT

FTP

NDMP

NFS

HTTP

So what makes a Data Lake? Well, a data lake is the combination of four elements:

Consolidation without compromise
- Consolidates all unstructured data, reducing multiple management points of individual silos of data.
- Simplifies storage from many file servers to a single file system or single volume of storage.
- Offers industry-leading storage efficiencies.
Enterprise Features
- All the benefits of having a single volume of data with security and protection features.
- Supports RBAC, access zones, audits, encryption, snapshots, or deduplication.

Data Collaboration
- Offers native access to both traditional protocols (like SMB, NFS, FTP, and NDMP) and next-generation protocols (like HTTP ….), since all the unstructured data is collected into a single repository. Traditional protocols like SMB, NFS, FTP, and NDMP as well as next generation protocols like HTTP, HDFS, OpenStack Swift and Amazon’s S3.
- The other side to data collaboration is managing multiple workloads with a proper data life cycle. The Data Lake can automatically tier your data to a more appropriate storage tier. Whether you are needing high performance, dense capacity or somewhere in-between, we offer different node types that allow you to separate your capacity and performance. A storage architecture that accepts not only your home directories and file shares, but video surveillance capacity, HPC and the ability to cross correlate all your data with in-place analytics certainly saves on OPEX and CAPEX.
Analytics
- What makes the data lake even more valuable is the ability to analyze all your data. If you are seeking how to gain that competitive advantage in discovering that new revenue stream, improve customer experience, or proactive maintenance you can now use just about every major Hadoop distribution model available, including Cloudera, Hortonworks, and IBM.

Design and Performance Considerations

A data lake harnesses more data from various sources quickly, enabling users (data scientists and data analysts) to collaborate and analyze data in various perspectives for faster and better decision making. However the design and implementation of a data lake requires organizations to make infrastructural and process decisions to fully use its key capabilities and act upon opportunities it provides to generate business value.

To help the users to overcome the challenges of data lake and get the most from data lakes, below are some of the key design and performance strategies that need to be considered:

Understand the need for data lake
Compute
Storage
Security
File formats
Governance
- Metadata cataloguing
- Data lineage

Let us look at each of these strategies in detail.

Understand the Need for a Data Lake

To get started with a data lake, an organization must understand more than simply the technology behind it. The main priority must be to determine whether there are adequate use cases to make the implementation yield impactful results.

The core tenet of a data lake is to store data now and analyze later when needed. So it is important for an organization to determine the type of data that they are going to deal with and the type of analytics that need to be performed to make the resources and time involved in implementing and designing a data lake pay off.

To summarize, data lake may not be a necessity for every organization, and there are cases and patterns where an organization may choose to opt for data lake.

When To Use Data Lake?
When Not To Use Data Lake?

When To Use Data Lake?

Some of the reasons why organizations incorporate a data lake solution include:

Acts as a centralized single repository for all types of data and stores it in the native format with no schema or structure.
Provides cheaper way to store and manage archival or historical data as, it eliminates the upfront costs involved in data transformation.
Allows application of governance policies to data to meet stringent regulations and compliance requirements.
Allows sharing of all organizational data, analytic tools and best practices across different parts of the organization.
Provides self-service access to authorized users to search and retrieve right data in less time for analysis using metadata cataloguing mechanism.

Example: Consider a scenario where a healthcare organization generates a huge amount of unstructured data from various sources that drive the organization to deliver a successful health treatment. They have hosted applications that are used by multiple clients for various purposes, and these applications generate data at various points in time. They also have standard datasets that need to fulfill government regulations and compliance requirements.

In such scenarios, a data lake would be a good option for the following reasons:

It can store data that is generated at various intervals in its original form.
Only the data that is needed for detailed analysis can be pushed to a data warehouse
Data is made available only to authorized users to perform different analytics.
A data lake allows the re-purpose of medical data for other types of analysis in the future.

When Not To Use Data Lake?

Some of the situations where using data lake is not a good option:

If the data volume is not huge, if data is mostly structured or there is a clear idea of how the data would be used. Implementing a data lake requires skilled people and modern tools. It further adds complexity in terms of handling data governance and security while draining resources. Taking these risks is not recommended when the requirement can be met using comparatively less complicated solutions like a data warehouse.
If quick and easy analytical processing from the performance stand point as the data stored in the data lake is not cleaned and takes time to join the data if needed.
Not recommended for streaming data - when the data needs to be processed immediately as and when received. Data lake adds multiple layers in terms of data ingestion, search, and transformation which adds delays into the process. Data lakes were initially designed for batch processing when the insights on data can be performed later.

Example: Consider another scenario where a marketing organization generates a huge amount of structured data from well-defined sources like transactional applications and CRM. They do not have the skillset and modern tools to explore the data. This data is used to generate simple reports such as to determine the number of products purchased or total sales over a period of time.

In this scenario a data warehouse could be a good fit and this type of use cases would not necessitate a data lake as the data is mostly structured and there is no need for advanced analytics.

Compute

Two main components of a data lake include compute and storage. Both can be located either on-premise or in the cloud. It is up to organizations to decide whether to go with on-premise, or cloud, including multicloud and hybrid cloud deployments. Each of these options will have its own pros and cons, and the organization needs to evaluate these based on their needs.

The compute system helps to ingest data into the data lake and hosts applications that help to process the data to derive insights from it. In an Hadoop-based data lake implementation, the processing capacity is provided by relatively less expensive commodity hardware, which allows parallel processing to access and process data quickly.

The compute capacity depends on the size of data and the speed of analysis that is required by the organization. To analyze petabytes of data at low cost and use fewer resources, it takes significant amount of processing time; To analyze large amounts of data in near real-time, it requires sufficient compute capacity and is quite expensive.

Separate compute clusters can be used for different types of processing such as ingestion, extraction, and analysis for optimum efficiency. Example: After ingesting the data successfully, the compute cluster may be freed. Further, compute capacity requirements may increase during complex analysis and may reduce when those analyses are completed. This way, there should be a flexibility where the compute capacity can be scaled based on demand.

Storage

Data lakes act as a centralized repository to store both structured data (from sources like transactional applications and operational databases) and unstructured data (from sources like IoT devices, social media, logs, web, and mobile applications). The basic component of any data lake design is physical storage.

Example: Dell EMC PowerScale is a scale-out NAS storage platform that is used for data lake. It provides an efficient way store, manage, protect, and analyze the growing unstructured data assets. It enables organizations to scale the storage needs separately from the compute needs

Scalability

Storage should be easily scalable without service interruption to accommodate data that can grow to any size. It should provide the ability to add storage capacity in modular blocks and should be transparent to applications. It is recommended to decouple storage from compute resources to enable independent scaling.

Cost Efficiency

As data lake involves storing “everything” and all types of data, it is important for organizations to have a low-cost storage solution that meets their requirements and supports storing all types of data in a single repository allowing faster access for data exploration. It should also support data efficiency features like compression and deduplication at lower cost.

Reliability

It is important that the data lake storage should be robust with high availability designs as it is a primary repository for all the data including an organization’s critical data. It should support accessing data through various storage protocols like SQL, JSON, block, file, and object. Example: If an application is generating data through NFS, the same data would be accessed for analysis using analytics tools like Hadoop.

Independent of Fixed Schema

As data lake enables applying a schema on read for data that needs to be analyzed, the underlying storage should also support this feature and should not dictate a fixed schema.

Security

Data lake security requirements start from the basic understanding of who should have access to the data lake, to what components of the data lake, and to what portions of the data. To have successful results, all of the security mechanisms and strategies for data lake should be implemented and managed within the framework of the organization’s overall security infrastructure and controls. Ensuring the security of the data lake involves four areas.

Access Control:

This mechanism focuses on authentication and authorization like determining what users have access to a specific data lake component and what roles those users can perform.

Example: When a user performs search operation, this mechanism determines what data is available for the user to view over search operation. The mechanism also decides if the user can execute the operation on a specific component and if he has the permission to modify or delete the data.

Document-level Security

This mechanism is implemented for data lakes as the data lake contains critical data generated from multiple sources, and only authorized users are allowed to view or retrieve documents through search results or through analytics.

This type of access control can be replicated to all the data or only to confidential data based on the access permissions associated with original data at source. Two important considerations for document-level security include:

Always retrieve access permissions along with raw data, and store them in the data lake. This will enable the implementation of document-level security more efficiently if there is any such requirement in the future.
Ensure that the access permissions stored in the data lake are always up-to-date with the permissions associated with the data at source (if the source is still available).

Note: Document-level security ensures that users must see only those documents to which they have read permission.

Encryption

Since a data lake stores content retrieved from various sources, it is important to protect the data using encryption techniques. It is recommended to apply data encryption at the storage level where the data is at rest and also over the network to protect data that is in transmit state.

It is also important to consider key management for secure key generation, storage, and distribution of encryption keys.

Network-level Security

The network layer needs to be secured to prevent access to the data lake from an inappropriate access path. This becomes a critical requirement for cloud-based deployments.

Consider implementing network isolation techniques like a virtual private network along with other mechanisms like firewalls and defense-in-depth strategy.

File Formats

A data lake offers control over exactly how the data has to be stored. Data lakes have an array of elements such as file size, storage type (row or columnar), degree of compression, indexing, schemas, and block sizes. These elements are commonly used to access data stored in the data lake.

The file formats supported by a data lake include CSVs, JSON, Avro, ORC, and Parquet. The prominent file format that is designed for Hadoop workloads is Optimized Row Columnar (ORC). This is a mix of row and column format. It is a collection of rows where the data is stored in columnar format within each row. This type of arrangement helps to read, decompress, and process only the values that are required for the query. It is also possible to split the data for parallel operations.

Governance

Data governance deals with the management of an organization’s data, including availability, usability, integrity, and security of data. The main goal of data governance is to ensure that high-quality data is available throughout the life cycle of data. It provides systematic structure and management of data in the data lake, making it meaningful and improves accessibility.

Governance practices with data lake include having a set of controls to enforce things like:

Metadata should be defined and captured for the data entering the data lake.
Processes should determine the key metadata that is required.
Data quality requirements like accuracy and consistency should be defined first prior to defining the technical systems and processes that are needed to meet these requirements.

Another important consideration that falls under governance practice of data lake is to enable metadata cataloguing.

Metadata Cataloguing

As a data lake stores a huge amount of raw data with no oversight of the contents, for the data lake to be effectively usable, it needs to have a defined mechanism to catalog, search, and secure data. The businesses should know information like why the data is present in the data lake, who created it, who is using it, and what it contains. Without this a data lake would become a data swamp, where the data stored is highly disorganized resulting in the inability to analyze and gain insights from the data.

Metadata, which is information about data, helps to search, identify, and learn about the datasets stored in a data lake. It basically provides the description and context of the data. Examples of metadata include author, file size, date created, and so on. A data lake design should incorporate a separate storage layer to store cataloguing metadata.

Some of the principles that can be used to ensure that appropriate metadata is created and maintained are:

Enforce the creation of metadata for all the data that enters the data lake.
Include a design that automates the process of metadata creation from the source content.

The data catalogue provides an interface for users to search all the data assets stored in the data lake. It acts as a single source of truth and also provides a single point of access to the data assets in the data lake. This catalogue enables users to search for the assets quickly and presents the data along with its metadata information, helping them realize whether this data what they have found is useful to them or not.

Data Lineage

Another key part of data governance is to maintain a record of data origin and what happens to it throughout its life cycle to get a better understanding of data during analytics. Since a data lake involves continuous operations of adding new datasets and modifying the existing datasets, it becomes important to provide traceability of data and an audit trail of when, where, and why a change was made. This process is referred to as data lineage.

This includes documenting and versioning of internal and external tools that access and modify the data.

Concepts In Practice—Data Lake with Dell EMC PowerScale

Minimize data migration across IT

SAS

Kafka

Spark

Hortonworks

Splunk

Pivotal

TensorFlow

Hadoop

Deliver agility of IT workloads

Minimize data lifecyle costs

PowerScale

The Dell EMC PowerScale OneFS operating system provides the intelligence behind the PowerScale scale-out Network Attached Storage (NAS) solutions.

With powerful features and capabilities to optimize storage at the core of the data lake, OneFS combines the three layers of traditional storage architectures—file system, volume manager, and data protection—into one unified software layer, creating a single intelligent file system that spans all nodes within a cluster.
PowerScale provides integrated support for industry-standard protocols, including SMB, NFS, FTP, HTTP, OpenStack Swift, and HDFS, to provide an efficient, shared storage infrastructure for your data lake. This support allows users to consolidate data, cut costs, and accelerate results for a wide range of workloads.

Reference: www.dellemc.com/en-us/storage/isilon/onefs-operating-system.htm

Scenario 1

A bank is looking to build a reporting repository in a structured format that stores historical information about bank customers and the products that the customers have with the bank. This repository must be designed to simplify the reporting tasks and provide accessibility for business intelligence tools for complex analysis. This repository consists of a data set with the following attributes:

cust_id: A unique number assigned to identify a customer
age: Age of the customer
work_experience: Number of years of work experience
annual_income: Annual income of the customer
zip_code: Customer’s home address ZIP code (In Indian Rupees)
dependents: Number of people in customer’s family dependent on the customer
cc_avg: Average spending on credit cards per month , indicated by 1 (yes) and 0 (no).
education: Education level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
home_loan: Value of house mortgage if any (In Indian Rupees)
personal_loan: Does the customer have a personal loan with the bank or not, indicated by 1 (yes) and 0 (no).
credit_card: Does the customer use a credit card issued by the bank, indicated by 1 (yes) and 0 (no).
net_banking: Does the customer use internet banking facilities of the bank, indicated by 1 (yes) and 0 (no).
demat_account: Does the customer have a demat account for buying shares, indicated by 1 (yes) and 0 (no).
bond_subscription: Does the customer have subscription to any bonds, indicated by 1 (yes) and 0 (no).

The bank is looking to expand their loan business using this structured data and needs your support in analyzing the data to provide useful insights. This would help the bank to use the insights while performing complex analysis like determining the potential customers who have a greater probability of applying for a loan.

Scenario 1: Tasks

In this scenario, you are asked to perform the following tasks:

Determine what would be the best database solution to address this scenario
Perform basic actions
- Create a database and a table. Load the data into the table from the csv file and set the primary key.
- The dataset is initially located in the Jump VM. You may go to the Jump VM and look at the Bank_Personal_Loan_Modelling dataset residing in the Datasets folder. After determining the database solution, move the dataset to an appropriate database.
Clean up data
- Update the dependents to 2 and zip _code to 91911 for the customer 3119
Analyze the data using various commands to derive useful insights
- Display customer id, age, education, and average spending on credit cards for customers whose age is from 25 to 30, inclusive
- Calculate average income of customers whose age is 55
- Identify the number of customers who have obtained a personal loan
- Identify the number of customers who have obtained a personal loan from the bank, and organize the result by age in ascending order
- Identify the number and work experience of customers who have taken home loan from the bank, and organize the result by work experience in ascending order
- Calculate the number of customers who are using Internet banking facility provided by the bank
- Calculate the total amount that bank has invested in home loans

Scenario 2

An online media service provider handles a huge of amount of data about the movies which are listed in their online platform. This data is unstructured, and the provider does not have the complete information about the movies. The information needs to be updated as and when it is available, and they should also be able to add new movies when they release.

The service provider is looking for a solution that helps to store self-contained units of data and that provides faster data retrieval for their customers. They are also looking for an availability solution to protect their data from corruption and data loss. The known information about movies includes:

Title: Title of the movie
Crew: Group of people such as actors and directors
Duration: Duration of the movie
Release_year: Year when the movie was released
Imdb_rating: Movie rating provided by IMDB
Budget: Budget of the movie

New attributes can be added as and when the provider obtains information. It is not necessary to have the all attributes mentioned above.

Scenario 2: Tasks

In this scenario, you are asked to perform the following tasks:

Determine what would be the best database solution to address this scenario
Create a database and add the data (minimum three records)
Update one of the records by adding an additional attribute
Display any of the three attributes of a particular record
Implement an availability solution to address the data loss and data corruption challenges

Search This Blog

Blog

Build Data Lake for Enterprise

Comparison: Data Warehouse and Data Lake

Data Lake Functions and Tools

Data Lake Element

Design and Performance Considerations

Understand the Need for a Data Lake

Storage

Security

File Formats

Governance

Metadata Cataloguing

Data Lineage

Concepts In Practice—Data Lake with Dell EMC PowerScale

Scenario 1

Scenario 1: Tasks

Scenario 2

Scenario 2: Tasks

Comments

Post a Comment

Popular posts from this blog

A Road-Map to Become Solution Architect

Module 3: Fine-Tuning and Customizing Generative AI Models

Top 20 Highlights from Google I/O 2025