Build Data Lake for Enterprise
Comparison: Data Warehouse and Data Lake
The comparison between a data warehouse and a data lake is as follows:
The figure represents the process of data collection, storage, and reporting in Big Data.
Data Warehouse | Data Lake | |
Data | Structure and transformed | Structure/semi-structure, processed/raw |
Working | Structure-ingest-analyze | Ingest-analyze-understand |
Processing | Schema-on-write | Schema-on-read |
Storage | Expensive when volumes are high | Built for low cost storage |
Flexibility | Fixed configuration, not very flexible | No particular structure, configure and reconfigure as per your requirements |
Cost/Efficiency | Efficiently uses CPU/IO | Efficiently uses storage and processing capabilities at very low cost |
Data Lake Functions and Tools
The four basic functions of data lakes are ingestion, storage/retention, processing, and access.
Ingestion Architecture | Storage/Retention | Processing | Access | |
Description | - Scalable, extensible to capture streaming and batch data. - Provide capability to use business logic, filters, validation, data quality, routing, and so on business requirements. | - Depending on the requirements, data is placed into Hadoop HDFS, Hive, Hbase, Elastic Search or in-memory. - Meta data management - Policy-based data retention is provided | - Processing is provided for both batch and near-realtime uses cases. - Provision workflows for repeatable data processing - Provide late data arrival handling | - Dashboard and applications that provide valuable business insights. - Data will be made available to consumers using API, MQ Feed and DB access. |
Tools (Technology Stack) | Apache Flume, Apache Kafka, Apache Storm, Apache Sqoop, NFS Gateway | HDFS, Hive Tables, Hbase/MapR DB, Elastic Search | Map Reduce, Hive, Spark, Storm, Drill | Qlik/Tableau/Spotfire, REST APIs, Apache Kafka, JDBC |
Reference: "Architecting Data Lakes" by Alice LaPlante and Ben Sharma, O'Reilly, 2016
Data Lake Element
So what makes a Data Lake? Well, a data lake is the combination of four elements:
Design and Performance Considerations
A data lake harnesses more data from various sources quickly, enabling users (data scientists and data analysts) to collaborate and analyze data in various perspectives for faster and better decision making. However the design and implementation of a data lake requires organizations to make infrastructural and process decisions to fully use its key capabilities and act upon opportunities it provides to generate business value.
To help the users to overcome the challenges of data lake and get the most from data lakes, below are some of the key design and performance strategies that need to be considered:
Let us look at each of these strategies in detail.
Understand the Need for a Data Lake
To get started with a data lake, an organization must understand more than simply the technology behind it. The main priority must be to determine whether there are adequate use cases to make the implementation yield impactful results.
The core tenet of a data lake is to store data now and analyze later when needed. So it is important for an organization to determine the type of data that they are going to deal with and the type of analytics that need to be performed to make the resources and time involved in implementing and designing a data lake pay off.
To summarize, data lake may not be a necessity for every organization, and there are cases and patterns where an organization may choose to opt for data lake.
When To Use Data Lake?
Some of the reasons why organizations incorporate a data lake solution include:
Example: Consider a scenario where a healthcare organization generates a huge amount of unstructured data from various sources that drive the organization to deliver a successful health treatment. They have hosted applications that are used by multiple clients for various purposes, and these applications generate data at various points in time. They also have standard datasets that need to fulfill government regulations and compliance requirements.
In such scenarios, a data lake would be a good option for the following reasons:
When Not To Use Data Lake?
Some of the situations where using data lake is not a good option:
Example: Consider another scenario where a marketing organization generates a huge amount of structured data from well-defined sources like transactional applications and CRM. They do not have the skillset and modern tools to explore the data. This data is used to generate simple reports such as to determine the number of products purchased or total sales over a period of time.
In this scenario a data warehouse could be a good fit and this type of use cases would not necessitate a data lake as the data is mostly structured and there is no need for advanced analytics.
Compute
Two main components of a data lake include compute and storage. Both can be located either on-premise or in the cloud. It is up to organizations to decide whether to go with on-premise, or cloud, including multicloud and hybrid cloud deployments. Each of these options will have its own pros and cons, and the organization needs to evaluate these based on their needs.
The compute system helps to ingest data into the data lake and hosts applications that help to process the data to derive insights from it. In an Hadoop-based data lake implementation, the processing capacity is provided by relatively less expensive commodity hardware, which allows parallel processing to access and process data quickly.
The compute capacity depends on the size of data and the speed of analysis that is required by the organization. To analyze petabytes of data at low cost and use fewer resources, it takes significant amount of processing time; To analyze large amounts of data in near real-time, it requires sufficient compute capacity and is quite expensive.
Separate compute clusters can be used for different types of processing such as ingestion, extraction, and analysis for optimum efficiency. Example: After ingesting the data successfully, the compute cluster may be freed. Further, compute capacity requirements may increase during complex analysis and may reduce when those analyses are completed. This way, there should be a flexibility where the compute capacity can be scaled based on demand.
Storage
Data lakes act as a centralized repository to store both structured data (from sources like transactional applications and operational databases) and unstructured data (from sources like IoT devices, social media, logs, web, and mobile applications). The basic component of any data lake design is physical storage.
Example: Dell EMC PowerScale is a scale-out NAS storage platform that is used for data lake. It provides an efficient way store, manage, protect, and analyze the growing unstructured data assets. It enables organizations to scale the storage needs separately from the compute needs
Scalability
Storage should be easily scalable without service interruption to accommodate data that can grow to any size. It should provide the ability to add storage capacity in modular blocks and should be transparent to applications. It is recommended to decouple storage from compute resources to enable independent scaling.
Cost Efficiency
As data lake involves storing “everything” and all types of data, it is important for organizations to have a low-cost storage solution that meets their requirements and supports storing all types of data in a single repository allowing faster access for data exploration. It should also support data efficiency features like compression and deduplication at lower cost.
Reliability
It is important that the data lake storage should be robust with high availability designs as it is a primary repository for all the data including an organization’s critical data. It should support accessing data through various storage protocols like SQL, JSON, block, file, and object. Example: If an application is generating data through NFS, the same data would be accessed for analysis using analytics tools like Hadoop.
Independent of Fixed Schema
As data lake enables applying a schema on read for data that needs to be analyzed, the underlying storage should also support this feature and should not dictate a fixed schema.
Security
Data lake security requirements start from the basic understanding of who should have access to the data lake, to what components of the data lake, and to what portions of the data. To have successful results, all of the security mechanisms and strategies for data lake should be implemented and managed within the framework of the organization’s overall security infrastructure and controls. Ensuring the security of the data lake involves four areas.
Access Control:
This mechanism focuses on authentication and authorization like determining what users have access to a specific data lake component and what roles those users can perform.
Example: When a user performs search operation, this mechanism determines what data is available for the user to view over search operation. The mechanism also decides if the user can execute the operation on a specific component and if he has the permission to modify or delete the data.
Document-level Security
This mechanism is implemented for data lakes as the data lake contains critical data generated from multiple sources, and only authorized users are allowed to view or retrieve documents through search results or through analytics.
This type of access control can be replicated to all the data or only to confidential data based on the access permissions associated with original data at source. Two important considerations for document-level security include:
Note: Document-level security ensures that users must see only those documents to which they have read permission.
Encryption
Since a data lake stores content retrieved from various sources, it is important to protect the data using encryption techniques. It is recommended to apply data encryption at the storage level where the data is at rest and also over the network to protect data that is in transmit state.
It is also important to consider key management for secure key generation, storage, and distribution of encryption keys.
Network-level Security
The network layer needs to be secured to prevent access to the data lake from an inappropriate access path. This becomes a critical requirement for cloud-based deployments.
Consider implementing network isolation techniques like a virtual private network along with other mechanisms like firewalls and defense-in-depth strategy.
File Formats
A data lake offers control over exactly how the data has to be stored. Data lakes have an array of elements such as file size, storage type (row or columnar), degree of compression, indexing, schemas, and block sizes. These elements are commonly used to access data stored in the data lake.
The file formats supported by a data lake include CSVs, JSON, Avro, ORC, and Parquet. The prominent file format that is designed for Hadoop workloads is Optimized Row Columnar (ORC). This is a mix of row and column format. It is a collection of rows where the data is stored in columnar format within each row. This type of arrangement helps to read, decompress, and process only the values that are required for the query. It is also possible to split the data for parallel operations.
Governance
Data governance deals with the management of an organization’s data, including availability, usability, integrity, and security of data. The main goal of data governance is to ensure that high-quality data is available throughout the life cycle of data. It provides systematic structure and management of data in the data lake, making it meaningful and improves accessibility.
Governance practices with data lake include having a set of controls to enforce things like:
Another important consideration that falls under governance practice of data lake is to enable metadata cataloguing.
Metadata Cataloguing
As a data lake stores a huge amount of raw data with no oversight of the contents, for the data lake to be effectively usable, it needs to have a defined mechanism to catalog, search, and secure data. The businesses should know information like why the data is present in the data lake, who created it, who is using it, and what it contains. Without this a data lake would become a data swamp, where the data stored is highly disorganized resulting in the inability to analyze and gain insights from the data.
Metadata, which is information about data, helps to search, identify, and learn about the datasets stored in a data lake. It basically provides the description and context of the data. Examples of metadata include author, file size, date created, and so on. A data lake design should incorporate a separate storage layer to store cataloguing metadata.
Some of the principles that can be used to ensure that appropriate metadata is created and maintained are:
The data catalogue provides an interface for users to search all the data assets stored in the data lake. It acts as a single source of truth and also provides a single point of access to the data assets in the data lake. This catalogue enables users to search for the assets quickly and presents the data along with its metadata information, helping them realize whether this data what they have found is useful to them or not.
Data Lineage
Another key part of data governance is to maintain a record of data origin and what happens to it throughout its life cycle to get a better understanding of data during analytics. Since a data lake involves continuous operations of adding new datasets and modifying the existing datasets, it becomes important to provide traceability of data and an audit trail of when, where, and why a change was made. This process is referred to as data lineage.
This includes documenting and versioning of internal and external tools that access and modify the data.
Concepts In Practice—Data Lake with Dell EMC PowerScale
The Dell EMC PowerScale OneFS operating system provides the intelligence behind the PowerScale scale-out Network Attached Storage (NAS) solutions.
Reference: www.dellemc.com/en-us/storage/isilon/onefs-operating-system.htm
Scenario 1
A bank is looking to build a reporting repository in a structured format that stores historical information about bank customers and the products that the customers have with the bank. This repository must be designed to simplify the reporting tasks and provide accessibility for business intelligence tools for complex analysis. This repository consists of a data set with the following attributes:
The bank is looking to expand their loan business using this structured data and needs your support in analyzing the data to provide useful insights. This would help the bank to use the insights while performing complex analysis like determining the potential customers who have a greater probability of applying for a loan.
Scenario 1: Tasks
In this scenario, you are asked to perform the following tasks:
Scenario 2
An online media service provider handles a huge of amount of data about the movies which are listed in their online platform. This data is unstructured, and the provider does not have the complete information about the movies. The information needs to be updated as and when it is available, and they should also be able to add new movies when they release.
The service provider is looking for a solution that helps to store self-contained units of data and that provides faster data retrieval for their customers. They are also looking for an availability solution to protect their data from corruption and data loss. The known information about movies includes:
New attributes can be added as and when the provider obtains information. It is not necessary to have the all attributes mentioned above.
Scenario 2: Tasks
In this scenario, you are asked to perform the following tasks:
Comments
Post a Comment