Build Data Lake for Enterprise

 

Comparison: Data Warehouse and Data Lake

Structured Data
Unstructured Data
Data Lake (Raw Data)
Reporting
Analytics
Process
Data Sources
Data Lake Architecture
Analytics
Reporting
Process
Structured Data
Reporting
Analytics
(Extract Transform Load)
ETL
Data Warehouse
(Metadata + Summary + Raw Data)
Data Sources
Data Warehouse Architecture





















The comparison between a data warehouse and a data lake is as follows:
The figure represents the process of data collection, storage, and reporting in Big Data.
Data Warehouse
Data Lake
Data
Structure and transformed
Structure/semi-structure, processed/raw
Working
Structure-ingest-analyze
Ingest-analyze-understand
Processing
Schema-on-write
Schema-on-read
Storage
Expensive when volumes are high
Built for low cost storage
Flexibility
Fixed configuration, not very flexible
No particular structure, configure and reconfigure as per your requirements
Cost/Efficiency
Efficiently uses CPU/IO
Efficiently uses storage and processing capabilities at very low cost

Data Lake Functions and Tools

The four basic functions of data lakes are ingestion, storage/retention, processing, and access.
Ingestion Architecture
Storage/Retention
Processing
Access
Description
- Scalable, extensible to capture streaming and batch data.
- Provide capability to use business logic, filters, validation, data quality, routing, and so on business requirements.
- Depending on the requirements, data is placed into Hadoop HDFS, Hive, Hbase, Elastic Search or in-memory.
- Meta data management
- Policy-based data retention is provided
- Processing is provided for both batch and near-realtime uses cases.
- Provision workflows for repeatable data processing
- Provide late data arrival handling
- Dashboard and applications that provide valuable business insights.
- Data will be made available to consumers using API, MQ Feed and DB access.
Tools (Technology Stack)
Apache Flume, Apache Kafka, Apache Storm, Apache Sqoop, NFS Gateway
HDFS, Hive Tables, Hbase/MapR DB, Elastic Search
Map Reduce, Hive, Spark, Storm, Drill
Qlik/Tableau/Spotfire, REST APIs, Apache Kafka, JDBC
Reference: "Architecting Data Lakes" by Alice LaPlante and Ben Sharma, O'Reilly, 2016

Data Lake Element

Scale-Out Single Repository
Multi-Protocol / Workload Tiers
In-Place Analytics
Manage PBs
Enterprise Features
Data Lake
SMB
HDFS
REST
SWIFT
FTP
NDMP
NFS
S3
HTTP












So what makes a Data Lake? Well, a data lake is the combination of four elements:
  • Consolidation without compromise
    • Consolidates all unstructured data, reducing multiple management points of individual silos of data.
    • Simplifies storage from many file servers to a single file system or single volume of storage.
    • Offers industry-leading storage efficiencies.
  • Enterprise Features
    • All the benefits of having a single volume of data with security and protection features.
    • Supports RBAC, access zones, audits, encryption, snapshots, or deduplication.
  • Data Collaboration
    • Offers native access to both traditional protocols (like SMB, NFS, FTP, and NDMP) and next-generation protocols (like HTTP ….), since all the unstructured data is collected into a single repository. Traditional protocols like SMB, NFS, FTP, and NDMP as well as next generation protocols like HTTP, HDFS, OpenStack Swift and Amazon’s S3.
    • The other side to data collaboration is managing multiple workloads with a proper data life cycle. The Data Lake can automatically tier your data to a more appropriate storage tier. Whether you are needing high performance, dense capacity or somewhere in-between, we offer different node types that allow you to separate your capacity and performance. A storage architecture that accepts not only your home directories and file shares, but video surveillance capacity, HPC and the ability to cross correlate all your data with in-place analytics certainly saves on OPEX and CAPEX.
  • Analytics
    • What makes the data lake even more valuable is the ability to analyze all your data. If you are seeking how to gain that competitive advantage in discovering that new revenue stream, improve customer experience, or proactive maintenance you can now use just about every major Hadoop distribution model available, including Cloudera, Hortonworks, and IBM.



Design and Performance Considerations

A data lake harnesses more data from various sources quickly, enabling users (data scientists and data analysts) to collaborate and analyze data in various perspectives for faster and better decision making. However the design and implementation of a data lake requires organizations to make infrastructural and process decisions to fully use its key capabilities and act upon opportunities it provides to generate business value.
To help the users to overcome the challenges of data lake and get the most from data lakes, below are some of the key design and performance strategies that need to be considered:
  • Understand the need for data lake
  • Compute
  • Storage
  • Security
  • File formats
  • Governance
    • Metadata cataloguing
    • Data lineage
Let us look at each of these strategies in detail.

Understand the Need for a Data Lake

To get started with a data lake, an organization must understand more than simply the technology behind it. The main priority must be to determine whether there are adequate use cases to make the implementation yield impactful results.
The core tenet of a data lake is to store data now and analyze later when needed. So it is important for an organization to determine the type of data that they are going to deal with and the type of analytics that need to be performed to make the resources and time involved in implementing and designing a data lake pay off.
To summarize, data lake may not be a necessity for every organization, and there are cases and patterns where an organization may choose to opt for data lake.

Comments

Popular posts from this blog

Cloud Computing in simple

Bookmark

How to Write an Effective Design Document