Training: Scala, Spark and Solr – 5 days
- Knowledge of Hadoop Eco-system
- Knowledge of Scala
(It Is mandatory to accomplish the training prerequisite conditions before nominating for the session)
Course on Apache Spark & Scala is a 4 days (32 hours) course which will cover different concepts of Big Data, Challenges in Big Data Processing, Approach to Big Data Problems using Apache Spark, specifics of Spark like it's Components, Installation Steps, RDDs, Transformations, Actions, Lazy Execution, Integration with HDFS.
After the completion of this course, you will be able to:
Ø Learn scala programming
Ø Understand Big Data and the challenges associated
Ø Find an approach to Big Data problems with Apache Spark
Ø Implement Apache Spark Concepts
- Apply Scala for Spark
- Understand data frame concept and How to run SQL queries using Spark-SQL
- Follow latest emerging trends like MLib, GraphX based on Spark
Hardware/Software requirements:
- 8 GB RAM windows machine
- Internet connection for setting up SBT/Maven project
- Virtualization feature on the machine should be enabled
Course Outline:
Day 1
- Hello World
- Primitive Types
- Type inference
- Vars vs Vals
- Lazy Vals
- Methods
- Pass By Name
- No parens/Brackets
- Default Arguments
- Named Arguments
- Introduction
- Inheritance
- Main/Additional Constructors
- Private Constructors
- Uniform Access
- Case Classes
- Objects
- Traits
- Lists
- Collection Manipulation
- Simple Methods
- Methods With Functions
- Use Cases With Common Methods
- Tuples
- Type parameterization
- Covariance
- Contravariance
- Type Upper Bounds
- 'Nothing' Type
- Option Implementation
- Like Lists
- Practice Application
Anonymous Classes:
- Introduction
- Structural Typing
- Anonymous Classes With Structural Typing
Special Methods:
- Apply
- Update
Closure and functions
- Introduction
- Applications
- Implicit Values/Parameters
- Implicit Conversions
- With Anonymous Classes
- Implicit Classes
For Loops:
- Introduction
- Coding Style
- With Options
- And flatMap
- Guards
- Definitions
Var Args:
- Introduction
- Ascribing the _* type
Partial Functions:
- Introduction
- Match
- Match Values/Constants
- Match Types
- Extractors
- If Conditions
- Or
Working with XML & JSON
Day 2:
Introduction of Spark
Evolution of distributed systems
Why we need new generation of distributed system?
Limitation with Map Reduce in Hadoop,
Understanding need of Batch Vs. Real Time Analytics
Batch Analytics - Hadoop Ecosystem Overview, Real Time Analytics Options
Introduction to stream and in memory analysis
What is Spark?
A Brief History: Spark
- Installing Spark and sbt
- Integrating Spark in Eciplse
- Running Spark in Eclipse and Spark Standalone cluster
Using Scala for creating Spark Application
Invoking Spark Shell
Creating the SparkContext
Loading a File in Shell
Performing Some Basic Operations on Files in Spark Shell
Building a Spark Project with sbt
Running Spark Project with sbt, Caching Overview
Distributed Persistence
Spark Streaming Overview
Example: Streaming Word Count
Testing Tips in Scala
Performance Tuning Tips in Spark
Shared Variables: Broadcast Variables
Shared Variables: Accumulators
Day 3:
Running SQL queries using Spark SQL
Starting Point: SQLContext
Creating DataFrames
DataFrame Operations
Running SQL Queries Programmatically
Interoperating with RDDs
Inferring the Schema Using Reflection
PInferring the Schema Using Reflection
Data Sources
Generic Load/Save Functions
Save Modes
Saving to Persistent Tables
Parquet Files
Loading Data Programmatically
Partition Discovery
Schema Merging
JSON Datasets
Hive Tables
JDBC To Other Databases
Performance Tuning
Caching Data In Memory
Compatibility with Apache Hive
Unsupported Hive Functionality
- Running SQL Quries with MySql
- Running Hive queries
- Reading JSON file and storing it as a Parquet format
Spark Streaming
Micro batch
Discretized Streams (DStreams)
Input DStreams and Receivers
Dstream to RDD
Basic Sources
Advanced Sources
Transformations on DStreams
Output Operations on DStreams
Design Patterns for using foreachRDD
DataFrame and SQL Operations
Socket stream
File Stream
Stateful operations
How stateful operations work?
Window Operations
Join Operations
- Network-wordcount with Spark Streaming
- Processing Flume data with Spark Streaming
- Processing Kafka data with Spark Streaming
- Processing Twitter data with Spark Streaming
Day 4:
Spark ML Programming
Main Concepts
ML Dataset
ML AlgorithmsModel Selection via Cross-Validation
- Clustering with K-means
- Classification examples
- Linear regression techniques
Tuning Spark
Data Serialization
Memory Tuning
Determining Memory Consumption
Tuning Data Structures
Serialized RDD Storage
Garbage Collection Tuning
Other Considerations
Level of Parallelism
Memory Usage of Reduce Tasks
Broadcasting Large Variables
Data Locality
Job Scheduling and Monitoring
Scheduling Across Applications
Dynamic Resource Allocation
Configuration and Setup
Resource Allocation Policy
Request Policy
Remove Policy
Graceful Decommission of Executors
Scheduling Within an Application
Fair Scheduler Pools
Default Behavior of Pools
Configuring Pool Properties
Day 5:
Apache Solr
The Fundamentals
- About Solr
- Installing and running Solr
- Adding content to Solr
- Reading a Solr XML response
- Changing parameters in the URL
- Using the browse interface
- Sorting results
- Query parsers
- More queries
- Hardwiring request parameters
- Adding fields to default search
- Faceting
- Result grouping
- Adding your own content to Solr
- Deleting data from Solr
- Building a bookstore search
- Adding book data
- Exploring the book data
- Dedupe updateprocessor
Updating your schema
- Adding fields to the schema
- Analyzing text
- Field weighting
- Phrase queries
- Function queries
- Fuzzier search
- Sounds-like
- Introduction
- How SolrCloud works
- Commit strategies
- ZooKeeper
- Managing Solr config files
