Training: Scala, Spark and Solr – 5 days

Training: Scala, Spark & Solr – 5 days

Prerequisites:
  1. Knowledge of Hadoop Eco-system
  2. Knowledge of Scala
 (It Is mandatory to accomplish the training prerequisite conditions before nominating for the session)

COURSE OVERVIEW
Course on Apache Spark & Scala is a 4 days (32 hours) course which will cover different concepts of Big Data, Challenges in Big Data Processing, Approach to Big Data Problems using Apache Spark, specifics of Spark like it's Components, Installation Steps, RDDs, Transformations, Actions, Lazy Execution, Integration with HDFS.

Objective:
After the completion of this course, you will be able to:
Ø  Learn scala programming
Ø  Understand Big Data and the challenges associated
Ø   Find an approach to Big Data problems with Apache Spark
Ø   Implement Apache Spark Concepts
  • Apply Scala for Spark
  • Understand data frame concept and How to run SQL queries using Spark-SQL
  • Follow latest emerging trends like MLib, GraphX based on Spark

Hardware/Software requirements:
  1. 8 GB RAM windows machine
  2. Internet connection for setting up SBT/Maven project
  3. Virtualization feature on the machine should be enabled

Course Outline:
Day 1
Basics:
  • Hello World
  • Primitive Types
  • Type inference
  • Vars vs Vals
  • Lazy Vals
  • Methods
  • Pass By Name
  • No parens/Brackets
  • Default Arguments
  • Named Arguments
Classes:
  • Introduction
  • Inheritance
  • Main/Additional Constructors
  • Private Constructors
  • Uniform Access
  • Case Classes
  • Objects
  • Traits
Collections:
  • Lists
  • Collection Manipulation
  • Simple Methods
  • Methods With Functions
  • Use Cases With Common Methods
  • Tuples
Types:
  • Type parameterization
  • Covariance
  • Contravariance
  • Type Upper Bounds
  • 'Nothing' Type
Options:
  • Option Implementation
  • Like Lists
  • Practice Application
Anonymous Classes:
  • Introduction
  • Structural Typing
  • Anonymous Classes With Structural Typing
Special Methods:
  • Apply
  • Update
Closure and functions
Currying:
  • Introduction
  • Applications
Implicits:
  • Implicit Values/Parameters
  • Implicit Conversions
  • With Anonymous Classes
  • Implicit Classes
For Loops:
  • Introduction
  • Coding Style
  • With Options
  • And flatMap
  • Guards
  • Definitions
Var Args:
  • Introduction
  • Ascribing the _* type
Partial Functions:
  • Introduction
  • Match
  • Match Values/Constants
  • Match Types
  • Extractors
  • If Conditions
  • Or
Working with XML & JSON

Day 2:
Introduction of Spark
Evolution of distributed systems
Why we need new generation of distributed system?
Limitation with Map Reduce in Hadoop,
Understanding need of  Batch Vs. Real Time Analytics
Batch Analytics - Hadoop Ecosystem Overview, Real Time Analytics Options
Introduction to stream and in memory analysis
What is Spark?
A Brief History: Spark
Hands-On
  1. Installing Spark and sbt
  2. Integrating Spark in Eciplse
  3. Running Spark in Eclipse and Spark Standalone cluster

Using Scala for creating Spark Application
Invoking Spark Shell
Creating the SparkContext
Loading a File in Shell
Performing Some Basic Operations on Files in Spark Shell
Building a Spark Project with sbt
Running Spark Project with sbt, Caching Overview
Distributed Persistence
Spark Streaming Overview
Example: Streaming Word Count
Testing Tips in Scala
Performance Tuning Tips in Spark
Shared Variables: Broadcast Variables
Shared Variables: Accumulators
Day 3:

Running SQL queries using Spark SQL

Starting Point: SQLContext

Creating DataFrames

DataFrame Operations

Running SQL Queries Programmatically

Interoperating with RDDs

Inferring the Schema Using Reflection

PInferring the Schema Using Reflection

Data Sources

Generic Load/Save Functions

Save Modes

Saving to Persistent Tables

Parquet Files

Loading Data Programmatically

Partition Discovery

Schema Merging

JSON Datasets

Hive Tables

JDBC To Other Databases

Troubleshooting

Performance Tuning

Caching Data In Memory

Compatibility with Apache Hive

Unsupported Hive Functionality

Hands-On
  1. Running SQL Quries with MySql
  2. Running Hive queries
  3. Reading JSON file and storing it as a Parquet format

Spark Streaming

Micro batch

Discretized Streams (DStreams)

Input DStreams and Receivers

Dstream to RDD

Basic Sources

Advanced Sources

Transformations on DStreams


Output Operations on DStreams

Design Patterns for using foreachRDD

DataFrame and SQL Operations

Checkpointing

Socket stream

File Stream

Stateful operations

How stateful operations work?

Window Operations

Join Operations


Hands-On
  1. Network-wordcount  with Spark Streaming
  2. Processing Flume data with Spark Streaming
  3. Processing Kafka data with Spark Streaming
  4. Processing Twitter data with Spark Streaming

Day 4:

Spark ML Programming 

Main Concepts
ML Dataset
ML AlgorithmsModel Selection via Cross-Validation
Hands-On
  1. Clustering with K-means
  2. Classification examples
  3. Linear regression techniques

Tuning Spark
Data Serialization
Memory Tuning
Determining Memory Consumption
Tuning Data Structures
Serialized RDD Storage
Garbage Collection Tuning
Other Considerations
Level of Parallelism
Memory Usage of Reduce Tasks
Broadcasting Large Variables
Data Locality
Summary

Job Scheduling and Monitoring
Overview
Scheduling Across Applications
Dynamic Resource Allocation
Configuration and Setup
Resource Allocation Policy
Request Policy
Remove Policy
Graceful Decommission of Executors
Scheduling Within an Application
Fair Scheduler Pools
Default Behavior of Pools
Configuring Pool Properties



Day 5:
Apache Solr
The Fundamentals
  • About Solr
  • Installing and running Solr
  • Adding content to Solr
  • Reading a Solr XML response
  • Changing parameters in the URL
  • Using the browse interface
Searching
  • Sorting results
  • Query parsers
  • More queries
  • Hardwiring request parameters
  • Adding fields to default search
  • Faceting
  • Result grouping
Indexing
  • Adding your own content to Solr
  • Deleting data from Solr
  • Building a bookstore search
  • Adding book data
  • Exploring the book data
  • Dedupe updateprocessor
Updating your schema
  • Adding fields to the schema
  • Analyzing text
Relevance
  • Field weighting
  • Phrase queries
  • Function queries
  • Fuzzier search
  • Sounds-like
SolrCloud
  • Introduction
  • How SolrCloud works
  • Commit strategies
  • ZooKeeper
  • Managing Solr config files

Comments

Popular posts from this blog

Cloud Computing in simple

How to Write an Effective Design Document

Bookmark