|
Lecture-1 Hadoop Installation and Setup
· The architecture of Hadoop cluster
· What is High Availability and Federation?
· How to setup a production cluster?
· Various shell commands in Hadoop
· Understanding configuration files in Hadoop
· Installing a single node cluster with Cloudera Manager
· Understanding Spark, Scala, Sqoop, Pig, and Flume
· Practical Exercise
|
|
|
|
Lecture-2 Introduction to Big Data Hadoop and Understanding HDFS and MapReduce
· Introducing Big Data and Hadoop
· What is Big Data and where does Hadoop fit in?
· Two important Hadoop ecosystem components, namely, MapReduce and HDFS
· In-depth Hadoop Distributed File System – Replications, Block Size, Secondary Name node, High Availability and in-depth YARN – resource manager and node manager
· HDFS working mechanism
· Data replication process
· How to determine the size of the block?
· Understanding a data node and name node
· Practical Exercise
|
|
|
|
Lecture-3 Deep Dive in MapReduce
· Learning the working mechanism of MapReduce
· Understanding the mapping and reducing stages in MR
· Various terminologies in MR like Input Format, Output Format, Partitioners, Combiners, Shuffle, and Sort
· How to write a WordCount program in MapReduce?
· How to write a Custom Partitioner?
· What is a MapReduce Combiner?
· How to run a job in a local job runner
· Deploying a unit test
· What is a map side join and reduce side join?
· What is a tool runner?
· How to use counters, dataset joining with map side, and reduce side joins?
· Practical Exercise
|
|
|
|
Lecture-4 Introduction to Hive
· Introducing Hadoop Hive
· Detailed architecture of Hive
· Comparing Hive with Pig and RDBMS
· Working with Hive Query Language
· Creation of a database, table, group by and other clauses
· Various types of Hive tables, HCatalog
· Storing the Hive Results, Hive partitioning, and Buckets
· Database creation in Hive
· Dropping a database
· Hive table creation
· How to change the database?
· Data loading
· Dropping and altering table
· Pulling data by writing Hive queries with filter conditions
· Table partitioning in Hive
· What is a group by clause?
· Practical Exercise
|
|
|
|
Lecture-5 Advanced Hive and Impala
· Indexing in Hive
· The ap Side Join in Hive
· Working with complex data types
· The Hive user-defined functions
· Introduction to Impala
· Comparing Hive with Impala
· The detailed architecture of Impala
· How to work with Hive queries?
· The process of joining the table and writing indexes
· External table and sequence table deployment
· Data storage in a different table
· Practical Exercise
|
|
|
|
Lecture-6 Introduction to Pig
· Apache Pig introduction and its various features
· Various data types and schema in Hive
· The available functions in Pig, Hive Bags, Tuples, and Fields
· Working with Pig in MapReduce and local mode
· Loading of data
· Limiting data to 4 rows
· Storing the data into files and working with Group By, Filter By, Distinct, Cross, Split in Hive
· Practical Exercise
|
|
|
|
Lecture-7 Flume, Sqoop and HBase
· Apache Sqoop introduction
· Importing and exporting data
· Performance improvement with Sqoop
· Sqoop limitations
· Introduction to Flume and understanding the architecture of Flume
· What is HBase and the CAP theorem?
· Working with Flume to generate Sequence Number and consume it
· Using the Flume Agent to consume the Twitter data
· Using AVRO to create Hive Table
· AVRO with Pig
· Creating Table in HBase
· Deploying Disable, Scan, and Enable Table
· Practical Exercise
|
|
|
|
Lecture-8 Writing Spark Applications Using Scala
· Using Scala for writing Apache Spark applications
· Detailed study of Scala
· The need for Scala
· The concept of object-oriented programming
· Executing the Scala code
· Various classes in Scala like getters, setters, constructors, abstract, extending objects, overriding methods
· The Java and Scala interoperability
· The concept of functional programming and anonymous functions
· Bobsrockets package and comparing the mutable and immutable collections
· Scala REPL, Lazy Values, Control Structures in Scala, Directed Acyclic Graph (DAG), first Spark application using SBT/Eclipse, Spark Web UI, Spark in Hadoop ecosystem
· Practical Exercise
|
|
|
|
Lecture-9 Spark framework
· Detailed Apache Spark and its various features
· Comparing with Hadoop
· Various Spark components
· Combining HDFS with Spark and Scalding
· Introduction to Scala
· Importance of Scala and RDD
· Practical Exercise
|
|
|
|
Lecture-10 RDD in Spark
· Understanding the Spark RDD operations
· Comparison of Spark with MapReduce
· What is a Spark transformation?
· Loading data in Spark
· Types of RDD operations viz. transformation and action
· What is a Key/Value pair?
· Practical Exercise
|
|
|
|
Lecture-11 Data Frames and Spark SQL
· The detailed Spark SQL
· The significance of SQL in Spark for working with structured data processing
· Spark SQL JSON support
· Working with XML data and parquet files
· Creating Hive Context
· Writing Data Frame to Hive
· How to read a JDBC file?
· Significance of a Spark data frame
· How to create a data frame?
· What is schema manual inferring?
· Work with CSV files, JDBC table reading, data conversion from Data Frame to JDBC, Spark SQL user-defined functions, shared variable, and accumulators
· How to query and transform data in Data Frames?
· How data frame provides the benefits of both Spark RDD and Spark SQL?
· Deploying Hive on Spark as the execution engine
· Practical Exercise
|
|
|
|
Lecture-12 Machine Learning Using Spark (MLlib)
· Introduction to Spark MLlib
· Understanding various algorithms
· What is Spark iterative algorithm?
· Spark graph processing analysis
· Introducing Machine Learning
· K-Means clustering
· Spark variables like shared and broadcast variables
· What are accumulators?
· Various ML algorithms supported by MLlib
· Linear regression, logistic regression, decision tree, random forest, and K-means clustering techniques
· Practical Exercise
|
|
|
|
Lecture-13 Integrating Apache Flume and Apache Kafka
· Why Kafka?
· What is Kafka?
· Kafka architecture
· Kafka workflow
· Configuring Kafka cluster
· Basic operations
· Kafka monitoring tools
· Integrating Apache Flume and Apache Kafka
· Practical Exercise
|
|
|
|
Lecture-14 Spark Streaming
· Introduction to Spark streaming
· The architecture of Spark streaming
· Working with the Spark streaming program
· Processing data using Spark streaming
· Requesting count and DStream
· Multi-batch and sliding window operations
· Working with advanced data sources
· Features of Spark streaming
· Spark Streaming workflow
· Initializing StreamingContext
· Discretized Streams (DStreams)
· Input DStreams and Receivers
· Transformations on DStreams
· Output Operations on DStreams
· Windowed operators and its uses
· Important Windowed operators and Stateful operators
· Practical Exercise
|
|
|
|
Lecture-15 Hadoop Administration – Multi-node Cluster Setup Using Amazon EC2
· Create a 4-node Hadoop cluster setup
· Running the MapReduce Jobs on the Hadoop cluster
· Successfully running the MapReduce code
· Working with the Cloudera Manager setup
· Practical Exercise
|
|
|
|
Lecture-16 Hadoop Administration – Cluster Configuration
· Overview of Hadoop configuration
· The importance of Hadoop configuration file
· The various parameters and values of configuration
· The HDFS parameters and MapReduce parameters
· Setting up the Hadoop environment
· The Include and Exclude configuration files
· The administration and maintenance of name node, data node directory structures, and files
· What is a File system image?
· Understanding Edit log
· Practical Exercise
|
|
|
|
Lecture-17 Hadoop Administration
· How to go about ensuring the MapReduce File System Recovery for different scenarios
· JMX monitoring of the Hadoop cluster
· How to use the logs and stack traces for monitoring and troubleshooting
· Using the Job Scheduler for scheduling jobs in the same cluster
· Getting the MapReduce job submission flow
· FIFO schedule
· Getting to know the Fair Scheduler and its configuration
· Practical Exercise
|
|
|
|
Lecture-18 ETL Connectivity with Hadoop Ecosystem (Self-Paced)
· How ETL tools work in Big Data industry?
· Introduction to ETL and data warehousing
· Working with prominent use cases of Big Data in ETL industry
· End-to-end ETL PoC showing Big Data integration with ETL tool
· Connecting to HDFS from ETL tool
· Moving data from Local system to HDFS
· Moving data from DBMS to HDFS,
· Working with Hive with ETL Tool
· Creating MapReduce job in ETL tool
· Practical Exercise
|
|
|