Big Data

Hadoop Course Content                                            

  1. Understanding BigData.
    1. What is Big Data?
    2. Big-Data characteristics.

 

  1. Hadoop Distributions:
    1. Cloudera
    2. MapR
    3. Hortonworks
    4. Amazone
  2. Introduction to Apache Hadoop.
    1. Flavors of Hadoop: Big-Insights, Google Query etc..
  3. Hadoop Eco-system components:
  4. Understanding Hadoop Cluster
  5. Hadoop Core-Components.
    1. NameNode.
    2. ResourceManager / JobTracker.
    3. NameNode/ TaskTracker.
    4. DataNode.
    5. SecondaryNameNode.
  6. HDFS Architecture
    1. Why 64MB?
    2. Why Block?
    3. Why replication factor 3?
  7. Rack Awareness.
    1. Network Topology.
    2. Assignment of Blocks to Racks and Nodes.
    3. Block Reports
    4. Heart Beat
    5. Block Management Service.
  8. Anatomy of File Write.
  9. Anatomy of File Read.
  10. Hadoop Federation and High Availability
  11. Map Reduce Overview
  12. Cluster Configuration overview
    1. Core-default.xml
    2. Hdfs-default.xml
    3. Mapred-default.xml
    4. Yarn-site.xml
    5. Hadoop-env.sh
    6. Slaves
    7. Masters

 

  1. Why Map Reduce?
  2. Use cases where Map Reduce is used.
  3. Parts of Map Reduce
  4.  Shuffle, Sort and Merge phases
  5. HDFS Practicals (HDFS Commands)

 

  1. ClouderaDistribution of Hadoop(CDH) – VM Setup
  2. Map Reduce Failure Scenarios
  3. Speculative Execution
  4. Input File Formats
  5. Output File Formats

Map Reduce Advance Concepts:

  1. Joins
  2. Multi outputs
  3. Counters
  4. Distributed Cache

Hadoop 2.X(YARN):

  1. YARN Architecture
  2.  Hadoop Classic vs YARN

 

Sqoop:

  1. Sqoop Architecture
  2. Import and Export
  3. Sqoop Hive/HBase Import
  4. SqoopPracticals

 

Hive:

  1. Hive Background.
  2. What is Hive?
  3. Pig Vs Hive
  4. Where to Use Hive?
  5. Hive Architecture
  6. Metastore
  7. Hive execution modes.
  8. External, Manged, Native and Non-native tables.
  9. Hive Partitions:
    1. Dynamic Partitions
    2. Static Partitions
  10. Hive DataModel
  11. Hive DataTypes
    1. Primitive
    2. Complex
  12. Queries:
    1. Create Managed Table
    2. Load Data
    3. Insert overwrite table
    4. Insert into Local directory.
    5. Insert Overwrite table select.
  13. Joins
    1. Inner Joins
    2. Outer Joins
    3. Skew Joins
  14. Multi-table Inserts
  15. Multiple files, directories, table inserts.
  16. Serde.
  17. UDF
  18. Hive Practical’s
  19. Hive Optimization Techniques and Best Practices

 

Pig:

  1. Need of Pig?
  2. Why Pig Created?
  3. Why go for Pig when Map Reduce is there?
  4. Pig use cases.
  5. Pig built in operators
  6. Operators:
    1. Load
    2. Store
    3. Dump
    4. Filter.
    5. Distinct
    6. Group
    7. CoGroup
    8. Join
    9. Foreach Generate
    10. Distinct
    11. Limit
    12. ORDER
    13. CROSS
    14. UNION
    15. SPLIT
  7. Dump Vs Store
  8. DataTypes
    1. Complex
      1. Bag
      2. Tuple
      3. Atom
      4. Map
    2. Primitives.
      1. Integers
      2. Float
      3. Chararray
      4. byteArray
      5. Double
  9. Diagnostic Operators
    1. Describe
    2. Explain
    3. Illustrate
  10. UDFs.
    1. Filter Function
    2. Eval Function
    3. Macros
    4. Demo
  11. Storage Handlers.
  12. Pig Practicals and Usecases.
  13. Pig Debugging using Explain and Illustrate commands
  14.  Pig Stats.

 

  •  
  1. Introduction to NOSQL Databases.
  2. NOSql Landscapes
  3. Introduction to HBASE
  4. HBASE vs RDBMS
  5. Create Table on HBASE using HBASE shell
  6. Write Files to HBASE.
  7. Major Components of HBASE.
    1. HBase Master.
    2. HRegionServer.
    3. HBase Client.
    4. Zookeeper.
    5. Region.
  8. HBase Practicals
  9. Row key Designing

 

  •  
  •  

History of Big Data & Apache Spark

Introduction to the Spark Shell and the training environment

Intro to Spark DataFrames and Spark SQL

Introduction to RDDs

Lazy Evaluation

Transformations and Actions

  •  

Data Sources: reading from Parquet, HDFS, and your local file system

Spark's Architecture

Programming with Accumulators and Broadcast variables

Debugging and tuning Spark jobs using Spark's admin UIs

Memory & Persistence

Advanced programming with RDDs (understanding the shuffle phase, partitioning, etc.)

 

Quick Enroll