Skip to content

Hadoop

Layer Description
Purpose Distributed processing of large data sets
Capabilities Store + process massive data across clusters
Mechanisms Batch parallelism (MapReduce), distributed storage (HDFS)
Architecture Master/Worker model: NameNode + DataNodes + YARN
Components HDFS, MapReduce, YARN, Hive, Pig, Spark (extended ecosystem)

Overview

Sam

Hadoop

  • definition: a distributed data platform (HDFS + YARN) that hosts engines (eg Spark/Hive)

  • capabilities:

    • reliable data storage

    • parallel data analysis

  • mechanisms:

    • horizontal scaling: adds more servers (ie nodes)

    • fault tolerance: built in data redundancy (3 copies) in case a node fails

    • resource isolation: YARN containers (CPU/RAM) allocated to engines

  • components:

    • HDFS (storage)

    • YARN (cluster resource manager)

    • Common services like cataloging / security / coordination.

Sequential process

Sam

Hadoop flow: Ingest ⟶ Store ⟶ Govern ⟶ Coordinate ⟶ Process/Analyze/Serve ⟶ Persist

  • ingest with Sqoop/Kafka/Flume

  • store with HDFS

  • govern with Ranger

  • coordinate with YARN

  • process/analyze/serve with MR/Spark/Hive

  • persist with HDFS/HBase

Sam

Hadoop flow in more detail:

  1. Ingest

    • Batch: Sqoop from RDBMS.

    • Real-time: Kafka (or Flume) events.

  2. Store

    • In blocks: HDFS (blocks + replication)

    • In tables: Kudu (mutable columnar)

    • In wide-rows: HBase (NoSQL)

  3. Govern

    • auth/security: Ranger

    • catalogs schemas: Hive Metastore

  4. Coordinate

    • provides cluster resources: YARN
  5. Process/Analyze/Serve

    • Batch: Spark or MR jobs.

    • Stream: Spark Structured Streaming

    • SQL: Hive

    • Search: Solr (text indexing/query)

  6. Persist

    • Write back to storage: HDFS/Kudu/HBase

    • publish aggregates

    • serve: via BI/ML/search

Sam

Modern stacks typically:

  • favor Spark over classic MR,

  • use Ranger (Cloudera) more often than Sentry

  • lean on Parquet/ORC or table formats for manageability


Ontology

(Core Entities & Relations)

Sam

  • HDFS

    • storesFiles (⟶ broken_intoBlocks (≈128 MB))

    • replicated_onDataNodes (×3 by default)

    • managed_byNameNode (namespace + block metadata)

  • YARN

    • schedules / allocates resources forMR, Spark
  • Spark

    • reads/writesHDFS/Kudu/HBase;

    • managed_byYARN;

    • providesSQL, Streaming, MLlib, GraphX.

  • Hive/Drill

    • execute_SQL_overHDFS/HBase/Kudu (varying latency and source support).
  • Sqoop

    • moves_betweenRDBMSHDFS (imports/exports; parallel by key; incremental options).