Ch.1 - What is Apache Spark?

24 Jul 2025

Spark’s toolkit

Structured Streaming	Advanced Analytics	Libraries & Ecosystem
Structured API Datasets, DataFrames, SQL
Low-level APIs RDD, Distributed Variables

Azure Storage and Amazon S3
Distributed file systems (e.g. Apache Hadoop)
Key-value stores (e.g. Apache Cassandra)
Message buses (e.g. Apache Kafka) Spark focuses on performing computations over the data, no matter where it resides.

Storage system: the Hadoop file system/HDFS designed for low-cost stage over clusters of commodity servers
Computing system: MapReduce Environments for which Hadoop architecture cannot work: public cloud/streaming application -> Spark can work on that too

Hardware advancement: (before 2005) computer became faster every year through processor speed increase
The trend in hardware stopped around 2005 due to hard limit in heat dissipation
Developer switch towards adding more parallel CPU cores all running at the same speed
The cost to store 1TB of data continues to drop by roughly two times every 14 months
Collecting data is extremely inexpensive
Processing huge amount of data requires large, parallel computation, often on clusters of machines