Ch.1 - What is Apache Spark?
10 Jan 2025Spark’s toolkit
Structured Streaming
|
Advanced Analytics
|
Libraries & Ecosystem
|
Structured API
Datasets, DataFrames, SQL
|
||
Low-level APIs
RDD, Distributed Variables
|
Unified aspect of Spark:
- Consistent, compassable APIs
- Unified engine for parallel data processing
- “Structured APIs” (Datasets, DataFrames, SQL)
Computing engines:
- Azure Storage and Amazon S3
- Distributed file systems (e.g. Apache Hadoop)
- Key-value stores (e.g. Apache Cassandra)
- Message buses (e.g. Apache Kafka) Spark focuses on performing computations over the data, no matter where it resides.
Hadoop:
- Storage system: the Hadoop file system/HDFS designed for low-cost stage over clusters of commodity servers
- Computing system: MapReduce Environments for which Hadoop architecture cannot work: public cloud/streaming application -> Spark can work on that too
Libraries:
- SQL
- Structured data (Spark SQL)
- Machine learning (MLlib)
- Stream processing (Spark Streaming and newer Structured Streaming)
- Graph analysis (GraphX)
Context: the big data problem
- Hardware advancement: (before 2005) computer became faster every year through processor speed increase
- The trend in hardware stopped around 2005 due to hard limit in heat dissipation
- Developer switch towards adding more parallel CPU cores all running at the same speed
- The cost to store 1TB of data continues to drop by roughly two times every 14 months
- Collecting data is extremely inexpensive
- Processing huge amount of data requires large, parallel computation, often on clusters of machines
Running Spark
- Spark can be used from Python, Java, Scala, R, or SQL
- Spark is written in Scala, and runs on the Java Virtual Machine (JVM)
- To run Spark, one has to install Java