Clare S. Y. Huang Data Scientist | Atmospheric Dynamicist

Ch.1 - What is Apache Spark?

Spark’s toolkit

Structured Streaming
Advanced Analytics
Libraries & Ecosystem
Structured API
Datasets, DataFrames, SQL
Low-level APIs
RDD, Distributed Variables

Unified aspect of Spark:

  • Consistent, compassable APIs
  • Unified engine for parallel data processing
  • “Structured APIs” (Datasets, DataFrames, SQL)

Computing engines:

  • Azure Storage and Amazon S3
  • Distributed file systems (e.g. Apache Hadoop)
  • Key-value stores (e.g. Apache Cassandra)
  • Message buses (e.g. Apache Kafka) Spark focuses on performing computations over the data, no matter where it resides.

Hadoop:

  • Storage system: the Hadoop file system/HDFS designed for low-cost stage over clusters of commodity servers
  • Computing system: MapReduce Environments for which Hadoop architecture cannot work: public cloud/streaming application -> Spark can work on that too

Libraries:

  • SQL
  • Structured data (Spark SQL)
  • Machine learning (MLlib)
  • Stream processing (Spark Streaming and newer Structured Streaming)
  • Graph analysis (GraphX)

Context: the big data problem

  • Hardware advancement: (before 2005) computer became faster every year through processor speed increase
  • The trend in hardware stopped around 2005 due to hard limit in heat dissipation
  • Developer switch towards adding more parallel CPU cores all running at the same speed
  • The cost to store 1TB of data continues to drop by roughly two times every 14 months
  • Collecting data is extremely inexpensive
  • Processing huge amount of data requires large, parallel computation, often on clusters of machines

Running Spark

  • Spark can be used from Python, Java, Scala, R, or SQL
  • Spark is written in Scala, and runs on the Java Virtual Machine (JVM)
  • To run Spark, one has to install Java
<< Previous Page