Spark Review

Posted on Fri, Jul 22, 2022 Tech Skill

Spark Architecture

cluster: group of machines, pools the resources of many machines together allowing us to use all the cumulative resources as if they were one

cluster manager: controls physical machines and allocates resources to spark applications, maintains an understanding of the resources available

spark is a tool for managing and coordinating the execution of tasks on data across a cluster of computers

Spark application = spark driver + spark executor

spark has both cluster mode and local mode.

spark driver: compute the resource requirement of the application and request resources from the cluster manager

  1. maintaining information about the spark application
  2. responding to a user’s program or input
  3. analyzing, distributing and scheduling work across the executors

spark executor: execute tasks assigned by spark driver and report the status back to the driver

  1. executing code assigned to it by the driver
  2. reporting the state of the computation back to the driver node

partition: rows held by one cluster

Spark Execution Hierarchy

job: highest hierarchy, an action

stage: group of tasks, can be executed together to compute the same transformation on multiple machines, stages are dependent on each other

task: a unit of computation applied to a single partition on a single executor

an example: an executor with 4 CPUs and 8 partitions, can only have 4 tasks running in parallel

spark job β†’ action

β€” stage 1

β€” task 1

β€” task 2

β€” task 3

β€” stage 2

β€” stage 3

DataFrames: distributed and immutable, spark dataframe can span thousands of computers.

Partition: a collection of rows that sit on one physical machine in our cluster. In order to allow every executor to perform work in parallel, spark breaks up the data into chunks β†’ partitions.

RDD Operations

Transformation

Transformations are the core of how you will be expressing your business logic using spark

examples of transformation: map(), filter()

Applying transformation β†’ RDD lineage: a logical execution plan, DAG of the entire parent RDDs of RDD

example:

val Df = customerDf
.withColumnRenamed("email_address", "mail")
.withColumn("developer_site", lit("test.com"))
.drop("address_id")
.filter("birth_date > 20")

// print execution plan
Df.explain("formatted")
// 1. scan: scan the json
// 2. filter: filter based on condition
// 3. projection: drop columns, rename columns, generate new columns

DAG: rename column β†’ add column β†’ drop column β†’ filter rows (acyclic: without cycle)

Transformations are lazy, they get execute when we call an action, not executed immediately.

Two types of transformations:

  1. narrow transformation: all required elements in single partition of parent RDD
  2. wide transformation: required elements live in many partitions of parent RDD

    shuffle: spark will exchange partitions across the cluster

Lazy evaluation

spark will wait until the very last moment to execute the graph of computation instructions. Instead of modifying the data immediately when we express some operation, we build up a plan of transformations that we would like to apply to our source data.

modification is not applied immediately, it wait until the computation is triggered

trigger evaluation = action

Action

Actions trigger the computation of all the transformations applied on a dataframe.

examples of action: count(), collect(), take(n), top(), reduce(), aggregate()

Three kinds of actions:

  1. actions to view data in the console
  2. actions to collect data to native objects in the respective language
  3. actions to write to output data sources

Exam topics:

Spark Architecture β€” Conceptual

Spark Architecture β€” Applied

Spark DataFrame API