Spark Architecture
cluster: group of machines, pools the resources of many machines together allowing us to use all the cumulative resources as if they were one
cluster manager: controls physical machines and allocates resources to spark applications, maintains an understanding of the resources available
spark is a tool for managing and coordinating the execution of tasks on data across a cluster of computers
Spark application = spark driver + spark executor
spark has both cluster mode and local mode.
spark driver: compute the resource requirement of the application and request resources from the cluster manager
- maintaining information about the spark application
- responding to a userβs program or input
- analyzing, distributing and scheduling work across the executors
spark executor: execute tasks assigned by spark driver and report the status back to the driver
- executing code assigned to it by the driver
- reporting the state of the computation back to the driver node
partition: rows held by one cluster
Spark Execution Hierarchy
job: highest hierarchy, an action
stage: group of tasks, can be executed together to compute the same transformation on multiple machines, stages are dependent on each other
task: a unit of computation applied to a single partition on a single executor
an example: an executor with 4 CPUs and 8 partitions, can only have 4 tasks running in parallel
spark job β action
β stage 1
β task 1
β task 2
β task 3
β stage 2
β stage 3
DataFrames: distributed and immutable, spark dataframe can span thousands of computers.
Partition: a collection of rows that sit on one physical machine in our cluster. In order to allow every executor to perform work in parallel, spark breaks up the data into chunks β partitions.
RDD Operations
Transformation
Transformations are the core of how you will be expressing your business logic using spark
examples of transformation: map(), filter()
Applying transformation β RDD lineage: a logical execution plan, DAG of the entire parent RDDs of RDD
example:
val Df = customerDf
.withColumnRenamed("email_address", "mail")
.withColumn("developer_site", lit("test.com"))
.drop("address_id")
.filter("birth_date > 20")
// print execution plan
Df.explain("formatted")
// 1. scan: scan the json
// 2. filter: filter based on condition
// 3. projection: drop columns, rename columns, generate new columns
DAG: rename column β add column β drop column β filter rows (acyclic: without cycle)
Transformations are lazy, they get execute when we call an action, not executed immediately.
Two types of transformations:
Lazy evaluation
spark will wait until the very last moment to execute the graph of computation instructions. Instead of modifying the data immediately when we express some operation, we build up a plan of transformations that we would like to apply to our source data.
modification is not applied immediately, it wait until the computation is triggered
trigger evaluation = action
Action
Actions trigger the computation of all the transformations applied on a dataframe.
examples of action: count(), collect(), take(n), top(), reduce(), aggregate()
Three kinds of actions:
- actions to view data in the console
- actions to collect data to native objects in the respective language
- actions to write to output data sources
Exam topics:
Spark Architecture β Conceptual
- Cluster architecture: nodes, drivers, workers, executors, slots, etc.
- Spark execution hierarchy: applications, jobs, stages, tasks, etc.
- Shuffling
- Partitioning
- Lazy evaluation
- Transformations vs Actions
- Narrow vs Wide transformations
Spark Architecture β Applied
- Execution deployment modes
- Stability
- Storage levels
- Repartitioning and coalescing
- repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the coalesce() is used to only decrease the number of partitions in an efficient way.
- SparkΒ
repartition()
andcoalesce()
are very expensive operations as they shuffle the data across many partitions hence try to minimize repartition as much as possible.
- Broadcasting
- In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks.
- Instead of sending this data along with every task, spark distributes broadcast variables to the machine using efficient broadcast algorithms to reduce communication costs.
- DataFrames
Spark DataFrame API
- Subsetting DataFrames (select, filter, etc.)
- Column manipulation (casting, creating columns, manipulating existing columns, complex column types)
- String manipulation (Splitting strings, regex)
- Performance-based operations (repartitioning, shuffle partitions, caching)
- Combining DataFrames (joins, broadcasting, unions, etc)
- Reading/writing DataFrames (schemas, overwriting)
- Working with dates (extraction, formatting, etc)
- Aggregations
- Miscellaneous (sorting, missing values, typed UDFs, value extraction, sampling)