- It was designed to solve what MR failed to address: perf issues due to no way to re-use data between computations.
- Iterative jobs (popular in Machine Learning algorithms)
- Interactive analytics (ad hoc exploratory queries)
- Resilient distributed dataset (RDD): which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. These can be cached and re-used in multiple parallel operations.
-
Fault tolerance achieved through lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.
- a handle to an RDD contains enough information to compute the RDD starting from data in reliable storage.
Constructing RDDs
- From a file in HDFS
- Parallelizing a Scala collection
- Transforming an existing RDD
- Change the persistence of an RDD
- Cache: lazy, but leave in cache after computation. Hint only, won't force if no space.
- Save: writes it to file system
- Parallel Operations
- Reduce: dataset elements using an associative function to produce a result at the driver program; reduce results are only collected at one process
- Collect: sends all elements of the dataset to the driver program
- Foreach: passes each element through a user provided function
Shared Variables
- Broadcast variables: distribute a large piece of read-only data to distribute to all workers and not package with every closure.
- Accumulators: workers can only add to it; only the driver can read it.
Implementation
- What is Mesos?!
- Spark is built on top of Mesos [16, 15], a “cluster operating system” that lets multiple parallel applications share a cluster in a fine-grained manner and provides an API for applications to launch tasks on a cluster
RDD Implementation
- Internally, each RDD object implements the same simple interface, which consists of three operations:
- getPartitions: returns a list of partition IDs.
- getIterator(partition): iterates over a partition.
- getPreferredLocations(partition): used for task scheduling to achieve data locality.
- Delay scheduling: send each task to one if its preferred locations.
- if a node fails, its partitions are re-read from their parent datasets and eventually cached on other nodes
Shared Variables Implementation
- Broadcast variables and accumulators are implemented using classes with custom serialization formats
- Broadcast variable is saved to filesystem, fetched and cached on worker node.
- Accumulator is saved to filesystem. Each worker node updates own accumulator from zero and sends back for global update.
Interpreter Integration
- Scala compiles a class for each line typed by user including a singleton that contains the variables and functions on the line.
- Previous lines are referenced via Class.getInstance.
- Sparked changed this to output compiled classes into a shared filesystem and reference the singleton objects directly.
Performance benchmarks
- Logistic regression runs 10x faster than Map Reduce.
- Interactive queries are much faster after first query, e.g. 35 s, → 0.5 s.
Related Work
- Distributed Shared Memory
- Fault tolerance: checkpointing, lineage. Lineage is better.
- Lineage: only the lost partitions need to be recomputed, and that can be done in parallel on different nodes, without requiring the program to revert to a checkpoint. No overhead if no nodes fail.
- Language Integation
- Unlike DryadLINQ, Spark allows RDDs to persist in memory across parallel operations. [What does DryadLINQ do again?]
- In addition, Spark enriches the language integration model by supporting shared variables (broadcast variables and accumulators), implemented using classes with custom serialized forms.
Future work — was this achieved?
- Formally characterize the properties of RDDs and Spark’s other abstractions, and their suitability for various classes of applications and workloads.
- **Enhance the RDD abstraction to allow programmers to trade between storage cost and re-construction cost. **
- Define new operations to transform RDDs, including a “shuffle” operation that repartitions an RDD by a given key. Such an operation would allow us to implement group-bys and joins.
- Provide higher-level interactive interfaces on top of the Spark interpreter, such as SQL and R [4] shells.